Saturday, December 11, 2021

Troubleshooting SD-WAN cEdge IPsec Replay Failures

 

Introduction

 

IPsec authentication provides built-in anti-replay protection against old or duplicated IPsec packets by checking the sequence number in the ESP header on the receiver. Anti-replay packet drops is one of the most common data-plane issues with IPsec due to packets delivered out of order outside of the anti-replay window. A general troubleshooting approach for IPsec anti-replay drops can be found here, and general technique applies to SDWAN as well. However, there are some implementation differences between traditional IPsec and IPsec used in the Cisco SD-WAN solution. This article is intended to explain these differences and the troubleshooting approach on the cEdge platforms running IOS-XE.

 

SDWAN Replay Detection Considerations

 

Group key vs. Pairwise key

 

Unlike traditional IPsec, where IPsec SAs are negotiated between two peers using the IKE protocol, SDWAN uses a group key concept. In this model, an SDWAN edge device periodically generates data plane inbound SA per TLOC and send these SAs to the vSmart controller, which in turn propagates the SA to the rest of the edges devices in the SD-WAN network. For a more detailed description of the SD-WAN data plane operations, see SD-WAN Data Plane Security Overview.

 

Note: Starting from IOS-XE 16.12.1a/SD-WAN 19.2, IPsec pairwise keys are supported. See IPsec Pairwise Keys Overview. With Pairwise keys, IPsec anti-replay protection works exactly like traditional IPsec. This article primarily will focus on replay check using the group key model.

 

SPI Encoding

 

In the IPsec ESP header, the SPI (Security Parameter Index) is a 32 bit value that the receiver uses to identify the SA to which an incoming packet should be decrypted with. With SD-WAN, this inbound SPI can be identified with show crypto ipsec sa:

 

cedge-2#show crypto ipsec sa | se inbound
     inbound esp sas:
      spi: 0x123(291)
        transform: esp-gcm 256 ,
        in use settings ={Transport UDP-Encaps, esn}
        conn id: 2083, flow_id: CSR:83, sibling_flags FFFFFFFF80000008, crypto map: Tunnel1-vesen-head-0
        sa timing: remaining key lifetime 9410 days, 4 hours, 6 mins
        Kilobyte Volume Rekey has been disabled
        IV size: 8 bytes
        replay detection support: Y
        Status: ACTIVE(ACTIVE)

Note: The SPI displayed with this command may not be the actual SA used in the data plane due to CSCvt06182 .

 

Notice even though this inbound SPI is the same for all the tunnels, the receiver has a different SA and the corresponding replay-window object associated with the SA for each peer edge device since the SA is identified by the source, destination ip address, source, destination ports 4-tuple, and the SPI number. So essentially, each peer will have its own anti-replay window object.

 

When looking at the actual packet sent by the peer device, one may notice the SPI value is different from the above output. Here is an example from the packet-trace output with the packet copy option enabled:

 

Packet Copy In
  45000102 0cc64000 ff111c5e ac127cd0 ac127cd1 3062303a 00eea51b 04000123
  00000138 78014444 f40d7445 3308bf7a e2c2d4a3 73f05304 546871af 8d4e6b9f

The actual SPI in the ESP header is 0x04000123. The reason for this is that, the leading bits in the SPI for SD-WAN are encoded with additional information, and only the low bits of the SPI field are allocated for the actual SPI. 

 

Traditional IPsec:

 

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|               Security Parameters Index (SPI)                 | 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 

SD-WAN:

 

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
|  CTR  | MSNS|         Security Parameters Index (SPI)         | 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 

Where:

 

  • CTR (first 4 bits, bits 0-3) - Control Bits, used to indicate specific type of control packets. For example control bit 0x80000000 is used for BFD.
  • MSNS (next 3 bits, bits 4-6) - Multiple Sequence Number Space Index. This is used to locate the correct sequence counter in the sequence counter array to check for replay for the given packet. For SD-WAN, the 3 bit of MSNS allows for 8 different traffic classes to be mapped into their own sequence number space. This implies the effective SPI value that can be used for SA selection is the reduced low order 25 bits from the full 32 bit value of the field. More on this below.

 

 

Multiple Sequence Number Space for QoS

 

It is common to observe IPsec replay failures in an environment where packets are delivered out of order due to QoS, e.g., LLQ, since QoS is always run after IPsec encryption and encapsulation. The Multiple Sequence Number Space solution solves this problem by maintaining multiple sequence number spaces mapped to different QoS traffic classes for a given Security Association. The different sequence number space is indexed by the MSNS bits encoded in the ESP packet SPI field as depicted above. For a more detailed description, please see IPsec Anti Replay Mechanism for QoS

 

As noted above, this Multiple Sequence Number implementation implies the effective SPI value that can be used for SA selection is the reduced low order 25 bits. Another practical consideration when configuring the replay window size with this implementation is that, the configured replay-window size is for the aggregate replay window, so the effective replay window size for each Sequenc Number Space is 1/8 of the aggregate. For example, with the following configuration:

 

config-t
Security
ipsec
replay-window 1024
Commit

 

The effective replay window size for each Sequence Number Space is 1024/8 = 128!

 

Note: starting from IOS-XE 17.2.1, the aggregate replay window size has been increased to 8192 so that each Sequence Number Space can have a maximum replay window of 8192/8 = 1024 packets. This change was introduced with CSCvs51630 .

 

On an IOS-XE cEdge device, the last sequence number received for each requence number space can be obtained from the following IPsec dataplane output:

 

cedge-2#show crypto ipsec sa peer 172.18.124.208 platform

<snip>

------------------ show platform hardware qfp active feature ipsec datapath crypto-sa 5 ------------------

 Crypto Context Handle: ea54f530
 peer sa handle: 0
 anti-replay enabled
 esn enabled
 Inbound SA
 Total SNS: 8
 Space                highest ar number
 ----------------------------------------
   0                               39444
   1                                   0
   2                                1355
   3                                   0
   4                                   0
   5                                   0
   6                                   0
   7                                   0
<snip>

In the above example, the highest anti-replay window (Right edge of the anti-replay sliding window) for MSNS of 0 (0x00) is 39444, and that for MSNS of 2 (0x04) is 1335, and these counters will be used to check if the sequence number is inside of the replay window for packets in the same sequence number space.

 

Note: There are implementation differences betweem the ASR1k platform and the rest of the IOS-XE routing platforms (ISR4k, ISR1k, CSR1kv). As a result, there are some discrepancies in terms of the show commands and their output for these platforms. Currently, there is no command that will display the inbound top replay window edge on the ASR1k platform. This will hopefully be addressed in 17.3 as part of our serviceability effort.

 

Troubleshooting Replay Drop Failures

 

Troubleshooting Data Collection

 

When dealing with IPsec anti-replay drops, it's important to understand the conditions and potential triggers of the problem. At a minimum, collect the following set of information for to provide the context:

 

  • Device information for both the sender and receiver for the replay packet drops, including type of device, cEdge vs. vEdge, software version, and configuration.
  • Problem history. How long has the deployment been in place? How long has the problem been happening? Any recent changes to the network or traffic conditions.
  • Any pattern to the replay drops, e.g., is it sporadic or constant? Time of the problem and/or significant event, e.g., does it only happen during high traffic peak production hours, or only during rekey, etc.?


With the above information collected, proceed with the following troubleshooting workflow.

 

Troubleshooting workflow

 

The general troubleshooting approach for IPsec replay issues is just like how it's performed for traditional IPsec, while taking into account the per-peer SA sequence space and Multiple Sequence Number Space as explained above. Then follow these steps:

 

1. First identify the peer for the replay drop from the syslog and the drop rate. For drop statistics, always collect multiple timestamped snapshots of the output so that the drop rate can be quatified:

 

*Feb 19 21:28:25.006: %IOSXE-3-PLATFORM: R0/0: cpp_cp: QFP:0.0 Thread:000 TS:00001141238701410779 %IPSEC-3-REPLAY_ERROR: IPSec SA receives anti-replay error, DP Handle 6, src_addr 172.18.124.208, dest_addr 172.18.124.209, SPI 0x123

cedge-2#show platform hardware qfp active feature ipsec datapath drops
Load for five secs: 1%/0%; one minute: 1%; five minutes: 1%
No time source, *11:25:53.524 EDT Wed Feb 26 2020
------------------------------------------------------------------------
Drop Type  Name                                     Packets
------------------------------------------------------------------------
        4  IN_US_V4_PKT_SA_NOT_FOUND_SPI                              30
       19  IN_CD_SW_IPSEC_ANTI_REPLAY_FAIL                            41

It's not uncommon to see occasional replay drops due to packet delivery reordering in the network, but persistent replay drops that's service impacting should be investigated.

 

2a. For relatively low traffic rate, take a packet-trace using a condition set to be the peer ipv4 address with the copy packet option and examine the sequence numbers for the packet dropped against the current replay window right edge and sequence numbers in the adjacent packets to confirm if they are indeed duplicate or outside of the replay window.

 

2b. For high traffic rate with no predictable trigger, setup an EPC capture using circular buffer and EEM to stop the capture when replay errors are detected. Since EEM is currently not supported on vManage as of 19.3, this implies the cEdge would have to be in CLI mode when this troubleshooting task is performed. Once the capture is taken, use the BDB IPsec replay analyzer to analyze the packet capture for replay conditions.

 

3. Collect the show crypto ipsec sa peer x.x.x.x platform on the receiver ideally at the same time the packet capture or packet-trace is collected. This command should include the realtime dataplane replay window information for both the inbound and outbound SA.

 

4. If the packet dropped is indeed out of order, then take simultaneous captures from both the sender and receiver to identify if the problem is with the source or with the underlay network delivery layer.

 

5. If the packets are dropped even though they are neither duplicate nor outside of the replay window, then it's usually indicative of a software problem on the receiver.

 

Known Issues/Enhancements

 

  • CSCvq31153  SDWAN BFD session stuck and packet drops due to IN_CD_SW_IPSEC_ANTI_REPLAY_FAIL drops
  • CSCvr64231  BFD down with IPSec SA receives anti-replay error after NAT session flap sometimes
  • CSCvs48535  %IPSEC-3-REPLAY_ERROR: + BFD down and drops IN_CD_COPROC_ANTI_REPLAY_FAIL (vEdge incorrectly resets ESP seq.)
  • CSCvn79788  Incorrect syslog for anti-replay error on TSN1100 platform with SDWAN per-Tunnel QoS
  • CSCvs51630  cEdge: 'security ipsec replay-window' needs to support 8192
  • CSCvq75871 : SDWAN ipsec anti-replay drops for all packets when NAT session flap
  • CSCvn67507 : Packet drops due to IPSec-input and anti-replay when remote TLOC flaps
  • CSCvx15750 : SD-WAN:cEdge ipsec replay-window size decreases to 128 after a peer reloading
  • CSCvr64231 : BFD down with IPSec SA receives anti-replay error after NAT session flap sometimes
  • CSCvw00044 : 20.4-EFT: BFD sessions down on vEdge due to rx_replay_integrity_drops - Polaris side commit
  • CSCvs98389 : Packet drops in XE-SDWAN because of "IN_CD_COPROC_ANTI_REPLAY_FAIL" errors

 

 

References

 

 

No comments:

Post a Comment