暗夜星空: troubleshooting-ad-redistribution-2

When we configure redistribution like that, we should have full connectivity. Take a look at the diagram below:

Above you can see where we will perform redistribution. Let’s be brainless and configure redistribution without thinking too much about this topology:

R2(config)#router rip
R2(config-router)#redistribute eigrp 1 metric 1

R2(config)#router eigrp 1
R2(config-router)#redistribute rip metric 1000 100 255 1 1500

R2 is now redistributing between RIP and EIGRP. Let’s configure R3:

R3(config)#router eigrp 1
R3(config-router)#redistribute ospf 1 metric 1500 100 255 1 1500

R3(config)#router ospf 1
R3(config-router)#redistribute eigrp 1 subnets

Easy enough! Redistribution has been configured so we should have full connectivity right? We should check if everything is reachable…

Reachability Verification

When you configure redistribution and your goal is to have full connectivity then you should try pinging all IP addresses in the topology. A quick method to do this is to fetch all IP addresses using show ip aliases and then use a TCLSH script. Here’s how to do it:

R1#show ip aliases
Address Type             IP Address      Port
Interface                1.1.1.1
Interface                192.168.12.1

R2#show ip aliases
Address Type             IP Address      Port
Interface                192.168.12.2
Interface                192.168.23.2
Interface                192.168.24.2

R3#show ip aliases
Address Type             IP Address      Port
Interface                192.168.23.3
Interface                192.168.34.3

R4#show ip aliases
Address Type             IP Address      Port
Interface                192.168.24.4
Interface                192.168.34.4

This gives us a nice overview of all IP addresses. Copy / paste them in notepad and then turn it into a simple TCLSH script, then run it on all routers. To reduce the output, I removed all the succesful pings and only kept the failed pings:

R1(tcl)#foreach address {
+>(tcl)#1.1.1.1
+>(tcl)#192.168.12.1
+>(tcl)#192.168.12.2
+>(tcl)#192.168.23.2
+>(tcl)#192.168.24.2
+>(tcl)#192.168.23.3
+>(tcl)#192.168.34.3
+>(tcl)#192.168.24.4
+>(tcl)#192.168.34.4
+>(tcl)#} { ping $address repeat 3 }

Sending 3, 100-byte ICMP Echos to 192.168.23.2, timeout is 2 seconds:
...
Success rate is 0 percent (0/3)
Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 192.168.23.3, timeout is 2 seconds:
...
Success rate is 0 percent (0/3)
R1(tcl)#

R1 can ping everything with the exception of 192.168.23.2 and 192.168.23.3. That’s the link in between R2 and R3. Let’s make a mental note of this and continue with the other routers:

R2(tcl)#foreach address {
+>(tcl)#1.1.1.1
+>(tcl)#192.168.12.1
+>(tcl)#192.168.12.2
+>(tcl)#192.168.23.2
+>(tcl)#192.168.24.2
+>(tcl)#192.168.23.3
+>(tcl)#192.168.34.3
+>(tcl)#192.168.24.4
+>(tcl)#192.168.34.4
+>(tcl)#} { ping $address repeat 3 }

Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
...
Success rate is 0 percent (0/3)

R2(tcl)#

R2 is able to reach everything with the exception of 1.1.1.1. Make a mental note of this and continue:

R3(tcl)#foreach address {
+>(tcl)#1.1.1.1
+>(tcl)#192.168.12.1
+>(tcl)#192.168.12.2
+>(tcl)#192.168.23.2
+>(tcl)#192.168.24.2
+>(tcl)#192.168.23.3
+>(tcl)#192.168.34.3
+>(tcl)#192.168.24.4
+>(tcl)#192.168.34.4
+>(tcl)#} { ping $address repeat 3 }

Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
...
Success rate is 0 percent (0/3)
R3(tcl)#

R3 has the same issue as R2. What about R4?

R4(tcl)#foreach address {
+>(tcl)#1.1.1.1
+>(tcl)#192.168.12.1
+>(tcl)#192.168.12.2
+>(tcl)#192.168.23.2
+>(tcl)#192.168.24.2
+>(tcl)#192.168.23.3
+>(tcl)#192.168.34.3
+>(tcl)#192.168.24.4
+>(tcl)#192.168.34.4
+>(tcl)#} { ping $address repeat 3 }

Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
...
Success rate is 0 percent (0/3)
R4(tcl)#

We see that R1 has some issues and that R2, R3 and R4 are unable to reach 1.1.1.1. Let’s focus on these last three routers first.

R2 is learning network 1.1.1.0 /24 directly from R1 through RIP, there’s nothing in between these two routers. Let’s take a quick look at the routing table of R2:

R2#show ip route

C    192.168.12.0/24 is directly connected, FastEthernet1/0
C    192.168.24.0/24 is directly connected, FastEthernet0/1
C    192.168.23.0/24 is directly connected, FastEthernet0/0
D    192.168.34.0/24 [90/307200] via 192.168.24.4, 01:03:07, FastEthernet0/1

Depending on when you check the routing table, you might or might not see an entry for 1.1.1.0 /24. Above you can see that it’s gone…a few seconds later this is what I see:

R2#show ip route

C    192.168.12.0/24 is directly connected, FastEthernet1/0
     1.0.0.0/24 is subnetted, 1 subnets
O E2    1.1.1.0 [110/20] via 192.168.23.3, 00:00:02, FastEthernet0/0
C    192.168.24.0/24 is directly connected, FastEthernet0/1
C    192.168.23.0/24 is directly connected, FastEthernet0/0
D    192.168.34.0/24 [90/307200] via 192.168.24.4, 01:03:45, FastEthernet0/1

Interesting, R2 now has an OSPF entry for network 1.1.1.0 /24. If you would check the routing tables of R3 and R4 you will see 1.1.1.0 /24 appearing and disappearing as well.

It is kinda annoying to check the routing tables of all these routers just to “hunt” for some routes that dissapear like a thief in the night, especially when we have a large topology. There are a couple of useful tools that we can use that help us.

The first one is route profiling. This tells us how often there are changes to the routing table. You have to enable it first before you can look at the results:

R2,R3 & R4#
(config)#ip route profile

Let it run for a couple of minutes and you will see results:

R2#show ip route profile
IP routing table change statistics:
Frequency of changes in a 5 second sampling interval
-------------------------------------------------------------
Change/   Fwd-path  Prefix   Nexthop  Pathcount  Prefix
interval  change    add      change   change     refresh
-------------------------------------------------------------
0         975       983      1165     983        1165
1         16        182      0        182        0
2         174       0        0        0          0
3         0         0        0        0          0
4         0         0        0        0          0
5         0         0        0        0          0

Route profiling works by checking the routing table with a 5 second interval. The changes to the routing table are categorized in:

Fwd-path change: the number of changes in the forwarding path. This value is an accumulation of prefix-add, next-hop change and pathcount.
Prefix add: new prefix has been added to the routing table.
Nexthop change: the next hop of a prefix has changed.
Pathcount change: the number of paths in the routing table has changed.
Prefix refresh: this is standard routing table maintenance, prefixes are refreshed every now and then. No changes to the routing table have been made.

The output of route profiling is not very easy to read. Let me explain it:

The column on the left (Change / interval) is the frequency. For example the value of 182 in row 1 means that in 182 intervals, one prefix has been added.
The value of 174 in row 2 means that we had 174 intervals where the forward path changed two times.
The value of 1165 in row 0 means that we had 1165 intervals where the next hop did not change.

If you see a lot of intervals in row 1 or higher then we know something is going on and that the routing table is unstable. Here’s what R3 and R4 look like:

R3#show ip route profile
IP routing table change statistics:
Frequency of changes in a 5 second sampling interval
-------------------------------------------------------------
Change/   Fwd-path  Prefix   Nexthop  Pathcount  Prefix
interval  change    add      change   change     refresh
-------------------------------------------------------------
0         987       987      1170     1170       1170
1         183       183      0        0          0
2         0         0        0        0          0
3         0         0        0        0          0
4         0         0        0        0          0
5         0         0        0        0          0

R4#show ip route profile
IP routing table change statistics:
Frequency of changes in a 5 second sampling interval
-------------------------------------------------------------
Change/   Fwd-path  Prefix   Nexthop  Pathcount  Prefix
interval  change    add      change   change     refresh
-------------------------------------------------------------
0         837       837      1021     1021       1021
1         184       184      0        0          0
2         0         0        0        0          0
3         0         0        0        0          0
4         0         0        0        0          0
5         0         0        0        0          0

Here we see something similar to R2, there have been quite some intervals where one prefix has changed.

Now we know something is going on but we still don’t know what. Enabling a debug will help:

R2, R3 & R4#debug ip routing
IP routing debugging is on

Now we can see in real-time what is happening to the routing table. Let’s take a look at R2:

R2#
RT: add 1.1.1.0/24 via 192.168.12.1, rip metric [120/1]
RT: NET-RED 1.1.1.0/24
Periodic IP routing statistics collection
RT: closer admin distance for 1.1.1.0, flushing 1 routes
RT: NET-RED 1.1.1.0/24
RT: add 1.1.1.0/24 via 192.168.23.3, ospf metric [110/20]
RT: NET-RED 1.1.1.0/24
Periodic IP routing statistics collection
RT: del 1.1.1.0/24 via 192.168.23.3, ospf metric [110/20]
RT: delete subnet route to 1.1.1.0/24
RT: NET-RED 1.1.1.0/24
RT: delete network route to 1.0.0.0
RT: NET-RED 1.0.0.0/8

Take a close look at the output above, this debug gives us a lot of valuable information. Let me describe what is happening to R2:

R2 learns prefix 1.1.1.0 /24 from R1 through RIP and adds it to its routing table.
R2 learns prefix 1.1.1.0 /24 from R3 through OSPF.
R2 removes the RIP entry for 1.1.1.0 /24 and installs the OSPF entry.
R2 deletes the OSPF entry for 1.1.1.0 /24 from the routing table.

This is very interesting, I’ll describe in a minute why this is happening…let’s first take a look at R3 and R4:

R3#
RT: add 1.1.1.0/24 via 192.168.34.4, eigrp metric [170/2636800]
RT: NET-RED 1.1.1.0/24
RT: delete route to 1.1.1.0 via 192.168.34.4, eigrp metric [170/2636800]
RT: no routes to 1.1.1.0
RT: NET-RED 1.1.1.0/24
RT: delete subnet route to 1.1.1.0/24
RT: NET-RED 1.1.1.0/24
RT: delete network route to 1.0.0.0
RT: NET-RED 1.0.0.0/8

R3 learns 1.1.1.0 /24 through EIGRP and then deletes this entry from its routing table…interesting, what about R4?

R4#
RT: add 1.1.1.0/24 via 192.168.24.2, eigrp metric [170/2611200]
RT: NET-RED 1.1.1.0/24
RT: delete route to 1.1.1.0 via 192.168.24.2, eigrp metric [170/2611200]
RT: no routes to 1.1.1.0
RT: NET-RED 1.1.1.0/24
RT: delete subnet route to 1.1.1.0/24
RT: NET-RED 1.1.1.0/24
RT: delete network route to 1.0.0.0
RT: NET-RED 1.0.0.0/8

R4 has the same issue, it installs the 1.1.1.0 /24 prefix that it learned from R2 and then deletes it from the routing table. So what exactly is going on here? Let me explain this story step-by-step with some images.

AD Based Redistribution Problem

Let’s describe the problem step-by-step. In the beginning, R2 redistributes network 1.1.1.0 /24 that it has learned from R1 through RIP into EIGRP:

R4 learns the prefix and is now able to advertise it to R3 through EIGRP. R3 will redistribute the prefix into OSPF:

R2 now has a decision to make:

R2 has two sources for 1.1.1.0/24. The correct RIP route from R1 and the incorrect OSPF route from R3. It will select the route from R3 since the AD of OSPF is lower than RIP.

As a result, R2 will remove the RIP entry from its routing table and now it is no longer able to redistribute 1.1.1.0 /24 from RIP into EIGRP:

Since R2 doesn’t have a RIP entry for 1.1.1.0 /24, there’s nothing to redistribute into EIGRP. R4 and R3 will remove their EIGRP entry for 1.1.1.0 /24 from their routing tables and R3 is now unable to redistribute 1.1.1.0 /24 into OSPF:

R2 won’t learn about 1.1.1.0 /24 through OSPF anymore and after a short while, R2 will install the RIP entry for 1.1.1.0 /24 again its routing table and the whole problem repeats itself. This problem that I just described is an administrative distance based redistribution problem. R2 is installing incorrect routing information in its routing table because the administrative distance is lower.

Before we start talking about solutions, let me get back to the issue with R1 that was unable to ping 192.168.23.2 and 192.168.23.3.

To understand why this doesn’t work, take a look at the routing table of R2:

R2#show ip route connected
C    192.168.12.0/24 is directly connected, FastEthernet1/0
C    192.168.24.0/24 is directly connected, FastEthernet0/1
C    192.168.23.0/24 is directly connected, FastEthernet0/0

On R2, network 192.168.23.0 /24 is directly connected and not advertised directly in RIP. It has been advertised in OSPF and R3 redistributes this network into EIGRP so that R4 can learn about it.

R2 however won’t install this in its routing table since it already has an entry (directly connected). Because of this, R1 won’t learn about this network. You could fix this by using a network command or redistribute connected under the RIP process on R2.

Back to our problem with network 1.1.1.0 /24…before we start looking at solutions, let’s think about the “core” issue of our problem.

In our particular scenario, R2 learned network 1.1.1.0 /24 from RIP which has an AD of 120, let’s call this the “internal route“.

After redistribution, R2 learns about 1.1.1.0 /24 through OSPF. Let’s call this the “external route”.

The problem is that R2 should never accept the external route, it should always prefer the internal route. This is a redistribution rule that you should follow:

Redistribution Rule: Always prefer your “internal” routes over “external” routes.

If R2 would prefer it’s internal route from RIP instead of the external route from OSPF then we wouldn’t have any problems. Are there any other scenarios where something like this could occur?

RIP doesn’t have a clue about “internal” and “external” routes, it’s all the same so it’s vulnerable to selecting the wrong route.
OSPF uses the same AD for internal and external routes but it always gives preference to internal routes.
EIGRP uses a different AD for internal and external routes.

Whenever your internal route has a higher AD than the external route, you have to be careful! Some examples:

The internal route was learned through RIP (AD 120) and the external route is learned through OSPF (AD 110).
The internal route was learned through EIGRP external (AD 170) and the external route is learned through RIP (AD 120).
The internal route was learned through BGP internal (AD 200) and the external route is learned through OSPF (AD 110).

Now let’s take a look at some solutions!

Solutions

Since this redistribution problem is related to the administrative distance, our options are limited. Playing with metrics won’t help since the AD is the decision maker here. To fix this problem, we have to change the AD

Decrease the AD of the internal route(s)

To comply with our redistribution rule of “preferring the internal route” we have to change the AD on R2. Let’s lower the AD of the RIP route:

R2(config)#ip access-list standard R1_LOOPBACK
R2(config-std-nacl)#permit 1.1.1.0 0.0.0.255

R2(config)#router rip
R2(config-router)#distance 100 0.0.0.0 255.255.255.255 R1_LOOPBACK

This reduces the RIP route for 1.1.1.0 /24 to an AD of 100. Now check the routing table:

R2#show ip route | include 1.1.1.0
R       1.1.1.0 [100/1] via 192.168.12.1, 00:00:07, FastEthernet1/0

R2 will now always prefer the internal route from R1. R3 and R4 are now also stable:

R3#show ip route | include 1.1.1.0
D EX    1.1.1.0 [170/2636800] via 192.168.34.4, 00:01:09, FastEthernet0/1

R4#show ip route | include 1.1.1.0
D EX    1.1.1.0 [170/2611200] via 192.168.24.2, 00:01:19, FastEthernet0/0

This is looking good! If you left the debug enabled, you will only see something like this:

R2,R3 & R4#
Periodic IP routing statistics collection
Periodic IP routing statistics collection
Periodic IP routing statistics collection

There are no changes in the routing tables anymore so this proves that the topology is stable. Problem solved!

After implementing your solution, it’s always a good idea to use your TCLSH script again to verify that you have full reachability (if required).

Are there any other methods? Let’s continue…

Increase the AD of the external route(s)

Instead of decreasing the AD of the internal RIP routes, we can also increase the AD of the OSPF external routes. The result will be the same.

Let’s get rid of that RIP AD:

R2(config)#router rip
R2(config-router)#no distance 100 0.0.0.0 255.255.255.255 R1_LOOPBACK

And now we make some changes to OSPF:

R2(config)#router ospf 1
R2(config-router)#distance ospf inter-area 110 external 180

We mimic the behavior of EIGRP with this setting. OSPF will now set the AD of all external routes to an AD of 180. This makes it prefer the internal RIP route once again…problem solved!

Are there any other solutions that we can use? There’s one more…

Redistribution into OSPF

R2 was able to learn network 1.1.1.0 /24 through OSPF because it didn’t advertise this network itself into OSPF. If R2 would then it wouldn’t learn 1.1.1.0 /24 from R3.

This solution will work but if this is a CCIE R&B lab you have to be careful not to break any of the requirements. Let me show you this solution anyway.

First let’s get rid of the distance command:

R2(config)#router ospf 1
R2(config-router)#no distance ospf external 180

Now we redistribute the RIP routes into OSPF:

R2(config)#router ospf 1
R2(config-router)#redistribute rip subnets

R2 will now keep using the 1.1.1.0 /24 prefix from RIP. Also:

R3 will use the OSPF ( AD 110) entry for 1.1.1.0 /24 instead of the EIGRP external (AD 170) entry from R4.
R4 will use the EIGRP external entry (AD 170) for 1.1.1.0 /24. R2 and R3 both redistribute 1.1.1.0 /24 into EIGRP, the route that R4 installs depends on the seed metric.

Problem solved!

Conclusion

Redistribution is a difficult topic but now you have seen the different problems that can occur. Before you configure redistribution, look at the topology and try to spot what possible issues you might encounter. This makes it a lot easier than just redistributing and solving problems later.

When you face these administrative distance based issues, remember the holy redistribution rule: “always prefer internal route over external routes”. The best way to enforce this is by making changes to the administrative distance.

I hope this lesson has been helpful, if you have any questions feel free to leave a comment!

暗夜星空

Thursday, February 20, 2020

troubleshooting-ad-redistribution-2

Reachability Verification

AD Based Redistribution Problem

Solutions

Decrease the AD of the internal route(s)

Increase the AD of the external route(s)

Redistribution into OSPF

Conclusion

No comments:

Post a Comment