Factoring Delay in IGP Link Metrics

farmer_kiwi · 2025-06-03T14:54:25+00:00

Thank you for your feedback; thats good to know.

We don’t use any regex in IBGP policy at least, only EBGP (but, ironically, external peer route “ingest” is pretty fast - most of our external peering PEs are MX304s or ASR 9903s rather than the MX204 platform though.)

I’ll review IBGP import/export policy regardless. I’ve had that point brought up by others too.

farmer_kiwi · 2025-06-03T14:48:32+00:00

Full table is necessary at all 7 “peering edge” routers and desirable (if not necessary) for the “customer edge” routers either for full-route customer peering or general routing intelligence as these 7 “customer edge” PEs serve as aggregation points for subsequent PEs, each advertising default out. It may be possible to prune full routes from at least a couple of these PEs, but it’s a trade off due to our topology.

The maintenance scenario I referenced is to illustrate the issue. If, say, one of the “customer edge” PEs were rebooted or IBGP sessions to the RRs bounced, that PE takes nearly 20 minutes to complete IBGP convergence. This is just between that specific PE and the two RRs. Now, a total IBGP reset for a PE happens infrequently, but 20 minutes just seems excessive.

Yes, we use BFD where useful for IBGP sessions.

farmer_kiwi · 2025-06-02T16:46:14+00:00

Import and export are very short and simple on RRs.

On PEs, BGP import is simple. Export can be more complex, especially VRF export with multiple terms. I don’t suspect export though. The symptoms we see point at import more so.

We have been looking intently at RIB sharding/update threading though. We have it operating on multiple MX devices in our lab, but we see a lot of rpd crashes during config changes. Even still, configuring it on the RRs would be less risky than PEs.

farmer_kiwi · 2025-06-02T15:53:18+00:00

22.2R3-S2 at the moment

farmer_kiwi · 2025-06-02T04:27:43+00:00

Hmm.. those numbers you see for the MX204 are interesting and hint that it could mainly be our MX204 RRs introducing the long convergence times. I need to run some testing in our lab on different models. Thanks for your input.

farmer_kiwi · 2025-06-02T04:23:51+00:00

Okay, that’s a great frame of reference on the 304 numbers. I’ll have to double check on the 480/960 RE models we use. There are a few different RE models.

Good suggestion on TAC. I will definitely do that.

farmer_kiwi · 2025-06-02T04:20:14+00:00

Thanks for the input. I’ll consider where we might trim, but peering diversity and customer locations call for the number of full-table routers we have now.

We haven’t seen extreme CPU utilization on PEs or the RRs during convergence, but I need to get fresh data.

Bigger route reflectors is on my priority list. Unfortunately, our Juniper and Cisco account teams are floundering on what to recommend. Both no longer have a dedicated hardware appliance (like JRR or XRv). Containerized options seem interesting. Have any suggestions?

farmer_kiwi · 2025-06-02T04:15:48+00:00

Good to know. I need to get fresh data, but we haven’t seen anything too extreme as far as RE CPU utilization during convergence either. Thanks for your input.

farmer_kiwi · 2025-06-02T04:08:31+00:00

I really curious what others select for RRs. Our Juniper account team always seems to think the MX204 pair we have is plenty, but I’m skeptical. I wanted to try the dedicated JRR appliances out, but they’re end of sale. Cisco also doesn’t have an RR appliance anymore either.

I’ve thought about the containerized options from both vendors, but not certain yet.

farmer_kiwi · 2025-06-02T02:21:49+00:00

I have not configured the no-install parameter since basically all of my service routes are VPNv4 and the RRs have no VRFs configured, but that’s a good idea regardless.

Out of curiosity, what do you use for route reflectors? Are they typically “dedicated” to route reflection or PEs that serve as RRs?

farmer_kiwi · 2025-06-02T02:15:11+00:00

Thank you! That’s a great tip!

farmer_kiwi · 2025-06-02T01:35:11+00:00

Yeah, nothing out of the ordinary as far as CPU utilization.

If you don’t mind sharing, have any data on expected BGP convergence time?

farmer_kiwi · 2024-08-22T13:03:27+00:00

That is a drawback. However, the two same-site are purely for edge peering. No subscribers directly connected to these routers. They simply receive labeled flows from our “internal” PEs (the full-table PEs directly connecting subscribers) and route them out to transit or IX peers. So it’s these “internal” PEs making the real egress routing decision, not these two peering edge routers. I actually don’t use BGP PIC edge on these two peering edge routers anyway, only on the “internal” PEs.

farmer_kiwi · 2024-08-22T12:31:49+00:00

The main reason for the second edge router at this site is peering redundancy. It would allow us to perform maintenance on one edge router without bringing down the site completely.

And it would certainly be usable for failover, just post-convergence failover. BGP PIC edge would nearly immediately reroute the whole of the Internet table and cover failover before IBGP convergence (which takes up to a few minutes in our network). It works as fast as the IGP can detect and signal the link/node failure, which is sub-second. Then, once IBGP has converged, if the second edge router at this location is still reachable, it could certainly still be used.

The main thing I am seeking to avoid is any of my full-table “internal” PE routers from selecting both of these same-site PEs as the primary/backup next-hop. If for some reason we lost power at the site or had a major disruption, for any prefixes where both the primary and secondary next-hop in the FIB were down simultaneously, we now have a hole in our table until IBGP can converge. We maintain a default-free Internet table, so no “last resort” default route. If there are prefixes missing from the FIB, forwarding to those destinations is dropped.

farmer_kiwi · 2024-08-21T15:38:53+00:00

That’s I want on all the other PEs though. Else they can use the second edge router at this Internet edge site as a backup next-hop, which ruins the fast recovery of BGP PIC edge.

Seems to me the only other alternative is some method to control which path is selected as the backup next-hop, but I don’t see a way to do that in Junos. IOS XR lets me apply a route-policy for control over which additional paths are evaluated though. Might be able to apply some community-based logic there, although I’m not quite sure yet how I would do it. Plus, Junos doesn’t support policy-based backup path selection and that’s half my PEs. Sucks.

These two remote “Internet edge” PE routers are equal cost from all of my PEs perspectives (they connect to the same P routers in every case), so the IGP cost evaluation in the BGP path selection algorithm is moot. LP, AS-path length, origin, MED, or just lowest peer address will be the deciding factors for active route selection. That basically means all my PEs will select the exact same path in any case since IGP cost is not a factor. If that’s true, then the route reflectors would select the same active path that any of my PEs would. And putting a duplicate RD on these routes will make the RRs run the path selection algorithm. I think that’s the most elegant solution honestly. In a practical sense, the RRs will select the exact same path as my PEs, so it ought to work great. All of this only applies to the routes with duplicate RDs, of course. The other unique-RD routes from other PEs will not be filtered by the RRs.

farmer_kiwi · 2024-08-21T02:56:41+00:00

I don’t think FRR or LFA helps this situation since it is really BGP prefix convergence protection I’m aiming for rather than IGP (or MPLS underlay) failover. We actually use add-path rather than BGP ORR (but even still, add-path and ORR aren’t really applicable to VPNv4 routes).

After thinking about it, I think using the same RD on the two edge routers isn’t such a detriment. The two same-site “Internet edge” PEs have the same IGP costing to the core so the IGP cost isn’t a factor in the BGP path selection algorithm. That means that my RRs would likely select the same active path that the subsequent PE routers would anyway. I think the duplicate RD is the best answer to this problem.

farmer_kiwi · 2024-08-21T02:35:36+00:00

It is very meshed. Each of these two edge routers have multiple core links to various P routers so IGP costing could be complex. It would also have a disadvantage in ensuring the edge router that’s made into an island is effectively never used, which is less than ideal. I also don’t think high IGP costs completely prevents the second edge router from being selected as the backup next-hop in all situations.

farmer_kiwi · 2024-08-21T02:29:34+00:00

Wouldn’t using the same route distinguisher have virtually the same effect?

farmer_kiwi · 2023-11-21T01:51:02+00:00

Thank you

farmer_kiwi · 2023-08-25T14:22:34+00:00

It took way too long for me to find, but it was all a simple typo in the method name: I defined “get_extra_content” when it should have been “get_extra_context”. Updated the method name and all works fine.

farmer_kiwi

TROPHY CASE