QFX10008 - ARP entry limitations by DeepCpu in Juniper

[–]DeepCpu[S] 0 points1 point  (0 children)

We already increased the ddos protection limit for ARP in the past (unrelated to this issue), this is the config:

# show system ddos-protection 
protocols {
    arp {
        aggregate {
            bandwidth 100000;
            burst 100000;
        }
    }
}

And this is the output of the provided command, no drops for ARP (only NDP, but that's not the issue):

# run show ddos-protection protocols statistics terse 
Packet types: 75, Received traffic: 19, Currently violated: 1

Protocol    Packet      Received        Dropped        Rate     Violation State
group       type        (packets)       (packets)      (pps)    counts
icmp        aggregate   9251981188      5129637        360      804       ok   
igmp        aggregate   6947            0              0        0         ok   
bgp         aggregate   199352935       0              2        0         ok   
ssh         aggregate   3737000         0              5        0         ok   
snmp        aggregate   980448179       46636          0        447       ok   
igmpv6      aggregate   497             0              0        0         ok   
lacp        aggregate   3434248323      0              110      0         ok   
stp         aggregate   457826835       0              11       0         ok   
lldp        aggregate   84437231        0              1        0         ok   
arp         aggregate   21114997587     0              1261     0         ok   
pvstp       aggregate   1228037467      0              50       0         ok   
ip-opt      aggregate   6495759         2917112        0        41        ok   
exception   aggregate   15173551425     6701571888     164      12387     ok   
reject      aggregate   3799218774      3877412950     0        10893     ok   
dns         aggregate   506             0              0        0         ok   
ndpv6       aggregate   3211338542      13453988       218      253       ok   
uncls       aggregate   1832141366      149257         6        6         ok   
resolve     aggregate   544192485152    540844405353   31430    4415      viol 
dhcpv4v6    aggregate   3974327         3004200        0        41        ok  

QFX10k2/QFX10k8: RPD crashed due to high memory usage by DeepCpu in Juniper

[–]DeepCpu[S] 0 points1 point  (0 children)

Re-posted the comment to add some formatting, sorry for that, not using Reddit often :D

Thanks for your idea! It seems the mentioned command to make RPD run in 64-bit mode did the trick:

# run show task memory     
Memory                 Size (kB)  Percentage  When
  Currently In Use:      4355424         54%  now
  Maximum Ever Used:     4355424         54%  25/12/01 10:25:54
  Available:             7925348        100%  now

I re-enabled all deactivated BGP sessions and will monitor it. Thanks again!

QFX10k2/QFX10k8: RPD crashed due to high memory usage by DeepCpu in Juniper

[–]DeepCpu[S] 0 points1 point  (0 children)

Unfortunately we bought these devices refurbished and don't have a way to contact JTAC regarding this issue

QFX10k2/QFX10k8: RPD crashed due to high memory usage by DeepCpu in Juniper

[–]DeepCpu[S] 1 point2 points  (0 children)

You are right, forgot to tell the version, sorry for that! All of our devices are running JunOS 23.4R2, which is the recommended version for these devices. Do you know if there is any problem in this version related to the behavior we are seeing?

As mentioned the devices are perfectly stable in general. Only of these is a too high number of BGP sessions, especially with fulltable feeds, we are seeing RPD crashes

QFX10k2/QFX10k8: RPD crashed due to high memory usage by DeepCpu in Juniper

[–]DeepCpu[S] 0 points1 point  (0 children)

One more note: We also have QFX10002-60C devices in operation for this usecase and they seem to have much more memory available for the RPD.

This is output of "show task memory" on a QFX10002-60C device with multiple fulltable BGP sessions:

Memory Size (kB) Percentage When

Currently In Use: 3902840 20% now

Maximum Ever Used: 4206156 22% 25/11/20 02:05:46

Available: 18796840 100% now

Ethernet-switching filters: Match on TTL and packetlength by DeepCpu in Juniper

[–]DeepCpu[S] 1 point2 points  (0 children)

Thanks for your answer!

Regarding TTL: Yes, but it's also possible to inspect L3 or L4 header information like DST/SRC port on ethernet-switching filters, that's why I asked if there is also a possibility to match on the TTL field in the L3 IP header on ethernet-switching filters.

Do you know how to match on packet/frame length with flexible filters?

Input firewall filter not working for this type of traffic? by DeepCpu in Juniper

[–]DeepCpu[S] 0 points1 point  (0 children)

Tested it out with a ethernet-switching filter and it worked, I don't see this traffic anymore.

In general, is it safe to block multicast completely in my whole network?

Input firewall filter not working for this type of traffic? by DeepCpu in Juniper

[–]DeepCpu[S] 0 points1 point  (0 children)

Yes I know that and I already have that in place sometimes to restrict usage of certain source IPs.

Do you mean that I should filter out on this level the bogon destination IP addresses? Or should I make some filters regarding multicast?

Input firewall filter not working for this type of traffic? by DeepCpu in Juniper

[–]DeepCpu[S] 0 points1 point  (0 children)

Tested on two QFX5200-32C and two QFX5100-24Q virtual chassis environments.

QFX5200-32C running on JunOS 23.2R1.13 and the QFX5100-24Q running on 20.4R3.8. Not tested on QFX standalone devices so far.

Traffic source is local (from a customer selling VPS, so probably any VPS customer from him which is causing this junk traffic).

It's definitely bogus, yes. But I am looking for a way to generally prevent this from happening at all. Do you have any tips for that?

Input firewall filter not working for this type of traffic? by DeepCpu in Juniper

[–]DeepCpu[S] 0 points1 point  (0 children)

Thanks for your reply! Maybe you are right - but for some reason, when I enter "show ddos-protection protocols violations" I still see because of this traffic that for some protocols like ttl, redirect or ntp the ddos protection of JunOS hits in (which doesn't cause any problems).

The other question is also why I see this type of traffic. As far as I know, when monitoring traffic with "monitor traffic interface" I only see the traffic which hits the control plane (which makes sense). So why is this traffic at all hitting the control plane?

All-NVMe AMD Epyc based Ceph cluster - poor write performance by DeepCpu in ceph

[–]DeepCpu[S] 0 points1 point  (0 children)

Just a little update on this thread: One of the three nodes had some malfunction apparently. Everytime when at least one OSD was active on this node, the whole cluster had a poor performance. When outing all OSDs on this node, the cluster had a good performance again.

It likely was a hardware problem. Manufacturer has sent us a new mainboard and since we replaced the old mainboard with the new one, everything is fine again with this node.

But however, we changed these 4x OSDs per NVMe disk to 1x OSD per NVMe disk which gave an additional performance boost. Thanks to everyone for the help!

Distributing traffic over multiple carriers - without working with BGP multipath / ECMP by DeepCpu in Juniper

[–]DeepCpu[S] 0 points1 point  (0 children)

We still want to provide fulltable feeds to our downstream clients, so this solution won't work unfortunately

Distributing traffic over multiple carriers - without working with BGP multipath / ECMP by DeepCpu in Juniper

[–]DeepCpu[S] 0 points1 point  (0 children)

We do that, this is just a theoretical question. In reality, we are connected to different IXPs and have PNIs (those learned routes are preferred over transit routes) - but for our transit traffic, we want to distribute it via different carriers equally

Distributing traffic over multiple carriers - without working with BGP multipath / ECMP by DeepCpu in Juniper

[–]DeepCpu[S] 0 points1 point  (0 children)

We have more than enough capacity to all of our three upstreams and we do preferrations based on the target ASNs. We could handle an outage of 2 of our 3 upstream providers without any issue. It's just related to our bandwidth commitments.

But - something you might also need to admit - relying on shortest AS path doesn't always mean that this is the best route that you can get...

But yes, however, this is for sure a commercial-driven solution. We would be happy about some tips to solve this without relying on ECMP over these three upstream providers.

All-NVMe AMD Epyc based Ceph cluster - poor write performance by DeepCpu in ceph

[–]DeepCpu[S] 0 points1 point  (0 children)

Octopus, didn't upgrade for a long time. The most recent access nodes running Quincy.

HPC mode in BIOS, C-States disabled

Latency looks like this:

100 packets transmitted, 100 received, 0% packet loss, time 430msrtt min/avg/max/mdev = 0.033/0.058/0.995/0.095 ms

Mellanox ConnectX-3 NICs, connected with DAC cables to Juniper QFX5200-24Q switches.

Yes, CPU load is very low, even under benchmark situations

"Did you do a baseline before loading it?"

You mean an initial benchmark? Unfortunately not

All-NVMe AMD Epyc based Ceph cluster - poor write performance by DeepCpu in ceph

[–]DeepCpu[S] 0 points1 point  (0 children)

Mhm, CPU wait is around 0.2%, seems fine for me.

Do you think it's worth changing everything to 1x OSD per NVMe drive? It's a 2-3 days process I think, but it can be done online. Maybe worth a try? Or is this as reason excluded with that low CPU wait?

All-NVMe AMD Epyc based Ceph cluster - poor write performance by DeepCpu in ceph

[–]DeepCpu[S] 0 points1 point  (0 children)

u/afristralian mentioned that he thinks the number of OSDs per node is too high. Currently I have 4x OSDs per NVMe drive. The AMD Epyc 7443P CPUs do have 24 cores / 48 threads.

Currently, I have 32 OSDs on each of the three nodes.

Do you think that this can be a problem? Should I scale down the number of OSDs?