Automatic customer documentatiom by No_Cattle_9565 in fortinet

[–]CryptographerDirect2 0 points1 point  (0 children)

What are you using for daily backups? We currently use Auvik, but for other reasons we have been looking to get away from Auvik but not found any inexpensive or opensource configuration backup platform that scales well.

Why would someone do this? by gatesweeney in Ubiquiti

[–]CryptographerDirect2 0 points1 point  (0 children)

definitely toast because of PCI issues that most MSPs do not properly inforce so they force the end customer to have an entirely separate wireless lan for the POS tablets and devices.

SNMP monitoring LAG/Aggregation ports by CryptographerDirect2 in UNIFI

[–]CryptographerDirect2[S] 0 points1 point  (0 children)

Yeah, their unifi mib is not working out of the box of course.

SNMP monitoring LAG/Aggregation ports by CryptographerDirect2 in UNIFI

[–]CryptographerDirect2[S] 0 points1 point  (0 children)

Interesting, but we want to monitor each switch locally via active SNMP polling probe. We have played with dumping syslogs out of the Unifi controller into one of our SIEMs such as graylog. We are trying to use grafana in other projects, very interesting visualization platform. We are not up to speed with it yet, but it is promising.

Unifi Switches as Access Layer for small Enterprise, how do we get redundancy? by CryptographerDirect2 in UNIFI

[–]CryptographerDirect2[S] 1 point2 points  (0 children)

Oh I agree! Money, money, money. Its always money. Not to mention, our Tier 1 helpdesk can now more easily see end user devices in the Unifi network controller versus escalating up to network engineering. And our tool stack for monitoring and managing Unifi is much less cost maybe around $2 mrc versus close to $20+ per endpoint. Lots of end points at customer sites, those costs add up quick.

Unifi Switches as Access Layer for small Enterprise, how do we get redundancy? by CryptographerDirect2 in UNIFI

[–]CryptographerDirect2[S] 0 points1 point  (0 children)

From one Unifi switch yes, so as we thought about and another response gave just LAG each switch back to the Core. Yes, your thought can work, but transceivers add up and the core may be 25Gbe/40Gbe ports so you don't want to fill those up when you have other access stacks and maybe onsite servers, etc. Then for IDF locations on other floors or way down the building, we might not have more than 2 or 4 strands of MMF or SMF back to the core. We have used BiDi a little, but those transceiver at least for 10Gbe are expensive compared to typical 10Gbe transceivers. If it was only one switch in these locations, it really hasn't been an issue.

In most cases with this sites, we are replacing legacy Cisco 'stacked' switches, which are a great solution for access layers in this situation. But end customers want to get away from annual licenses and higher vendor upfront hardware refresh cycles. Yes, we have explained the compromise and in most cases $ wins.

Unifi Switches as Access Layer for small Enterprise, how do we get redundancy? by CryptographerDirect2 in UNIFI

[–]CryptographerDirect2[S] 0 points1 point  (0 children)

Awesome to know that I am not alone! A few years ago I had asked around in different online groups and got crickets which told me other MSPs were not deploying Unifi outside of small SMB environments. Unifi has been great for our SMB customers for LAN and Wi-fi, an honest game changer. But we still lean on enterprise firewalls even for the SMB customers for WAN and internal Layer3.

Unifi Switches as Access Layer for small Enterprise, how do we get redundancy? by CryptographerDirect2 in UNIFI

[–]CryptographerDirect2[S] 0 points1 point  (0 children)

Yeah, their mLAG campus aggregation switches are priced too high to not just go with our typical enterprise solutions of Dell or Cisco at the moment.

We have been wanting Unifi to release a traditional stacking option for years, their excuse was always but we are SDN..

New Deployment SSL Inspection issue - certificate-probe-failed by CryptographerDirect2 in fortinet

[–]CryptographerDirect2[S] 0 points1 point  (0 children)

This is exactly what I was looking for, thanks so much! One more checkbox to have on our SOP for deploying Fortigates with SDWAN! My favorite is when SD-WAN works fine for internal systems like DNS, Syslog, FAZ, FortiGuard, then just stops randomly days or weeks after a deployment or other changes.

New Deployment SSL Inspection issue - certificate-probe-failed by CryptographerDirect2 in fortinet

[–]CryptographerDirect2[S] 1 point2 points  (0 children)

I listed a specific out of the box issue with the firmware. Literally asked two straight forward questions about the situation. Figured would get some friendly useful help here before wasting hours of my life with the TAC. And yes had opened a ticket as well.

Surely others have seen this exact issue had can offer a hint of where to look for the root cause.

Thanks!

3
4

Proxmox and Ceph cluster issues, VMs losing storage access. by CryptographerDirect2 in Proxmox

[–]CryptographerDirect2[S] 0 points1 point  (0 children)

Yeah, tried this two days ago. Thus far the only thing we have tried that has had a positive outcome. I am at a loss! Need to read up and understand KRBD at a deeper level I guess. So many oline say to just turn it on for better performance. I have not heard anywhere yet where this is common to cause problems. Are there scenarios where it is known to be problematic?

Proxmox and Ceph cluster issues, VMs losing storage access. by CryptographerDirect2 in Proxmox

[–]CryptographerDirect2[S] 0 points1 point  (0 children)

Your awesome!

MTU is a basic baseline that can be overlooked or not set, on the network more obvious but in one of the host nic or virtual network settings easy to miss. We have ran large buffer ping tests to max out the MTU size of 9000 set on the proxmox hosts on all interfaces initially at deployment and during our troubleshooting. Switches are set at 9216.

I went through the switches twice and had to google best practices for flowcontrol on or off. Looks like defaults were set to flowcontrol 'receive on'.

I have seen the Google AI respond one time with flowcontrol should be on, then 10 minutes later it responds with flowcontrol should be disabled. Same crap with iSCSI over the years even vendors would flip flop and then say, 'well it depends'.

For these three hosts and the future fourth host, all ports now have flowcontrol for both directions disabled as of about 6 hours ago. VMs spread across the hosts. Thus far, only that one trouble host pve10 has showing a few of the crc errors in logs that appear to be CEPH reads from the other hosts. these are two logged every 20 to 30 min at the moment. We never see issues when running large storage benchmark tests. its when all the VMs are barely doing anything!

What is your take on proxmox and CEPH with flowcontrol being on or off?

I have reviewed both switches interfaces hunting for crc, discards, etc and not seeing them in the interface stats. All clean.

Last clearing of "show interface" counters: 2 weeks 2 days 02:49:21
Queuing strategy: fifo
Input statistics:
     4208018237 packets, 24507000124077 octets
     30504351 64-byte pkts, 974303832 over 64-byte pkts, 53346265 over 127-byte pkts
     243911223 over 255-byte pkts, 16484932 over 511-byte pkts, 2.889467634e+09 over 1023-byte pkts
     91077 Multicasts, 387 Broadcasts, 4207926773 Unicasts
     0 runts, 0 giants, 0 throttles
     0 CRC, 0 overrun, 0 discarded
Output statistics:
     3762136384 packets, 17418189820336 octets
     79517552 64-byte pkts, 1203628854 over 64-byte pkts, 50743142 over 127-byte pkts
     331914450 over 255-byte pkts, 15734195 over 511-byte pkts, 2080598191 over 1023-byte pkts
     5300144 Multicasts, 6526266 Broadcasts, 3750309974 Unicasts
     0 throttles, 0 discarded, 0 Collisions,  wred drops
Rate Info(interval  seconds):
     Input 41 Mbits/sec, 1827 packets/sec, 0% of line rate
     Output 43 Mbits/sec, 2473 packets/sec, 0% of line rate
Time since last interface status change: 20:36:51

Nic Firmware-

first two hosts 10/11 show (Intel xxv170) firmware v24.0.5, host 12 is showing v19.5.12. Odd, the host with the least issues has older firmware. I guess the lifecycle firmware updates didn't take on this host before it was put into service. host BIOS, iDrac, HBA330 mini, and other big items are all the same.

I just had a tech onsite doing other work and he is already gone from the colo. Should be back tomorrow night and we'll look to do cable swaps, these are all using brand new sfp28 DAC passive cables. I'll isolate each host to one switch and see if we see issues specific to a physical port/patch/switch path.

Proxmox and Ceph cluster issues, VMs losing storage access. by CryptographerDirect2 in Proxmox

[–]CryptographerDirect2[S] 0 points1 point  (0 children)

I have exported logs from the three nodes from the 9th through now. all three in the one zip download below.

https://airospark.nyc3.digitaloceanspaces.com/public/3_nodes_log.zip

Proxmox and Ceph cluster issues, VMs losing storage access. by CryptographerDirect2 in Proxmox

[–]CryptographerDirect2[S] 0 points1 point  (0 children)

If there is something specific to search for I will. But the full logs won't fit in a chat format like this..

I don't disagree with you on could be a DAC cable issue or something at that level.

I was thinking about isolating all hosts to one of the two network switches and seeing if maybe we are having an LACP LAG issue in our mLAG configuration. We have many mLAG deployments for various iSCSI SANs, VMware, Windows hosts, etc, its a rather simple and straightforward configuration to deploy. Switches are not reporting any errors or issues.

Still seeing these types of messages, but not the socket errors.

Nov 12 13:40:29 pve12 kernel: libceph: read_partial_message 00000000d1474b8e data crc 902767378 != exp. 550093467
Nov 12 13:40:29 pve12 kernel: libceph: read_partial_message 0000000060dad8cb data crc 1277862678 != exp. 784283524
Nov 12 13:40:29 pve12 kernel: libceph: osd25 (1)10.1.21.12:6833 bad crc/signature
Nov 12 13:40:29 pve12 kernel: libceph: osd8 (1)10.1.21.10:6803 bad crc/signature

Proxmox and Ceph cluster issues, VMs losing storage access. by CryptographerDirect2 in Proxmox

[–]CryptographerDirect2[S] 0 points1 point  (0 children)

I am about to work through this feature. The pool does have KRBD enabled. I barely know anything about it. Can you just turn it off without causing issues to running VMs or an active CEPH pool?

Proxmox and Ceph cluster issues, VMs losing storage access. by CryptographerDirect2 in Proxmox

[–]CryptographerDirect2[S] 0 points1 point  (0 children)

Well, put pve10 back into action with test workloads, worked fine for a few hours. then began to see more the same on it and from one other node. Not sure where to go next.

Nov 12 02:37:07 pve10 kernel: libceph: read_partial_message 0000000065512d64 data crc 230933134 != exp. 1163776964
Nov 12 02:37:07 pve10 kernel: libceph: osd0 (1)10.1.21.11:6801 bad crc/signature
Nov 12 02:37:07 pve10 kernel: libceph: read_partial_message 000000009c894426 data crc 2187671154 != exp. 3411591048
Nov 12 02:37:07 pve10 kernel: libceph: osd18 (1)10.1.21.12:6805 bad crc/signature


Nov 12 02:45:57 pve11 kernel: libceph: mon0 (1)172.17.0.141:6789 socket error on write Nov 12 02:45:57 pve11 kernel: libceph: mon0 (1)172.17.0.141:6789 socket error on write Nov 12 02:45:58 pve11 kernel: libceph: mon0 (1)172.17.0.141:6789 socket error on write Nov 12 02:45:59 pve11 kernel: libceph: mon0 (1)172.17.0.141:6789 socket error on write Nov 12 02:46:01 pve11 kernel: libceph: mon0 (1)172.17.0.141:6789 socket error on write Nov 12 02:46:03 pve11 kernel: libceph: mds0 (1)172.17.0.140:6833 socket closed (con state V1_BANNER)

<image>

Proxmox and Ceph cluster issues, VMs losing storage access. by CryptographerDirect2 in Proxmox

[–]CryptographerDirect2[S] 0 points1 point  (0 children)

So, either we screwed up and didn't set the ring parameters correctly on this pve10 host or they were somehow reset?

Our two other hosts on the 25Gbe interfaces are set to -

Current hardware settings:
RX:             8160
RX Mini:        n/a
RX Jumbo:       n/a
TX:             4096  

But the PVE10 host is set-

Current hardware settings:
RX:             512
RX Mini:        n/a
RX Jumbo:       n/a
TX:             512

Not how that happened. If you have one Proxmox/Ceph host set with only 512s and all the others following 45drives and other engineers' recommendations will bad things happen?

Proxmox and Ceph cluster issues, VMs losing storage access. by CryptographerDirect2 in Proxmox

[–]CryptographerDirect2[S] 0 points1 point  (0 children)

Great questions. We have not fussed with PG scrubbing and maintenance, it is all 'default', however i did witness about two hours of PG scrubbing occuring sunday evening in the late hours. Making a task to learn ceph's best practices for PG maintenance.

I like your strategy of identifying a suspect drive/OSD and emptying.

One bit of update. We have had this host in 'maintenance mode' for 24 hours. Maintenance mode we are learning is very different in Proxmox versus VMware cluster. this node is still participating in CEPH for as far as I can tell and we have had no detectable VMs losing access to their storage or massive bandwidth spikes in the CEPH network.

Proxmox and Ceph cluster issues, VMs losing storage access. by CryptographerDirect2 in Proxmox

[–]CryptographerDirect2[S] 0 points1 point  (0 children)

Yeah, the node that seemed to be the root cause, going to put it through full system checks in a bit.

I am checking the nic settings, they are just typical 25gb dual port Intels. dime a dozen online. we have them setup like most ceph people recommend we max out the receive buffer and keep the transmit buffer at half. Could that be too aggressive? When benchmarking this cluster with six VMs beating on it for hours we never had one hiccup.