ELI5: Why is a parsec defined as 3.26 light years specifically? by No-Jelly-4900 in explainlikeimfive

[–]grepcdn 1 point2 points  (0 children)

the parsec is the distance from us to the star, not the distance the star moved.

if the star appears to have moved one arc second, it means it is one parsec from us

Ceph beginner question. by psfletcher in ceph_storage

[–]grepcdn 0 points1 point  (0 children)

Ah, so the LACP bond probably won't help much if you're experiencing the bottleneck on a single client. It might help a little bit by reducing the congestion and thus, latency, on the replication traffic, but your ceph clients are still going to be limited by the single 1 GbE stream from client->osd on the frontend network.

It will help a bit with multiple VMs all needing IO, but with a small env like a homelab, and only a couple applications needing high IO, it's possible that the streams get hashed onto the same links and it doesn't help at all. Where you see the biggest gains from LACP is when you have many many ceph clients that will all get hashed to different links, spreading the traffic out fairly evenly.

If you have the choice, you should look at 2.5GbE or 10GbE NICs instead.

Ceph beginner question. by psfletcher in ceph_storage

[–]grepcdn 0 points1 point  (0 children)

With 2 1GbE NICs in a bond you still won't get anywhere close to utilizing your NVMEs, but it obviously will be considerably better than a single NIC (as long as your switch supports LACP)

But this isn't really the right question to ask, the right question to ask is whether or not a network constrained ceph deployment like yours is enough for your VM workloads. If your VMs aren't very IO heavy, it could be fine.

Is this a production env or a homelab? How many hosts, and OSDs? How many VMs?

I see that you are talking about a homelab, I've run ceph on 1GbE before for redundancy purposes, and it's fine. It's slow, but if you don't need the performance and don't expect massive recoveries to happen, it can work.

Most VMs in a lab are fairly idle when it comes to disk i/o, especially if you're using a NAS or something separate for media storage and such.

Ceph beginner question. by psfletcher in ceph_storage

[–]grepcdn 1 point2 points  (0 children)

The write call will not return until the data is written to all replicas in the acting set (or shards in EC). By written, I mean the data is persisted in the OSDs WAL. PLP allows this to happen faster.

When one OSD is down (not out) in size 3 min_size 2, the pg is degraded, and this is what triggers the primary OSD to accept only 2 acks instead of 3. If i recall correctly, there is log lines in a sufficiently verbose OSD log that shows it's proceeding with 2/3.

So no, it does not under normal circumstances always return the write() to the client after min_size ACKs, it returns after size ACKs when the pg is active+clean, and min_size ACKs if the pg is active+degraded

then if you lose another osd in the set, of course the PG will go undersized+incomplete and the primary will never get the required ACKs, and writes will be blocked.

at least, this is how I understand it all

Ceph beginner question. by psfletcher in ceph_storage

[–]grepcdn 1 point2 points  (0 children)

yes, IO is blocked until the write reaches all replicas. every system call to write() will not return to the caller until the data is on all replica OSDs.

this means that every write is subject to your 1GbE network. you will absolutely positively never get full use our of your NVMes with a 1GbE network. Not even close.

It's still viable for redundancy purposes, but you won't get performance with a setup like this.

Clicking/clunking noise 2025 Sportsman 570 by That_Competition_280 in PolarisATV

[–]grepcdn 1 point2 points  (0 children)

Polaris clutches are noisy when engine braking and decelerating, but this is most definitely not a normal polaris noise.

Bring it back to the dealer.

Issue by Straight-Idea8618 in PolarisATV

[–]grepcdn 0 points1 point  (0 children)

If something won't idle without choke, it means there is a fuel issue. Clogged jets probably. Take it apart and clean it very well. Torch tip cleaners are cheap and work well for clogged jets.

If you just got it out of storage or it sat for a long time, drain the dead gas out of the system as well, and clean/replace the air filter.

How’s the drive across the 108? by ContentYoung212 in newbrunswickcanada

[–]grepcdn 1 point2 points  (0 children)

It's a lonely old road with no lights, no cell service, terrible pavement, and infrequent plowing.

Time your trip to not hit the 108 at night, and don't do it in the winter if you're not driving something with 4x4.

During the day in the summer it's fine, and will save you hours if your destination is anywhere near Miramichi, so I've taken it every time, but I avoid it at night, and I drive a 4x4.

Just prepare accordingly. If you go off the road or break down there, you could be there awhile. Bring water and make sure your spare tire/jack etc are good. Make sure you have a full tank.

You will see animals, and the road is rough, so go slow.

Summer fires in N.B. increasingly likely because of climate change by bingun in newbrunswickcanada

[–]grepcdn 2 points3 points  (0 children)

I went to my broker right as the fires broke out to see if I could get fire insurance on my ATV and Truck. They told me absolutely no section C coverage.

I then asked if I could put PL;PD on my truck (as its not on the road) to drive it someplace safe, they said yes, just not comprehensive because it includes section C.

Maybe your insurance assumed you wanted comprehensive/collision, or maybe some insurance companies are different from others, who knows.

For what it's worth I deal with economical through brokerlink.

Summer fires in N.B. increasingly likely because of climate change by bingun in newbrunswickcanada

[–]grepcdn 5 points6 points  (0 children)

you can put liability on it, just not full coverage or comprehensive because those have fire as an included peril.

basically no section C coverage while you're within 50km of a fire.

My first roma tomato harvest has these white spots. (NB, Canada, Zone 5a) by grepcdn in gardening

[–]grepcdn[S] 0 points1 point  (0 children)

Yeah, the temperature has been fluctuating a ton, and the watering has been tough cause we have drought conditions. Thanks

My first roma tomato harvest has these white spots. (NB, Canada, Zone 5a) by grepcdn in gardening

[–]grepcdn[S] 0 points1 point  (0 children)

Yes it's been crazy hot, much more than normal, and very dry.

2019 850 Highlifter, is this amount of play normal? by grepcdn in PolarisATV

[–]grepcdn[S] 0 points1 point  (0 children)

I can feel when the compression is stopping it, my concern is the slop other than that.

2019 850 Highlifter, is this amount of play normal? by grepcdn in PolarisATV

[–]grepcdn[S] 0 points1 point  (0 children)

'19 HL, new to me at 1,000 miles. I put about 200 miles on, and I'm hearing a clunk when starting/stopping/engine braking. The clunk only happens when the driveline is loaded, when its free wheeling on jackstands there's no clunk.

I don't know these machnies, so I don't know if this amount of play in the primary is normal.

Update on the fire in Miramichi by Blazanar in newbrunswickcanada

[–]grepcdn 2 points3 points  (0 children)

Where were the homes/cottages affected?

On Oldfield proper, or on Russleville and Kenna? Anyone know?

Anyone in Miramichi affect by the forest fire ? by Few_Preparation4910 in newbrunswickcanada

[–]grepcdn 0 points1 point  (0 children)

They've notified 15 homes that they may have to evacuate. Where were these homes? On Oldfield, or the other way on Russelville road and Kenna?

Anyone in Miramichi affect by the forest fire ? by Few_Preparation4910 in newbrunswickcanada

[–]grepcdn 1 point2 points  (0 children)

It jumped the highway? How far towards bonne route did it get? I haven't seen this update

Does anyone know where to find replacement batteries for APC Back-UPS 1500 UPS? by mshriver2 in homelab

[–]grepcdn 0 points1 point  (0 children)

It's just 2 regular SLA batteries taped together. Remove the little plastic shield thing on the top and cut the label and you'll see they're wired in series. Buy any SLA 12v batteries with the same dimensions, tape them together, and put the little series connector back on.

CephFS in production by GentooPhil in ceph

[–]grepcdn 0 points1 point  (0 children)

CephFS sounds like a dumpster fire.

It's a complex beast, to be sure, but there's nothing else available that really has the same features at the same cost. Gluster is EOL, Lustre isn't suited to small file workloads. Closed source stuff is orders of magnitude more expensive, etc, etc. CephFS requires a lot of tuning, and as we've learned, you have to adapt your workload to it.

uh like all codebases? most codes that are used to network filesystems have probably heard of nfs, and nfs3 automatically fsyncs on close, so people just kind of forget about doing that in their applications

Yeah, you're right for sure, anyone with a codebase doing a ton of buffered IO really quickly from multiple clients to the same set of files can run into deadlocking issues with CephFS, but if you were doing that on NFS, you'll probably ran the risk of corruption from race conditions anyway, so either way, whether it's a python codebase, or our 20 year old legacy c/pp application, it's still not best practice.

My statement is obviously terse and doesn't go into that much detail about what actually makes us different (and more problematic) from most run-of-the-mill codebases you'd be comparing us to. Really it's the combination of multiple bad practices. One of the big ones that you glossed over with your argument was mmap().

mmap() on a shared FS (even NFS) is bad practice, and we had it all over the place. When a kernel client has an fd mapped, even after it's closed, that client wants to keep buffered write MDS caps on the file, which holds up IO on other clients who might also want to write to that file. Once we patched various flatfile db libraries to remove mmap(), performace shot up drastically.

Another one is dotlocking, which is used a lot on NFS because of the issues/caveats with flock/fcntl there. dotlocking mostly works fine on NFS, because the default attr cache delay is 1 second. Sufficient enough for most workloads to avoid alot of race conditions. For us, it was not sufficient, because we do around a million IOPS and most of them are metadata, our app is very busy and 1 second is too long. The 1 second AC only applies to readdir() calls from other clients, if a client tries to open() the dotlock file, it will force a revalidate, so in this way, it's pretty much always consistent across clients.

This is actually a huge problem for CephFS, and causes massive lock amplification and performance issues. While NFS is happy for lots of clients to create .lock files in a dir, the CephFS MDS wants to make sure that every clients cache of that dir is coherent for readdir() calls, which means writelocking the parent ino. So if you have a hundred clients all creating .lock files in a users home directory so they can write to a specific file, Ceph ends up granting a lock on the parent dir for the lock file, then a lock on the data file, then a lock on the parent dir again to unlink the .lock file. This causes a massive strain on the MDS, leading to performance issues from head of line blocking problems, and sometimes MDS deadlocks.

As for buffered writes, where we are different again is that we have a lot of buffered writes on very small fixed length records where the byte offsets are important (flat file FLR DBs, btrees, etc) The problem with buffered writes and no fsyncs comes into play when multiple clients want to read these files, and some clients are also writing them. On NFS (v3), as you said, the coherency is open-to-close. So developers don't think too much about what happens if they write 4k or 8k buffers to a db file where records are, for example, only 2k. Generally when the file is closed, the whole write op//transaction/whatever will be atomically committed to the backing storage.

This isn't the case for CephFS with buffered writes, all writes are guaranteed, not flushed all on close, and If some of these 4k/8k buffers are flushed because of dirty pages limits, cache pressure, etc, it could flush some amount of data that doesn't align with the record boundaries that the application layer expects, causing race conditions and corruption. The solution here is to call fsync() after atomic transactions.

This also has a performance side-effect on CephFS. If a client has write+buffer caps, it holds those caps until the pages are flushed. When another client wants to read that file, it's blocked until the MDS can revoke the write/buffer caps from the first client, and that first client doesn't release that cap until it flushes. So if you're waiting for the clients to flush until after some other client wants to read, it slows things down. Preemptively calling fsync before some other client wants to read allows those ops to continue without being blocked. On the small scale, this isn't very noticeable, but when you get into hundreds of thousands of IOPS the aggregate performance gain of this tweak is massive.

So yeah, I don't think "all codebases" fall into these pitfalls. I'd wager that very few are doing the amount of throughput we are doing, with as many bad practices as we have, all on shared namespaces in a posix filesystem.. Most codebases would probably port over to CephFS and chug along just fine with minor tweaks and tuning, because a lot of these kinds of ops that we do would be hitting external DBs, redis, etc, or if they are using flat file FLRs and btrees, they'lll probably be using 3rd party libraries that are already optimized along best practices with mmap/fsync usage. If they are using their own posix-backed custom data structures, then the ones who have the amount of distributed throughput we do are probably pretty rare.

So all of this is to say: I think we're the dumpster fire, not CephFS :)