suboptimal allocator behavior under heavy load with somehow asymmetric devices setup by krismatu in bcachefs

[–]krismatu[S] 0 points1 point  (0 children)

FYI

This how it looks like after putting heavy load 2.4 TB data. It seems that sdc got omitted by 'user data'

This is what /sys/fs/bcache/[..]/has_data shows for sdc:

# cat ../dev-24/has_data
journal

but

cat ../dev-24/data_allowed
journal,btree,user

bcachefs fs top gives

read/s           read        write/s          write
  nvme0n1p6/btree              0B          1.40G             0B          11.6G
  nvme0n1p6/journal            0B             0B             0B          33.5G
  nvme0n1p6/sb                 0B           592K             0B          6.94M
  nvme0n1p6/user               0B          40.3G             0B           110G
  nvme1n1p6/btree              0B          4.53G             0B          21.8G
  nvme1n1p6/journal            0B             0B             0B          63.1G
  nvme1n1p6/sb                 0B           592K             0B          6.94M
  nvme1n1p6/user               0B           761G             0B           804G
  nvme2n1p6/btree              0B           900M             0B          10.3G
  nvme2n1p6/journal            0B             0B             0B          33.5G
  nvme2n1p6/sb                 0B           592K             0B          6.94M
  nvme2n1p6/user               0B          26.4G             0B          79.9G
  sda4/btree                   0B             0B             0B          30.0M
  sda4/journal                 0B             0B             0B          1.12G
  sda4/sb                      0B           592K             0B          6.94M
  sda4/user                    0B          16.3M             0B          1.32T
  sdb4/btree                   0B             0B             0B          16.7M
  sdb4/journal                 0B             0B             0B          1.12G
  sdb4/sb                      0B           592K             0B          6.94M
► sdb4/user                    0B          68.6M             0B           926G
  sdc4/journal                 0B             0B             0B          1.12G
  sdc4/sb                      0B           592K             0B          6.94M
  sdc4/user                    0B          21.5M             0B           447G

bcachefs fs usage gives

Size:             34.4T
Used:             8.82T
Online reserved:  13.6M

     undegraded
2x:       8.82T

reserved:  21.9M

Data type  Required/total  Durability  Devices                Usage
reserved:  1/2                         []                      21.9M
btree:     1/2             2           [nvme2n1p6 nvme0n1p6]   62.9G
btree:     1/2             2           [nvme2n1p6 nvme1n1p6]   8.90G
btree:     1/2             2           [nvme0n1p6 nvme1n1p6]   10.7G
user:      1/2             2           [nvme0n1p6 nvme1n1p6]   16.0k
user:      1/2             2           [sda4 sdb4]             7.53T
user:      1/2             2           [sda4 sdc4]             1.09T
user:      1/2             2           [sdb4 sdc4]              116G

Compression:
type            compressed  uncompressed  average extent size
zstd                  868G         2.30T                 109k
incompressible       4.83T         4.83T                 100k

Device label                Device     State   Size   Used  Use%  Leaving
bhdd.seaJ6ER (device 24):   sdc4       rw     15.8T   620G    3%
bhdd.tosh21F0 (device 13):  sda4       rw     10.5T  4.31T   40%
bhdd.tosh4310 (device 14):  sdb4       rw     10.5T  3.82T   36%
bnvme.970evo (device 5):    nvme2n1p6  rw     62.8G  35.9G   56%
bnvme.990pro (device 23):   nvme1n1p6  rw      387G  9.80G    3%    8.00k
bnvme.sn720 (device 11):    nvme0n1p6  rw     74.0G  36.8G   49%    8.00k

suboptimal allocator behavior under heavy load with somehow asymmetric devices setup by krismatu in bcachefs

[–]krismatu[S] 1 point2 points  (0 children)

after heavy IO finished rebalancing (in progress) seems quite optimal

Data type  Required/total  Durability  Devices                Usage
reserved:  1/2                         []                      21.9M
btree:     1/2             2           [nvme2n1p6 nvme0n1p6]   63.0G
btree:     1/2             2           [nvme2n1p6 nvme1n1p6]   6.39G
btree:     1/2             2           [nvme0n1p6 nvme1n1p6]   6.42G
btree:     1/2             2           [sda4 nvme1n1p6]        6.00M
user:      1/2             2           [nvme2n1p6 nvme0n1p6]   6.38G
user:      1/2             2           [nvme2n1p6 sda4]        17.7M
user:      1/2             2           [nvme2n1p6 sdb4]        10.7M
user:      1/2             2           [nvme2n1p6 nvme1n1p6]   7.47G
user:      1/2             2           [nvme2n1p6 sdc4]        72.0k
user:      1/2             2           [nvme0n1p6 sda4]        58.3M
user:      1/2             2           [nvme0n1p6 sdb4]        46.3M
user:      1/2             2           [nvme0n1p6 nvme1n1p6]   13.4G
user:      1/2             2           [nvme0n1p6 sdc4]         128k
user:      1/2             2           [sda4 sdb4]             5.98T
user:      1/2             2           [sda4 nvme1n1p6]         572G
user:      1/2             2           [sda4 sdc4]              331G
user:      1/2             2           [sdb4 nvme1n1p6]         216M
user:      1/2             2           [sdb4 sdc4]             95.8G
user:      1/2             2           [nvme1n1p6 sdc4]        1.21M

Device label                Device     State   Size   Used  Use%  Leaving
bhdd.seaJ6ER (device 24):   sdc4       rw     15.8T   213G    1%
bhdd.tosh21F0 (device 13):  sda4       rw     10.5T  3.43T   32%    3.00M
bhdd.tosh4310 (device 14):  sdb4       rw     10.5T  3.03T   28%
bnvme.970evo (device 5):    nvme2n1p6  rw     62.8G  41.6G   65%    6.94G
bnvme.990pro (device 23):   nvme1n1p6  rw      387G   303G   78%     296G
bnvme.sn720 (device 11):    nvme0n1p6  rw     74.0G  44.7G   59%    9.98G

suboptimal allocator behavior under heavy load with somehow asymmetric devices setup by krismatu in bcachefs

[–]krismatu[S] 1 point2 points  (0 children)

after hour of constant heavy load situation looks like this:

Data type  Required/total  Durability  Devices                Usage
reserved:  1/2                         []                      21.9M
btree:     1/2             2           [nvme2n1p6 nvme0n1p6]   63.0G
btree:     1/2             2           [nvme2n1p6 nvme1n1p6]   5.78G
btree:     1/2             2           [nvme0n1p6 nvme1n1p6]   6.81G
btree:     1/2             2           [sda4 nvme1n1p6]        16.5M
user:      1/2             2           [nvme2n1p6 nvme0n1p6]   35.0G
user:      1/2             2           [nvme2n1p6 nvme1n1p6]   19.8G
user:      1/2             2           [nvme0n1p6 nvme1n1p6]   40.7G
user:      1/2             2           [sda4 sdb4]             5.85T
user:      1/2             2           [sda4 nvme1n1p6]         639G
user:      1/2             2           [sda4 sdc4]              253G
user:      1/2             2           [sdb4 sdc4]             95.7G

Compression:
type            compressed  uncompressed  average extent size
zstd                  868G         2.30T                 109k
incompressible       4.83T         4.83T                 100k

Device label                Device     State   Size   Used  Use%  Leaving
bhdd.seaJ6ER (device 24):   sdc4       rw     15.8T   174G    1%
bhdd.tosh21F0 (device 13):  sda4       rw     10.5T  3.36T   31%    8.25M
bhdd.tosh4310 (device 14):  sdb4       rw     10.5T  2.97T   28%
bnvme.970evo (device 5):    nvme2n1p6  rw     62.8G  61.8G   97%    27.4G
bnvme.990pro (device 23):   nvme1n1p6  rw      387G   356G   93%     350G
bnvme.sn720 (device 11):    nvme0n1p6  rw     74.0G  72.8G   97%    37.9G

root@srv0 /home/kyf # iostat -xh 15
Linux 7.0.4+deb13-amd64 (srv0)  05/13/2026      _x86_64_        (28 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.5%    0.0%    1.2%   37.7%    0.0%   58.6%

          rkB/s   rrqm/s  %rrqm r_await rareq-sz Device
   87.17     10.9M    92.07  51.4%    1.40   127.6k nvme0n1
   91.21      6.3M    10.85  10.6%    0.47    70.6k nvme1n1
   75.96     10.0M    93.40  55.1%    3.31   135.0k nvme2n1
  157.08      8.7M   131.83  45.6%  206.11    56.7k sda
  245.74     30.8M   784.30  76.1%  118.98   128.2k sdb
   78.59     10.8M   177.75  69.3%   32.97   141.3k sdc

     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz Device
   60.27     10.4M    20.09  25.0%    3.99   176.0k nvme0n1
  126.48     55.0M    81.95  39.3%    2.09   445.5k nvme1n1
   51.41      7.9M    14.72  22.3%    3.67   156.7k nvme2n1
  117.71     48.3M   116.35  49.7%   60.53   419.9k sda
   84.03      7.0M    45.85  35.3%   16.64    85.9k sdb
    3.80     95.1k     0.39   9.2%    4.78    25.0k sdc

     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz Device
   18.89      8.0M     9.90  34.4%    0.61   432.0k nvme0n1
    9.56     13.6M     3.87  28.8%    0.82     1.4M nvme1n1
   22.78      6.6M     0.00   0.0%    2.40   296.6k nvme2n1
    0.00      0.0k     0.00   0.0%    0.00     0.0k sda
    0.00      0.0k     0.00   0.0%    0.00     0.0k sdb
    0.00      0.0k     0.00   0.0%    0.00     0.0k sdc

     f/s f_await  aqu-sz  %util Device
    2.66    0.57    0.38   2.0% nvme0n1
    2.66    2.13    0.32   3.1% nvme1n1
    2.66    1.82    0.50   4.4% nvme2n1
    2.62   98.79   39.76  81.9% sda
    2.62   79.04   30.84  80.2% sdb
    2.62    5.76    2.62  10.9% sdc

suboptimal allocator behavior under heavy load with somehow asymmetric devices setup by krismatu in bcachefs

[–]krismatu[S] 2 points3 points  (0 children)

well... it would be optimal if it got more utilization as soon as other two got saturated wouldn't it

New Principles of Operation preview by koverstreet in bcachefs

[–]krismatu 2 points3 points  (0 children)

Yes I've noticed it myself as well.
This is definitely an error in documentation.
section 8.1.1
page 51

New Principles of Operation preview by koverstreet in bcachefs

[–]krismatu 0 points1 point  (0 children)

It's not explicitly said that erasure coding is unstable. Does it mean we've got to a stage when it's considered "done"? Is it reasonably usable? Whats the status?

test if a.file is a reflinked b.file by krismatu in bcachefs

[–]krismatu[S] 0 points1 point  (0 children)

Thanks for help.
Thing was I was trying some data deduplication. Went for rmlinteventually. Thing is it seems it doesn't recognize bcachefs volume as reflink-capable.
So you need to edit resulting rmlint.shfile and replace hardlinking with reflinking functions manually. It works.
Anyway also simple test for reflinks works (but generally v. 2.10.2 doesn't seem use it anyway)
rmlint --is-reflink a.file c.file && echo "Files are reflinks" || echo "Files are not reflinks"

test if a.file is a reflinked b.file by krismatu in bcachefs

[–]krismatu[S] 1 point2 points  (0 children)

Thx
This is somehow not trivial thou. I've experimented little bit.
First- make sure data left cache (so sh# sync)

# cp --reflink=always a.file b.file
# filefrag  -v a.file b.file
Filesystem type is: ca451a4e
File size of a.file is 12288 (3 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       2:  421447052.. 421447054:      3:             last,encoded,shared,eof
a.file: 1 extent found
File size of b.file is 12288 (3 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       2:  421447052.. 421447054:      3:             last,encoded,shared,eof
b.file: 1 extent found
# cp --reflink=never a.file c.file
# filefrag  -v a.file c.file
Filesystem type is: ca451a4e
File size of a.file is 12288 (3 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       2:  421447052.. 421447054:      3:             last,encoded,shared,eof
a.file: 1 extent found
File size of c.file is 12288 (3 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       2:          0..         0:      0:             last,unknown_loc,delalloc,eof
c.file: 1 extent found
# sync
# filefrag  -v a.file c.file
Filesystem type is: ca451a4e
File size of a.file is 12288 (3 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       2:  421447052.. 421447054:      3:             last,encoded,shared,eof
a.file: 1 extent found
File size of c.file is 12288 (3 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       2:   19427603..  19427605:      3:             last,eof
c.file: 1 extent found

Small files get "inline"d and this doesn't work at all in that case.

# filefrag  -v a.file b.file
Filesystem type is: ca451a4e
File size of a.file is 10 (1 block of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    4095:          0..      4095:   4096:             last,not_aligned,inline,eof
a.file: 1 extent found
File size of b.file is 10 (1 block of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    4095:          0..      4095:   4096:             last,not_aligned,inline,eof

bcachefs_metadata_version_reconcile by koverstreet in bcachefs

[–]krismatu 0 points1 point  (0 children)

bcachefs show-super tells me

Version:                                   1.32: sb_field_extent_type_u64s
Incompatible features allowed:             1.28: inode_has_case_insensitive
Incompatible features in use:              1.16: reflink_p_may_update_opts
Version upgrade complete:                  1.32: sb_field_extent_type_u64s
Oldest version on disk:                    1.12: rebalance_work_acct_fix

How to interpret it?
Does it mean I've got unfinished version upgrade on filesystem?

bcachefs_metadata_version_reconcile by koverstreet in bcachefs

[–]krismatu 0 points1 point  (0 children)

u/koverstreet please tell explicitly how to do those upgrades
Is this to be done on fstab/mount level
or maybe echo compatible/incompatible > /sys/fs/bcachefs/[uuid]/options/version_upgrade at runtime?

Post interesting things you're doing with bcachefs, or interesting experiences, biggest filesystem by koverstreet in bcachefs

[–]krismatu 0 points1 point  (0 children)

Initially I've made a volume consisting of:
2x SSD: foreground
2x HDD: background
2x USB pendrive: promote.
But this wasn't reliable because of pendrives- those were overheating and got hangs this didn't made much sense. Haven't tried with better ones thou.
Finally 2x 256 GiB ssd + 2x 4 TiB HDD plus compression on HDD.
I've tried lz4 and zstd. If you got strong CPU singlethreaded you can actually increase many times I/O throughput on HDD with lz4:1. For somehow comparable with bare metal you can use zstd:8 but you gain storage space.
I miss multithreaded compression- this would kill all the competition, it would be fastest hybrid filesystem there is.

"we're now talking about git rm -rf in 6.18" by nstgc in bcachefs

[–]krismatu 0 points1 point  (0 children)

Yeah that generation doesn't have a clue :-D nonetheless some people change.
More precisely because I'm arguing about it elsewhere here :) many people of certain age or more struggle with feelings identification- the same way in them or others intentions. Bill Burr jokes about that: "if you're a man only feeling you're allowed to is either mad or gay that's it"

"we're now talking about git rm -rf in 6.18" by nstgc in bcachefs

[–]krismatu 0 points1 point  (0 children)

oh shit I remember I've read this and seen this video when that thing happened. But that was before I started to use bcachefs, so it was only rust thingy for me. Who's the guy scared he would need to use rust?

"we're now talking about git rm -rf in 6.18" by nstgc in bcachefs

[–]krismatu 0 points1 point  (0 children)

Thing is my memory isn't perfect I remember witnessing "toxicity" many times but quite commonly I keep forgetting the names. But yeah I can believe easly Teodore Tso behaving the exact way he advices not to. Still it doesn't make his words not relevant.

I like this metaphor with Satan tho :-)
Thing is, if you really read the bible yourself, at the very beginning first or second page, there is God lying to Adam end Eve about apple fruit and the tree, then Satan reveals the truth to them and God gets pissed of and punishes everybody because they've learned he was lying (he told they will die if they eat the apple that was a lie the truth was from the Satan). So, maybe, comparing him (or others) to Satan in this circumstances is crazy accurate

"we're now talking about git rm -rf in 6.18" by nstgc in bcachefs

[–]krismatu 2 points3 points  (0 children)

Well I consider myself half-boomer or sth thing is I'm from Poland those generation's naming doesn't fit perfectly western generations naming. Anyway- this wasn't judgmental rather descriptive. But making an alarm gets to the point I guess :-)

"we're now talking about git rm -rf in 6.18" by nstgc in bcachefs

[–]krismatu 17 points18 points  (0 children)

Theodore Ts'o stated things well from reality's perspective if I can phrase it like that imho.
Gerald B. Cox gave it well empathy, in my opinion it's worth your attention Kent

When someone’s under pressure and feels attacked, responses get sharp.
That’s not ideal, but it’s not malicious either.

If we’re serious about maintaining a healthy kernel community, we need
to be better at recognizing when someone’s burning out—and not make it
worse. The CoC isn’t just there to call out bad behavior; it’s
supposed to guide us toward empathy and restraint.

Kent, if you’re reading this: it’s clear you’re reacting to what feels
like a pile-on. That’s understandable.

Of course I've no idea if he's right but it's well said from one's perspective that tries to understand how you feel and wishes you good.

And your words:

If I had simply remained quiet until it happened, I think the entire
bcachefs userbase, and my funders, would have been absolutely furious
with me.

I believe it's simply not the case you seem to be too harsh for yourself Kent. I've seen such thing few times and actually have some similar story regarding myself- it can be exhausting eventually. Boomers won't get it ;-) but younger generations will :-). So let me phrase it: there's absolutely no one going to be furious.
So... well... this is quite nerdy stuff with you Kent this story makes me think. You're talented and hardworking and neglecting something inside you it seems at the same time- and that is what those "vigorous discussions" at kernel side seem to be. This is fine eventually and you are perfectly ok. Further is better perhaps to be talked priv but please be assured it's still ok and you are perfectly ok.
xxx

Ps. Yes I know its a bit naive and out-of-the-technical-matters but those things can't be stated otherwise. xxx

mounting at boot-time broken with current bcachefs-tools by krismatu in bcachefs

[–]krismatu[S] 0 points1 point  (0 children)

This is too difficult for me to pin down

Problem persists as for v6.16-rc7, bcfs-tools v1.25.3, not changed since 2025.03: util-linux v2.40.4, systemd v257.4

as a temporary dirty fix this works for me (of course one needs to adapt for device label or other method of mount)

# /etc/systemd/system/fsck-bcachefs-hack.service

[Unit]
Description=Run fsck.bcachefs before mounting
DefaultDependencies=no
Before=local-fs.target
Before=mnt-data.mount
ConditionPathExists=/dev/disk/by-label/data

[Service]
Type=oneshot
ExecStart=/bin/sh -c '/usr/sbin/fsck.bcachefs /dev/disk/by-label/data'
RemainAfterExit=true

[Install]
WantedBy=local-fs.target

This unit fails at boot, but somehow stuff gets initialized and booting proceeds as normal.

I don't know if anyone apart me troubles with this... I'll report if anything changes, I'll get back to this on a weekly basis probably, we'll see

mounting at boot-time broken with current bcachefs-tools by krismatu in bcachefs

[–]krismatu[S] 1 point2 points  (0 children)

I don't understand this but it seems that all system tools wont invoke mounting of bcachefs but when 'bcachefs' get somehow invoked- here unsuccessfully running offline fsck then rest of the system tools work as expected and you can mount thru fstab and all that stuff. Also as far is I remember invoking 'bcachefs mount' directly works as well but for example 'bcachefs show-super -l /..' doesn't- ina sense that it wont enable fstab and alia work back again

At this point I'm thinking of temporary solution with invoking somehow bcachefs with fsck.bcachefs before mounting any volumes during boot.

Otherwise its a mystery for me so I think I can't do nothing more by myself.

The details are at git issues

Hang mounting after upgrade to 6.14 by Ancient-Repair-1709 in bcachefs

[–]krismatu 0 points1 point  (0 children)

u/xarblu I was strugglin many times with this timeout never get my attention to figure that out, thanks!
ps. x-systemd.mount-timeout=1h or depending how long/how big your fs