Experimenting a modern filesystem (btrfs)
November 25, 2023
Tags: btrfs linuxRecently I bought myself a beefy new desktop computer as an upgrade from my 12 year old computer, which was still running strong but showing signs of its age. over the 12 years that I had this computer, I accumulated 5 hdds and 4 ssds (plus a few more that died). after going every single one of them and reviewing all the data in it and copying what was worth keeping around, I was left with 12.5 terabytes spread across 9 devices. I knew I was going to use at least some of the disks and the old computer as a data storage server. managing all the disks was the issue. having 9 (or more) different mount points is not fun to manage and very easy to lose data, considering that some hard drives were 8+ years old and had pretty much constant use. what I needed was a way to use multiple devices as one, and in a raid1 setup if possible. encryption was also something I set as a requirement.
On linux there are quite a few ways of doing this, for example using a
combination of luks
, lvm
and/or mdadm
and putting a filesystem (such as
ext4) on top. having done this in the past, but with a single device, I really
didn't want to use this approach again since I found it to be a bit too clunky
to manage. another issue is the overall lack a assurance regarding how to
handle disk failures, since the setup would be custom-ish and assembled from
some basic building blocks.
The solution? using a "modern" filesystem! some of the options available are
zfs
at 18 years old, btrfs
at 15 years old, and bcachefs
at 8 years old.
all three support encryption either directly or through luks
, cow (copy on
write), which is very useful for doing snapshots at very low cost, and handling
multiple devices, and checksums for data and metadata. zfs
, being the older
of the trio, is probably the most well tested and stable option, and often used
on servers (for example on rsync.net). it isn't
available as part of the linux kernel due to legal/license constraints. and a
bigger issue for me, it doesn't handle pools made of heterogeneous devices
efficiently. a zfs
raid1 of a 2x1 and 2x2 terabyte disks will have a total
storage 2 terabytes, since it normalizes down to the smallest device. in
contrast, btrfs
and bcachefs
both can use heterogeneous disks more
efficiently. bcachefs
is more feature packed, notably tracking disk latencies
and optimizing reads from faster disks. it also supports using disks as cache
layers, where for example and nvme drive could be used to receive writes, a set
of hdds for storing data, and an ssd for caching reads. it is much newer and
only recently merged into kernel mainline.
The alternative left was btrfs
, which seems to have a reputation for eating
away the data it stores. a bit of searching around reveals that these are
usually from several years ago and that the situation has changed. several blog
posts exist with people telling they stories of using btrfs
without much
issue. one notable post which helped make up my mind was
this.
The setup I went for consisted of encrypting the full disks using lusk
, which
can be done with cryptsetup luksformat /dev/sda --label foo
for example. this
has to be ran for each device to be used. once encrypted, the device needs to
be opened (as a mapped device, where we can read and write transparently and
the data will be encrypted/decrypted transparently). this is done with
cryptsetup open /dev/sda foo
. the next step is setting the btrfs
system. it
starts with by running mkfs.btrfs
on the mapped devices, eg mkfs.btrfs
/dev/mapper/foo
. it is possible to pass all devices at a single time.
alternativelly, it is possible to add devices one by one to the pool. first it
needs to me mounted mount /dev/mapper/foo /mnt/disk_pool/
for example, and
then we can add the devices using btrfs device add /mnt/mapper/bar
/mnt/disk_pool
. By default, this will use the sum of all storage capacity,
also know as JBOD (Just a Bunch of Disks). While handy, it couples failures
between disks, where losing one may cause more data to be lost than if they
were simply being used individually. This happens because the file system might
store chunks and metadata across devices (haven't confirmed this to be honest).
To switch to a raid1 setup, which can tolerate the loss of a single drive, we
need to run btrfs balance start --verbose -dconvert=raid1,soft -mconvert=raid1
/mnt/disk_pool
. The dconvert
sets the profile for the data, while mconvert
sets the profile for the metadata. If metadata is lost, it is gameover. If the
pool consists of more than two disks, we can use -deconvert=raid1c3
or even
raid1c4
to replicate it to three or four disks. This can be done for data too
for a higher replication rate, but this might be overkill. One cool thing is
that if the motherboard supports hotswapping the disks, then in case of failure
it is possible to swap the disks without downtime or unmounting the brtfs pool.
The partition could even be activelly being used, although this will make the
rebalancing process slower and decrease the performance of the applications
using the btrfs partition. Whenever the profiles change, we can run.
A small mishap
The initial setup had 3 disks. One HDD with 3 terabytes and two SSDs of 1 terabytes each. In total there were 5 terabytes of storage, but only 2 terabytes were usable. The first terabyte from the HDD was mirrored on one of the SSDs, and the second terabyte would be mirrored to the second SDD. The third terabyte doesn't have a match to mirror against, so it isnt utilized.
After a few days of use and storing a bunch of data on it, the pool got full.
Based on my grafana logs, one of the SSDs got filled, and then the second one.
At this point the system switched to readonly. I bought a new 4 terabyte SSD,
plugged it in, and ran btrfs device add
to add it to the pool. To my surprise
it failed because it had no space left. I tried a couple of times mounting and
unmounting the disk, and any command I ran put the filesystem in readonly mode.
After some annoyance I decided to delete some of the data, and after it I got
the new device properly added in. Rebalancing took several hours, but it worked
well. Going forwards, it is definitelly worth adding or replace disks before it
gets full.
A bigger mishap
After many years without a windows install, I decided to install it for the
tooling available for tuning cpu and ram. The process was quite annoying and
took me several hours (mostly due to my own inconpetence / ignorance).
Naturally, this broke my linux boot. After fixing it, I noticed that the btrfs
filesystem failed to mount. Logs showed that it was due to a missing disk (or
in this case, the device mapper being missing). Running fdisk -l
quickly
revealed the problem: There was an EFI partition at the disk, which overwrote
the LUKS
header and basically nuked the data in that disk. Well, this is what
the raid1 setup was for hehe. btrfs
has a command specifically for replacing
a failed disk, but it is intended when you are adding a new device to replace a
physical one. In my case, it was the same device, but it was wiped clean.
Instead of using the replace
command, I ran btrfs device delete missing
. I
expected this to run immediately, or at least very quickly. To my surprise, it
took a few minutes, then a few hours, and then it kept running. By running
watch -d btrfs filesystem usage /mnt/disk_pool
I could see that it was
aparently going over all the data allocated in the disk, which was the 3
terabyte one. Not sure what it is doing, but I would imagine that it is
verifying that all the data is present or something similar.
After 14 hours and only having processed 1 terabyte out of 3, I decided that
would be a good idea to stop the process and just add the disk to the pool and
see what would happen. What happened is that btrfs
still had 1.3 terabyte
marked as missing, and the disk was showing up as 3 terabytes of unallocated
space. Running a balance with soft
updated zero chunks. Then I remembered
that the filesystem had to be mounted in degraded mode. Perhaps this is
preventing the balance from running? Well, back to running btrfs device delete
missing
again. Interestingly, this time I could see space from the 3 terabytes
device being used. So it seems like it is already replicating the data that is
present in only one disk. The final 1.3 terabytes took another 8 hours. Since
this was my first time replacing a disk, I decided to run a rebalance
afterwards to ensure data was properly replicated. This is also a good
opportunity to go back to raid1c4
for metadata. This rebalancing process took
21 minutes, which tells me that it didn't do much, so I guess it wasn't really
necessary.
Conclusion?
Overall, using btrfs
has been a positive experience. It allowed me to give
new life to my random collection of storage devices, while making it more
"ergonomic" by being able to use all the devices as if it were one. Encryption
adds a layer of security in case my computer gets stolen or something. The
raid1
setup with the checksums adds an extra layer of resiliency for the
system. Compression improves data density and throughput, but putting more load
on the cpu and reducing the load on the disk bandwidth, which in this setup is
much more limited. In less than a month I had two situations where btrfs
really helped me, one when the pool got full, and another when I nuked one of
the disks and lost all of the data there.