From 51a31dea331bade54e12e7f93bb36957b6592817 Mon Sep 17 00:00:00 2001 From: "jbranso@dismail.de" Date: Sat, 6 Jan 2024 14:59:40 -0500 Subject: open_issues/bcachefs.mdwn: new file. Well, we might as well document our conversation with Kent about bachchefs. Message-ID: <20240106200039.2043-1-jbranso@dismail.de> --- open_issues/bcachefs.mdwn | 326 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 326 insertions(+) create mode 100644 open_issues/bcachefs.mdwn (limited to 'open_issues') diff --git a/open_issues/bcachefs.mdwn b/open_issues/bcachefs.mdwn new file mode 100644 index 00000000..aa39bce0 --- /dev/null +++ b/open_issues/bcachefs.mdwn @@ -0,0 +1,326 @@ +[[!meta copyright="Copyright © 2007, 2008, 2010, 2011 Free Software Foundation, +Inc."]] + +[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable +id="license" text="Permission is granted to copy, distribute and/or modify this +document under the terms of the GNU Free Documentation License, Version 1.2 or +any later version published by the Free Software Foundation; with no Invariant +Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license +is included in the section entitled [[GNU Free Documentation +License|/fdl]]."]]"""]] + +[[!tag open_issue_hurd]] + +The Hurd's primary filesystem is ext2, which works but lacks modern +features. With ext2, Hurd users reguarly deal with filesystem +corruption. Ext2 does not have a journal, so Hurd users occasionally +have to deal with filesystem corruption. `fsck` can fix most of the +issues (with loss of random data), but without a proper journal the +Hurd currently is not a good a OS for long-term data storage. + +Bcachefs is a modern COW (copy-on-write) open source filesystem for +Linux, which intends to replace Btrfs and ZFS while having the +performance of ext4 or XFS. It is almost 100,000 lines of code. +Btrfs is 150,000 lines of code. Bcachefs is structured as a +filesystem built on top of a database. There is a clean small +database transaction layer. That core database library is maybe +25,000 lines of code. + +Some Hurd developers recently [[talked with +Bcachefs|https://youtube.com/watch?v=bcWsrYvc5Fg]] author Kent +Overstreat about porting bcachefs to the Hurd. There are currently no +concrete plans to do so due to lack of developer man power. + +90% of the Bcachefs filesystem code builds and runs in userspace. It +uses a shim layer that makes maps kernel locking primatives to +pthreads, the kernel io API is mapped to AIO, etc. Bcachefs does +intend to eventually rewrite most or all of its current codebase into +rust. + +Kent is ok with us merging a shim layer for libstore that maps to the +Unix filesystem API. That would be a header file that goes into the +bcachefs code. + +There is a somewhat working FUSE port of bcachefs, but Kent is not +certain that is a good way to run bcachefs in userspace. Kent wants +to use the FUSE port to help in debbugging. Suppose bcachefs starts +acting up, then you could switch to running it in userspace and attach +GDB to the running process. This is currently not possible. + +We could port bcachefs to the Hurd's native filesystem API: libdiskfs. + +One interesting aspect of the conversation was Kent's goal of re-using +kernel code in userspace. The Linux kernel hashtable code is high +performance, resizeable, lockless, and builds and runs in userspace. +As long as you have liburcu, then you can use the kernel hashtable in +userspace on the Hurd. This might be useful to use on the Hurd. + +Bcachefs is liscensed as GPLv2, and many of Kent's previous employers +own the patents, including Google. Kent is ok with potentially making +the license GPLv2+, as long as there was not a promise to keep +bcachefs GPLv2 only. + +# IRC logs + +https://logs.guix.gnu.org/hurd/2023-09-26.log + + maybe I'm wrong though, do you know much about fuse? or file systems? + no i dont know much about filesystems + what is bcachefs? + see? :D + I agree that someone intimate in the Mach pager api, libdiskfs and fuse would be great at that meeting + I do kind of understand Mach VM / paging, I must say + from the looks of it, I even understand it best among those who have looked at it recently + and I mostly understand libdiskfs + so go to the meeting + what is fuse? do we even need it for hurd? + file systems in userspace + FUSE is "filesystem in user space", it's both the name for the concept, and the name of Linux's specific mechanism, of offloading fs to userland + yeah, i think it may be unneeded for filesystem on hurd + it's basically a giant hack that pretends to be a fileystem implementation to the rest of the kernel, and then sends requests and receives responses from a userland program that _actually_ implements the fs + on the Hurd, *of course* filesystems are implemented in userland, that's the only and tnhe natural way everything works + but that's where the similarities end + you cannot just take a linux fuse fs, using libfuse, and run it on the Hurd + there has been a project make a library that would have the same API as libfuse, but act as a Hurd translator, specifically to facilitate porting linux filesystems + i imagine fuse has an api + last I heard, it was never completed, but who knows + it has a kerne <->userland protocol and a userspace library (libfuse) for implementing that protocol, yes + solid_black: you seem to know more about fuse than you admitted + https://www.gnu.org/software/hurd/hurd/libfuse.html + I know the basics, around as much as I have just told you + I think that gnucode idea was that this would be the easiest to port bcachefs to the Hurd, but I doubt it would be the best + I have also hacked on a C++ fuse fs (darling-dmg), though I don't think I interacted with the fuse parts of it much + Or even the easier + yeah, I don't think it'd be the best or the easiest one either + if someone implemented libfuse api and made it as a hurd translator, surely it would work natively? + zacts: the main problem seems to be the interactions between the fuse file system and virtual memory (including caching) + something the hurd doesn't excel at + it *may* be possible to find existing userspace implementations that don't use the system cache (e.g. implement their own) + Yes, that’s a possibility that needs to be kept open for discussion + Sounds interesting + youpi: ping + pong + hello! + any thoughts on the above discussion? are you going to participate in the call that's being set up? + I don't have time for it + (AFAIK the fuse hurd implementation does work to some extent) + I should at least try out Hurd's fuse before the call, good idea + maybe read up on the Linux's fuse + thoughts on using fuse vs libdiskfs for bcachefs? + using fuse would probably be less work + and it'd probably mean fixing things in libfuse, which can benefit many other FS anyway + is it true that the "low level" API of libfuse is unimplemented and unimplementable? + I don't know what that "low level" API is + this IIUC https://github.com/libfuse/libfuse/blob/master/include/fuse_lowlevel.h + > libfuse offers two APIs: a "high-level", synchronous API, and a "low-level" asynchronous API. In both cases, incoming requests from the kernel are passed to the main program using callbacks. When using the high-level API, the callbacks may work with file names and paths instead of inodes, and processing of a request finishes when the callback function returns. When using the low-level API, the callbacks must work with inodes and responses must be se + nt explicitly using a separate set of API functions. + where did you read that it'd be unimplementable ? + https://git.savannah.gnu.org/cgit/hurd/incubator.git/tree/README?h=libfuse/master + > This is simply because it is to specific to the Linux kernel and (besides that) it is not farly used now. + In case the latter should change in the future, we might want to re-think about that issue though. + so, sounds like it's perhaps implementable in theory, but that'd require additional work and design + see the sentence below... + the low-level API is what bcachefs uses + well, additional work and design, of course + seems to, at least, from a quick glance + any async API needs some + but I don't see why it would not be possible + mig precisely supports asynchronous stubs + bcachefs-tools/cmd_fusermount.c is just 1274 lines, which inspires some hope + asynchrony is not the problem, I imagine (but I haven't looked), but being too tied to Linux might be + it's not really tied, as in it doesn't seem to use linux-specific functions + but it uses linux-like notions, which indeed need to be translated to the hurdish notions + but that's not something really tough + just needs to be worked on + +https://logs.guix.gnu.org/hurd/2023-09-27.log#103329 + + libfuse as shipped as Debian doesn't seem very + functional, I can't even build a simple program against it: + 'i386-gnu/libfuse.so: undefined reference to `assert'' + + (assert is of course a macro in glibc) + and it segfaults in fuse_main_real + lowleve fuse ops do seem to map to netfs concept nicely, as far as I can see so far + and (again, so far) I don't see any asynchrony in how bcachefs uses fuse, i.e. they always fuse_reply() inside the method implementation + + but if we had to implement low-level fuse API, this would be an issue + because netfs is syncronous + this is again a place where I don't think netfs is actually that useful + libfuse should be its own standalone tranlator library, a peer to lib{disk,net,triv}fs + yell at me if you disagree + or perhaps make it use libdiskfs ? + there's significant code in libdiskfs that you'd probably not want to reimplement in libfuse + like what? + starting a translator + all the posix semantic bits + (this is another thing, I don't believe there is a significant difference that explains libdiskfs and libnetfs being two separate libraries. but it's too late to merge them, and I'm not an fs dev) + + starting a translator is abstracted into libfshelp specifically so it can be easily reused? + is libdiskfs synchronous? + I'm just saying things out of my memory + scratch that, diskfs does not work like that at all + piece of it is in fshelp yes + it works on pagers, always + but significant pieces are in libdiskfs too + and you are saying you are not an FS person :) + you do know libdiskfs etc. well beyond the average + perhaps not the ext2 FS structure, but that's not really important here + see e.g. the short-circuits in file-get-trans.c + I may understand how the Hurd's translator libraries work, somewhat better than the avergae person :) + and the code around fshelp_fetch_root + but I don't know about how filesystems are actually organized, on-disk (beyond the basics that there any inodes and superblocks and journaled writes and btrees etc) + you don't really need to know more about that + nor do I know the million little things about how filesystem code should be written to be robust and performant + yeah so as I was saying, libdiskfs expects files to be mappable (diskfs_get_filemap_pager_struct), and then all I/O is implemented on top of that + e.g. to read, libdiskfs queries that pager from the impl, maps it into memory, and copies data from there to the reply message + I must have mentioned that already, I'd like to rewrite that code path some day to do less copying + I imagine this might speed up I/O heavy workloads + ? it doesn't copy into the reply + it transfers map + it does, let me find the code + in some corner cases yes + but not normal case + https://darnassus.sceen.net/~hurd-web/hurd/io_path/ + libdiskfs/rdwr-internal.c, it does pager_memcpy, which is a glorified memcpy + fault handling + don't trust that wiki page + why not ? + not, pager_memcpy is not just a memcpy + it's using vm_copy whenever it can + i.e. map transfer + well yes, but doesn't the regular memcpy also attempt to do that? + it happens to do so indeed + but that' doesn't matter: I do mean it's trying *not* copying + by going through the mm + note: if a wiki page is bogus, propose a fix + I think there was another copy on the path somewhere (in the server, there's yet another in the client of course), but I can't quite remember where + and I wouldn't rely on that vm_copy optimization + it's may be useful when it working, but we have to design for there to not be a need to make a copy in the first place + ah well, pager_read_page does the other copy + when things are not aligned etC. you'll have to do a copy anyway + but then again, this is all my idle observations, I'm not an fs person, I haven't done any profiling, and perhaps indeed all these copies are optimized away with vm_copy + where in pager_read_page do you see a copy? + it should be doing a store_read + passing the pointer to the driver + ext2fs/pager.c:file_pager_read_page (at line 220 here, but I haven't pulled in a while) + it does do a store_read, and that returns a buffer, and then it may have to copy that into the buffer it's trying to return + though in the common case hopefully it'll read everything in a single read op + it's in the new_buf != *buf + offs case + which is not supposed to be the usual case + but now imagine how much overhead this all is + what? the ifs? + we're inside io_read, we already have a buffer where we should put the data into + I have to go give a course, gotta go + we could just device_read() into there + you also want to use a cache + otherwise it'll be the disk that'll kill yiour performance + so at some point you do have to copy from the cache to the application + that's unavoidable + or if it's large, you can vm_copy + copy-on-write + but basically, the presence of the cache means you can have to do copies + and that's far less costly than re-reading from the disk + why can't you return the cache page directly from io_read RPC? + that's vm_copy, yes + but then if the app modifies the piece, you have to copy-on-write + anywauy, really gottago + that part is handled by Mach + right, so once you're back: my conclusion from looking at libfuse is that it should be rewritten, and should not be using netfs (nor diskfs), but be its own independent translator framework + and it just sounds like I'm going to be the one who is going to do it + and we could indeed use bcachefs as a testbed for the low level api, and darling-dmg for the high level api + I installed avfs from Debian (one of the few packages that depend on libfuse), and sure enough: avfs: symbol lookup error: /lib/i386-gnu/libfuse.so.1: undefined symbol: assert_perror + upstream fuse is built with Meson 🤩️ + I'm wondering whether this would be better done as a port in the upstream libfuse, or as a Hurd-specific libfuse lookalike that borrows some code from the upstream one (as now) + solid_black: what is your argument to rewrite a translator framework for fuse? + i dont understand + hi + hi + basically, 1. while the concepts of libfuse *lowlevel* api seem to match that of hurd / netfs, they seem sufficiently different to not be easily implementable on top of netfs + particularly, the async-ness of it, while netfs expects you to do everything synchronously + is that a bug in netfs? + this could be maybe made to work, by putting the netfs thread doing the request to sleep on a condition variable that would get signalled once the answer is provided via the fuse api... but I don't think that's going to be any nicer than designing for the asynchrony from the start + it's not a bug, it's just a design decision, most Hurd tranalators are structured that way + maybe you can rewrite netfs to be asynchronous and replace it + i.e.: it's rare that translators use MIG_NO_REPLY + explicit reply, it's much more common to just block the thread + 2. the current state is not "somewhat working", it's "clearly broken" + why not start by trying to implement rumpdisk async + and see what parts are missing + wdym rumpdisk async? + rumpdisk has a todo to make it asynchronous + let me find the stub + * FIXME: + * Long term strategy: + * + * Call rump_sys_aio_read/write and return MIG_NO_REPLY from + * device_read/write, and send the mig reply once the aio request has + * completed. That way, only the aio request will be kept in rumpdisk + * memory instead of a whole thread structure. + ah right, that reminds me: we still don't have proper mig support for returning errors asynchronously + if the disk driver is not asynchronous, what is the point of making the filesystem asynchronous? + the way this works, being asynchronous or not is an implementatin detail of a server + it doesn't matter to others, the RPC format is the same + there's probably not much point in asynchrony for a real disk fs like bcachefs, which must be why they don't use it and reply immediately + but imagine you're implementing an over-the-network fs with fuse, then you'd want asynchrony + what is your goal here? do you want to fix libfuse? + I don't know + I'm preparing for the call with Kent + but it looks like I'm going to have to rewrite libfuse, yes + possibly the caching is important + ie, where does it happen + maybe, yes + does fuse support mmap? + idk + good q for kent + one essential fs property is coherence between mmap and r/w + so it you change a byte in an mmaped file area, a read() of that byte after that should already return the new value + same for write() + read from memory + this is why libdiskfs insists on reading/writing files via the pager and not via callbacks + I wonder how fuse deals with this + good point, no idea + does fuse really make the kernel handle O_CREAT / O_EXCL? I can't imagine how that would work without racing + guess it could be done by trying opening/creating in a loop, if creation itself is atomic, but this is not nice + something is still slowing down smp + it cant possibly be executing as fast as possible on all cores + if more cores are available to run threads, it should boot faster not slower + Hi damo22, your reasoning would hold if the kernel wouldn’t be “wasting” most of its time running in kernel mode tasks + If replacing CPU_NUMBER by a better implementation gave you a two digits improvement, that kind of implies that the kernel is indeed taking most of the cpu + yes i mean, something in the kernel is slowing down smp + What about vm_map and all thread tasks synchronization + ? + i dont understand how the scheduler can halt the APs in machine_idle() and not end up wasting time + how does anything ever run after HLT + in that code path + if the idle thread halts the processor the only way it can wake up is with an interrupt + but then, does MARK_CPU_ACTIVE() ever run? + hmm it does + I think that normally the cpu would be running scheduler code and get a thread by itself. + thats not how it works + most of the cpus are in idle_continue + then on a clock interrupt or ast interrupt, they are woken to choose a thread i think + s/choose/run + If they are in cpu_idle then that’s what happens, yea + But normally they wouldn’t be in cpu idle but running the schedule and just a thread on their own + Cpu_idle basically turns off the cpu + To save power + every time i interrupt the kernel debugger, its in cpu-idle + i dont know if it waits until it is in that state so maybe thats why + That means that there is nothing to schedule + Or yea that’s another explanation + yes, exactly i think it is seemingly running out of threads to schedule + A bug in the debugger + i need to print the number of threads in the queue + adding a show subcommand for the scheduler state would probably be useful + solid_black: btw, about copies, there's a todo in rumpdisk's rumpdisk_device_read : /* directly write at *data when it is aligned */ + youpi: indeed, that looks relevant, and wouldn't be hard to do + ideally, it should all be zero-copy (or: minimal number of copies), from the device buffer (DMA? idk how this works, can dma pages be then used as regular vm pages?) all the way to the data a unix process receives from read() or something like that + without "slow" memcpies, and ideally with little vm_copies too, though transferring ages in Mach messages is ok + s/ages/pages/ + read() requires ones copy purely because it writes into the provided buffer (and not returns a new one), and we don't have mach_msg_overwrite + though again one would hope vm_copy would help there + ...I do think that it'd be easier to port bcachefs over to netfs than to rewrite libfuse though + but then nothing is going to motivate me to work on libfuse + solid_black: I never work on things that don’t motivate me somehow + Btw, if you want zerocopy for IO, I think you need to do asynchronous io + At least that’s the only way for me to make sense of zerocopy + I don't think sync vs async has much to do with zero-copy-ness? w + + -- cgit v1.2.3