[[!meta copyright="Copyright © 2007, 2008, 2010, 2011 Free Software Foundation, Inc."]] [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable id="license" text="Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled [[GNU Free Documentation License|/fdl]]."]]"""]] [[!tag open_issue_hurd]] The Hurd's primary filesystem is ext2, which works but lacks modern features. With ext2, Hurd users reguarly deal with filesystem corruption. Ext2 does not have a journal, so Hurd users occasionally have to deal with filesystem corruption. `fsck` can fix most of the issues (with loss of random data), but without a proper journal the Hurd currently is not a good a OS for long-term data storage. Bcachefs is a modern COW (copy-on-write) open source filesystem for Linux, which intends to replace Btrfs and ZFS while having the performance of ext4 or XFS. It is almost 100,000 lines of code. Btrfs is 150,000 lines of code. Bcachefs is structured as a filesystem built on top of a database. There is a clean small database transaction layer. That core database library is maybe 25,000 lines of code. Some Hurd developers recently [[talked with Bcachefs|https://youtube.com/watch?v=bcWsrYvc5Fg]] author Kent Overstreat about porting bcachefs to the Hurd. There are currently no concrete plans to do so due to lack of developer man power. 90% of the Bcachefs filesystem code builds and runs in userspace. It uses a shim layer that makes maps kernel locking primatives to pthreads, the kernel io API is mapped to AIO, etc. Bcachefs does intend to eventually rewrite most or all of its current codebase into rust. Kent is ok with us merging a shim layer for libstore that maps to the Unix filesystem API. That would be a header file that goes into the bcachefs code. There is a somewhat working FUSE port of bcachefs, but Kent is not certain that is a good way to run bcachefs in userspace. Kent wants to use the FUSE port to help in debbugging. Suppose bcachefs starts acting up, then you could switch to running it in userspace and attach GDB to the running process. This is currently not possible. We could port bcachefs to the Hurd's native filesystem API: libdiskfs. One interesting aspect of the conversation was Kent's goal of re-using kernel code in userspace. The Linux kernel hashtable code is high performance, resizeable, lockless, and builds and runs in userspace. As long as you have liburcu, then you can use the kernel hashtable in userspace on the Hurd. This might be useful to use on the Hurd. Bcachefs is liscensed as GPLv2, and many of Kent's previous employers own the patents, including Google. Kent is ok with potentially making the license GPLv2+, as long as there was not a promise to keep bcachefs GPLv2 only. # IRC logs https://logs.guix.gnu.org/hurd/2023-09-26.log maybe I'm wrong though, do you know much about fuse? or file systems? no i dont know much about filesystems what is bcachefs? see? :D I agree that someone intimate in the Mach pager api, libdiskfs and fuse would be great at that meeting I do kind of understand Mach VM / paging, I must say from the looks of it, I even understand it best among those who have looked at it recently and I mostly understand libdiskfs so go to the meeting what is fuse? do we even need it for hurd? file systems in userspace FUSE is "filesystem in user space", it's both the name for the concept, and the name of Linux's specific mechanism, of offloading fs to userland yeah, i think it may be unneeded for filesystem on hurd it's basically a giant hack that pretends to be a fileystem implementation to the rest of the kernel, and then sends requests and receives responses from a userland program that _actually_ implements the fs on the Hurd, *of course* filesystems are implemented in userland, that's the only and tnhe natural way everything works but that's where the similarities end you cannot just take a linux fuse fs, using libfuse, and run it on the Hurd there has been a project make a library that would have the same API as libfuse, but act as a Hurd translator, specifically to facilitate porting linux filesystems i imagine fuse has an api last I heard, it was never completed, but who knows it has a kerne <->userland protocol and a userspace library (libfuse) for implementing that protocol, yes solid_black: you seem to know more about fuse than you admitted https://www.gnu.org/software/hurd/hurd/libfuse.html I know the basics, around as much as I have just told you I think that gnucode idea was that this would be the easiest to port bcachefs to the Hurd, but I doubt it would be the best I have also hacked on a C++ fuse fs (darling-dmg), though I don't think I interacted with the fuse parts of it much Or even the easier yeah, I don't think it'd be the best or the easiest one either if someone implemented libfuse api and made it as a hurd translator, surely it would work natively? zacts: the main problem seems to be the interactions between the fuse file system and virtual memory (including caching) something the hurd doesn't excel at it *may* be possible to find existing userspace implementations that don't use the system cache (e.g. implement their own) Yes, that’s a possibility that needs to be kept open for discussion Sounds interesting youpi: ping pong hello! any thoughts on the above discussion? are you going to participate in the call that's being set up? I don't have time for it (AFAIK the fuse hurd implementation does work to some extent) I should at least try out Hurd's fuse before the call, good idea maybe read up on the Linux's fuse thoughts on using fuse vs libdiskfs for bcachefs? using fuse would probably be less work and it'd probably mean fixing things in libfuse, which can benefit many other FS anyway is it true that the "low level" API of libfuse is unimplemented and unimplementable? I don't know what that "low level" API is this IIUC https://github.com/libfuse/libfuse/blob/master/include/fuse_lowlevel.h > libfuse offers two APIs: a "high-level", synchronous API, and a "low-level" asynchronous API. In both cases, incoming requests from the kernel are passed to the main program using callbacks. When using the high-level API, the callbacks may work with file names and paths instead of inodes, and processing of a request finishes when the callback function returns. When using the low-level API, the callbacks must work with inodes and responses must be se nt explicitly using a separate set of API functions. where did you read that it'd be unimplementable ? https://git.savannah.gnu.org/cgit/hurd/incubator.git/tree/README?h=libfuse/master > This is simply because it is to specific to the Linux kernel and (besides that) it is not farly used now. In case the latter should change in the future, we might want to re-think about that issue though. so, sounds like it's perhaps implementable in theory, but that'd require additional work and design see the sentence below... the low-level API is what bcachefs uses well, additional work and design, of course seems to, at least, from a quick glance any async API needs some but I don't see why it would not be possible mig precisely supports asynchronous stubs bcachefs-tools/cmd_fusermount.c is just 1274 lines, which inspires some hope asynchrony is not the problem, I imagine (but I haven't looked), but being too tied to Linux might be it's not really tied, as in it doesn't seem to use linux-specific functions but it uses linux-like notions, which indeed need to be translated to the hurdish notions but that's not something really tough just needs to be worked on https://logs.guix.gnu.org/hurd/2023-09-27.log#103329 libfuse as shipped as Debian doesn't seem very functional, I can't even build a simple program against it: 'i386-gnu/libfuse.so: undefined reference to `assert'' (assert is of course a macro in glibc) and it segfaults in fuse_main_real lowleve fuse ops do seem to map to netfs concept nicely, as far as I can see so far and (again, so far) I don't see any asynchrony in how bcachefs uses fuse, i.e. they always fuse_reply() inside the method implementation but if we had to implement low-level fuse API, this would be an issue because netfs is syncronous this is again a place where I don't think netfs is actually that useful libfuse should be its own standalone tranlator library, a peer to lib{disk,net,triv}fs yell at me if you disagree or perhaps make it use libdiskfs ? there's significant code in libdiskfs that you'd probably not want to reimplement in libfuse like what? starting a translator all the posix semantic bits (this is another thing, I don't believe there is a significant difference that explains libdiskfs and libnetfs being two separate libraries. but it's too late to merge them, and I'm not an fs dev) starting a translator is abstracted into libfshelp specifically so it can be easily reused? is libdiskfs synchronous? I'm just saying things out of my memory scratch that, diskfs does not work like that at all piece of it is in fshelp yes it works on pagers, always but significant pieces are in libdiskfs too and you are saying you are not an FS person :) you do know libdiskfs etc. well beyond the average perhaps not the ext2 FS structure, but that's not really important here see e.g. the short-circuits in file-get-trans.c I may understand how the Hurd's translator libraries work, somewhat better than the avergae person :) and the code around fshelp_fetch_root but I don't know about how filesystems are actually organized, on-disk (beyond the basics that there any inodes and superblocks and journaled writes and btrees etc) you don't really need to know more about that nor do I know the million little things about how filesystem code should be written to be robust and performant yeah so as I was saying, libdiskfs expects files to be mappable (diskfs_get_filemap_pager_struct), and then all I/O is implemented on top of that e.g. to read, libdiskfs queries that pager from the impl, maps it into memory, and copies data from there to the reply message I must have mentioned that already, I'd like to rewrite that code path some day to do less copying I imagine this might speed up I/O heavy workloads ? it doesn't copy into the reply it transfers map it does, let me find the code in some corner cases yes but not normal case https://darnassus.sceen.net/~hurd-web/hurd/io_path/ libdiskfs/rdwr-internal.c, it does pager_memcpy, which is a glorified memcpy + fault handling don't trust that wiki page why not ? not, pager_memcpy is not just a memcpy it's using vm_copy whenever it can i.e. map transfer well yes, but doesn't the regular memcpy also attempt to do that? it happens to do so indeed but that' doesn't matter: I do mean it's trying *not* copying by going through the mm note: if a wiki page is bogus, propose a fix I think there was another copy on the path somewhere (in the server, there's yet another in the client of course), but I can't quite remember where and I wouldn't rely on that vm_copy optimization it's may be useful when it working, but we have to design for there to not be a need to make a copy in the first place ah well, pager_read_page does the other copy when things are not aligned etC. you'll have to do a copy anyway but then again, this is all my idle observations, I'm not an fs person, I haven't done any profiling, and perhaps indeed all these copies are optimized away with vm_copy where in pager_read_page do you see a copy? it should be doing a store_read passing the pointer to the driver ext2fs/pager.c:file_pager_read_page (at line 220 here, but I haven't pulled in a while) it does do a store_read, and that returns a buffer, and then it may have to copy that into the buffer it's trying to return though in the common case hopefully it'll read everything in a single read op it's in the new_buf != *buf + offs case which is not supposed to be the usual case but now imagine how much overhead this all is what? the ifs? we're inside io_read, we already have a buffer where we should put the data into I have to go give a course, gotta go we could just device_read() into there you also want to use a cache otherwise it'll be the disk that'll kill yiour performance so at some point you do have to copy from the cache to the application that's unavoidable or if it's large, you can vm_copy + copy-on-write but basically, the presence of the cache means you can have to do copies and that's far less costly than re-reading from the disk why can't you return the cache page directly from io_read RPC? that's vm_copy, yes but then if the app modifies the piece, you have to copy-on-write anywauy, really gottago that part is handled by Mach right, so once you're back: my conclusion from looking at libfuse is that it should be rewritten, and should not be using netfs (nor diskfs), but be its own independent translator framework and it just sounds like I'm going to be the one who is going to do it and we could indeed use bcachefs as a testbed for the low level api, and darling-dmg for the high level api I installed avfs from Debian (one of the few packages that depend on libfuse), and sure enough: avfs: symbol lookup error: /lib/i386-gnu/libfuse.so.1: undefined symbol: assert_perror upstream fuse is built with Meson 🤩️ I'm wondering whether this would be better done as a port in the upstream libfuse, or as a Hurd-specific libfuse lookalike that borrows some code from the upstream one (as now) solid_black: what is your argument to rewrite a translator framework for fuse? i dont understand hi hi basically, 1. while the concepts of libfuse *lowlevel* api seem to match that of hurd / netfs, they seem sufficiently different to not be easily implementable on top of netfs particularly, the async-ness of it, while netfs expects you to do everything synchronously is that a bug in netfs? this could be maybe made to work, by putting the netfs thread doing the request to sleep on a condition variable that would get signalled once the answer is provided via the fuse api... but I don't think that's going to be any nicer than designing for the asynchrony from the start it's not a bug, it's just a design decision, most Hurd tranalators are structured that way maybe you can rewrite netfs to be asynchronous and replace it i.e.: it's rare that translators use MIG_NO_REPLY + explicit reply, it's much more common to just block the thread 2. the current state is not "somewhat working", it's "clearly broken" why not start by trying to implement rumpdisk async and see what parts are missing wdym rumpdisk async? rumpdisk has a todo to make it asynchronous let me find the stub * FIXME: * Long term strategy: * * Call rump_sys_aio_read/write and return MIG_NO_REPLY from * device_read/write, and send the mig reply once the aio request has * completed. That way, only the aio request will be kept in rumpdisk * memory instead of a whole thread structure. ah right, that reminds me: we still don't have proper mig support for returning errors asynchronously if the disk driver is not asynchronous, what is the point of making the filesystem asynchronous? the way this works, being asynchronous or not is an implementatin detail of a server it doesn't matter to others, the RPC format is the same there's probably not much point in asynchrony for a real disk fs like bcachefs, which must be why they don't use it and reply immediately but imagine you're implementing an over-the-network fs with fuse, then you'd want asynchrony what is your goal here? do you want to fix libfuse? I don't know I'm preparing for the call with Kent but it looks like I'm going to have to rewrite libfuse, yes possibly the caching is important ie, where does it happen maybe, yes does fuse support mmap? idk good q for kent one essential fs property is coherence between mmap and r/w so it you change a byte in an mmaped file area, a read() of that byte after that should already return the new value same for write() + read from memory this is why libdiskfs insists on reading/writing files via the pager and not via callbacks I wonder how fuse deals with this good point, no idea does fuse really make the kernel handle O_CREAT / O_EXCL? I can't imagine how that would work without racing guess it could be done by trying opening/creating in a loop, if creation itself is atomic, but this is not nice something is still slowing down smp it cant possibly be executing as fast as possible on all cores if more cores are available to run threads, it should boot faster not slower Hi damo22, your reasoning would hold if the kernel wouldn’t be “wasting” most of its time running in kernel mode tasks If replacing CPU_NUMBER by a better implementation gave you a two digits improvement, that kind of implies that the kernel is indeed taking most of the cpu yes i mean, something in the kernel is slowing down smp What about vm_map and all thread tasks synchronization ? i dont understand how the scheduler can halt the APs in machine_idle() and not end up wasting time how does anything ever run after HLT in that code path if the idle thread halts the processor the only way it can wake up is with an interrupt but then, does MARK_CPU_ACTIVE() ever run? hmm it does I think that normally the cpu would be running scheduler code and get a thread by itself. thats not how it works most of the cpus are in idle_continue then on a clock interrupt or ast interrupt, they are woken to choose a thread i think s/choose/run If they are in cpu_idle then that’s what happens, yea But normally they wouldn’t be in cpu idle but running the schedule and just a thread on their own Cpu_idle basically turns off the cpu To save power every time i interrupt the kernel debugger, its in cpu-idle i dont know if it waits until it is in that state so maybe thats why That means that there is nothing to schedule Or yea that’s another explanation yes, exactly i think it is seemingly running out of threads to schedule A bug in the debugger i need to print the number of threads in the queue adding a show subcommand for the scheduler state would probably be useful solid_black: btw, about copies, there's a todo in rumpdisk's rumpdisk_device_read : /* directly write at *data when it is aligned */ youpi: indeed, that looks relevant, and wouldn't be hard to do ideally, it should all be zero-copy (or: minimal number of copies), from the device buffer (DMA? idk how this works, can dma pages be then used as regular vm pages?) all the way to the data a unix process receives from read() or something like that without "slow" memcpies, and ideally with little vm_copies too, though transferring ages in Mach messages is ok s/ages/pages/ read() requires ones copy purely because it writes into the provided buffer (and not returns a new one), and we don't have mach_msg_overwrite though again one would hope vm_copy would help there ...I do think that it'd be easier to port bcachefs over to netfs than to rewrite libfuse though but then nothing is going to motivate me to work on libfuse solid_black: I never work on things that don’t motivate me somehow Btw, if you want zerocopy for IO, I think you need to do asynchronous io At least that’s the only way for me to make sense of zerocopy I don't think sync vs async has much to do with zero-copy-ness? w