[[!meta copyright="Copyright © 2012, 2013, 2014 Free Software Foundation, Inc."]] [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable id="license" text="Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled [[GNU Free Documentation License|/fdl]]."]]"""]] [[!tag open_issue_documentation open_issue_gnumach]] [[!toc]] # Deficiencies ## IRC, freenode, #hurd, 2012-06-29 I do not understand what are the deficiencies of Mach, the content I find on this is vague... the major problems are that the IPC architecture offers poor performance; and that resource usage can not be properly accounted to the right parties antrik: the more i study it, the more i think ipc isn't the problem when it comes to performance, not directly i mean, the implementation is a bit heavy, yes, but it's fine the problems are resource accounting/scheduling and still too much stuff inside kernel space and with a very good implementation, the performance problem would come from crossing address spaces (and even more on SMP, i've been thinking about it lately, since it would require syncing mmu state on each processor currently using an address space being modified) braunr: the problem with Mach IPC is that it requires too many indirections to ever be performant AIUI antrik: can you mention them ? the semantics are generally quite complex, compared to Coyotos for example, or even Viengoos antrik: the semantics are related to the message format, which can be simplified i think everybody agrees on that i'm more interested in the indirections but then it's not Mach IPC anymore :-) right 22:03 < braunr> i mean, the implementation is a bit heavy, yes, but it's fine that's not an implementation issue that's what i meant by heavy :) well, yes and no Mach IPC have changed over time it would be newer Mach IPC ... :) the fact that data types are (supposed to be) transparent to the kernel is a major part of the concept, not just an implementation detail but it's not just the message format transparent ? but they're not :/ the option to buffer in the kernel also adds a lot of complexity buffer in the kernel ? ah you mean message queues yes braunr: eh? the kernel parses all the type headers during transfer yes, so it's not transparent at all maybe you have a different understanding of "transparent" ;-) i guess I think most of the other complex semantics are kinda related to the in-kernel buffering... i fail to see why :/ well, it allows ports rights to be destroyed while a message is in transfer. a lot of semantics revolve around what happens in that case yes but it doesn't affect performance a lot sure it does. it requires a lot of extra code and indirections not a lot of it "a lot" is quite a relative term :-) compared to L4 for example, it *is* a lot and those indirections (i think you refer to more branching here) are taken only when appropriate, and can be isolated, improved through locality, etc.. the features they add are also huge L4 is clearly insufficient all current L4 forks have added capabilities .. (that, with the formal verification, make se4L one of the "hottest" recent system projects) seL4* yes, but with very few extra indirection I think... similar to EROS (which claims to have IPC almost as efficient as the original L4) possibly I still fail to see much real benefit in formal verification :-) but compared to other problems, this added code is negligible antrik: for a microkernel, me too :/ the kernel is already so small you can simply audit it :) no, it's not neglible, if you go from say two cache lines touched per IPC (original L4) to dozens (Mach) every additional variable that needs to be touched to resolve some indirection, check some condition adds significant overhead if you compare the dozens to the huge amount of inter processor interrupt you get each time you change the kernel map, it's next to nothing .. change the kernel map? not sure what you mean syncing address spaces on hundreds of processors each time you send a message is a real scalability issue here (as an example), where Mach to L4 IPC seem like microoptimization braunr: modify, you mean? yes (not switchp ) but that's only one example yes, modify, not switch also, we could easily get rid of the ihash library making the message provide the address of the object associated to a receive right so the only real indirection is the capability, like in other systems, and yes, buffering adds a bit of complexity there are other optimizations that could be made in mach, like merging structures to improve locality "locality"? having rights close to their target port when there are only a few pinotree: locality of reference for cache efficiency hundreds of processors? let's stay realistic here :-) i am .. a microkernel based system is also a very good environment for RCU (i yet have to understand how liburcu actually works on linux) I'm not interested in systems for supercomputers. and I doubt desktop machines will get that many independant cores any time soon. we still lack software that could even romotely exploit that hum, the glibc build system ? :> lol we have done a survey over the nix linux distribution quite few packages actually benefit from a lot of cores and we already know them :) what i'm trying to say is that, whenever i think or even measure system performance, both of the hurd and others, i never actually see the IPC as being the real performance problem there are many other sources of overhead to overcome before getting to IPC I completely agree and with the advent of SMP, it's even more important to focus on contention (also, 8 cores aren't exactly a lot...) antrik: s/8/7/ , or even 6 ;) braunr: it depends a lot on the use case. most of the problems we see in the Hurd are probably not directly related to IPC performance; but I pretty sure some are (such as X being hardly usable with UNIX domain sockets) antrik: these have more to do with the way mach blocks than IPC itself similar to the ext2 "sleep storm" a lot of overhead comes from managing ports (for for example), which also mostly comes down to IPC performance antrik: yes, that's the main indirection antrik: but you need such management, and the related semantics in the kernel interface (although i wonder if those should be moved away from the message passing call) you mean a different interface for kernel calls than for IPC to other processes? that would break transparency in a major way. not sure we really want that... antrik: no antrik: i mean calls specific to right management admittedly, transparency for port management is only useful in special cases such as rpctrace, and that probably could be served better with dedicated debugging interfaces... antrik: i.e. not passing rights inside messages passing rights inside messages is quite essential for a capability system. the problem with Mach IPC in regard to that is that the message format allows way more flexibility than necessary in that regard... antrik: right antrik: i don't understand why passing rights inside messages is important though antrik: essential even braunr: I guess he means you need at least one way to pass rights braunr: well, for one, you need to pass a reply port with each RPC request... youpi: well, as he put, the message passing call is overpowered, and this leads to many branches in the code antrik: the reply port is obvious, and can be optimized antrik: but the case i worry about is passing references to objects between tasks antrik: rights and identities with the auth server for example antrik: well ok forget it, i just recall how it actually works :) antrik: don't forget we lack thread migration antrik: you may not think it's important, but to me, it's a major improvement for RPC performance braunr: how can seL4 be the most interesting microkernel then?... ;-) antrik: hm i don't know the details, but if it lacks thread migration, something is wrong :p antrik: they should work on viengoos :) (BTW, AIUI thread migration is quite related to passive objects -- something Hurd folks never dared seriously consider...) i still don't know what passive objects are, or i have forgotten it :/ no own control threads hm, i'm still missing something what do you refer to by control thread ? with* i.e. no main loop etc.; only activated by incoming calls ok well, if i'm right, thomas bushnel himself wrote (recently) that the ext2 "sleep" performance issue was expected to be solved with thread migration so i guess they definitely considered having it braunr: don't know what the "sleep peformance issue" is... http://lists.gnu.org/archive/html/bug-hurd/2011-12/msg00032.html antrik: also, the last message in the thread, http://lists.gnu.org/archive/html/bug-hurd/2011-12/msg00050.html antrik: do you consider having a reply port being an avoidable overhead ? braunr: not sure. I don't remember hearing of any capability system doing this kind of optimisation though; so I guess there are reasons for that... antrik: yes me too, even more since neal talked about it on viengoos I wonder whether thread management is also such a large overhead with fully sync IPC, on L4 or EROS for example... antrik: it's still a very handy optimization for thread scheduling antrik: it makes solving priority inversions a lot easier actually, is thread scheduling a problem at all with a thread activation approach like in Viengoos? antrik: thread activation is part of thread migration antrik: actually, i'd say they both refer to the same thing err... scheduler activation was the term I wanted to use same well scheduler activation is too vague to assert that antrik: do you refer to scheduler activations as described in http://en.wikipedia.org/wiki/Scheduler_activations ? my understanding was that Viengoos still has traditional threads; they just can get scheduled directly on incoming IPC braunr: that Wikipedia article is strange. it seems to use "scheduler activations" as a synonym for N:M multithreading, which is not at all how I understood it antrik: I used to try to keep a look at those pages, to fix such wrong things, but left it antrik: that's why i ask IIRC Viengoos has a thread associated with each receive buffer. after copying the message, the kernel would activate the processes activation handler, which in turn could decide to directly schedule the thead associated with the buffer or something along these lines antrik: that's similar to mach handoff antrik: generally enough, all the thread-related pages on wikipedia are quite bogus nah, handoff just schedules the process; which is not useful, if the right thread isn't activated in turn... antrik: but i think it's more than that, even in viengoos for instance, the french "thread" page was basically saying that they were invented for GUIs to overlap computation with user interaction .. :) youpi: good to know... antrik: the "misunderstanding" comes from the fact that scheduler activations is the way N:M threading was implemented on netbsd youpi: that's a refreshing take on the matter... ;-) antrik: i'll read the critique and viengoos doc/source again to be sure about what we're talking :) antrik: as threading is a major issue in mach, and one of the things i completely changed (and intend to change) in x15, whenever i get to work on that again ..... :) antrik: interestingly, the paper about scheduler activations was written (among others) by brian bershad, in 92, when he was actively working on research around mach braunr: BTW, I have little doubt that making RPC first-class would solve a number of problems... I just wonder how many others it would open # X15 ## IRC, freenode, #hurd, 2012-09-04 it was intended as a mach clone, but now that i have better knowledge of both mach and the hurd, i don't want to retain mach compatibility and unlike viengoos, it's not really experimental it's focused on memory and cpu scalability, and performance, with techniques likes thread migration and rcu the design i have in mind is closer to what exists today, with strong emphasis on scalability and performance, that's all and the reason the hurd can't be modified first is that my design relies on some important design changes so there is a strong dependency on these mechanisms that requires the kernel to exists first ## IRC, freenode, #hurd, 2012-09-06 In context of [[open_issues/multithreading]] and later [[open_issues/select]]. And you will address the design flaws or implementation faults with x15? no i'll address the implementation details :p and some design issues like cpu and memory resource accounting but i won't implement generic resource containers assuming it's completed, my work should provide a hurd system on par with modern monolithic systems (less performant of course, but performant, scalable, and with about the same kinds of problems) for example, thread migration should be mandatory which would make client calls behave exactly like a userspace task asking a service from the kernel you have to realize that, on a monolithic kernel, applications are clients, and the kernel is a server and when performing a system call, the calling thread actually services itself by running kernel code which is exactly what thread migration is for a multiserver system thread migration also implies sync IPC and sync IPC is inherently more performant because it only requires one copy, no in kernel buffering sync ipc also avoids message floods, since client threads must run server code and this is not achievable with evolved gnumach and/or hurd? well that's not entirely true, because there is still a form of async ipc, but it's a lot less likely it probably is but there are so many things to change i prefer starting from scratch scalability itself probably requires a revamp of the hurd core libraries and these libraries are like more than half of the hurd code mach ipc and vm are also very complicated it's better to get something new and simpler from the start a major task nevertheless:-D at least with the vm, netbsd showed it's easier to achieve good results from new code, as other mach vm based systems like freebsd struggled to get as good well yes but at least it's not experimental everything i want to implement already exists, and is tested on production systems it's just time to assemble those ideas and components together into something that works you could see it as a qnx-like system with thread migration, the global architecture of the hurd, and some improvements from linux like rcu :) ### IRC, freenode, #hurd, 2012-09-07 braunr: thread migration is tested on production systems? BTW, I don't think that generally increasing the priority of servers is a good idea in most cases, IPC should actually be sync. slpz looked at it at some point, and concluded that the implementation actually has a fast-path for that case. I wonder what happens to scheduling in this case -- is the receiver sheduled immediately? if not, that's something to fix... antrik: qnx does something very close to thread migration, yes antrik: i agree increasing the priority isn't a good thing, but it's the best of the quick and dirty ways to reduce message floods the problem isn't sync ipc in mach the problem is the notifications (in our cases the dead name notifications) that are by nature async and a malicious program could send whatever it wants at the fastest rate it can braunr: malicious programs can do any number of DOS attacks on the Hurd; I don't see how increasing priority of system servers is relevant in that context (BTW, I don't think dead name notifications are async by nature... just like for most other IPC, the *usual* case is that a server thread is actively waiting for the message when it's generated) antrik: it's async with respect to the client antrik: and malicious programs shouldn't be able to do that kind of dos but this won't be fixed any time soon on the other hand, a higher priority helps servers not create too many threads because of notifications, and that's a good thing gnu_srs: the "fix" for this will be to rewrite select so that it's synchronous btw replacing dead name notifications with something like cancelling a previously installed select request no idea what "async with respect to the client" means it means the client doesn't wait for anything what is the client? what scenario are you talking about? how does it affect scheduling? for notifications, it's usually the kernel it doesn't directly affect scheduling it affects the amount of messages a hurd server has to take care of and the more messages, the more threads i'm talking about event loops and non blocking (or very short) selects the amount of messages is always the same. the question is whether they can be handled before more come in. which would be the case if be default the receiver gets scheduled as soon as a message is sent... no scheduling handoff doesn't imply the thread will be ready to service the next message by the time a client sends a new one the rate at which a message queue gets filled has nothing to do with scheduling handoff I very much doubt rates come into play at all well they do in my understanding the problem is that a lot of messages are sent before the receive ever has a chance to handle them. so no matter how fast the receiver is, it looses a lot of non blocking selects means a lot of reply ports destroyed, a lot of dead name notifications, and what i call message floods at server side no it used to work fine with cthreads it doesn't any more with pthreads because pthreads are slightly slower if the receiver gets a chance to do some work each time a message arrives, in most cases it would be free to service the next request with the same thread no, because that thread won't have finished soon enough no, it *never* worked fine. it might have been slighly less terrible. ok it didn't work fine, it worked ok it's entirely a matter of rate here and that's the big problem, because it shouldn't I'm pretty sure the thread would finish before the time slice ends in almost all cases no too much contention and in addition locking a contended spin lock depresses priority so servers really waste a lot of time because of that I doubt contention would be a problem if the server gets a chance to handle each request before 100 others come in i don't see how this is related handling a request doesn't mean entirely processing it there is *no* relation between handoff and the rate of incoming message rate unless you assume threads can always complete their task in some fixed and low duration sure there is. we are talking about a single-processor system here. which is definitely not the case i don't see what it changes I'm pretty sure notifications can generally be handled in a very short time if the server thread is scheduled as soon as it gets a message, it can also get preempted by the kernel before replying no, notifications can actually be very long hurd_thread_cancel calls condition_broadcast so if there are a lot of threads on that .. (this is one of the optimizations i have in mind for pthreads, since it's possible to precisely select the target thread with a doubly linked list) but even if that's the case, there is no guarantee you can't assume it will be "quick enough" there is no guarantee. but I'm pretty sure it will be "quick enough" in the vast majority of cases. which is all it needs. ok that's also the idea behind raising server priorities braunr: so you are saying the storms are all caused by select(), and once this is fixed, the problem should be mostly gone and the workaround not necessary anymore? yes let's hope you are right :-) :) (I still think though that making hand-off scheduling default is the right thing to do, and would improve performance in general...) sure well no it's just a hack ;p but it's a right one the right thing to do is a lot more complicated as roland wrote a long time ago, the hurd doesn't need dead-name notifications, or any notification other than the no-sender (which can be replaced by a synchronous close on fd like operation) well, yes... I still think the viengoos approach is promising. I meant the right thing to do in the existing context ;-) better than this priority hack oh? you happen to have a link? never heard of that... i didn't want to do it initially, even resorting to priority depression on trhead creation to work around the problem hm maybe it wasn't him, i can't manage to find it antrik: http://lists.gnu.org/archive/html/l4-hurd/2003-09/msg00009.html "Long ago, in specifying the constraints of what the Hurd needs from an underlying IPC system/object model we made it very clear that we only need no-senders notifications for object implementors (servers)" "We don't in general make use of dead-name notifications, which are the general kind of object death notification Mach provides and what serves as task death notification." "In the places we do, it's to serve some particular quirky need (and mostly those are side effects of Mach's decouplable RPCs) and not a semantic model we insist on having." ### IRC, freenode, #hurd, 2012-09-08 The notion that seemed appropriate when we thought about these issues for Fluke was that the "alert" facility be a feature of the IPC system itself rather than another layer like the Hurd's io_interrupt protocol. braunr: funny, that's *exactly* what I was thinking when looking at the io_interrupt mess :-) (and what ultimately convinced me that the Hurd could be much more elegant with a custom-tailored kernel rather than building around Mach) ## IRC, freenode, #hurd, 2012-09-24 my initial attempt was a mach clone but now i want a mach-like kernel, without compability which new licence ? and some very important changes like sync ipc gplv3 (or later) cool 8) yes it is gplv2+ since i didn't take the time to read gplv3, but now that i have, i can't use anything else for such a project: ) what is mach-like ? (how it is different from Pistachio like ?) l4 doesn't provide capabilities hmmm.. you need a userspace for that +server and it relies on complete external memory management how much work is done ? my kernel will provide capabilities, similar to mach ports, but simpler (less overhead) i want the primitives right like multiprocessor, synchronization, virtual memory, etc.. ### IRC, freenode, #hurd, 2012-09-30 for those interested, x15 is now a project of its own, with no gnumach compability goal, and covered by gplv3+ ### IRC, freenode, #hurd, 2012-12-31 bits of news about x15: it can now create tasks, threads, vm_maps, physical maps (cpu-specific page tables) for user tasks, and stack tracing (in addition to symbol resolution when symbols are available) were recently added ### IRC, freenode, #hurd, 2013-01-15 Anarchos: as a side note, i'm currently working on a hurd clone with a microkernel that takes a lot from mach but also completely changes the ipc interface (making it not mach at all in the end) it's something between mach and qnx neutrino braunr: do you have a git repo of your new clone? http://git.sceen.net/rbraun/x15.git/ neat it's far from complete and hasn't reached a status where it can be publically announced ok but progress has been constant so far, the ideas i'm using have proven applicable on other systems, i don't expect the kind of design issues that blocked HurdNG (also, my attempt doesn't aim at the same goals as hurdng did) (e.g. denial of service remains completely possible) so x15 will use the current hurd translators? you are only replacing gnumach? that was the plan some years ago, but now that i know the hurd better, i think the main issues are in the hurd, so there isn't much point rewriting mach so, if the hurd needs a revamp, it's better to also make the underlying interface better if possible zacts: in other words: it's a completely different beast ok the main goal is to create a hurd-like system that overcomes the current major defficiencies, most of them being caused by old design decisions like async ipc? yes time for a persistent hurd ? :) no way i don't see a point to persistence for a general purpose system and it easily kills performance on the other hand, it would be nice to have a truely scalable, performant, and free microkernel based system (and posix compatible) there is currently none zacts: the projects focuses mostly on performance and scalability, while also being very easy to understand and maintain (something i think the current hurd has failed at :/) project* very cool i think so, but we'll have to wait for an end result :) what's currently blocking me is the IDL earlier research has shown that an idl must be optmized the same way compilers are for the best performances i'm not sure i can write something good enough :/ the first version will probably be very similar to mig, small and unoptimized ### IRC, freenode, #hurd, 2013-01-18 braunr: so how exactly do the goals of x15 differ from viengoos? zacts: viengoos is much more ambitious about the design tbh, i still don't clearly see how its half-sync ipc work x15 is much more mach-like, e.g. a hybrid microkernel with scheduling and virtual memory in the kernel its goals are close to those of mach, adding increased scalability and performance to the list that's neat that's different in a way, you could consider x15 is to mach what linux is to unix, a clone with a "slightly" different interface ah, ok. cool! viengoos is rather a research project, with very interesting goals, i think they're both neat :p ### IRC, freenode, #hurd, 2013-01-19 for now, it provides kernel memory allocation and basic threading it already supports both i386 and amd64 processors (from i586 onwards), and basic smp oh wow how easily can it be ported to other archs? the current focus is smp load balancing, so that thread migration is enabled during development hard to say everything that is arch-specific is cleanly separated, the same way it is in mach and netbsd but the arch-specific interfaces aren't well defined yet because there is only one (and incomplete) arch ### IRC, freenode, #hurd, 2013-03-08 BTW, what is your current direction? did you follow through with abandonning Mach resemblance?... no it's very similar to mach in many ways unless mach is defined by its ipc in which case it's not mach at all the ipc interface will be similar to the qnx one well, Mach is pretty much defined by it's IPC and VM interface... the vm interface remains its although vm maps will be first class objects so that it will be possible to move parts of the vm server outside the kernel some day if it feels like a good thing to do i.e. vm maps won't be inferred from tasks not implicitely the kernel will be restricted to scheduling, memory management, and ipc, much as mach is (notwithstanding drivers) hm... going with QNX IPC still seems risky to me... it's designed for simple embedded environments, not for general-purpose operating systems in my understanding no, the qnx ipc interface is very generic they can already call remote services the system can scale well on multiprocessor machines that's not risky at all, on the contrary yeah, I'm sure it's generic... but I don't think anybody tried to build a Hurd-like system on top of it; so it's not at all clear whether it will work out at all... clueless question: does x15 have any inspiration from helenos? absolutely none i'd say x15 is almost an opposite to helenos it's meant as a foundation for unix systems, like mach some unix interfaces considered insane by helenos people (such as fork and signals) will be implemented (although not completely in the kernel) ipc will be mostly synchronous they're very different well, helenos is very different cool x15 and actually propel (the current name i have for the final system), are meant to create a hurd clone another clueless question: any similarities of x15 to minix? and since we're few, implementing posix efficiently is a priority goal for me again, absolutely none for the same reasons minix targets resilience in embedded environments propel is a hurd clone propel aims at being a very scalable and performant hurd clone that's all neato unfortunately, i couldn't find a name retaining all the cool properties of the hurd feel free to suggest ideas :) propel? as in to launch forward? push forward, yes that's very likely a better name than anything i could conjure up x15 is named after mach (the first aircraft to break mach 4, reaching a bit less than mach 7) servers will be engines, and together to push the system forward ..... :) nice thrust might be a bit too generic i guess oh i'm looking for something like "hurd" doubly recursive acronym, related to gnu and short, so it can be used as a c namespace antrik: i've thought about it a lot, and i'm convinced this kind of interface is fine for a hurd like system the various discussions i found about the hurd requirements (remember roland talking about notifications) all went in this direction note however the interface isn't completely synchronous and that's very important well, I'm certainly curious. but if you are serious about this, you'd better start building a prototype as soon as possible, rather than perfecting SMP ;-) i'm not perfecting smp but i consider it very important to have migrations and preemption actually working before starting the prototype so that tricky mistakes about concurrency can be catched early my current hunch is that you are trying to do too much at the same time... improving both the implementation details and redoing the system design so, for example, there is (or will be soon, actually) thread migratio, but the scheduler doesn't take processor topology into account that's why i'm starting from scratch i don't delve too deep into the details just the ones i consider very important what do you mean by thread migration here? didn't you say you don't even have IPC?... i mean migration between cpus OK the other is too confusing and far too unused and unknown to be used and i won't actually implement it the way it was done in mach again, it will be similar to qnx oh? now that's news for me :-) you seemed pretty hooked on thread migration when we talked about these things last time... i still am i'm just saying it won't be implemented the same way instead of upcalls from the kernel into userspace, i'll "simply" make server threads inherit from the caller's scheduling context the ideas i had about stack management are impossible to apply in practice which make the benefit i imagined unrealistic and the whole idea was very confusing when compared and integrated into a unix like view so stack usage will be increased that's ok but thread migration is more or less equivalent with first-class RPCs AIUI. does that work with the QNX IPC model?... the very important property that threads don't block and wake a server when sending, and the server again blocks and wake the client on reply, is preserved (in fact I find the term "first-class RPC" much clearer...) i dont there are two benefits in practice: since the scheduling context is inherited, the client is charged for the cpu time consumed and since there are no wakeups and blockings, but a direct hand off in the scheduler, the cost of crossing task space is closer to the system call which can be problematic too... but still it's the solution chosen by EROS for example AIUI (inheriting scheduling contexts I mean) by practically all modern microkernel based systems actually, as noted by shapiro braunr: well, both benefits can be achieved by other means as well... scheduler activations like in Viengoos should handle the hand-off part AIUI, and scheduling contexts can be inherited explicitly too, like in EROS (and in a way in Viengoos) i don't understand viengoos well enough to do it that way ## IRC, freenode, #hurd, 2013-04-13 a microkernel loosely based on mach for a future hurd-like system ok. no way! Are you in the process of building a micro-kernel that the hurd may someday run on? not the hurd, a hurd-like system ok wow. sounds pretty cool, and tricky the hurd could, but would require many changes too, and the point of this rewrite is to overcome the most difficult technical performance and scalability problems of the current hurd doing that requires deep changes in the low level interfaces imo, a rewrite is more appropriate sometimes, things done in x15 can be ported to the hurd but it still requires a good deal of effort ## IRC, freenode, #hurd, 2013-04-26 braunr: Did I see that you are back tinkering with X15? well yes i am and i'm very satisfied with it currently, i hope i can maintain the same level of quality in the future it can already handle hundreds of processors with hundreds of GB of RAM in a very scalable way most algorithms are O(1) even waking up multiple threads is O(1) :) i'd like to implement rcu this summer Nice. When are you gonna replace gnumach? ;-P never it's x15, not x15mach now it's not meant to be compatible Who says it has to be compatible? :) i don't know, my head the point is, the project is about rewriting the hurd now, not just the kernel new kernel, new ipc, new interfaces, new libraries, new everything Yikes, now that is some work. :) well yes and no ipc shouldn't be that difficult/long, considering how simple i want the interface to be Cool. networking and drivers will simply be reused from another code base like dde or netbsd so besides the kernel, it's a few libraries (e.g. a libports like library), sysdeps parts in the c library, and a file system For inclusion in glibc or are you not intending on using glibc? i intend to use glibc, but not for upstream integration, if that's what you meant so a private, local branch i assume i expect that part to be the hardest ## IRC, freenode, #hurd, 2013-05-02 braunr: also, will propel/x15 use netbsd drivers or netdde linux drivers? or both? probably netbsd drivers and if netbsd, will it utilize rump? [[open_issues/user-space_device_drivers]], *External Projects*, *The Anykernel and Rump Kernels*. i don't know yet ok device drivers and networking will arrive late the system first has to run in ram, with a truely configurable boot process (i.e. a boot process that doesn't use anything static, and can boot from either disk or network) rump looks good but it still requires some work since it doesn't take care of messaging as well as we'd want e.g. signal relaying isn't that great I personally feel like using linux drivers would be cool, just because linux supports more hardware than netbsd iirc.. zacts: But it could be problematic as you should take quite a lot code from linux kernel to add support even for a single driver. zacts: netbsd drivers are far more portable oh wow, interesting. yeah I did have the idea that netbsd would be more portable. mcsim: that doesn't seem to be as big a problem as you might suggest the problem is providing the drivers with their requirements there are a lot of different execution contexts in linux (hardirq, softirq, bh, threads to name a few) being portable (as implied in netbsd) also means being less demanding on the execution context which allows reusing code in userspace more easily, as demonstrated by rump i don't really care about extensive hardware support, since this is required only for very popular projects such as linux and hardware support actually comes with popularity (the driver code base is related with the user base) so you think that more users will contribute if the projects takes off? i care about clean and maintainable code well yes I think that's a good attitude what i mean is, there is no need for extensive hardware support braunr: TBH, I did not really got idea of rump. Do they try to run the whole kernel or some chosen subsystems as user tasks? mcsim: some subsystems well all the subsystems required by the code they actually want to run (be it a file system or a network stack) braunr: What's the difference with dde? it's not kernel oriented what do you mean? it's not only meant to run on top of a microkernel as the author named it, it's "anykernel" if you remember at fosdem, he run code inside a browser ran* and also, netbsd drivers wouldn't restrict the license although not a priority, having a (would be) gnu system under gplv3+ would be nice that would be cool x15 is already gplv3+ iirc yes cool yeah, I would agree netbsd drivers do look more attractive in that case again, that's clearly not the main reason for choosing them ok it could also cause other problems, such as accepting a bsd license when contributing back but the main feature of the hurd isn't drivers, and what we want to protect with the gpl is the main features I see drivers, as well as networking, would be third party code, the same way you run e.g. firefox on linux with just a bit of glue braunr: what do you think of the idea of being able to do updates for propel without rebooting the machine? would that be possible down the road? simple answer: no that would probably require persistence, and i really don't want that does persistence add a lot of complexity to the system? not with the code, but at execution, yes interesting we could add per-program serialization that would allow it but that's clearly not a priority for me updating with a reboot is already complex enough :) ## IRC, freenode, #hurd, 2013-05-09 the thing is, i consider the basic building blocks of the hurd too crappy to build anything really worth such effort over them mach is crappy, mig is crappy, signal handling is crappy, hurd libraries are ok but incur a lot of contention, which is crappy today Understood but it is all we have currently. i know and it's good as a prototype We have already had L4, viengoos, etc and nothing has ever come to fruition. :( my approach is compeltely different it's not a new design a few things like ipc and signals are redesigned, but that's minor compared to what was intended for hurdng propel is simply meant to be a fast, scalable implementation of the hurd high level architecture bddebian: imagine a mig you don't fear using imagine interfaces not constrained to 100 calls ... imagine per-thread signalling from the start braunr: I am with you 100% but it's vaporware so far.. ;-) bddebian: i'm just explaining why i don't want to work on large scale projects on the hurd fixing local bugs is fine fixing paging is mandatory usb could be implemented with dde, perhaps by sharing the pci handling code (i.e. have one big dde server with drivers inside, a bit ugly but straightforward compared to a full fledged pci server) braunr: But this is the problem I see. Those of you that have the skills don't have the time or energy to put into fixing that kind of stuff. braunr: That was my thought. bddebian: well i have time, and i'm currently working :p but not on that bddebian: also, it won't be vaporware for long, i may have ipc working well by the end of the year, and optimized and developer-friendly by next year) ## IRC, freenode, #hurd, 2013-06-05 i'll soon add my radix tree with support for lockless lookups :> a tree organized based on the values of the keys thmselves, and not how they relatively compare to each other also, a tree of arrays, which takes advantage of cache locality without the burden of expensive resizes you seem to be applying good algorithmic teghniques that is nice that's one goal of the project you can't achieve performance and scalability without the appropriate techniques see http://git.sceen.net/rbraun/librbraun.git/blob/HEAD:/rdxtree.c for the existing userspace implementation in kern/work.c I see one TODO "allocate numeric IDs to better identify worker threads" yes and i'm adding my radix tree now exactly for that (well not only, since radix tree will also back VM objects and IPC spaces, two major data structures of the kernel) ## IRC, freenode, #hurd, 2013-06-11 and also starting paging anonymous memory in x15 :> well, i've merged my radix tree code, made it safe for lockless access (or so i hope), added generic concurrent work queues and once the basic support for anonymous memory is done, x15 will be able to load modules passed from grub into userspace :> but i've also been thinking about how to solve a major scalability issue with capability based microkernels that noone else seem to have seen or bothered thinking about for those interested, the problem is contention at the port level unlike on a monolithic kernel, or a microkernel with thread-based ipc such as l4, mach and similar kernels use capabilities (port rights in mach terminology) to communicate the kernel then has to "translate" that reference into a thread to process the request this is done by using a port set, putting many ports inside, and making worker threads receive messages on the port set and in practice, this gets very similar to a traditional thread pool model one thread actually waits for a message, while others sit on a list when a message arrives, the receiving thread wakes another from that list so it receives the next message this is all done with a lock Maybe they thought about it but couldn't or were to lazy to find a better way? :) braunr: what do you mean under "unlike .... a microkernel with thread-based ipc such as l4, mach and similar kernels use capabilities"? L4 also has capabilities. mcsim: not directly capabilities are implemented by a server on top of l4 unless it's OKL4 or another variant with capabilities back in the kernel i don't know how fiasco does it so the problem with this lock is potentially very heavy contention and contention in what is the equivalent of a system call .. it's also hard to make it real-time capable for example, in qnx, they temporarily apply priority inheritance to *every* server thread since they don't know which one is going to be receiving next braunr: in fiasco you have capability pool for each thread and this pool is stored in tread control block. When one allocates capability kernel just marks slot in a pool as busy mcsim: ok but, there *is* a thread for each capability i mean, when doing ipc, there can only be one thread receiving the message (iirc, this was one of the big issue for l4-hurd) ok. i see the difference. well i'm asking i'm not so sure about fiasco but that's what i remember from the generic l4 spec sorry, but where is the question? 16:04 < braunr> i mean, when doing ipc, there can only be one thread receiving the message yes, you specify capability to thread you want to send message to i'll rephrase: when you send a message, do you invoke a capability (as in mach), or do you specify the receiving thread ? you specify a thread that's my point but you use local name (that is basically capability) i see from wikipedia: "Furthermore, Fiasco contains mechanisms for controlling communication rights as well as kernel-level resource consumption" not certain that's what it refers to, but that's what i understand from it more capability features in the kernel but you still send to one thread yes that's what makes it "easily" real time capable a microkernel that would provide mach-like semantics (object-oriented messaging) but without contention at the messsage passing level (and with resource preallocation for real time) would be really great bddebian: i'm not sure anyone did braunr: Well you can be the hero!! ;) the various papers i could find that were close to this subject didn't take contention into account exception for network-distributed ipc on slow network links bddebian: eh well i think it's doable acctually braunr: can you elaborate on where contention is, because I do not see this clearly? mcsim: let's take a practical example a file system such as ext2fs, that you know well enough imagine a large machine with e.g. 64 processors and an ignorant developer like ourselves issuing make -j64 every file access performed by the gcc tools will look up files, and read/write/close them, concurrently at the server side, thread creation isn't a problem we could have as many threads as clients the problem is the port set for each port class/bucket (let's assume they map 1:1), a port set is created, and all receive rights for the objects managed by the server (the files) are inserted in this port set then, the server uses ports_manage_port_operations_multithread() to service requests on that port set with as many threads required to process incoming messages, much the same way a work queue does it but you can't have *all* threads receiving at the same time there can only be one the others are queued i did a change about the queue order a few months ago in mach btw mcsim: see ipc/ipc_thread.c in gnumach this queue is shared and must be modified, which basically means a lock, and contention so the 64 concurrent gcc processes will suffer from contenion at the server while they're doing something similar to a system call by that, i mean, even before the request is received mcsim: if you still don't understand, feel free to ask braunr: I'm thinking on it :) give me some time "Fiasco.OC is a third generation microkernel, which evolved from its predecessor L4/Fiasco. Fiasco.OC is capability based" ok so basically, there are no more interesting l4 variants strictly following the l4v2 spec any more "The completely redesigned user-land environment running on top of Fiasco.OC is called L4 Runtime Environment (L4Re). It provides the framework to build multi-component systems, including a client/server communication framework" so yes, client/server communication is built on top of the kernel something i really want to avoid actually So when 1 core wants to pull something out of queue it has to lock it, and the problem arrives when other 63 cpus are waiting in the same lock. Right? mcsim: yes could this be solved by implementing per cpu queues? Like in slab allocator solved, no reduced, yes by using multiple port sets, each with their own thread pool but this would still leave core problems unsolved (those making real-time hard) to make it real-time is not really essential to solve this problem that's the other way around we just need to guarantee that locking protocol is fair solving this problem is required for quality real-time what you refer to is similar to what i described in qnx earlier it's ugly keep in mind that message passing is the equivalent of system calls on monolithic kernels os ideally, we'd want something as close as possible to an actually system call so* mcsim: do you see why it's ugly ? no i meant exactly opposite, I meant to use some deterministic locking protocol please elaborate because what qnx does is deterministic We know in what sequences threads will acquire the lock, so we will not have to apply inheritance to all threads hwo do you know ? there are different approaches, like you use ticket system or MCS lock (http://portal.acm.org/citation.cfm?id=103729) that's still locking a system call has 0 contention 0 potential contention in linux? everywhere i assume than why do they need locks? they need locks after the system call the system call itself is a stupid trap that makes the thread "jump" in the kernel and the reason why it's so simple is the same as in fiasco: threads (clients) communicate directly with the "server thread" (themselves in kernel mode) so 1/ they don't go through a capability or any other abstraction and 2/ they're even faster than on fiasco because they don't need to find the destination, it's implied by the trap mechanism) 2/ is only an optimization that we can live without but 1/ is a serious bottleneck for microkernels Do you mean that there system call that process without locks or do you mean that there are no system calls that use locks? this is what makes papers such as https://www.kernel.org/doc/ols/2007/ols2007v1-pages-251-262.pdf valid i mean the system call (the mechanism used to query system services) doesn't have to grab any lock the idea i have is to make the kernel transparently (well, as much as it can be) associate a server thread to a client thread at the port level at the server side, it would work practically the same the first time a server thread services a request, it's automatically associated to a client, and subsequent request will directly address this thread when the client is destroyed, the server gets notified and destroys the associated server trhead for real-time tasks, i'm thinking of using a signal that gets sent to all servers, notifying them of the thread creation so that they can preallocate the server thread or rather, a signal to all servers wishing to be notified or perhaps the client has to reserve the resources itself i don't know, but that's the idea and who will send this signal? the kernel x15 will provide unix like signals but i think the client doing explicit reservation is better more complicated, but better real time developers ought to know what they're doing anyway mcsim: the trick is using lockless synchronization (like rcu) at the port so that looking up the matching server thread doesn't grab any lock there would still be contention for the very first access, but that looks much better than having it every time (potential contention) it also simplifies writing servers a lot, because it encourages the use of a single port set for best performance instead of burdening the server writer with avoiding contention with e.g. a hierarchical scheme "looking up the matching server" -- looking up where? in the port but why can't you just take first? that's what triggers contention you have to look at the first > (16:34:13) braunr: mcsim: do you see why it's ugly ? BTW, not really imagine serveral clients send concurrently mcsim: well, qnx doesn't do it every time qnx boosts server threads only when there are no thread currently receiving, and a sender with a higher priority arrives since qnx can't know which server thread is going to be receiving next, it boosts every thread boosting priority is expensive, and boosting everythread is linear with the number of threads so on a big system, it would be damn slow for a system call :) ok and grabbing "the first" can't be properly done without serialization if several clients send concurrently, only one of them gets serviced by the "first server thread" the second client will be serviced by the "second" (or the first if it came back) making the second become the first (i call it the manager) must be atomic that's the core of the problem i think it's very important because that's currently one of the fundamental differences wih monolithic kernels so looking up for server is done without contention. And just assigning task to server requires lock, right? mcsim: basically yes i'm not sure it's that easy in practice but that's what i'll aim at almost every argument i've read about microkernel vs monolithic is full of crap Do you mean lock on the whole queue or finer grained one? the whole port (including the queue) why the whole port? how can you make it finer ? is queue a linked list? yes than can we just lock current element in the queue and elements that point to current that's two lock and every sender will want "current" which then becomes coarse grained but they want different current let's call them the manager and the spare threads yes, that's why there is a lock so they don't all get the same the manager is the one currently waiting for a message, while spare threads are available but not doing anything when the manager finally receives a message, it takes the first spare, which becomes the new manager exactly like in a common thread pool so what are you calling current ? we have in a port queue of threads that wait for message: t1 -> t2 -> t3 -> t4; kernel decided to assign message to t3, than t3 and t2 are locked. why not t1 and t2 ? i was calling t3 in this example as current some heuristics yeah well no it wouldn't be deterministic then for instance client runs on core 3 and wants server that also runs on core 3 i really want the operation as close as a true system call as possible, so O(1) what if there are none ? it looks up forward up to the end of queue: t1->t2->t4; takes t4 than it starts from the beginning that becomes linear in the worst case no so 4095 attempts on a 4096 cpus machine ? you're right unfortunately :/ a per-cpu scheme could be good and applicable with much more thought and the problem is that, unlike the kernel, which is naturally a one thread per cpu server, userspace servers may have less or more threads than cpu possibly unbalanced too so it would result in complicated code one good thing with microkernels is that they're small they don't pollute the instruction cache much keeping the code small is important for performance too so forgetting this kind of optimization makes for not too complicated code, and we rely on the scheduler to properly balance threads mcsim: also note that, with your idea, the worst cast is twice more expensive than a single lock and on a machine with few processors, this worst case would be likely so, you propose every time try to take first server from the queue? braunr: ^ no that's what is done already i propose doing that the first time a client sends a message but then, the server thread that replied becomes strongly associated to that client (it cannot service requests from other clients) and it can be recycled only when the client dies (which generates a signal indicating the server it can now recycle the server thread) (a signal similar to the no-sender or dead-name notifications in mach) that signal would be sent from the kernel, in the traditional unix way (i.e. no dedicated signal thread since it would be another source of contention) and the server thread would directly receive it, not interfering with the other threads in the server in any way => contention on first message only now, for something like make -j64, which starts a different process for each compilation (itself starting subprocesses for preprocessing/compiling/assembling) it wouldn't be such a big win so even this first access should be optimized if you ever get an idea, feel free to share :) May mach block thread when it performs asynchronous call? braunr: ^ sure but that's unrelated in mach, a sender is blocked only when the message queue is full So we can introduce per cpu queues at the sender side (and mach_msg wasn't called in non blocking mode obviously) no they need to be delivered in order In what order? messages can't be reorder once queued reordered so fifo order if you break the queue in per cpu queues, you may break that, or need work to rebuild the order which negates the gain from using per cpu queues Messages from the same thread will be kept in order are you sure ? and i'm not sure it's enough thes cpu queues will be put to common queue once context switch occurs *all* messages must be received in order these* uh ? you want each context switch to grab a global lock ? if you have parallel threads that send messages that do not have dependencies than they are unordered always the problem is they might consider auth for example you have one client attempting to authenticate itself to a server through the auth server if message order is messed up, it just won't work but i don't have this problem in x15, since all ipc (except signals) is synchronous but it won't be messed up. You just "send" messages in O(1), but than you put these messages that are not actually sent in queue all at once i think i need more details please you have lock on the port as it works now, not the kernel lock the idea is to batch these calls i see batching can be effective, but it would really require queueing x15 only queues clients when there is no receiver i don't think batching can be applied there you batch messages only from one client that's what i'm saying so client can send several messages during his time slice and than you put them into queue all together x15 ipc is synchronous, no more than 1 message per client at any time there also are other problems with this strategy problems we have on the hurd, such as priority handling if you delay the reception of messages, you also delay priority inheritance to the server thread well not the reception, the queueing actually but since batching is about delaying that, it's the same if you use synchronous ipc than there is no sence in batching, at least as I see it. yes 18:08 < braunr> i don't think batching can be applied there and i think sync ipc is the only way to go for a system intended to provide messaging performance as close as possible to the system call do you have as many server thread as many cores you have? no as many server threads as clients which matches the monolithic model in current implementation? no currently i don't have userspace :> and what is in hurd atm? in gnumach asyn ipc async with message queues no priority inheritance, simple "handoff" on message delivery, that's all I managed to read the conversation :-) eh anatoly: any opinion on this ? braunr: I have no opinion. I understand it partially :-) But association of threads sounds for me as good idea But who am I to say what is good or what is not in that area :-) there still is this "first time" issue which needs at least one atomic instruction I see. Does mach do this "first time" thing every time? yes but gnumach is uniprocessor so it doesn't matter if we have 1:1 relation for client and server threads we need only per-cpu queues mcsim: explain that please and the problem here is establishing this relation with a lockless lookup, i don't even need per cpu queues you said: (18:11:16) braunr: as many server threads as clients how do you create server threads? pthread_create :) ok :) why and when do you create a server thread? there must be at least one unbound thread waiting for a message when a message is received, that thread knows it's now bound with a client, and if needed wakes up/spawns another thread to wait for incoming messages when it gets a signal indicating the death of the client, it knows it's now unbound, and goes back to waiting for new messages becoming either the manager or a spare thread if there already is a manager a timer could be used as it's done on the hurd to make unbound threads die after a timeout the distinction between the manager and spare threads would only be done at the kernel level the server would simply make unbound threads wait on the port set How client sends signal to thread about its death (as I understand signal is not message) (sorry for noob question) in what you described there are no queues at all anatoly: the kernel does it mcsim: there is, in the kernel the queue of spare threads anatoly: don't apologize for noob questions eh braunr: is that client is a thread of some user space task? i don't think it's a newbie topic at all anatoly: a thread make these queue per cpu why ? there can be a lot less spare threads than processors i don't think it's a good idea to spawn one thread per cpu per port set on a large machine you'd have tons of useless threads if you have many useless threads, than assign 1 thread to several core, thus you will have twice less threads i mean dynamically that becomes a hierarchical model it does reduce contention, but it's complicated, and for now i'm not sure it's worth it it could be a tunable though if you want something fast you should use something complicated. really ? a system call is very simple and very fast :p why is it fast? you still have a lot of threads in kernel but they don't interact during the system call the system call itself is usually a simple instruction with most of it handled in hardware if you invoke "write" system call, what do you do in kernel? you look up the function address in a table you still have queues no sorry wait by system call, i mean "the transition from userspace to kernel space" and the return not the service itself the equivalent on a microkernel system is sending a message from a client, and receiving it in a server, not processing the request ideally, that's what l4 does: switching from one thread to another, as simply and quickly as the hardware can so just a context and address space switch at some point you put something in queue even in monolithic kernel and make request to some other kernel thread the problem here is the indirection that is the capability yes but that's the service i don't care about the service here i care about how the request reaches the server this division exist for microkernels for monolithic it's all mixed What does thread do when it receive a message? anatoly: what it wants :p the service mcsim: ? mixed ? braunr: hm, is it a thread of some server? if you have several working threads in monolithic kernel you have to put request in queue anatoly: yes mcsim: why would you have working threads ? and there is no difference either you consider it as service or just "transition from userspace to kernel space" i mean, it's a good thing to have, they usually do, but they're not implied they're completely irrelevant to the discussion here of course there is you might very well perform system calls that don't involve anything shared you can also have only one working thread in microkernel yes and all clients will wait for it you're mixing up work queues in the discussion here server threads are very similar to a work queue, yes but you gave me an example with 64 cores and each core runs some server thread they're a thread pool handling requests you can have only one thread in a pool they have to exist in a microkernel system to provide concurrency monolithic kernels can process concurrently without them though why? because on a monolithic system, _every client thread is its own server_ a thread making a system call is exactly like a client requesting a service on a monolithic kernel, the server is the kernel and it *already* has as many threads as clients and that's pretty much the only thing beautiful about monolithic kernels right have to think about it :) that's why they scale so easily compared to microkernel based systems and why l4 people chose to have thread-based ipc but this just moves the problems to an upper level and is probably why they've realized one of the real values of microkernel systems is capabilities and if you want to make them fast enough, they should be handled directly by the kernel ## IRC, freenode, #hurd, 2013-06-13 Heya Richard. Solve the worlds problems yet? :) bddebian: I fear the worlds problems are NP-complete ;) heh bddebian: i wish i could solve mine at least :p braunr: I meant the contention thing you were discussing the other day :) bddebian: oh i have a solution that improves the behaviour yes, but there is still contention the first time a thread performs an ipc Any thread or the first time there is contention? there may be contention the first time a thread sends a message to a server (assuming a server uses a single port set to receive requests) Oh aye i think it's as much as can be done considering there is a translation from capability to thread other schemes are just too heavy, and thus don't scale well this translation is one of the two important nice properties of microkernel based systems, and translations (or indrections) usually have a cost so we want to keep them and we have to accept that cost the amount of code in the critical section should be so small it should only matter for machines with several hundreds or thousands processors so it's not such a bit problem OK but it would have been nice to have an additional valid theoretical argument to explain how ipc isn't that slow compared to system calls s/bit/big/ people keep saying l4 made ipc as fast as system calls without taking that stuff into account which makes the community look lame in the eyes of those familiar with it heh with my solution, persistent applications like databases should perform as fast as on an l4 like kernel but things like parallel builds, which start many different processes for each file, will suffer a bit more from contention seems like a fair compromise to me Aye as mcsim said, there is a lot of contention about everywhere in almost every application and lockless stuff is hard to correctly implement os it should be all right :) ... :) braunr: What if we have at least 1 thread for each core that stay in per-core queue. When we decide to kill a thread and this thread is last in a queue we replace it with load balancer. This is still worse than with monolithic kernel, but it is simplier to implement from kernel perspective. mcsim: it doesn't scale well you end up with one thread per cpu per port set load balancer is only one thread why would it end up like you said? remember the goal is to avoid contention your proposition is to set per cpu queues the way i understand what you said, it means clients will look up a server thread in these queues one of them actually, the one for the cpu they're currently running one so 1/ it disables migration or 2/ you have one server thread per client per cpu i don't see what a "load balancer" would do here client either finds server thread without contention or it sends message to load balancer, that redirects message to thread from global queue. Where global queue is concatenation of local ones. you can't concatenate local queues in a global one if you do that, you end up with a global queue, and a global lock again not global load balancer is just one then you serialize all remote messaging through a single thread so contention will be only among local thread and load balancer i don't see how it doesn't make the load balancer global it makes but it just makes bootstraping harder i'm not following and i don't see how it improves on my solution in your example with make -j64 very soon there will be local threads at any core yes, hence the lack of scalability but that's your goal: create as many server thread as many clients you have, isn't it? your solution may create a lot more again, one per port set (or server) per cpu imagine this worst case: you have a single client with one thread which gets migrated to every cpu on the machine it will spawn one thread per cpu at the server side why would it migrate all the time? it's a worst case if it can migrate, consider it will murphy's law, you know also keep in mind contention doesn't always occur with a global lock i'm talking about potential contention and same things apply: if it can happen, consider it will than we can make load balancer that also migrates server threads ok so in addition to worker threads, we'll add an additional per server load balancer which may have to lock several queues at once doesn't it feel completely overkill to you ? load balancer is global, not per-cpu there could be contention for it again, keep in mind this problem becomes important for several hundreds processors, not below yes but it has to balance which means it has to lock cpu queues and at least two of them to "migrate" server threads and i don't know why it would do that i don't see the point of the load balancer so, you start make -j64. First 64 invocations of gcc will suffer from contention for load balancer, but later on it will create enough server threads and contention will disappear no that's the best case : there is always one server thread per cpu queue how do you guarantee your 64 server threads don't end up in the same cpu queue ? (without disabling migration) load balancer will try to put some server thread to the core where load balancer was invoked so there is no guarantee LB can pin server thread unless we invoke it regularly, in a way similar to what is already done in the SMP scheduler :/ and this also means one balancer per cpu then why one balance per cpu? 15:56 < mcsim> load balancer will try to put some server thread to the core where load balancer was invoked why only where it was invoked ? because it assumes that if some one asked for server at core x, it most likely will ask for the same service from the same core i'm not following LB just tries to prefetch were next call will be what you're describing really looks like per-cpu work queues ... i don't see how you make sure there aren't too many threads i don't see how a load balancer helps this is just an heuristic when server thread is created? who creates it? and it may be useless, depending on how threads are migrated and when they call the server same answer as yesterday there must be at least one thread receiving messages on a port set when a message arrives, if there aren't any spare threads, it spawns one to receive messages while it processes the request at the moment server threads are killed by timeout, right? yes well no there is a debian patch that disables that because there is something wrong with thread destruction but that's an implementation bug, not a design issue so it is the mechanism how we insure that there aren't too many threads it helps because yesterday I proposed to hierarchical scheme, were one server thread could wait in cpu queues of several cores but this has to be implemented in kernel a hierarchical scheme would help yes a bit i propose scheme that could be implemented in userspace ? kernel should not distinguish among load balancer and server thread sorry this is too confusing please start describing what you have in mind from the start ok so my starting point was to use hierarchical management but the drawback was that to implement it you have to do this in kernel right? no so I thought how can this be implemented in user space being in kernel isn't the problem contention is on the contrary, i want ipc in kernel exactly because that's where you have the most control over how it happens and can provide the best performance ipc is the main kernel responsibility but if you have few clients you have low contention the goal was "0 potential contention" and if you have many clients, you have many servers let's say server threads for me, a server is a server task or process right so i think 0 potential contention is just impossible or it requires too many resources that make the solution not scalable 0 contention is impossible, since you have disbalance in numbers of client threads and server threads well no it *canù be achieved imagine servers register themselves to the kernel and the kernel signals them when a client thread is spawned you'd effectively have one server thread per client (there would be other problems like e.g. when a server thread becomes the client of another, etc..) so it's actually possible but we clearly don't want that, unless perhaps for real time threads but please continue what does "and the kernel signals them when a client thread is spawned" mean? it means each time a thread not part of a server thread is created, servers receive a signal meaning "hey, there's a new thread out there, you might want to preallocate a server thread for it" and what is the difference with creating thread on demand? on demand can occur when receiving a message i.e. during syscall I will continue, I just want to be sure that I'm not basing on wrong assumtions. and what is bad in that? (just to clarify, i use the word "syscall" with the same meaning as "RPC" on a microkernel system, whereas it's a true syscall on a monolithic one) contention whether you have contention on a list of threads or on map entries when allocating a stack doesn't matter the problem is contention and if we create server thread always? and do not keep them in queue? always ? yes again you'd have to allocate a stack for it every time so two potentially heavy syscalls to allocate/free the stac k not to mention the thread itself, its associations with its task, ipc space, maintaining reference counts (moar contention) creating threads was considered cheap at the time the process was the main unit of concurrency ok, than we will have the same contention if we will create a thread when "the kernel signals them when a client thread is spawned" now we have work queues / thread pools just to avoid that no because that contention happens at thread creation not during a syscall i'll redefine the problem: the problem is contention during a system call / IPC ok note that my current solution is very close to signalling every server it's the lazy version match at first IPC time so I was basing my plan on the case when we create new thread when client makes syscall and there is not enough server threads the problem exists even when there is enough server threads we shouldn't consider the case where there aren't enough server threads real time tasks are the only ones which want that, and can preallocate resources explicitely I think that real time tasks should be really separated For them resource availability as much more important that good resource utilisation. So if we talk about real time tasks we should apply one police and for non-real time another So it shouldn't be critical if thread is created during syscall agreed that's what i was saying : :) 16:23 < braunr> we shouldn't consider the case where there aren't enough server threads in this case, we spawn a thread, and that's ok it will live on long enough that we really don't care about the cost of lazily creating it so let's concentrate only on the case where there already are enough server threads So if client makes a request to ST (is it ok to use abbreviations?) there are several cases: 1/ There is ST waiting on local queue (trivial case) 2/ There is no ST, only load balancer (LB). LB decides to create a new thread 3/ Like in previous case, but LB decides to perform migration migration of what ? migration of ST from other core the only case effectively solving the problem is 1 others introduce contention, and worse, complex code i mean a complex solution not only code even the addition of a load balancer per port set thr data structures involved for proper migration But 2 and 3 in long run will lead to having enough threads on all cores then you end up having 1 per client per cpu migration is needed in any case no why would it be ? to balance load not only for this case there already is load balancing in the scheduler we don't want to duplicate its function what kind of load balancing? *has scheduler thread weight / cpu and does it perform migration? sure so scheduler can be simplified if policy "when to migrate" will be moved to user space this is becoming a completely different problem and i don't want to do that it's very complicated for no real world benefit but all this will be done in userspace ? all what ? migration decisions in your scheme you mean ? yes explain how LB will decide when thread will migrate and LB is user space task what does it bring ? imagine that, in the mean time, the scheduler then decides the client should migrate to another processor for fairness you'd have migrated a server thread once for no actual benefit or again, you need to disable migration for long durations, which sucks also 17:06 < mcsim> But 2 and 3 in long run will lead to having enough threads on all cores contradicts the need for a load balancer if you have enough threads every where, why do you need to balance ? and how are you going to deal with the case when client will migrate all the time? i intend to implement something close to thread migration because some of them can die because of timeout something l4 already does iirc the thread scheduler manages scheduling contexts which can be shared by different threads which means the server thread bound to its client will share the scheduling context the only thing that gets migrated is the scheduling context the same way a thread can be migrated indifferently on a monolithic system, whether it's in user of kernel space (with kernel preemption enabled ofc) or* but how server thread can process requests from different clients? mcsim: load becomes a problem when there are too many threads, not when they're dying they can't at first message, they're *bound* => one server thread per client when the client dies, the server thread is ubound and can be recycled unbound* and you intend to put recycled threads to global queue, right? yes and I propose to put them in local queues in hope that next client will be on the same core the thing is, i don't see the benefit next client could be on another in which case it gets a lot heavier than the extremely small critical section i have in mind but most likely it could be on the same uh, no becouse on this load on this core is decreased *because well, ok, it would likely remain on the same cpu but what happens when it migrates ? and what about memory usage ? one queue per cpu per port set can get very large (i understand the proposition better though, i think) we can ask also "What if random access in memory will be more usual than sequential?", but we still optimise sequential one, making random sometimes even worse. The real question is "How can we maximise benefit of knowledge where free server thread resides?" previous was reply to: "(17:17:08) braunr: but what happens when it migrates ?" i understand you optimize for the common case where a lot more ipc occurs than migrations agreed now, what happens when the server thread isn't in the local queue ? than client request will be handled to LB why not search directly itself ? (and btw, the right word is "then") LB can decide whom to migrate right, sorry i thought you were improving on my scheme which implies there is a 1:1 mapping for client and server threads If job of LB is too small than it can be removed and everything will be done in kernel it can't be done in userspace anyway these queues are in the port / port set structures it could be done though i mean using per cpu queues server threads could be both in per cpu queues and in a global queue as long as they exist there should be no global queue, because there again will be contention for it mcsim: accessing a load balancer implies contention there is contention anyway what you're trying to do is reduce it in the first message case if i'm right braunr: yes well then we have to revise a few assumptions 17:26 < braunr> you optimize for the common case 17:26 < braunr> where a lot more ipc occurs than migrations that actually becomes wrong the first message case occurs for newly created threads for make -j64 this is actually common case and those are usually not spawn on the processor their parent runs on yes if you need all processors, yes i don't think taking into account this property changes many things per cpu queues still remain the best way to avoid contention my problem with this solution is that you may end up with one unbound thread per processor per server also, i say "per server", but it's actually per port set and even per port depending on how a server is written (the system will use one port set for one server in the common case but still) so i'll start with a global queue for unbound threads and the day we decide it should be optimized with local (or hierarchical) queues, we can still do it without changing the interface or by simply adding an option at port / port set creation whicih is a non intrusive change ok. your solution should be simplier. And TBH, what I propose is not clearly much mory gainful. well it is actually for big systems it is because instead of grabbing a lock, you disable preemption which means writing to a local, uncontended variable with 0 risk of cache line bouncing this actually looks very good to me now using an option to control this behaviour and yes, in the end, it gets very similar to the slab allocator, where you can disable the cpu pool layer with a flag :) (except the serialized case would be the default one here) mcsim: thanks for insisting or being persistent braunr: thanks for conversation :) and probably I had to start from statement that I wanted to improve common case ## IRC, freenode, #hurd, 2013-06-20 braunr: how about your x15, it is impovement for mach or redesign? I really want to know that:) it's both largely based on mach and now quite far from it based on mach from a functional point of view i.e. the kernel assumes practically the same functions, with a close interface Good point:) except for ipc which is entirely rewritten why ? :) for from a functional point of view:) I think each design has it intrinsic advantage and disadvantage but why is it good ? if redesign , I may need wait more time to a new function hurd you'll have to wait a long time anyway :p Improvement was better sometimes, although redesign was more attraction sometimes :) I will wait :) i wouldn't put that as a reason for it being good this is a departure from what current microkernel projects are doing i.e. x15 is a hybrid Sure, it is good from design too:) yes but i don't see why you say that Sorry, i did not show my view clear, it is good from design too:) you're just saying it's good, you're not saying why you think it's good I would like to talk hybrid, I want to talk that, but I am a litter afraid that you are all enthusiasm microkernel fans well no i'm not on the contrary, i'm personally opposed to the so called "microkernel dogma" but i can give you reasons why, i'd like you to explain why *you* think a hybrid design is better so, when I talk apple or nextstep, I got one soap :) that's different these are still monolithic kernels well, monolithic systems running on a microkernel yes, I view this as one type of hybrid no it's not microkernel wan't to divide process ( task ) from design view, It is great as implement view or execute view, we have one cpu and some physic memory, as the simplest condition, we can't change that that what resource the system has what's your point ? I view this as follow I am cpu and computer application are the things I need to do for running the program and finish the job, which way is the best way for me I need keep all the thing as simple as possible, divide just from application design view, for me no different desgin was microkernel , run just for one cpu and these resource. (well there can be many processors actually) I know, I mean hybrid at some level, we can't escape that braunr: I show my point? well l4 systems showed we somehow can no you didn't x15's api was rpc, right? yes well a few system calls, and mostly rpcs on top of the ipc one jsu tas with mach and you hope the target logic run locally just like in process function call, right? no it can't run locally you need thread context switch and address space context switch but you cut down the cost how so ? I mean you do it, right? x15 yes but no in this way in every other way :p I know, you remeber performance anywhere :p i still don't see your point i'd like you to tell, in one sentence, why you think hybrids are better balance the design and implement problem :p which is ? hybird for kernel arc you're stating the solution inside the problem you are good at mathmatics sorry, I am not native english speaker braunr: I will find some more suitable sentence to show my point some day, but I can't find one if you think I did not show my point:) for today too bad If i am computer I hope the arch was monolithic, If i am programer I hope the arch was microkernel, that's my idea ok let's get a bit faster monolithic for performance ? braunr: sorry for that, and thank you for the talk:) (a computer doesn't "hope") braunr: you need very clear answer, I can't give you that, sorry again why do you say "If i am computer I hope the arch was monolithic" ? I know you can slove any single problem no i don't, and it's not about me i'm just curious I do the work for myself, as my own view, all the resource belong to me, I does not think too much arch related divide was need, if I am the computer :P separating address spaces helps avoiding serious errors like corrupting memory of unrelated subsystems how does one not want that ? (except for performance) braunr: I am computer when I say that words! a computer doesn't want anything users (including developers) on the other way are the point of view you should have I am engineer other time we create computer, but they are lifeable just my feeling, hope not talk this topic what ? I mark computer as life things please don't and even, i'll make a simple example in favor of isolating resources if we, humans, were able to control all of our "resources", we could for example shut down our heart by mistake back to the topic, I think monolithic was easy to understand, and cut the combinatorial problem count for the perfect software the reason the body have so many involuntary functions is probably because those who survived did so because these functions were involuntary and controlled by separated physiological functions now that i've made this absurd point, let's just not consider computers as life forms microkernels don't make a system that more complicated they does no do they create isolation and another layer of indirection with capabilities that's it it's not that more complicated view the kernel function from more nature view, execute some code what ? I know the benefit of the microkernel and the os it's complicated not that much I agree with you microkernel was the idea of organization yes but always keep in mind your goal when thinking about means to achieve them we do the work at diferent view what's quite complicated is making a microkernel design without too much performances loss, but aside from that performances issue, it's not really much more complicated hurd do the work at os level even a monolithic kernel is made of several subsystems that communicated with each others using an API i'm reading this conversation for some time now and I have to agree with braunr microkernels simplify the design yes and no i think it depends a lot on the availability of capabilities i have experience mostly with QNX and i can say it is far more easier to write a driver for QNX, compared to Linux/BSD for example ... which are the major feature microkernels usually add qnx >= 5 do provide capabilities (in the form of channels) yeah ... it's the basic communication mechanism but my initial and still unanswered question was: why do people think a hybrid kernel is batter than a true microkernel, or not better* I does not say what is good or not, I just say hybird was accept core-ix: and if i'm right, they're directly implemented by the kernel, and not a userspace system server braunr: evolution is more easily accepted than revolution :) braunr: yes, message passing is in the QNX kernel not message passing, capabilities l4 does message passing in kernel too, but you need to go through a capability server (for the l4 variants i have in mind at least) the operating system evolve for it's application. congzhang: about evolution, that's one explanation, but other than that ? core-ix: ^ braunr: by capability you mean (for the lack of a better word i'll use) access control mechanisms? i mean reference-rights the "trusted" functionality available in other OS? http://en.wikipedia.org/wiki/Capability-based_security i don't know what other systems refer to with "trusted" functionnality yeah, the same thing for now, I am searching one way to make hurd arm edition suitable for Raspberry Pi I hope design or the arch itself cant scale can be scale braunr: i think (!!!) that those are implemented in the Secure Kernel (http://www.qnx.com/products/neutrino-rtos/secure-kernel.html) never used it though ... rpc make intercept easy :) core-ix: regular channels are capabilities yes, and by extensions - they are in the kenrel that's my understanding too and that one thing that, for me, makes qnx an hybrid as well just need intercept in kernel, braunr: i would dive the academic aspects of this ... in my mind a microkernel is system that provides minimal hardware abstraction, communication primitives (usually message passing), virtual memory protection *wouldn't ... i think it's very important on the contrary what you describe is the "microkernel dogma" precisely that doesn't include capabilities that's why l4 messaging is thread-based and that's why l4 based systems are so slow (except okl4 which put back capabilities in the kernel) so the compromise here is to include capabilities implementation in the kernel, thus making the final product hybrid? not only because now that you have them in kernel the kernel probably has to manage memory for itself so you need more features in the virtual memory system true ... that's what makes it a hybrid other ways being making each client provide memory, but that's when your system becomes very complicated but I believe this is true for pretty much any "general OS" case and some resources just can't be provided by a client e.g. a client can't provide virtual memory to another process okl4 is actually the only pragmatic real-world implementation of l4 and they also added unix-like signals so that's an interesting model as well as qnx the good thing about the hurd is that, although it's not kernel agnostic, it doesn't require a lot from the underlying kernel about hurd? yes i really need to dig into this code at some point :) well you may but you may not see that property from the code itself ## IRC, freenode, #hurd, 2013-06-28 so tell me about x15 if you are in the mood to talk about that what do you want to know ? well, the high level stuff first like what's the big picture the big picture is that x15 is intended to be a "better mach for the hurd " mach is too general purpose its ipc mechanism too powerful too complicated, error prone, and slow so i intend to build something a lot simpler and faster :p so your big picture includes actually porting hurd? i thought i read somewhere that you have a rewrite in mind it's a clone, yes x15 will feature mostly sync ipc, and no high level types inside messages the ipc system call will look like what qnx does send-recv from the client, recv/reply/reply-recv from the server but doesn't sync mean that your context switch will have to be quite fast? how does that differ from the async approach ? (keep in mind that almost all hurd RPCs are synchronous) yes, I know, and it also affects async mode, but a slow switch is worse for the sync case, isn't it? ok so your ipc will be more agnostic wrt to what it transports? unlike mig I presume? no it's the same yes input will be an array, each entry denoting either memory or port rights (or directly one entry for fast ipcs) memory as in pointers? (well fast ipc when there is only one entry to avoid hitting a table) pointer/size yes hm, surely you want a way to avoid copying that, right? the only operation will be copy (i.e. unlike mach which allows sharing) why ? copy doesn't exclude zero copy (zero copy being adjusting page tables with copy on write techniques) right but isn't that too coarse, like in cow a whole page? depends on the message size or options provided by the caller, i don't know yet oh, you are going to pack the memory anyway? depends on the caller i'm not yet sure about these details ideally, i'd like to avoid serialization altogether wouldn't that be like cheating b/c it's the first copy? directly pass pointers/sizes from the sender address space, and either really copy or use zero copy right, but then you're back at the page size issue yes it's not a real issue the kernel must support both ways the minor issue is determining which way to choose it's not a critical issue my current plan is to always copy, unless the caller has explicitely set a flag and is passing properly aligned buffers u sure? I mean the caller is free to arange the stuff he intends to send anyway he likes, how are you going to cow that then? ok right properly aligned buffers :) otherwise the kernel rejects the request that's reasonable, yes in addition to being synchronous, ipc will also take a special path in the scheduler to directly use the client scheduling context avoiding the sleep/wakeup overhead, and providing priority inheritence by side effect uh, but wouldn't dropping serialization create security and reliability issues? if the receiver isn't doing a proper job sanitizing its stuff why would the client not sanitize ? err server it has to anyway sure, but a proper parser written once might be more robust, even if it adds overhead the serialization i mean it's just a layer even with high level types, you still need to sanitize the real downside is loosing cross architecture portability making the potential implementation of a single system image a lot more restricted or difficult but i don't care about that much mach was built with this in mind though it's a nice idea, but i don't believe anyone does ssi anymore i don't know and certainly not across architectures there are few projects anyway it's irrelevant currently and my interface just restricts it, it doesn't prevent it so i consider it an acceptable compromise so, does it run? what does it do? it certainly is, yes for now, it manages memory (physical, virtual, kernel, and soon, anonymous) support multiple processors with the required posix scheduling policies (it uses a cute proportionally fair time sharing algorithm) there are locks (spin locks, mutexes, condition variables) and lockless stuff (à la rcu) both x86 and x86_64 are supported (even pae) work queues sounds impressive :) :) i also added basic debugging stack trace (including getting the symbol table) handling so yes, it's much much better than what i previously did and on the right track it already scales a lot better than mach for what it does there are generic data structures (linked list, red-black tree, radix tree) the radix tree supports lockless lookups, so looking up both the page cache and the ipc spaces is lockless) that's nice :) there are a few things using global locks, but there are TODOs about them even with that, it should be scalable enough for a start and improving those parts shouldn't be too difficult ## IRC, freenode, #hurd, 2013-07-10 braunr: From what I have understood you aim for x15 to be a production ready μ-kernel for usage in the Hurd? Or is it unrelated to the Hurd? nlightnfotis: it's for a hurd clone braunr: I see. Is it close to any of the existing microkernels as far as its design is concerned (L4, Viengoos) or is it new research? it's close to mach and qnx ## IRC, freenode, #hurd, 2013-07-29 making progress on x15 pmap module factoring code for mapping creation/removal on current/kernel and remote processes also started "swap emulation" by reserving some physical memory to act as swap backing store which will allow creating memory pressure very early in the development process ## IRC, freenode, #hurd, 2013-08-23 < nlightnfotis> braunr: something a little bit irrelevant: how many things are missing from mach to be considered a solid base for the Hurd? Is it only SMP and x86_64 support? < braunr> define "solid base for the hurd" < nlightnfotis> solid enough to not look for a replacement for it < braunr> then i'd say, from my very personal point of view, that you want x15 < nlightnfotis> I didn't understand this. Are you planning for x15 to be a better mach? < braunr> with a different interface, so not compatible < braunr> and thus, not mach < nlightnfotis> is the source code for it available? Can I read it somewhere? < braunr> the implied answer being: no, mach isn't a solid base for the hurd considering your definition < braunr> http://git.sceen.net/rbraun/x15.git/ < nlightnfotis> thanks. for that. So it's definite that mach won't stay for long as the Hurd's base, right? < braunr> it will, for long < braunr> my opinion is that it needs to be replaced < nlightnfotis> is it possible that it (slowly) gets rearchitected into what's being considered a second generation microkernel, or is it hopeless? < braunr> it would require a new interface < braunr> you can consider x15 to be a modern mach, with that new interface < braunr> from a high level view, it's very similar (it's a hybrid, with both scheduling and virtual memory management in the kernel) < braunr> ipc change a lot ## IRC, freenode, #hurd, 2013-09-23 for those of us interested in x15 and scalability in general: http://darnassus.sceen.net/~rbraun/radixvm_scalable_address_spaces_for_multithreaded_applications.pdf finally an implementation allowing memory mapping to occur concurrently (which is another contention issue when using mach-like ipc, which often do need to allocate/release virtual memory) ## IRC, freenode, #hurd, 2013-09-28 braunr: http://git.sceen.net/rbraun/x15.git/blob/HEAD:/README "X15 is a free microkernel." braunr: what distinguishes it from existing microkernels? ## IRC, freenode, #hurd, 2013-09-29 rah: the next part maybe ? "Its purpose is to provide a foundation for a Hurd-like operating system." braunr: there are already microkernels that canbe used as the foundatin for Hurd=like operating systems; why are you creating another one? braunr: what distinguishes your microkernel from existing microkernels? rah: http://www.gnu.org/software/hurd/microkernel/mach/deficiencies.html rah: it's better :) rah: and please, cite one suitable kernel for the hurd tschwinge: those are deficiencies in Mach; I'm asking about x15 braunr: in what way is it better exactly? rah: more performant, more scalable braunr: how? better algorithms, better interfaces for example, it supports smp ah it supports SMP ok that's one thing it implements lockless synchronization à la rcu are there any others? ok lockless sync anything else? it can scale from 4MB of physical memory up to several hundreds GiB ipc is completely different, leading to simpler code, less data involved, faster context switches (although there is no code for that yet) how can it support larger memory while other microkernels can't? how is the ipc "different"? others can gnumach doesn't how can it support larger memory while gnumach can't? because it's not the same code base? gnumach doesn't support temporary kernel mapping ok, so x15 supports temporary kernel mapping not exactly virtual memory is completely different how so? gnumach does the same as linux, physical memory is mapped in kernel space so you can't have more physical memory than you have kernel space which is why gnumach can't handle more than 1.8G right now it's a 2/2 split in x15, the kernel maps what it needs and can map it from anywhere in physical memory rah: I think basically all this has already been discussed before and captured on that page? it already supports i386/pae/amd64 I see the drawback is that it needs to update kernel page tables more often on linux, a small part of the kernel space is reserved for temporary mappings, which need page table updates too but most allocations don't use that it's complicated also, i plan to make virtual memory operations completely concurrent on x15, similar to what is described in radixvm ok which means mapping operations on non overlapping regions won't be serialized a big advantage for microkernels which base their messaging optimizations on mapping so simply put, better performance because of simpler ipc and data structures, and better scalability because of improved data structure algorithms and concurrency tschwinge: yes but that page is no use to someone who wants a summary of what distinguishes x15 x15 is still far from complete, which is why i don't advertise it other than here "release early, release often"? give it a few more years :p release what ? something that doesn't work ? software yes this release early practice applies to maintenance release something that doesn't work so that others can help make it work not big developments i don't want that for now i have a specific idea of what i want, and both explaining and defending it would take time, better spent in development itself just wait for a first prototype and then you'll see if you want to help or not * rah does not count himself as one of the "others" who might help make it work one big difference with other microkernels is that x15 is specifically intended to run a unix like system a hurd like system providing a psoix interface more accurately and efficiently so for example, while many microkernels provide only sync ipc, x15 provides both sync ipc and signals and then, there are a lot of small optimizations, like port names which will transparently identify as file descriptors light reference counting a restriction on ipc that only allows reliable transfers across network to machines with same arch and endianness etc.. ## Summary Created on 2013-09-29 by wiki user *BobHam*, *rah* on IRC. > The x15 microkernel is under development by Richard Braun. Overall, x15 is intended to provide better performance because of simpler IPC and data structures and better scalability because of improved data structure algorithms and concurrency. > > The following specific features are intended to distinguish x15 from other microkernels. However, it should be noted that the microkernel is under heavy development and so the list may (and almost certainly will) change. > > * SMP support > * Lockless synchronisation à la RCU > * Support for large amounts of physical memory. GNU Mach does the same as Linux, physical memory is mapped in kernel space so you can't have more physical memory than you have kernel space which is why GNU Mach can't handle more than 1.8G right now, it's a 2/2 split. In x15, the kernel maps what it needs and can map it from anywhere in physical memory the drawback is that it needs to update kernel page tables more often. > * Virtual memory operations are planned to be completely concurrent on x15, similar to what is described in radixvm > * Intended to efficiently run a Hurd-like system providing a POSIX interface > * Providing both synchronisation IPC and signals, as opposed to just synchronisation IPC > * Port names which will transparently identify as file descriptors > * Light reference counting > * A restriction on IPC that only allows reliable transfers across network to machines with same arch and endianness > * etc. ## IRC, freenode, #hurd, 2013-10-12 braunr: are you still working on x15/propel? * zacts checks the git logs zacts: taking a break for now, will be back on it when i have a clearer view of the new vm system ## IRC, freenode, #hurd, 2013-10-15 braunr, few questions about x15. I was reading IRC logs on hurd site, and in the latest part, you say (or I misunderstood) that x15 is now hybrid kernel. So what made you change design... or did you? gnufreex: i always intended to go for a hybrid ## IRC, freenode, #hurd, 2013-10-19 braunr: when do you plan to start on x15/propel again? zacts: after i'm done with thread destruction on the hurd [[open_issues/libpthread/t/fix_have_kernel_resources]]. and do you plan to actually run hurd on top of x15, or are you still going to reimplement hurd as propel? and no, i don't intend to run the hurd on top of x15 ## IRC, freenode, #hurd, 2013-10-24 braunr: What is your Mach replacement doing? "what" ? :) you mean how i guess Sure. well it's not a mach replacement any more and for now it's stalled while i'm working on the hurd that could be positive :) it's in good shape how did it diverge? sync ipc, with unix-like signals and qnx-like bare data messages hmm, like okl5? (with scatter gather) okl4 yes btw, if you can find a document that explains this property of okl4, i'm interested, since i can't find it again on my own :/ basically, x15 has a much lighter ipc interface capabilities? mach ports are mostly retained but reference counting will be simplified hmm I don't like the reference counting part port names will be plain integers, to directly be usable as file descriptors and avoid a useless translation layer (same as in qnx) this sounds like future tense there is no ipc code yet so I guess this stuff is not implemented ok. next step is virtual memory and i'm taking my time because i want it to be a killer feature so if you don't IPC and you don't have VM, what do you have? :) i have multiprocessor multithreading I see. mutexes, condition variables, rcu-like lockless synchronization, work queues basic bsd-like virtual memory which i want to rework I ignored all of that in Viengoos :) and since ipc will still depend on virtual memory for zero-copy, i want the vm system to be right well, i'm more interested in the implementation than the architecture for example, i have unpublished code that features a lockless radix tree for vm_object lookups that's quite new for a microkernel based system, but the ipc interface itself is very similar to what already exists your half-sync ipc are original :) I'm considering getting back in the OS game. oh But, I'm not going to write a kernel this time. did anyone here consider starting a company for such things, like genode did ? I was considering using genode as a base. neal: why genode ? I want to build a secure system. I think the best way to do that is using capabilities. Genode runs on Fiasco.OC, for instance and it provides a lot of infrastructure neal: why not l4re for example ? neal: how important is the ability to revoke capabilities ? In the discussion on [[community/gsoc/project_ideas/object_lookups]], *IRC, freenode, #hurd, 2013-10-24*: and, with some effort, getting rid of the hash table lookup by letting the kernel provide the address of the object (iirc neil knew the proper term for that) teythoon: that is a big interface change how so optimizing libihash and libpthread should already be a good start well how do you intend to add this information ? ok, "big" is overstatement, but still, it's a low level interface change that would probably break a lot of things store a pointer in the port structure in gnumach, make that accessible somehow yes but how ? interesting question indeed my plan for x15 is to make this "label" part of received messages which means you need to change the format of messages that is what i call a big change ### IRC, freenode, #hurd, 2013-10-31 neal: you mentioned you want to use Genode as a base... what exactly would you want to build on top of it, different than what the Genode folks are doing? [[Genode]]. antrik: I want to build a secure operating system. antrik: One focused on user security. braunr: You mean revoke individual send rights? braunr: Or, what do you mean? Or do you mean the ability to receive anotification on revocation? neal: yes, revoking individual send rights I don't think it is needed in practice. neal: ok But, you need a membrane object Here's the idea: like a peropen ? you have say a file server and a proxy a process only talks to the file server via the proxy for the proxy to revoke access to the file object it gave out, it needs to either use your revoke interpose on all ipcs (which is expensive) or use a proxy object/membrane which basically forwards messages to the underlying object isn't that also interposing ? of course but if it is done in the kernel, it is fast ah in the kernel you just walk a linked list what's the difference with a peropen object ? That's another option you use a peropen and then provide a call to force the per-open to be closed so the proxy now invokes the server the issue here is that the proxy has to trust the server yes how can you not trust servers ? that is, if the intent is to prevent further communication between the server and the process, the server may ignore the request in this case, you probably trust the server hum but it could be that you have two processes communicating if the intent is to prevent communication, doesn't the client just need to humm not communicate ? :) the point is that the two processes are colluding what are these two processes ? I'm not sure this case is of practical relevance ok https://www.cs.cornell.edu/courses/cs513/2002sp/L10.html thanks ### IRC, freenode, #hurd, 2013-11-14 neal: hm... I was under the impression that the Genode themselves are also interested in user security... what is missing from their version that you want to add? err... the Genode folks antrik: I'm missing some context neal: a while back you said that you want to build a secure system on top of Genode yes the fact that they are doing what I want is great but there is more to a secure system than an operating system ah, so it's about applications+ ? yes, that is part of it it's also about secure messaging and hiding "meta-data" i'm still wondering how you envision the powerbox when a program wants the user to select a file, it makes an upcall to the power box application braunr: you can probably find some paper from Shapiro ;-) well, sure, it looks easy but is there always a power box application ? is there always a guarantee there won't be recursive calls made by that application ? how does it integrate with the various interfaces a system can have ? there is always a power box application I don't know what you mean by recursive calls aer techniques such as remembering for some time like sudo does applicable to a powerbox application ? if you mean many calls, then it is possible to rate limit it well, the powerbox will use messaging itself is it always privileged ? privileged enough it is privileged such like the X11 display manager is privileged and can see all of the video content what else other than accessing a file would it be used for ? one case i think of is accessing the address space of another application, in debuggers 14:56 < neal> there is always a power box application what would it be when logging on a terminal ? braunr: when running pure command line tools, you can already pass the authority as part of the command line. however, I'm wondering whether it really makes sense to apply this to traditional shell tools... that's one of my concerns when does it really make sense ? for interactive use (opening new files from within a running program), I don't think it can be accomplished in a pure terminal interaction model... and you say "you pass the authority" braunr: it makes sense for interactive applications i thought the point of the powerbox is precisely not to do that no, it's still possible and often reasonable to pass some initial authority on startup. the powerbox is only necessary when further access needs to be provided at runtime ok the power box enable dynamic delegation of authority, as antrik said ok but how practical is it ? applications whose required authority is known apriori and max(required authority) is approximately min(required authority) can be handled with static policies don't application sometimes need a lot of additional authority ? ok actally, thinking about it, a powerbox should also be possible on a simple terminal, if we make sure the application doesn't get full control of the terminal, but rather allow the powerbox to temporarily take over input/output without the application being able to interpose... so not quite a traditional UNIX terminal, but close enough I'd say the terminal itself maybe ? hm... that would avoid having to implement a more generic multiplexing approach -- but it would mix things that are normally quite orthogonal... BTW, I personally believe terminals need to get smarter anyways :-) ok the traditional fully linear dialog has some nice properties; but it is also pretty limited, leading to usability problems soon. I have some vague ideas for an approach that still looks mostly like a linear dialog, but is actually more structured ## IRC, freenode, #hurd, 2013-11-04 yes the learning curve [of the Hurd] is too hard that's an entry barrier this is why i use well known posix-like (or other well established) apis in x15 also why i intend to make port rights blend into file descriptors right well the real reason is efficiency but matching existing practices is very good too ## IRC, freenode, #hurd, 2013-11-08 braunr, how is work on x-15 progressing? Is there some site to check what is new? gnufreex: stalled for 2 months i'm working on the hurd for now, will get back to it later no site well so, you hit some design problem, or what? I mean why stalled http://git.sceen.net/rbraun/x15.git/ :p Thanks something like that yes i came across http://darnassus.sceen.net/~rbraun/radixvm_scalable_address_spaces_for_multithreaded_applications.pdf I read that, I think I found it on Hurd site. and since x15 aims at being performant and scalable, it seems like a major feature to bring in but it's not simple to integrate So you want to add that? gnufreex: yes branur, but what are the problems? ? ah you really want to know ? :) Well... yeah you need to know both x15 and radixvm for that for one, refcache, as described in the radixvm paper, doesn't seem scalable it is in practice in their experiments, but only because they didn't push some parameters too high so i need to rethink it I don't know x15... but I read radixvm paper next, the bsd-like vm used by x15 uses a red-black tree to store memory areas, which doesn't need external storage radixvm as implemented in xv6 is only used for user processes, not the kernel which means the kernel allocator is a separate implementation, as it's done in linux x15 uses the same implementation for both the kernel and user maps which results in a recursion problem because a radix tree uses external nodes that must be dynamically allocated so you would pretty much need to rewrite x15 no just vm/ and $arch/pmap and yes, pmap needs to handle per-core page tables something i wanted to add already but couldn't because of similar recursion problems Yeah, vm system... but what else did you do with x15... it is at early stage... multithreading That doesn't need to be rewriten? no Ok... good. physical memory allocation neither only virtual memory is x15 in runable state? I mean in virtual machine? you can start it but you won't go far :) What do you use as development platform? it basically detects memory and processors, starts idle, migration and worker threads, and leaves Is it compilable on fedora 19 probably i use debian stable and unstable on the hurd ok, I will probably try it in KVM... better do it on real hardware too in case you find a bug I cant make new partition now... it seems my hard drive is dying. When I get a new one I will try on real harware. you don't need a new partition the reason radixvm is important is twofold 1/ ipc will probably make use of the core vm operations used by mmap and munmap 2/ no other system currently provides scalable mmap/munmap/mprotect Yes, that would make x15 pretty special... But I read somewhere that you wanted to implement RCU during summer Did you do that? ## IRC, freenode, #hurd, 2013-11-12 neal: about secure operating systems i assume you consider clients providing their own memory a strong requirement for that, right ? no I'm less interested in availability or performance guarantees ok but i thought it was a requirement to avoid denial of service of course then why don't you consider it required ? I want something working in a reasonable amount of time :) agreed more seriously: my primary requirement is that a program cannot access information that the user has not authorized it to access ok the requirement that you are suggesting is that a program be able to access information that the user has authorized it to access this is availability i'm not following what's the difference ? assume we have two programs: A and B on Unix, if they run under the same uid, they access access each other files I want to fix this ok, that's not explicit authorization but is that what you mean ? Now, assuming that A cannot access B's data and vice versa we have an availability problem A could prevent B from accessing its data via a DoS attach I'm not going to try to fix that. ok and how do you intend to allow A to access B's data ? i guess the powerbox mentioned in the critique but do you have a more precise description about something practical to use ? ## IRC, freenode, #hurd, 2013-11-14 In context of [[hurd/libports]], *Open Issues*, *IRC, freenode, #hurd, 2013-11-14*. fyi, x15 will not provide port renaming teythoon: also, i'm considering enforcing port names to be as close as possible to 0 when being allocated as part of the interface what do you think about that ? braunr: that's probably wise, yes you could hand out receive ports close to 0 and send ports close to ~0 teythoon: what for ? well, if one stores only one kind in an array, it won't waste as much space this also means you need to separate receive from send rights in the interface so that you know where to look for them i'm not sure it's worth the effort using the same code for them both looks more efficient the right lookup code is probably one of the hottest path in the system right one of the nice things about not reusing port names is that it helps catch bugs you don't want to accidently send a message to the wrong recipient how could you, if the same name at different times denotes different rights ? you forget to clean up something if you don't clean, how could you get the same name for a right you didn't release ? that's not hard to do :) ah, you cleaned up the port right but not the name ah ok destroy the port and forget that a thread is still working on a response the data structure says use the port at index X X is reallocated in the mean time excuse my ignorance, but gnumach *is* reusing port names, isn't it? that policy is why i'm not sure i want to enforce allocation policy in the interface :/ This is not about a security property of the system this is about failing fast you want to fail as close to the source of the problem as possible we could make the kernel use different allocation policies for names, to catch bugs, yes make the index X valid again and you've potentially masked the bug braunr: if you were to merge your radix tree implementation into gnumach and replace the splay tree with it, would that make using renamed ports fast enough so we can just rename all receive ports doing away with the extra lookup like mach-defpager does ? i don't think so the radix tree code is able to compress its size when keys are close to 0 using addresses would add 1, 2, maybe 3 levels of internal nodes for every right we could use a true integer hash table for that though hm no, hurd packages crash ... :/ but malloc allocates stuff in a contigious space, so the pointers should be similar in the most significant bits if you use malloc, yes sure but that'd make the radix tree representation compact, no? it could the current code only compresses near 0 oh better compression could be implemented though ## IRC, freenode, #hurd, 2013-11-21 have you seen liburcu ? a bit, yes it might be worth investigating to use it in some servers it is the proc server comes to mind personally, i think all hurd servers should use rcu libports should use rcu yes lockless synchronization should be a major feature of x15/propel present even during message passing ## IRC, freenode, #hurd, 2013-12-09 improving our page cache with arc would be great it's on the todo list for x15 :> not sure you referred to virtual memory management though (actually, it's CAR, not ARC that is planned for x15) ## IRC, freenode, #hurd, 2013-12-30 zacts: http://darnassus.sceen.net/~rbraun/x15/qemu_x15.sh ## IRC, freenode, #hurd, 2014-01-03 oh, btw, i've started working on x15 again :> saw that :) first item on the list: per-cpu page tables the magic that will make ipc extremely scalable :) i'm worried about your approach tbh too much overhead ? not on any technical level but haven ? 't there been enough reimplementation efforts that got nowhere ? oh that ^^ well, i have personal constraints and frustrations with the existing code, and my goal isn't to actually produce anything serious until it actually gets there which, yes, it might not really, i'm doing it for fun well sure that's a damn good reason ;) and if it ever reaches a state where it can actually be used to run stuff, i would be very happy and considering how it's done, i'm pretty sure things could be built a lot faster on such a system but you need to reimplement all the userspace servers as well, and the libc stuff yes do you plan to reimplement this from scratch or do you have plans to 'bootstrap' propel from hurd ? from scratch well... i'm not sure that this is feasible or even a good idea. that's what i meant in a nutshell i guess. i'm familiar with that criticism and you may be right this is also why i keep working on the hurd at the same time we could also talk about making hurd more easily portable portable with regard to what ? evolving hurd and mach to the point where it might be feasible to port hurd to another ukernel not so easy i know i'm not even sure i would want that well, since the hurd isn't optimized at all, why not why would it neccessarily hinder optimization ? because in practice, it's rare for a microkernel to provide all the features the hurd would require to run really well the most severe issue being that they either provide asynchronous ipc, used for signals, or only synchronous ipc, making signal and other event-driven code hard to emulate (usually requiring separate threads) ## IRC, freenode, #hurd, 2014-01-20 [[open_issues/translate_fd_or_port_to_file_name]]: i wonder if it would not be best to add a description to mach tasks i think it would to aid fixing these kind of issues in x15, i actually add descriptions (names) to all kernel objects that's probably a good idea, yes well, not all, but many i'd like to push x15 this year it currently is the only design of a truely scalable microkernel that i know of push how? spend time on it k do you think it will make sense to solicit outside contributions at one point? yes the roadmap is vm system -> ipc system -> userspace (including RPC handling) once we can actually do things in userspace, the priority will be getting a shell with glibc people will be able to help "easily" at that point just wondering, apart from scalability, did you write it for performance, for hackability, or something else? it's basically the hurd architecture, including improvements from the critique, with performance and scalability in mind ok the main improvements i think of currently are resource containers, lexical .. resolution, and lists of trusted users with which to communicate it's strongly oriented for posix compatibility though sounds nice, i like it already ;) is it compatible with Mach to some degree? so things like running without an identity will be forbidden in the default system personality no, not compatible with mach at all this sounds like it is doing more than Mach did braunr: ah, ok it's not "x15mach" any more :) right, I missed out on that ### IRC, freenode, #hurd, 2014-01-21 i also don't write anything that would prevent real-time b/c that's a potential market for such an operating system ? yes well, i can't say i don't like the sound of that ;) the ipc interface should be close to that of qnx ## IRC, freenode, #hurd, 2014-02-23 braunr: have you looked at genode? braunr: i sometimes wonder how hard it'd be to port hurd atop it because i find some similarities with what l4/fiasco/viengos provided cluck: i have, but genode seems a bit too far from posix for our tastes (and yes, i realize we'd be getting farther from the hw) ah you really mean running the hurd on top of it i personally don't like the idea braunr: well, true, but their noux implementation proves it's not a dealbreaker braunr: at least initially that'd be the best implementation approach, no? as time went on integrating hurd servers more tightly at a lower level makes sense but doing so from the get go would be foolhardy braunr: or am i missing something obvious? cluck: why would it be ? braunr: going by what happened with l4 it's too much code to port and optimize at once cluck: i don't think it is cluck: problems with l4 didn't have much to do with "too much code" braunr: i won't debate that, you have more experience than me with hurd code. anyway that's how i'd go about it, first get it all running then get it running fast. breakage is bad and you think moving from something like linux or genode to an implementation closer to hardware won't break things ? braunr: yes, i read the paper, obvious unexpected shortcomings but even had them not been there the paradigms are too different and creating proper mappings from one model to the other would at least be time consuming ye yes i'm convinved the simple approach of a small microkernel with the proper interfacen along with the corresponding sysdeps layer in glibc would be enough to get a small hurd like system quickly experience with other systems shows how to directly optimize a lot of things from the start, without much effort braunr: sorry. back to our talk, i mentioned genode because of the nice features it has that'd be useful on hurd cluck: which ones do you refer to ? braunr: the security model is the biggest one how does it differ from the hurd, except for revocation ? braunr: then there's the ease of portability ? braunr: it's more strict how would that help us ? braunr: if hurd was running atop it we'd get extra platforms supported almost for free whenever they did (since we'd be using the same primitives) why not choose the underlying microkernel directly ? call me crazy but i believe code reuse is a good thing, i see little point in duplicating existing code just because you can what part of genode should be reused then ? that's what got me thinking about genode in the first place, ideologically they share a lot (if not most) of hurd's goals and code wise they feel close enough to make a merge of sorts not seem crazy talk, thus my asking if i'm missing something obvious i think the design is incompatible with our goals of posix compatibility braunr: oh, ok. braunr: i was assuming that wasn't an issue, as i mentioned before they have noux already and if hurd's servers got ported they'd provide whatever else that was missing noux looks like a unix server for binary compatibility i'm not sure it is but that's what the description makes me think and if it really, then it's no different than running linux on top of an hypervisor ok it's not for binary compatibility but it definitely is a (partial) unix server i much prefer the way the hurd is posix compliant without any additional layer for compatibility or virtualization braunr: noux is a runtime, as i understand it there's no binary compatibility just source (ie library/api calls) yes i corrected that just now sorry, i'm having lag issues no worries braunr: anyway, how's x15 coming along? still far from being a practical replacement? yes .. :( and it's not a replacement (for mach) no huh? it's not a replacement for the hurd err, for mach braunr: i thought you were writing it to be compatible with mach's interfaces no it used to be that way but no braunr: what changed? mach ipc is too ccmplicated complicated* its supposed benefit (of allowing the creation of computer clusters for single system images) are outdated and not very interesting it's error prone and it incurrs more overhead than it should no arguing there braunr: are you still targeting being able to run hurd atop x15 or is it just your pet project now? i don't intend the hurd to run on top of it the reason it's a rewrite is to fix a whole bunch of major issues in one go