From 9667351422dec0ca40a784a08dec7ce128482aba Mon Sep 17 00:00:00 2001 From: Thomas Schwinge Date: Wed, 10 Jul 2013 23:39:29 +0200 Subject: IRC. --- microkernel/mach/deficiencies.mdwn | 1427 ++++++++++++++++++++++- microkernel/mach/gnumach/memory_management.mdwn | 15 + 2 files changed, 1440 insertions(+), 2 deletions(-) (limited to 'microkernel') diff --git a/microkernel/mach/deficiencies.mdwn b/microkernel/mach/deficiencies.mdwn index 1294b8b3..d1cdeb54 100644 --- a/microkernel/mach/deficiencies.mdwn +++ b/microkernel/mach/deficiencies.mdwn @@ -260,9 +260,9 @@ License|/fdl]]."]]"""]] solve a number of problems... I just wonder how many others it would open -# IRC, freenode, #hurd, 2012-09-04 +# X15 -X15 +## IRC, freenode, #hurd, 2012-09-04 it was intended as a mach clone, but now that i have better knowledge of both mach and the hurd, i don't want to retain mach @@ -767,3 +767,1426 @@ In context of [[open_issues/multithreading]] and later [[open_issues/select]]. imo, a rewrite is more appropriate sometimes, things done in x15 can be ported to the hurd but it still requires a good deal of effort + + +## IRC, freenode, #hurd, 2013-04-26 + + braunr: Did I see that you are back tinkering with X15? + well yes i am + and i'm very satisfied with it currently, i hope i can maintain + the same level of quality in the future + it can already handle hundreds of processors with hundreds of GB + of RAM in a very scalable way + most algorithms are O(1) + even waking up multiple threads is O(1) :) + i'd like to implement rcu this summer + Nice. When are you gonna replace gnumach? ;-P + never + it's x15, not x15mach now + it's not meant to be compatible + Who says it has to be compatible? :) + i don't know, my head + the point is, the project is about rewriting the hurd now, not + just the kernel + new kernel, new ipc, new interfaces, new libraries, new everything + Yikes, now that is some work. :) + well yes and no + ipc shouldn't be that difficult/long, considering how simple i + want the interface to be + Cool. + networking and drivers will simply be reused from another code + base like dde or netbsd + so besides the kernel, it's a few libraries (e.g. a libports like + library), sysdeps parts in the c library, and a file system + For inclusion in glibc or are you not intending on using glibc? + i intend to use glibc, but not for upstream integration, if that's + what you meant + so a private, local branch i assume + i expect that part to be the hardest + + +## IRC, freenode, #hurd, 2013-05-02 + + braunr: also, will propel/x15 use netbsd drivers or netdde linux + drivers? + or both? + probably netbsd drivers + and if netbsd, will it utilize rump? + i don't know yet + ok + device drivers and networking will arrive late + the system first has to run in ram, with a truely configurable + boot process + (i.e. a boot process that doesn't use anything static, and can + boot from either disk or network) + rump looks good but it still requires some work since it doesn't + take care of messaging as well as we'd want + e.g. signal relaying isn't that great + I personally feel like using linux drivers would be cool, just + because linux supports more hardware than netbsd iirc.. + zacts: But it could be problematic as you should take quite a lot + code from linux kernel to add support even for a single driver. + zacts: netbsd drivers are far more portable + oh wow, interesting. yeah I did have the idea that netbsd would be + more portable. + mcsim: that doesn't seem to be as big a problem as you might + suggest + the problem is providing the drivers with their requirements + there are a lot of different execution contexts in linux (hardirq, + softirq, bh, threads to name a few) + being portable (as implied in netbsd) also means being less + demanding on the execution context + which allows reusing code in userspace more easily, as + demonstrated by rump + i don't really care about extensive hardware support, since this + is required only for very popular projects such as linux + and hardware support actually comes with popularity (the driver + code base is related with the user base) + so you think that more users will contribute if the projects takes + off? + i care about clean and maintainable code + well yes + I think that's a good attitude + what i mean is, there is no need for extensive hardware support + braunr: TBH, I did not really got idea of rump. Do they try to run + the whole kernel or some chosen subsystems as user tasks? + mcsim: some subsystems + well + all the subsystems required by the code they actually want to run + (be it a file system or a network stack) + braunr: What's the difference with dde? + it's not kernel oriented + what do you mean? + it's not only meant to run on top of a microkernel + as the author named it, it's "anykernel" + if you remember at fosdem, he run code inside a browser + ran* + and also, netbsd drivers wouldn't restrict the license + although not a priority, having a (would be) gnu system under + gplv3+ would be nice + that would be cool + x15 is already gplv3+ + iirc + yes + cool + yeah, I would agree netbsd drivers do look more attractive in that + case + again, that's clearly not the main reason for choosing them + ok + it could also cause other problems, such as accepting a bsd + license when contributing back + but the main feature of the hurd isn't drivers, and what we want + to protect with the gpl is the main features + I see + drivers, as well as networking, would be third party code, the + same way you run e.g. firefox on linux + with just a bit of glue + braunr: what do you think of the idea of being able to do updates + for propel without rebooting the machine? would that be possible down the + road? + simple answer: no + that would probably require persistence, and i really don't want + that + does persistence add a lot of complexity to the system? + not with the code, but at execution, yes + interesting + we could add per-program serialization that would allow it but + that's clearly not a priority for me + updating with a reboot is already complex enough :) + + +## IRC, freenode, #hurd, 2013-05-09 + + the thing is, i consider the basic building blocks of the hurd too + crappy to build anything really worth such effort over them + mach is crappy, mig is crappy, signal handling is crappy, hurd + libraries are ok but incur a lot of contention, which is crappy today + Understood but it is all we have currently. + i know + and it's good as a prototype + We have already had L4, viengoos, etc and nothing has ever come + to fruition. :( + my approach is compeltely different + it's not a new design + a few things like ipc and signals are redesigned, but that's minor + compared to what was intended for hurdng + propel is simply meant to be a fast, scalable implementation of + the hurd high level architecture + bddebian: imagine a mig you don't fear using + imagine interfaces not constrained to 100 calls ... + imagine per-thread signalling from the start + braunr: I am with you 100% but it's vaporware so far.. ;-) + bddebian: i'm just explaining why i don't want to work on large + scale projects on the hurd + fixing local bugs is fine + fixing paging is mandatory + usb could be implemented with dde, perhaps by sharing the pci + handling code + (i.e. have one big dde server with drivers inside, a bit ugly but + straightforward compared to a full fledged pci server) + braunr: But this is the problem I see. Those of you that have + the skills don't have the time or energy to put into fixing that kind of + stuff. + braunr: That was my thought. + bddebian: well i have time, and i'm currently working :p + but not on that + bddebian: also, it won't be vaporware for long, i may have ipc + working well by the end of the year, and optimized and developer-friendly + by next year) + + +## IRC, freenode, #hurd, 2013-06-05 + + i'll soon add my radix tree with support for lockless lookups :> + a tree organized based on the values of the keys thmselves, and + not how they relatively compare to each other + also, a tree of arrays, which takes advantage of cache locality + without the burden of expensive resizes + you seem to be applying good algorithmic teghniques + that is nice + that's one goal of the project + you can't achieve performance and scalability without the + appropriate techniques + see http://git.sceen.net/rbraun/librbraun.git/blob/HEAD:/rdxtree.c + for the existing userspace implementation + in kern/work.c I see one TODO "allocate numeric IDs to better + identify worker threads" + yes + and i'm adding my radix tree now exactly for that + (well not only, since radix tree will also back VM objects and IPC + spaces, two major data structures of the kernel) + + +## IRC, freenode, #hurd, 2013-06-11 + + and also starting paging anonymous memory in x15 :> + well, i've merged my radix tree code, made it safe for lockless + access (or so i hope), added generic concurrent work queues + and once the basic support for anonymous memory is done, x15 will + be able to load modules passed from grub into userspace :> + but i've also been thinking about how to solve a major scalability + issue with capability based microkernels that noone else seem to have + seen or bothered thinking about + for those interested, the problem is contention at the port level + unlike on a monolithic kernel, or a microkernel with thread-based + ipc such as l4, mach and similar kernels use capabilities (port rights in + mach terminology) to communicate + the kernel then has to "translate" that reference into a thread to + process the request + this is done by using a port set, putting many ports inside, and + making worker threads receive messages on the port set + and in practice, this gets very similar to a traditional thread + pool model + one thread actually waits for a message, while others sit on a + list + when a message arrives, the receiving thread wakes another from + that list so it receives the next message + this is all done with a lock + Maybe they thought about it but couldn't or were to lazy to find + a better way? :) + braunr: what do you mean under "unlike .... a microkernel with + thread-based ipc such as l4, mach and similar kernels use capabilities"? + L4 also has capabilities. + mcsim: not directly + capabilities are implemented by a server on top of l4 + unless it's OKL4 or another variant with capabilities back in the + kernel + i don't know how fiasco does it + so the problem with this lock is potentially very heavy contention + and contention in what is the equivalent of a system call .. + it's also hard to make it real-time capable + for example, in qnx, they temporarily apply priority inheritance + to *every* server thread since they don't know which one is going to be + receiving next + braunr: in fiasco you have capability pool for each thread and this + pool is stored in tread control block. When one allocates capability + kernel just marks slot in a pool as busy + mcsim: ok but, there *is* a thread for each capability + i mean, when doing ipc, there can only be one thread receiving the + message + (iirc, this was one of the big issue for l4-hurd) + ok. i see the difference. + well i'm asking + i'm not so sure about fiasco + but that's what i remember from the generic l4 spec + sorry, but where is the question? + 16:04 < braunr> i mean, when doing ipc, there can only be one + thread receiving the message + yes, you specify capability to thread you want to send message to + i'll rephrase: + when you send a message, do you invoke a capability (as in mach), + or do you specify the receiving thread ? + you specify a thread + that's my point + but you use local name (that is basically capability) + i see + from wikipedia: "Furthermore, Fiasco contains mechanisms for + controlling communication rights as well as kernel-level resource + consumption" + not certain that's what it refers to, but that's what i understand + from it + more capability features in the kernel + but you still send to one thread + yes + that's what makes it "easily" real time capable + a microkernel that would provide mach-like semantics + (object-oriented messaging) but without contention at the messsage + passing level (and with resource preallocation for real time) would be + really great + bddebian: i'm not sure anyone did + braunr: Well you can be the hero!! ;) + the various papers i could find that were close to this subject + didn't take contention into account + exception for network-distributed ipc on slow network links + bddebian: eh + well i think it's doable acctually + braunr: can you elaborate on where contention is, because I do not + see this clearly? + mcsim: let's take a practical example + a file system such as ext2fs, that you know well enough + imagine a large machine with e.g. 64 processors + and an ignorant developer like ourselves issuing make -j64 + every file access performed by the gcc tools will look up files, + and read/write/close them, concurrently + at the server side, thread creation isn't a problem + we could have as many threads as clients + the problem is the port set + for each port class/bucket (let's assume they map 1:1), a port set + is created, and all receive rights for the objects managed by the server + (the files) are inserted in this port set + then, the server uses ports_manage_port_operations_multithread() + to service requests on that port set + with as many threads required to process incoming messages, much + the same way a work queue does it + but you can't have *all* threads receiving at the same time + there can only be one + the others are queued + i did a change about the queue order a few months ago in mach btw + mcsim: see ipc/ipc_thread.c in gnumach + this queue is shared and must be modified, which basically means a + lock, and contention + so the 64 concurrent gcc processes will suffer from contenion at + the server while they're doing something similar to a system call + by that, i mean, even before the request is received + mcsim: if you still don't understand, feel free to ask + braunr: I'm thinking on it :) give me some time + "Fiasco.OC is a third generation microkernel, which evolved from + its predecessor L4/Fiasco. Fiasco.OC is capability based" + ok + so basically, there are no more interesting l4 variants strictly + following the l4v2 spec any more + "The completely redesigned user-land environment running on top of + Fiasco.OC is called L4 Runtime Environment (L4Re). It provides the + framework to build multi-component systems, including a client/server + communication framework" + so yes, client/server communication is built on top of the kernel + something i really want to avoid actually + So when 1 core wants to pull something out of queue it has to lock + it, and the problem arrives when other 63 cpus are waiting in the same + lock. Right? + mcsim: yes + could this be solved by implementing per cpu queues? Like in slab + allocator + solved, no + reduced, yes + by using multiple port sets, each with their own thread pool + but this would still leave core problems unsolved + (those making real-time hard) + to make it real-time is not really essential to solve this problem + that's the other way around + we just need to guarantee that locking protocol is fair + solving this problem is required for quality real-time + what you refer to is similar to what i described in qnx earlier + it's ugly + keep in mind that message passing is the equivalent of system + calls on monolithic kernels + os ideally, we'd want something as close as possible to an + actually system call + so* + mcsim: do you see why it's ugly ? + no i meant exactly opposite, I meant to use some deterministic + locking protocol + please elaborate + because what qnx does is deterministic + We know in what sequences threads will acquire the lock, so we will + not have to apply inheritance to all threads + hwo do you know ? + there are different approaches, like you use ticket system or MCS + lock (http://portal.acm.org/citation.cfm?id=103729) + that's still locking + a system call has 0 contention + 0 potential contention + in linux? + everywhere i assume + than why do they need locks? + they need locks after the system call + the system call itself is a stupid trap that makes the thread + "jump" in the kernel + and the reason why it's so simple is the same as in fiasco: + threads (clients) communicate directly with the "server thread" + (themselves in kernel mode) + so 1/ they don't go through a capability or any other abstraction + and 2/ they're even faster than on fiasco because they don't need + to find the destination, it's implied by the trap mechanism) + 2/ is only an optimization that we can live without + but 1/ is a serious bottleneck for microkernels + Do you mean that there system call that process without locks or do + you mean that there are no system calls that use locks? + this is what makes papers such as + https://www.kernel.org/doc/ols/2007/ols2007v1-pages-251-262.pdf valid + i mean the system call (the mechanism used to query system + services) doesn't have to grab any lock + the idea i have is to make the kernel transparently (well, as much + as it can be) associate a server thread to a client thread at the port + level + at the server side, it would work practically the same + the first time a server thread services a request, it's + automatically associated to a client, and subsequent request will + directly address this thread + when the client is destroyed, the server gets notified and + destroys the associated server trhead + for real-time tasks, i'm thinking of using a signal that gets sent + to all servers, notifying them of the thread creation so that they can + preallocate the server thread + or rather, a signal to all servers wishing to be notified + or perhaps the client has to reserve the resources itself + i don't know, but that's the idea + and who will send this signal? + the kernel + x15 will provide unix like signals + but i think the client doing explicit reservation is better + more complicated, but better + real time developers ought to know what they're doing anyway + mcsim: the trick is using lockless synchronization (like rcu) at + the port so that looking up the matching server thread doesn't grab any + lock + there would still be contention for the very first access, but + that looks much better than having it every time + (potential contention) + it also simplifies writing servers a lot, because it encourages + the use of a single port set for best performance + instead of burdening the server writer with avoiding contention + with e.g. a hierarchical scheme + "looking up the matching server" -- looking up where? + in the port + but why can't you just take first? + that's what triggers contention + you have to look at the first + > (16:34:13) braunr: mcsim: do you see why it's ugly ? + BTW, not really + imagine serveral clients send concurrently + mcsim: well, qnx doesn't do it every time + qnx boosts server threads only when there are no thread currently + receiving, and a sender with a higher priority arrives + since qnx can't know which server thread is going to be receiving + next, it boosts every thread + boosting priority is expensive, and boosting everythread is linear + with the number of threads + so on a big system, it would be damn slow for a system call :) + ok + and grabbing "the first" can't be properly done without + serialization + if several clients send concurrently, only one of them gets + serviced by the "first server thread" + the second client will be serviced by the "second" (or the first + if it came back) + making the second become the first (i call it the manager) must be + atomic + that's the core of the problem + i think it's very important because that's currently one of the + fundamental differences wih monolithic kernels + so looking up for server is done without contention. And just + assigning task to server requires lock, right? + mcsim: basically yes + i'm not sure it's that easy in practice but that's what i'll aim + at + almost every argument i've read about microkernel vs monolithic is + full of crap + Do you mean lock on the whole queue or finer grained one? + the whole port + (including the queue) + why the whole port? + how can you make it finer ? + is queue a linked list? + yes + than can we just lock current element in the queue and elements + that point to current + that's two lock + and every sender will want "current" + which then becomes coarse grained + but they want different current + let's call them the manager and the spare threads + yes, that's why there is a lock + so they don't all get the same + the manager is the one currently waiting for a message, while + spare threads are available but not doing anything + when the manager finally receives a message, it takes the first + spare, which becomes the new manager + exactly like in a common thread pool + so what are you calling current ? + we have in a port queue of threads that wait for message: t1 -> t2 + -> t3 -> t4; kernel decided to assign message to t3, than t3 and t2 are + locked. + why not t1 and t2 ? + i was calling t3 in this example as current + some heuristics + yeah well no + it wouldn't be deterministic then + for instance client runs on core 3 and wants server that also runs + on core 3 + i really want the operation as close as a true system call as + possible, so O(1) + what if there are none ? + it looks up forward up to the end of queue: t1->t2->t4; takes t4 + than it starts from the beginning + that becomes linear in the worst case + no + so 4095 attempts on a 4096 cpus machine + ? + you're right + unfortunately :/ + a per-cpu scheme could be good + and applicable + with much more thought + and the problem is that, unlike the kernel, which is naturally a + one thread per cpu server, userspace servers may have less or more + threads than cpu + possibly unbalanced too + so it would result in complicated code + one good thing with microkernels is that they're small + they don't pollute the instruction cache much + keeping the code small is important for performance too + so forgetting this kind of optimization makes for not too + complicated code, and we rely on the scheduler to properly balance + threads + mcsim: also note that, with your idea, the worst cast is twice + more expensive than a single lock + and on a machine with few processors, this worst case would be + likely + so, you propose every time try to take first server from the queue? + braunr: ^ + no + that's what is done already + i propose doing that the first time a client sends a message + but then, the server thread that replied becomes strongly + associated to that client (it cannot service requests from other clients) + and it can be recycled only when the client dies + (which generates a signal indicating the server it can now recycle + the server thread) + (a signal similar to the no-sender or dead-name notifications in + mach) + that signal would be sent from the kernel, in the traditional unix + way (i.e. no dedicated signal thread since it would be another source of + contention) + and the server thread would directly receive it, not interfering + with the other threads in the server in any way + => contention on first message only + now, for something like make -j64, which starts a different + process for each compilation (itself starting subprocesses for + preprocessing/compiling/assembling) + it wouldn't be such a big win + so even this first access should be optimized + if you ever get an idea, feel free to share :) + May mach block thread when it performs asynchronous call? + braunr: ^ + sure + but that's unrelated + in mach, a sender is blocked only when the message queue is full + So we can introduce per cpu queues at the sender side + (and mach_msg wasn't called in non blocking mode obviously) + no + they need to be delivered in order + In what order? + messages can't be reorder once queued + reordered + so fifo order + if you break the queue in per cpu queues, you may break that, or + need work to rebuild the order + which negates the gain from using per cpu queues + Messages from the same thread will be kept in order + are you sure ? + and i'm not sure it's enough + thes cpu queues will be put to common queue once context switch + occurs + *all* messages must be received in order + these* + uh ? + you want each context switch to grab a global lock ? + if you have parallel threads that send messages that do not have + dependencies than they are unordered + always + the problem is they might + consider auth for example + you have one client attempting to authenticate itself to a server + through the auth server + if message order is messed up, it just won't work + but i don't have this problem in x15, since all ipc (except + signals) is synchronous + but it won't be messed up. You just "send" messages in O(1), but + than you put these messages that are not actually sent in queue all at + once + i think i need more details please + you have lock on the port as it works now, not the kernel lock + the idea is to batch these calls + i see + batching can be effective, but it would really require queueing + x15 only queues clients when there is no receiver + i don't think batching can be applied there + you batch messages only from one client + that's what i'm saying + so client can send several messages during his time slice and than + you put them into queue all together + x15 ipc is synchronous, no more than 1 message per client at any + time + there also are other problems with this strategy + problems we have on the hurd, such as priority handling + if you delay the reception of messages, you also delay priority + inheritance to the server thread + well not the reception, the queueing actually + but since batching is about delaying that, it's the same + if you use synchronous ipc than there is no sence in batching, at + least as I see it. + yes + 18:08 < braunr> i don't think batching can be applied there + and i think sync ipc is the only way to go for a system intended + to provide messaging performance as close as possible to the system call + do you have as many server thread as many cores you have? + no + as many server threads as clients + which matches the monolithic model + in current implementation? + no + currently i don't have userspace :> + and what is in hurd atm? + in gnumach + asyn ipc + async + with message queues + no priority inheritance, simple "handoff" on message delivery, + that's all + I managed to read the conversation :-) + eh + anatoly: any opinion on this ? + braunr: I have no opinion. I understand it partially :-) But + association of threads sounds for me as good idea + But who am I to say what is good or what is not in that area :-) + there still is this "first time" issue which needs at least one + atomic instruction + I see. Does mach do this "first time" thing every time? + yes + but gnumach is uniprocessor so it doesn't matter + if we have 1:1 relation for client and server threads we need only + per-cpu queues + mcsim: explain that please + and the problem here is establishing this relation + with a lockless lookup, i don't even need per cpu queues + you said: (18:11:16) braunr: as many server threads as clients + how do you create server threads? + pthread_create + :) + ok :) + why and when do you create a server thread? + there must be at least one unbound thread waiting for a message + when a message is received, that thread knows it's now bound with + a client, and if needed wakes up/spawns another thread to wait for + incoming messages + when it gets a signal indicating the death of the client, it knows + it's now unbound, and goes back to waiting for new messages + becoming either the manager or a spare thread if there already is + a manager + a timer could be used as it's done on the hurd to make unbound + threads die after a timeout + the distinction between the manager and spare threads would only + be done at the kernel level + the server would simply make unbound threads wait on the port set + How client sends signal to thread about its death (as I + understand signal is not message) (sorry for noob question) + in what you described there are no queues at all + anatoly: the kernel does it + mcsim: there is, in the kernel + the queue of spare threads + anatoly: don't apologize for noob questions eh + braunr: is that client is a thread of some user space task? + i don't think it's a newbie topic at all + anatoly: a thread + make these queue per cpu + why ? + there can be a lot less spare threads than processors + i don't think it's a good idea to spawn one thread per cpu per + port set + on a large machine you'd have tons of useless threads + if you have many useless threads, than assign 1 thread to several + core, thus you will have twice less threads + i mean dynamically + that becomes a hierarchical model + it does reduce contention, but it's complicated, and for now i'm + not sure it's worth it + it could be a tunable though + if you want something fast you should use something complicated. + really ? + a system call is very simple and very fast + :p + why is it fast? + you still have a lot of threads in kernel + but they don't interact during the system call + the system call itself is usually a simple instruction with most + of it handled in hardware + if you invoke "write" system call, what do you do in kernel? + you look up the function address in a table + you still have queues + no + sorry wait + by system call, i mean "the transition from userspace to kernel + space" + and the return + not the service itself + the equivalent on a microkernel system is sending a message from a + client, and receiving it in a server, not processing the request + ideally, that's what l4 does: switching from one thread to + another, as simply and quickly as the hardware can + so just a context and address space switch + at some point you put something in queue even in monolithic kernel + and make request to some other kernel thread + the problem here is the indirection that is the capability + yes but that's the service + i don't care about the service here + i care about how the request reaches the server + this division exist for microkernels + for monolithic it's all mixed + What does thread do when it receive a message? + anatoly: what it wants :p + the service + mcsim: ? + mixed ? + braunr: hm, is it a thread of some server? + if you have several working threads in monolithic kernel you have + to put request in queue + anatoly: yes + mcsim: why would you have working threads ? + and there is no difference either you consider it as service or + just "transition from userspace to kernel space" + i mean, it's a good thing to have, they usually do, but they're + not implied + they're completely irrelevant to the discussion here + of course there is + you might very well perform system calls that don't involve + anything shared + you can also have only one working thread in microkernel + yes + and all clients will wait for it + you're mixing up work queues in the discussion here + server threads are very similar to a work queue, yes + but you gave me an example with 64 cores and each core runs some + server thread + they're a thread pool handling requests + you can have only one thread in a pool + they have to exist in a microkernel system to provide concurrency + monolithic kernels can process concurrently without them though + why? + because on a monolithic system, _every client thread is its own + server_ + a thread making a system call is exactly like a client requesting + a service + on a monolithic kernel, the server is the kernel + and it *already* has as many threads as clients + and that's pretty much the only thing beautiful about monolithic + kernels + right + have to think about it :) + that's why they scale so easily compared to microkernel based + systems + and why l4 people chose to have thread-based ipc + but this just moves the problems to an upper level + and is probably why they've realized one of the real values of + microkernel systems is capabilities + and if you want to make them fast enough, they should be handled + directly by the kernel + + +## IRC, freenode, #hurd, 2013-06-13 + + Heya Richard. Solve the worlds problems yet? :) + bddebian: I fear the worlds problems are NP-complete ;) + heh + bddebian: i wish i could solve mine at least :p + braunr: I meant the contention thing you were discussing the + other day :) + bddebian: oh + i have a solution that improves the behaviour yes, but there is + still contention the first time a thread performs an ipc + Any thread or the first time there is contention? + there may be contention the first time a thread sends a message to + a server + (assuming a server uses a single port set to receive requests) + Oh aye + i think it's as much as can be done considering there is a + translation from capability to thread + other schemes are just too heavy, and thus don't scale well + this translation is one of the two important nice properties of + microkernel based systems, and translations (or indrections) usually have + a cost + so we want to keep them + and we have to accept that cost + the amount of code in the critical section should be so small it + should only matter for machines with several hundreds or thousands + processors + so it's not such a bit problem + OK + but it would have been nice to have an additional valid + theoretical argument to explain how ipc isn't that slow compared to + system calls + s/bit/big/ + people keep saying l4 made ipc as fast as system calls without + taking that stuff into account + which makes the community look lame in the eyes of those familiar + with it + heh + with my solution, persistent applications like databases should + perform as fast as on an l4 like kernel + but things like parallel builds, which start many different + processes for each file, will suffer a bit more from contention + seems like a fair compromise to me + Aye + as mcsim said, there is a lot of contention about everywhere in + almost every application + and lockless stuff is hard to correctly implement + os it should be all right :) + ... :) + braunr: What if we have at least 1 thread for each core that stay + in per-core queue. When we decide to kill a thread and this thread is + last in a queue we replace it with load balancer. This is still worse + than with monolithic kernel, but it is simplier to implement from kernel + perspective. + mcsim: it doesn't scale well + you end up with one thread per cpu per port set + load balancer is only one thread + why would it end up like you said? + remember the goal is to avoid contention + your proposition is to set per cpu queues + the way i understand what you said, it means clients will look up + a server thread in these queues + one of them actually, the one for the cpu they're currently + running one + so 1/ it disables migration + or 2/ you have one server thread per client per cpu + i don't see what a "load balancer" would do here + client either finds server thread without contention or it sends + message to load balancer, that redirects message to thread from global + queue. Where global queue is concatenation of local ones. + you can't concatenate local queues in a global one + if you do that, you end up with a global queue, and a global lock + again + not global + load balancer is just one + then you serialize all remote messaging through a single thread + so contention will be only among local thread and load balancer + i don't see how it doesn't make the load balancer global + it makes + but it just makes bootstraping harder + i'm not following + and i don't see how it improves on my solution + in your example with make -j64 very soon there will be local + threads at any core + yes, hence the lack of scalability + but that's your goal: create as many server thread as many clients + you have, isn't it? + your solution may create a lot more + again, one per port set (or server) per cpu + imagine this worst case: you have a single client with one thread + which gets migrated to every cpu on the machine + it will spawn one thread per cpu at the server side + why would it migrate all the time? + it's a worst case + if it can migrate, consider it will + murphy's law, you know + also keep in mind contention doesn't always occur with a global + lock + i'm talking about potential contention + and same things apply: if it can happen, consider it will + than we can make load balancer that also migrates server threads + ok so in addition to worker threads, we'll add an additional per + server load balancer which may have to lock several queues at once + doesn't it feel completely overkill to you ? + load balancer is global, not per-cpu + there could be contention for it + again, keep in mind this problem becomes important for several + hundreds processors, not below + yes but it has to balance + which means it has to lock cpu queues + and at least two of them to "migrate" server threads + and i don't know why it would do that + i don't see the point of the load balancer + so, you start make -j64. First 64 invocations of gcc will suffer + from contention for load balancer, but later on it will create enough + server threads and contention will disappear + no + that's the best case : there is always one server thread per cpu + queue + how do you guarantee your 64 server threads don't end up in the + same cpu queue ? + (without disabling migration) + load balancer will try to put some server thread to the core where + load balancer was invoked + so there is no guarantee + LB can pin server thread + unless we invoke it regularly, in a way similar to what is already + done in the SMP scheduler :/ + and this also means one balancer per cpu then + why one balance per cpu? + 15:56 < mcsim> load balancer will try to put some server thread to + the core where load balancer was invoked + why only where it was invoked ? + because it assumes that if some one asked for server at core x, it + most likely will ask for the same service from the same core + i'm not following + LB just tries to prefetch were next call will be + what you're describing really looks like per-cpu work queues ... + i don't see how you make sure there aren't too many threads + i don't see how a load balancer helps + this is just an heuristic + when server thread is created? + who creates it? + and it may be useless, depending on how threads are migrated and + when they call the server + same answer as yesterday + there must be at least one thread receiving messages on a port set + when a message arrives, if there aren't any spare threads, it + spawns one to receive messages while it processes the request + at the moment server threads are killed by timeout, right? + yes + well no + there is a debian patch that disables that + because there is something wrong with thread destruction + but that's an implementation bug, not a design issue + so it is the mechanism how we insure that there aren't too many + threads + it helps because yesterday I proposed to hierarchical scheme, were + one server thread could wait in cpu queues of several cores + but this has to be implemented in kernel + a hierarchical scheme would help yes + a bit + i propose scheme that could be implemented in userspace + ? + kernel should not distinguish among load balancer and server thread + sorry this is too confusing + please start describing what you have in mind from the start + ok + so my starting point was to use hierarchical management + but the drawback was that to implement it you have to do this in + kernel + right? + no + so I thought how can this be implemented in user space + being in kernel isn't the problem + contention is + on the contrary, i want ipc in kernel exactly because that's where + you have the most control over how it happens + and can provide the best performance + ipc is the main kernel responsibility + but if you have few clients you have low contention + the goal was "0 potential contention" + and if you have many clients, you have many servers + let's say server threads + for me, a server is a server task or process + right + so i think 0 potential contention is just impossible + or it requires too many resources that make the solution not + scalable + 0 contention is impossible, since you have disbalance in numbers of + client threads and server threads + well no + it *canĂ¹ be achieved + imagine servers register themselves to the kernel + and the kernel signals them when a client thread is spawned + you'd effectively have one server thread per client + (there would be other problems like e.g. when a server thread + becomes the client of another, etc..) + so it's actually possible + but we clearly don't want that, unless perhaps for real time + threads + but please continue + what does "and the kernel signals them when a client thread is + spawned" mean? + it means each time a thread not part of a server thread is + created, servers receive a signal meaning "hey, there's a new thread out + there, you might want to preallocate a server thread for it" + and what is the difference with creating thread on demand? + on demand can occur when receiving a message + i.e. during syscall + I will continue, I just want to be sure that I'm not basing on + wrong assumtions. + and what is bad in that? + (just to clarify, i use the word "syscall" with the same meaning + as "RPC" on a microkernel system, whereas it's a true syscall on a + monolithic one) + contention + whether you have contention on a list of threads or on map entries + when allocating a stack doesn't matter + the problem is contention + and if we create server thread always? + and do not keep them in queue? + always ? + yes + again + you'd have to allocate a stack for it + every time + so two potentially heavy syscalls to allocate/free the stac + k + not to mention the thread itself, its associations with its task, + ipc space, maintaining reference counts + (moar contention) + creating threads was considered cheap at the time the process was + the main unit of concurrency + ok, than we will have the same contention if we will create a + thread when "the kernel signals them when a client thread is spawned" + now we have work queues / thread pools just to avoid that + no + because that contention happens at thread creation + not during a syscall + i'll redefine the problem: the problem is contention during a + system call / IPC + ok + note that my current solution is very close to signalling every + server + it's the lazy version + match at first IPC time + so I was basing my plan on the case when we create new thread when + client makes syscall and there is not enough server threads + the problem exists even when there is enough server threads + we shouldn't consider the case where there aren't enough server + threads + real time tasks are the only ones which want that, and can + preallocate resources explicitely + I think that real time tasks should be really separated + For them resource availability as much more important that good + resource utilisation. + So if we talk about real time tasks we should apply one police and + for non-real time another + So it shouldn't be critical if thread is created during syscall + agreed + that's what i was saying : + :) + 16:23 < braunr> we shouldn't consider the case where there aren't + enough server threads + in this case, we spawn a thread, and that's ok + it will live on long enough that we really don't care about the + cost of lazily creating it + so let's concentrate only on the case where there already are + enough server threads + So if client makes a request to ST (is it ok to use abbreviations?) + there are several cases: + 1/ There is ST waiting on local queue (trivial case) + 2/ There is no ST, only load balancer (LB). LB decides to create a + new thread + 3/ Like in previous case, but LB decides to perform migration + migration of what ? + migration of ST from other core + the only case effectively solving the problem is 1 + others introduce contention, and worse, complex code + i mean a complex solution + not only code + even the addition of a load balancer per port set + thr data structures involved for proper migration + But 2 and 3 in long run will lead to having enough threads on all + cores + then you end up having 1 per client per cpu + migration is needed in any case + no + why would it be ? + to balance load + not only for this case + there already is load balancing in the scheduler + we don't want to duplicate its function + what kind of load balancing? + *has scheduler + thread weight / cpu + and does it perform migration? + sure + so scheduler can be simplified if policy "when to migrate" will be + moved to user space + this is becoming a completely different problem + and i don't want to do that + it's very complicated for no real world benefit + but all this will be done in userspace + ? + all what ? + migration decisions + in your scheme you mean ? + yes + explain how + LB will decide when thread will migrate + and LB is user space task + what does it bring ? + imagine that, in the mean time, the scheduler then decides the + client should migrate to another processor for fairness + you'd have migrated a server thread once for no actual benefit + or again, you need to disable migration for long durations, which + sucks + also + 17:06 < mcsim> But 2 and 3 in long run will lead to having enough + threads on all cores + contradicts the need for a load balancer + if you have enough threads every where, why do you need to balance + ? + and how are you going to deal with the case when client will + migrate all the time? + i intend to implement something close to thread migration + because some of them can die because of timeout + something l4 already does iirc + the thread scheduler manages scheduling contexts + which can be shared by different threads + which means the server thread bound to its client will share the + scheduling context + the only thing that gets migrated is the scheduling context + the same way a thread can be migrated indifferently on a + monolithic system, whether it's in user of kernel space (with kernel + preemption enabled ofc) + or* + but how server thread can process requests from different clients? + mcsim: load becomes a problem when there are too many threads, not + when they're dying + they can't + at first message, they're *bound* + => one server thread per client + when the client dies, the server thread is ubound and can be + recycled + unbound* + and you intend to put recycled threads to global queue, right? + yes + and I propose to put them in local queues in hope that next client + will be on the same core + the thing is, i don't see the benefit + next client could be on another + in which case it gets a lot heavier than the extremely small + critical section i have in mind + but most likely it could be on the same + uh, no + becouse on this load on this core is decreased + *because + well, ok, it would likely remain on the same cpu + but what happens when it migrates ? + and what about memory usage ? + one queue per cpu per port set can get very large + (i understand the proposition better though, i think) + we can ask also "What if random access in memory will be more usual + than sequential?", but we still optimise sequential one, making random + sometimes even worse. The real question is "How can we maximise benefit + of knowledge where free server thread resides?" + previous was reply to: "(17:17:08) braunr: but what happens when it + migrates ?" + i understand + you optimize for the common case + where a lot more ipc occurs than migrations + agreed + now, what happens when the server thread isn't in the local queue + ? + than client request will be handled to LB + why not search directly itself ? + (and btw, the right word is "then") + LB can decide whom to migrate + right, sorry + i thought you were improving on my scheme + which implies there is a 1:1 mapping for client and server threads + If job of LB is too small than it can be removed and everything + will be done in kernel + it can't be done in userspace anyway + these queues are in the port / port set structures + it could be done though + i mean + using per cpu queues + server threads could be both in per cpu queues and in a global + queue as long as they exist + there should be no global queue, because there again will be + contention for it + mcsim: accessing a load balancer implies contention + there is contention anyway + what you're trying to do is reduce it in the first message case if + i'm right + braunr: yes + well then we have to revise a few assumptions + 17:26 < braunr> you optimize for the common case + 17:26 < braunr> where a lot more ipc occurs than migrations + that actually becomes wrong + the first message case occurs for newly created threads + for make -j64 this is actually common case + and those are usually not spawn on the processor their parent runs + on + yes + if you need all processors, yes + i don't think taking into account this property changes many + things + per cpu queues still remain the best way to avoid contention + my problem with this solution is that you may end up with one + unbound thread per processor per server + also, i say "per server", but it's actually per port set + and even per port depending on how a server is written + (the system will use one port set for one server in the common + case but still) + so i'll start with a global queue for unbound threads + and the day we decide it should be optimized with local (or + hierarchical) queues, we can still do it without changing the interface + or by simply adding an option at port / port set creation + whicih is a non intrusive change + ok. your solution should be simplier. And TBH, what I propose is + not clearly much mory gainful. + well it is actually for big systems + it is because instead of grabbing a lock, you disable preemption + which means writing to a local, uncontended variable + with 0 risk of cache line bouncing + this actually looks very good to me now + using an option to control this behaviour + and yes, in the end, it gets very similar to the slab allocator, + where you can disable the cpu pool layer with a flag :) + (except the serialized case would be the default one here) + mcsim: thanks for insisting + or being persistent + braunr: thanks for conversation :) + and probably I had to start from statement that I wanted to improve + common case + + +## IRC, freenode, #hurd, 2013-06-20 + + braunr: how about your x15, it is impovement for mach or + redesign? I really want to know that:) + it's both largely based on mach and now quite far from it + based on mach from a functional point of view + i.e. the kernel assumes practically the same functions, with a + close interface + Good point:) + except for ipc which is entirely rewritten + why ? :) + for from a functional point of view:) I think each design has + it intrinsic advantage and disadvantage + but why is it good ? + if redesign , I may need wait more time to a new function hurd + you'll have to wait a long time anyway :p + Improvement was better sometimes, although redesign was more + attraction sometimes :) + I will wait :) + i wouldn't put that as a reason for it being good + this is a departure from what current microkernel projects are + doing + i.e. x15 is a hybrid + Sure, it is good from design too:) + yes but i don't see why you say that + Sorry, i did not show my view clear, it is good from design + too:) + you're just saying it's good, you're not saying why you think it's + good + I would like to talk hybrid, I want to talk that, but I am a + litter afraid that you are all enthusiasm microkernel fans + well no i'm not + on the contrary, i'm personally opposed to the so called + "microkernel dogma" + but i can give you reasons why, i'd like you to explain why *you* + think a hybrid design is better + so, when I talk apple or nextstep, I got one soap :) + that's different + these are still monolithic kernels + well, monolithic systems running on a microkernel + yes, I view this as one type of hybrid + no it's not + microkernel wan't to divide process ( task ) from design view, + It is great + as implement view or execute view, we have one cpu and some + physic memory, as the simplest condition, we can't change that + that what resource the system has + what's your point ? + I view this as follow + I am cpu and computer + application are the things I need to do + for running the program and finish the job, which way is the + best way for me + I need keep all the thing as simple as possible, divide just + from application design view, for me no different + desgin was microkernel , run just for one cpu and these + resource. + (well there can be many processors actually) + I know, I mean hybrid at some level, we can't escape that + braunr: I show my point? + well l4 systems showed we somehow can + no you didn't + x15's api was rpc, right? + yes + well a few system calls, and mostly rpcs on top of the ipc one + jsu tas with mach + and you hope the target logic run locally just like in process + function call, right? + no + it can't run locally + you need thread context switch + and address space context switch + but you cut down the cost + how so ? + I mean you do it, right? + x15 + yes but no in this way + in every other way :p + I know, you remeber performance anywhere :p + i still don't see your point + i'd like you to tell, in one sentence, why you think hybrids are + better + balance the design and implement problem :p + which is ? + hybird for kernel arc + you're stating the solution inside the problem + you are good at mathmatics + sorry, I am not native english speaker + braunr: I will find some more suitable sentence to show my + point some day, but I can't find one if you think I did not show my + point:) + for today + too bad + If i am computer I hope the arch was monolithic, If i am + programer I hope the arch was microkernel, that's my idea + ok let's get a bit faster + monolithic for performance ? + braunr: sorry for that, and thank you for the talk:) + (a computer doesn't "hope") + braunr: you need very clear answer, I can't give you that, + sorry again + why do you say "If i am computer I hope the arch was monolithic" ? + I know you can slove any single problem + no i don't, and it's not about me + i'm just curious + I do the work for myself, as my own view, all the resource + belong to me, I does not think too much arch related divide was need, if + I am the computer :P + separating address spaces helps avoiding serious errors like + corrupting memory of unrelated subsystems + how does one not want that ? + (except for performance) + braunr: I am computer when I say that words! + a computer doesn't want anything + users (including developers) on the other way are the point of + view you should have + I am engineer other time + we create computer, but they are lifeable just my feeling, hope + not talk this topic + what ? + I mark computer as life things + please don't + and even, i'll make a simple example in favor of isolating + resources + if we, humans, were able to control all of our "resources", we + could for example shut down our heart by mistake + back to the topic, I think monolithic was easy to understand, + and cut the combinatorial problem count for the perfect software + the reason the body have so many involuntary functions is probably + because those who survived did so because these functions were + involuntary and controlled by separated physiological functions + now that i've made this absurd point, let's just not consider + computers as life forms + microkernels don't make a system that more complicated + they does + no + do + they create isolation + and another layer of indirection with capabilities + that's it + it's not that more complicated + view the kernel function from more nature view, execute some + code + what ? + I know the benefit of the microkernel and the os + it's complicated + not that much + I agree with you + microkernel was the idea of organization + yes + but always keep in mind your goal when thinking about means to + achieve them + we do the work at diferent view + what's quite complicated is making a microkernel design without + too much performances loss, but aside from that performances issue, it's + not really much more complicated + hurd do the work at os level + even a monolithic kernel is made of several subsystems that + communicated with each others using an API + i'm reading this conversation for some time now + and I have to agree with braunr + microkernels simplify the design + yes and no + i think it depends a lot on the availability of capabilities + i have experience mostly with QNX and i can say it is far more + easier to write a driver for QNX, compared to Linux/BSD for example ... + which are the major feature microkernels usually add + qnx >= 5 do provide capabilities + (in the form of channels) + yeah ... it's the basic communication mechanism + but my initial and still unanswered question was: why do people + think a hybrid kernel is batter than a true microkernel, or not + better* + I does not say what is good or not, I just say hybird was + accept + core-ix: and if i'm right, they're directly implemented by the + kernel, and not a userspace system server + braunr: evolution is more easily accepted than revolution :) + braunr: yes, message passing is in the QNX kernel + not message passing, capabilities + l4 does message passing in kernel too, but you need to go through + a capability server + (for the l4 variants i have in mind at least) + the operating system evolve for it's application. + congzhang: about evolution, that's one explanation, but other than + that ? + core-ix: ^ + braunr: by capability you mean (for the lack of a better word + i'll use) access control mechanisms? + i mean reference-rights + the "trusted" functionality available in other OS? + http://en.wikipedia.org/wiki/Capability-based_security + i don't know what other systems refer to with "trusted" + functionnality + yeah, the same thing + for now, I am searching one way to make hurd arm edition + suitable for Raspberry Pi + I hope design or the arch itself cant scale + can be scale + braunr: i think (!!!) that those are implemented in the Secure + Kernel (http://www.qnx.com/products/neutrino-rtos/secure-kernel.html) + never used it though ... + rpc make intercept easy :) + core-ix: regular channels are capabilities + yes, and by extensions - they are in the kenrel + that's my understanding too + and that one thing that, for me, makes qnx an hybrid as well + just need intercept in kernel, + braunr: i would dive the academic aspects of this ... in my mind + a microkernel is system that provides minimal hardware abstraction, + communication primitives (usually message passing), virtual memory + protection + *wouldn't ... + i think it's very important on the contrary + what you describe is the "microkernel dogma" + precisely + that doesn't include capabilities + that's why l4 messaging is thread-based + and that's why l4 based systems are so slow + (except okl4 which put back capabilities in the kernel) + so the compromise here is to include capabilities implementation + in the kernel, thus making the final product hybrid? + not only + because now that you have them in kernel + the kernel probably has to manage memory for itself + so you need more features in the virtual memory system + true ... + that's what makes it a hybrid + other ways being making each client provide memory, but that's + when your system becomes very complicated + but I believe this is true for pretty much any "general OS" case + and some resources just can't be provided by a client + e.g. a client can't provide virtual memory to another process + okl4 is actually the only pragmatic real-world implementation of + l4 + and they also added unix-like signals + so that's an interesting model + as well as qnx + the good thing about the hurd is that, although it's not kernel + agnostic, it doesn't require a lot from the underlying kernel + about hurd? + yes + i really need to dig into this code at some point :) + well you may but you may not see that property from the code + itself diff --git a/microkernel/mach/gnumach/memory_management.mdwn b/microkernel/mach/gnumach/memory_management.mdwn index 4e237269..477f0a18 100644 --- a/microkernel/mach/gnumach/memory_management.mdwn +++ b/microkernel/mach/gnumach/memory_management.mdwn @@ -188,3 +188,18 @@ License|/fdl]]."]]"""]] patch (more kernel memory, thus more physical memory - up to 1.8 GiB - but then, less user memory) + + +# IRC, freenode, #hurd, 2013-06-06 + + braunr: quick question, what memory allocation algorithms + does the Mach use? I know it uses slab allocation, so I can guess buddy + allocators too? + no + slab allocator for kernel memory (allocation of buffers used by + the kernel itself) + a simple freelist for physical pages + and a custom allocator based on a red-black tree, a linked list + and a hint for virtual memory + (which is practically the same in all BSD variants) + and linux does something very close too -- cgit v1.2.3