[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]] [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable id="license" text="Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled [[GNU Free Documentation License|/fdl]]."]]"""]] [[!tag open_issue_gnumach]] [[!toc]] # IRC, freenode, #hurd, 2011-05-07 things that are referred to as "system calls" in glibc are actually RPCs to the kernel or other tasks, those RPCs have too lookup port rights the main services have tens of thousands of ports, looking up one is slow There is a [[!FF_project 268]][[!tag bounty]] on this task. # IRC, freenode, #hurd, 2011-04-23 youpi: is there any use of the port renaming facility ? I don't know at least, did you see such use ? i wonder why mach mach_port_insert_right() lets the caller specify the port name ../hurd-debian/hurd/serverboot/default_pager.c: kr = mach_port_rename( default_pager_self, mach_port_rename() is used only once, in the default pager so it's not that important but mach_port_insert_right() lets userspace task decide the port name value just to repeat myself again, I don't know port stuff very much :) well you know that a port denotes a right, which denotes a port yes, but I don't have any real experience with it err port name the only reason I see is that the caller, say /hurd/exec running a fork() hm no, i don't even see the reason here port names should be allocated by the kernel only, like file descriptors you can choose file descriptor values too really ? with dup2, yes oh hm what's the data structure in current unices to store file descriptors ? a hash table ? I don't know i'll have to look at that FYI, i'm asking these questions because i'm thinking of reworking ipc spaces i believe the use of splay trees completely destroys performance of tasks with many many port names such as the root file system that can be a problem yes since there are 3 ports per opened file, and like 3 per thread too + the page cache with a few thousand opened files and threads, that makes a lot by "opened file" I meant page cache actually i saw numbers up to 30k ok on buildds I easily see 100k ports for a single task ? wow yes the page cache is 4k files so that's definitely worth the try so that already makes 12k ports and 4k is not so big it's limited to 4K ? I haven't been able to check where the 100k come from yet braunr: yas could be leaks :/ yes omg, a hard limit on the page cache .. vm/vm_object.c:int vm_object_cached_max = 4000; /* may be patched*/ mach is really old :( I've raised it before it was 200 ... oO I tried to dro pthe limit, but then I was lacking memory which I believe have fixed the other day, but I have to test again that implementation doesn't know how to deal with memory pressure yes i saw your recent changes about adding warnings in such cases so, back to ipc spaces i think splay trees 1/ can get very unbalanced easily which isn't hard to imagine and 2/ make poor usage of the cpu caches because they're BST and write a lot to memory maybe you could write a patch which would dump statistics on that? that's part of the job i'm assigning to myself ok i'd like to try replacing splay trees with radix trees I can run it on the buildds buildds are very good stress-tests :) :) 22h building -> 77k ports 26h building -> 97k ports the problem is that when I add leak debugging (backtraces), I'm getting out of memory :) that will be a small summer of code outside the gsoc :p :/ backtraces are very consuming but that's only because of hardcoded limits I'll have to test again with bigger limits again .. evil hard limits well, actually we could as well just drop them but we'd also need to easily get statistics on zone/vm_maps usage because else we don't see leaks (except that the machine eventually crashes) hm i haven't explained why i was asking my questions actually so, i want radix trees, because they're nice they reduce the paths lengths they don't get too unbalanced (they're invariant wrt the order of operations) they don't need to write to memory on lookups the only drawback is that they can create much overhead if their usage pattern isn't appropriate elements in such a structure should be close, so that they share common nodes the common usage pattern in ext2fs is a big bunch of ever-open ports :) if there is one entry per node, it's a big waste yes there are 3, actually but the port names have low values they're allocated sequentially, beginning at 0 (or 1 actually) which is perfect for radix trees yes 97989: send but if anyone can rename this introduces a new potential weakness ah, if it's just a weakness it's probably not a problem I thought it was even a no-go i think so I guess port rename is very seldom but in a future version, it would be nice not to allow port renaming unless there are similar issues in current unix kernels in which case i'd say it's acceptable there are of that order ? and it'd be useful for e.g. processing tracing/debugging/tweaking/whatever it's also used to hide fds from a process port renaming you mean ? you allocate them very high yes ok choosing your port name, generally to match what the process expects for instance then it would be a matter of resource limiting (which we totally lack afaik) along the number of maximum open files, you would have a number of maximum rights does that seem fine to you ? if done throught rlimits, sure something similar yes (_no_ PORTS_MAX ;) ) oh and, in addition, i remember gnumach has a special configuration of the processor in which caching is limited like write-through only didn't I fix that recently ? i don't know :) CR0=e001003b i don't think it's fixed I mean, in the git ah not in the debian package didn't tried the git version yet last time i tried (which was a long time ago), it made the kernel crash have you figured why ? I'm not aware of that anyway, splay trees write a lot, and most trees write a lot even at insertion/removal to rebalance braunr: Mmm, there's no clearance of CD in the kernel actually with radix trees, even if caching can't be fully enabled, it would make much better use of it so if port renaming isn't a true issue, i'll choose that data structure that'd probably be better yes I'm surprised by the CD, I do remember fixing something like this lately there are several levels where CD can be set the processors ORs all those if i'm right to determine if caching is enabled I know ok but in my memory that was at the CR* level, precisely maybe for xen only ? no well good luck if you hunt that one, i'm off, see you :) braunr: ah, no, it was the PGE flag that I had fixed braunr: explicit port naming is used for example to pass some initial ports to a new task at well-known places IIRC braunr: but these tend to be low numbers, so I don't see a problem there (I'm not familiar with radix trees... why would high numbers be a problem?) braunr: iirc the ipc space is limited to ~192k ports antrik: in most cases i've seen, the insert_right() call is used on task_self() and if there really are special ports (like the bootstrap or device ports), they should have special names IIRC, these ports are given through command line expansion by the kernel at boot time but it seems reasonable to think of port renaming as a potentially useful feature antrik: the problem with radix trees isn't them being high, it's them being sparse you get the most efficient trees when entries have keys that are close to each other because radix trees are a type of tries (the path in the tree is based on the elements composing the key) so the more common prefixes you have, the less external nodes you need here, keys are port names, but they can be memory addresses or offsets in memory objects (like in the page cache) the radix algorithm takes a few bits, say 4 or 6, at a time from a key, and uses that as an index in a node if keys are sparse, there can be as little as one entry per node IIRC, the worst case (on entry per node with the maximum possible number of nodes for a 32-bits key) is 2% entries the reste being null entries and almost-empty nodes containing them so if you leave the ability to give port rights the names you want, you can create such worst case trees which may consume several MiB of memory per tree tens of MiB i'd say on the other hand, in the current state, almost all hurd applications use sequentially allocated port names, close to 0 (which allows a nice optimization) so a radix ree would be the most efficient well, if some processes really feel they must use random numbers for port names, they *ought* to be penalized ;-) # IRC, freenode, #hurd, 2011-04-27 antrik: remember when you asked why high numbers would be a problem with radix trees ? here is a radix tree with one entry, which key is around 5000 [ 656.296412] tree height: 3 [ 656.296412] index: 0, level: 0, height: 3, count: 1, bitmap: 0000000000000002 [ 656.296412] index: 1, level: 1, height: 2, count: 1, bitmap: 0000000000004000 [ 656.296412] index: 14, level: 2, height: 1, count: 1, bitmap: 0000000000000080 three levels, each with an external node (dynamically allocated), for one entry so in the worst case of entries with keys close to the highest values, the could be many external nodes with higher paths lengths than when keys are close to 0 which also brings the problem of port name allocation can someone with access to a buildd which has an uptime of at least a few days (and did at least one build) show me the output of portinfo 3 | tail ? port names are allocated linearly IIRC, like PIDs, and some parts of the kernel may rely on them not being reused often but for maximum effifiency, they should be efficiency* 00:00 < braunr> can someone with access to a buildd which has an uptime of at least a few days (and did at least one build) show me the output of portinfo 3 | tail ? :) it's almost like wc -l 4905: receive vs 4647 for / 52902: receive vs 52207 for the chroot even after several builds ? and several days ? that's after 2 days it's not so many builds rossini is not so old (7h) but many builds 70927: send vs 70938 ok so it seems port names are reused good yes they are clearly i think i remember a comment about why the same port name shouldn't be reused too soon well, it could help catching programming errors that it helped catch bugs in applications that could deallocate/reallote quickly reallocate* without carefuly synchronization careful damn, i'm tired :/ but that's about debugging so we don't care about performance there yes i'll try to improve allocation performance too using e.g. bitmaps in each external node back to the root so that unused slots are quickly found i thknk that's what idr does in linux braunr: idr? antrik: a data structure used to map integers to pointers http://fxr.watson.org/fxr/source/lib/idr.c?v=linux-2.6 # IRC, freenode, #hurd, 2011-06-08 hm, reverse space/port to name lookups also suck having separate types for simple ipc entries and splay tree entries really makes many parts of the ipc code complicated and a global hash table for these operations is scary # IRC, freenode, #hurd, 2011-06-09 hm nice, my radix tree code runs inside gnumach, along with the original splay tree code and assertions making sure results are the same there is this "collision" thing i'm not sure to understand but once this is solved, replacing the splay trees should be easy youpi: is there a way to easily know the number of send rights associated to a port ? portinfo ? portinfo gives information in a space but this is specific to a port is there an option for that ? -v hm ok 25: send (refs: 550) nice youpi: if you have time, could you give me the min/max/avg numbers of send rights referring to the same port on buildds ? i'm trying to estimate if it's better to have space->list_of_ports or port->list_of_spaces to replace the global ipc hash table the latter seems better but there could be unexpected cases on machines using large amounts of resources like the buildds max is 64k min is 1 of course :) 64k then it's not what i'm looking for avg is 55 isn't this the number of urefs ? I don't know hmm what i'm looking for is the number of *pure send rights* for the same port i don't think portinfo can give it there can only be one such send right per task for the same port 64k would mean there are 64k tasks ok, so it's more difficult it means using -t ahh and run n^2 portinfo over the n processes i see Mmm, that will however still show any duplicate send right but then by min/max/avg, you mean, over time ? i'll change the source code, simpler e.g. min would be right after boot? min is 1 1 what ? 1 send right to a port ah, 1 for a given port yes ok, it becomes really hairy to compute, I don't hav ethe time :) avg and max are more interesting :) no worries braunr: I wouldn't be surprised that max is the number of tasks e.g. for a send port to the proc server for instance youpi: it is, but i'm not looking for potential numbers I'm not talking about a potential number, but an actual number that's almost always true for one port, yes but yes, ok for max this makes choosing an appropriate data structure difficult braunr: actually, min number of send rights to a port is 0... but I'm sure you know that already :-) youpi: normally each client gets a separate port. I'm not sure there are any ports with send rights distributed over many tasks... antrik, what about / ? antrik: not necessarily jkoenig: not sure... isn't the "/" port authenticated to the specific user? antrik, I guess so, but a single user could still have many tasks. (wrt /) jkoenig: well, in theory the tasks having exactly the same UIDs and GITs could probably share an auth token... but that's not how things are handled in general at least I don't think so tasks are authenticated, not users err... GIDs :-) antrik, still, my quick glance to the fork() code seemed to indicate the port is inherited as-is, maybe authentication happens only when something is actually looked up? hmm "rpctrace ls -d /" does not show any authentication calls, only a lookup("") on the root which returns a different port so I guess the root port is "deauthenticated" or something when the uid of a process is changed. too bad cfhammer isn't around, he digged into all this stuff... I know that there is a mechanism which reauths all FDs when the IDs of a process change but I'm not sure the "/" port uses this mechanism antrik: but the radix tree codee is really used as is, which means no locking, no preloading before locking, all of this because simple locks don't exist on UP, and because the kernel isn't preemptible antrik: and yes, min number is 0, but in that case you don't need (space, port) -> right lookups :) antrik: or put in another way, whichever reasonable structure you use, when it's empty, you don't care much which also means that the min number has actually no value here because the same applies to 1 then what seems to take most time is forks and i hope my upcoming kernel changes will help the situation a bit what are your incoming gnumach changes about? the ipc translation layer speed which basically means operating on port names (looking them up, inserting, removing, renaming, looking up send rights to a specific ports) will be faster and i believe forks are (one of) the most demanding use cases wrt ipc space manipulation i'm really surprised how badly the splay trees are used the worst case for this data structure is traversal and this is done in many situations leaving the tree in its worst case shape and i didn't mentioned the bunch of memory writes occurring, event for things like lookups or traversals this is slow and can disrupt many cpu cache lines and when there are 10k to 100+k (e.g. in some ext2fs instances on buildds), just imagine the number of operations involved in those operations a simple traversal_next involves a rotation *gasp* this means potentially writing on 3 different cache lines, for *one* next operation what are you replacing that splay tree with? radix trees much shorter paths extremely few memory writes locality of reference when traversing good cache usage (as many of the top nodes are reused) the two drawbacks are 1/ memory allocation for external nodes, which means the tree must be preloaded before locking and 2/ high memory overhead if the keys are sparse but this isn't the case with port names, unless someone messes it up by assigning random names to many rights braunr: so, when will we see the first performance comparision? :-) antrik: that's a topic of itself, how to compare ? antrik: the thing is, my current gnumach patches only makes assertions this is the best way i found to test my tree in real life conditions much cleanup is needed and what i'd like to do is to completely replace all teh translation layer structures with it which means removing much code, making sure it still works as expected this is tedious so one month at least braunr: comparing shouldn't be too hard... the average configure script does a lot of forking, which should be a good benchmark according to your observations rough estimates are easy, yes but my observations my be wrong :p braunr: well, we don't really need precise numbers... unless you need to do some kind of fine-tuning? i don't know yet # IRC, freenode, #hurd, 2011-06-18 < braunr> hmm, i'm having a problem with integrating my radix tree code in gnumach < braunr> inserting into such a tree can trigger memory allocation < braunr> so commonly, the tree i loaded with nodes before insertion, usually if it requires strong locking < braunr> ipc spaces are locked using "simple locks" (which are spin locks) < braunr> but spin locks are noops on UP, and gnumach is always UP .. < braunr> so, should i still include preloading code, even if it'll end up dead code ? < antrik> hm... I think we discussed this before; but isn't gnumach supposed to be SMP-capable, minus bugs?... < braunr> it is < braunr> but ofc, if i choose not to include preloading, i'll write #errors so that the day gnumach is built for SMP again, such support will be included < antrik> oh, sorry, I think I misread. what is UP? < braunr> uniprocessor < antrik> well, if it's just bugs forcing the current UP state, I think saying that gnumach is always UP is a stretch... < braunr> sure, but it's a practical consideration < antrik> does the locking complicate stuff? or is it just performance considerations? < braunr> no it's about correctness and completeness < braunr> if you don't preload a tree before locking < braunr> and memory allocation occurs while you're holding a simple lock < braunr> and memory allocation requires the kernel to sleep < braunr> you're screwed < braunr> but i hate the idea of including code that won't be used and which won't be easy to test < braunr> so i'm wondering if it's ok for now to just put this in a TODO comment and write it when the time is right < braunr> or if i should spens the week adding this and tweaking the userspace implementation to "emulate" spin locks < antrik> well, it's tricky situation. on one hand, it seems stupid to include handling for something that presently isn't used, and it's not clear when it will. on the other hand, I'd rather not see additional problems introduced that will make fixing SMP even harder... < braunr> that's why i'm asking here < antrik> of course, you could resolve this question by fixing SMP first... ;-) < braunr> ew < antrik> well, I guess it would be best first to make the code work... and we can still decide about the locking thing before it goes mainline I'd say? < braunr> "make the code work" ? < antrik> I mean make gnumach work with your radix tree code < braunr> without preloading then < antrik> yeah... as a first step... I guess adding it later won't be any harder than adding it right now? < braunr> not much < braunr> testing is what requires time really # IRC, freenode, #hurd, 2011-06-27 < braunr> ok, here is the radix tree code: http://git.sceen.net/rbraun/libbraunr.git/ < braunr> the preloading stuff will be added in the kernel only, as it's really pointless and not easily doable in userspace < youpi> preloading? < braunr> youpi: yes, preloading < braunr> radix trees allocate external nodes < youpi> well, providing a url at some random time of some random day is not a great way to get eyes on it :) < braunr> and ipc spaces are locked when inserting/allocating names < braunr> we normally don't need preloading in gnumach < braunr> since there is no preemption nor SMP < braunr> but in case someone changes that, i'd like the code to be mostly ready < braunr> and correctly handle those ugly simple locks < braunr> youpi: is what i say clear enough or do you need more background on what is done ? < youpi> about preloading? < braunr> yes < youpi> I guess it means allocating nodes in advance? < braunr> yes < youpi> k < braunr> before locking the ipc spaces # IRC, freenode, #hurd, 2011-06-28 < braunr> antrik: i think i won't write the code for the preloading stuff actually < braunr> antrik: it's not very difficult, but i really hate the idea of not being able to reliably test it < braunr> antrik: and i'd rather concentrate on integrating the radix tree code in gnu mach now < braunr> (i've already removed much code, even some files which weren't actually used before my changes !) < braunr> hmm, i won't be able not to write the preloading code after all < antrik> braunr: not able not to write? how's that? < braunr> antrik: it's actually required < braunr> there are three functions, ipc_entry_get, ipc_entry_alloc, and ipc_entry_grow_table < braunr> ipc_entry_get cannot allocate memory < braunr> if it fails, ipc_entry_grow_table is called, which will allocate memory < braunr> ipc_entry_alloc calls both of them depending on the result of ipc_entry_get < braunr> this is the equivalent of the preloading thing i had in mind < braunr> not a bad thing after all < braunr> the only thing i'm afraid of are the "optimized" version of those ipc functions in te so-called fast paths < braunr> i'm afraid if i don't deal right with those, the kernel may end up using mostly slow paths < braunr> but considering the purpose of those fast paths was merely to avoid the overhead of function calls and some locking functions, it shouldn't be that bad < braunr> this is such a mess eh < antrik> hurray microoptimisations ;-) < braunr> there, the preload functions are done, easy :) < antrik> braunr: seems you spent less time implementing it than pondering whether you should implement it ;-) < braunr> well, i couldn't implement it correctly before knowing what should have been done exactly < braunr> and there are still other problems :/ < braunr> and the other problems make me reconsider if this was useful at all eh < braunr> youpi: i'm unable to find where ipc tree entries are released except in ipc_entry_alloc_name(), which could mean they're leaked ... < braunr> youpi: would you have time to take a look ? < youpi> they aren't in ipc_entry_dealloc() ? < braunr> no ..... < youpi> it's not so unprobable that they're only freed when the task quits < braunr> i don't see that either < braunr> i only see them released in ipc_entry_alloc_name() < braunr> so maybe they're reused < braunr> but i'm not sure about that when reading the code < braunr> oh wait, yes, they are :/ < braunr> my bad < youpi> in the ipc_splay_tree_* fucntions I guess? < braunr> yes < braunr> it's just surprsing to see them allocated outside the tree code only < braunr> but released in both the entry and the splay tree code ... # IRC, freenode, #hurd, 2011-06-29 < braunr> hmm i missed an important thing :/ < braunr> and it's scary < braunr> it looks like the splay tree is mainly used when names are provided < braunr> whereas the entry table is used when names are allocated < braunr> which means the table is the main ipc data structure, even for tasks with lots of rights < braunr> i can make my root ext2fs have more than 10k rights, and i see the ipc table table grow along that number ... < braunr> now thetable has 15k+ entries < braunr> IOW there is no point to put the radix tree code in gnumach :( < antrik> braunr: what do you mean by "provided" and "allocated"? < antrik> and what is that table you are talking about? < braunr> antrik: provided means the user space tasks gives the name of the new right < braunr> antrik: allocated means the kernel generates it < braunr> antrik: the table i'm talking about is is_table in struct ipc_space < braunr> 55 * Every space has a non-NULL is_table with is_table_size entries. < braunr> 56 * A space may have a NULL is_tree. is_tree_small records the < braunr> 57 * number of entries in the tree that, if the table were to grow < braunr> 58 * to the next larger size, would move from the tree to the table. < braunr> here is the description which mislead me (in addition of the obscure code) < braunr> 50 * Spaces hold capabilities for ipc_object_t's (ports and port sets). < braunr> 51 * Each ipc_entry_t records a capability. Most capabilities have < braunr> 52 * small names, and the entries are elements of a table. < braunr> 53 * Capabilities can have large names, and a splay tree holds < braunr> 54 * those entries. The cutoff point between the table and the tree < braunr> 55 * is adjusted dynamically to minimize memory consumption. < antrik> ah, so the rights with a low name are in a linear table, and only those with "random" high names are stored in the splay tree instead? < antrik> seems a rather complex design... I guess though there isn't much room for further optimisation there :-( < antrik> (well, except for code size optimisation -- which could in fact make a considerable difference...) < braunr> well there are problems with this approach, but most don't concern performance < braunr> when the table gets big (close to the page size or more), it gets remapped when reallocated < braunr> which will incur some penalty because of the tlb < braunr> but it's annoying even for small tables < braunr> the initial table size is 4 entries, and from what i can see, most tables are 128 entries wide when tasks are destroyed < braunr> an obvious simple optimization is to set a larger default size < braunr> the same applies for the dead name tables < braunr> those reallocations are a pain, and they're due to this design < braunr> they can also fail because of fragmentation < braunr> there would be a point to radix trees if they would replace all that, and not just the splay tree < braunr> but this would cause a lot of changes in a lot of places, and in particular the "optimized" fast paths i mentioned yesterday < braunr> we'll see how they perform in x15 :> < braunr> there is a slight noticeable improvement when increasing the initial size of the entry table < antrik> braunr: well, if you use them in a completely different implementation, there will be no way of telling whether they make a difference < antrik> how did you test the improvement? < braunr> antrik: no actually it's completely negligeable < braunr> hm < braunr> is that a valid word ? :) < braunr> negligible < braunr> youpi: did you see my comments about the ipc stuff this earlier today ? < braunr> youpi: well to make things short, when port names are allocated, the right they refer to is allocated from the ipc table < braunr> youpi: the splay tree is only used for user provided names < braunr> youpi: i had tables as large as the number of rights in a space (i could easily reach 20k) < braunr> youpi: whereas the splay trees had at most ~40 entries ..