[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]] [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable id="license" text="Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled [[GNU Free Documentation License|/fdl]]."]]"""]] [[!tag open_issue_gnumach]] IRC, freenode, #hurd, 2011-05-07 things that are referred to as "system calls" in glibc are actually RPCs to the kernel or other tasks, those RPCs have too lookup port rights the main services have tens of thousands of ports, looking up one is slow There is a [[!FF_project 268]][[!tag bounty]] on this task. IRC, freenode, #hurd, 2011-04-23 youpi: is there any use of the port renaming facility ? I don't know at least, did you see such use ? i wonder why mach mach_port_insert_right() lets the caller specify the port name ../hurd-debian/hurd/serverboot/default_pager.c: kr = mach_port_rename( default_pager_self, mach_port_rename() is used only once, in the default pager so it's not that important but mach_port_insert_right() lets userspace task decide the port name value just to repeat myself again, I don't know port stuff very much :) well you know that a port denotes a right, which denotes a port yes, but I don't have any real experience with it err port name the only reason I see is that the caller, say /hurd/exec running a fork() hm no, i don't even see the reason here port names should be allocated by the kernel only, like file descriptors you can choose file descriptor values too really ? with dup2, yes oh hm what's the data structure in current unices to store file descriptors ? a hash table ? I don't know i'll have to look at that FYI, i'm asking these questions because i'm thinking of reworking ipc spaces i believe the use of splay trees completely destroys performance of tasks with many many port names such as the root file system that can be a problem yes since there are 3 ports per opened file, and like 3 per thread too + the page cache with a few thousand opened files and threads, that makes a lot by "opened file" I meant page cache actually i saw numbers up to 30k ok on buildds I easily see 100k ports for a single task ? wow yes the page cache is 4k files so that's definitely worth the try so that already makes 12k ports and 4k is not so big it's limited to 4K ? I haven't been able to check where the 100k come from yet braunr: yas could be leaks :/ yes omg, a hard limit on the page cache .. vm/vm_object.c:int vm_object_cached_max = 4000; /* may be patched*/ mach is really old :( I've raised it before it was 200 ... oO I tried to dro pthe limit, but then I was lacking memory which I believe have fixed the other day, but I have to test again that implementation doesn't know how to deal with memory pressure yes i saw your recent changes about adding warnings in such cases so, back to ipc spaces i think splay trees 1/ can get very unbalanced easily which isn't hard to imagine and 2/ make poor usage of the cpu caches because they're BST and write a lot to memory maybe you could write a patch which would dump statistics on that? that's part of the job i'm assigning to myself ok i'd like to try replacing splay trees with radix trees I can run it on the buildds buildds are very good stress-tests :) :) 22h building -> 77k ports 26h building -> 97k ports the problem is that when I add leak debugging (backtraces), I'm getting out of memory :) that will be a small summer of code outside the gsoc :p :/ backtraces are very consuming but that's only because of hardcoded limits I'll have to test again with bigger limits again .. evil hard limits well, actually we could as well just drop them but we'd also need to easily get statistics on zone/vm_maps usage because else we don't see leaks (except that the machine eventually crashes) hm i haven't explained why i was asking my questions actually so, i want radix trees, because they're nice they reduce the paths lengths they don't get too unbalanced (they're invariant wrt the order of operations) they don't need to write to memory on lookups the only drawback is that they can create much overhead if their usage pattern isn't appropriate elements in such a structure should be close, so that they share common nodes the common usage pattern in ext2fs is a big bunch of ever-open ports :) if there is one entry per node, it's a big waste yes there are 3, actually but the port names have low values they're allocated sequentially, beginning at 0 (or 1 actually) which is perfect for radix trees yes 97989: send but if anyone can rename this introduces a new potential weakness ah, if it's just a weakness it's probably not a problem I thought it was even a no-go i think so I guess port rename is very seldom but in a future version, it would be nice not to allow port renaming unless there are similar issues in current unix kernels in which case i'd say it's acceptable there are of that order ? and it'd be useful for e.g. processing tracing/debugging/tweaking/whatever it's also used to hide fds from a process port renaming you mean ? you allocate them very high yes ok choosing your port name, generally to match what the process expects for instance then it would be a matter of resource limiting (which we totally lack afaik) along the number of maximum open files, you would have a number of maximum rights does that seem fine to you ? if done throught rlimits, sure something similar yes (_no_ PORTS_MAX ;) ) oh and, in addition, i remember gnumach has a special configuration of the processor in which caching is limited like write-through only didn't I fix that recently ? i don't know :) CR0=e001003b i don't think it's fixed I mean, in the git ah not in the debian package didn't tried the git version yet last time i tried (which was a long time ago), it made the kernel crash have you figured why ? I'm not aware of that anyway, splay trees write a lot, and most trees write a lot even at insertion/removal to rebalance braunr: Mmm, there's no clearance of CD in the kernel actually with radix trees, even if caching can't be fully enabled, it would make much better use of it so if port renaming isn't a true issue, i'll choose that data structure that'd probably be better yes I'm surprised by the CD, I do remember fixing something like this lately there are several levels where CD can be set the processors ORs all those if i'm right to determine if caching is enabled I know ok but in my memory that was at the CR* level, precisely maybe for xen only ? no well good luck if you hunt that one, i'm off, see you :) braunr: ah, no, it was the PGE flag that I had fixed braunr: explicit port naming is used for example to pass some initial ports to a new task at well-known places IIRC braunr: but these tend to be low numbers, so I don't see a problem there (I'm not familiar with radix trees... why would high numbers be a problem?) braunr: iirc the ipc space is limited to ~192k ports antrik: in most cases i've seen, the insert_right() call is used on task_self() and if there really are special ports (like the bootstrap or device ports), they should have special names IIRC, these ports are given through command line expansion by the kernel at boot time but it seems reasonable to think of port renaming as a potentially useful feature antrik: the problem with radix trees isn't them being high, it's them being sparse you get the most efficient trees when entries have keys that are close to each other because radix trees are a type of tries (the path in the tree is based on the elements composing the key) so the more common prefixes you have, the less external nodes you need here, keys are port names, but they can be memory addresses or offsets in memory objects (like in the page cache) the radix algorithm takes a few bits, say 4 or 6, at a time from a key, and uses that as an index in a node if keys are sparse, there can be as little as one entry per node IIRC, the worst case (on entry per node with the maximum possible number of nodes for a 32-bits key) is 2% entries the reste being null entries and almost-empty nodes containing them so if you leave the ability to give port rights the names you want, you can create such worst case trees which may consume several MiB of memory per tree tens of MiB i'd say on the other hand, in the current state, almost all hurd applications use sequentially allocated port names, close to 0 (which allows a nice optimization) so a radix ree would be the most efficient well, if some processes really feel they must use random numbers for port names, they *ought* to be penalized ;-) 2011-04-27 antrik: remember when you asked why high numbers would be a problem with radix trees ? here is a radix tree with one entry, which key is around 5000 [ 656.296412] tree height: 3 [ 656.296412] index: 0, level: 0, height: 3, count: 1, bitmap: 0000000000000002 [ 656.296412] index: 1, level: 1, height: 2, count: 1, bitmap: 0000000000004000 [ 656.296412] index: 14, level: 2, height: 1, count: 1, bitmap: 0000000000000080 three levels, each with an external node (dynamically allocated), for one entry so in the worst case of entries with keys close to the highest values, the could be many external nodes with higher paths lengths than when keys are close to 0 which also brings the problem of port name allocation can someone with access to a buildd which has an uptime of at least a few days (and did at least one build) show me the output of portinfo 3 | tail ? port names are allocated linearly IIRC, like PIDs, and some parts of the kernel may rely on them not being reused often but for maximum effifiency, they should be efficiency* 00:00 < braunr> can someone with access to a buildd which has an uptime of at least a few days (and did at least one build) show me the output of portinfo 3 | tail ? :) it's almost like wc -l 4905: receive vs 4647 for / 52902: receive vs 52207 for the chroot even after several builds ? and several days ? that's after 2 days it's not so many builds rossini is not so old (7h) but many builds 70927: send vs 70938 ok so it seems port names are reused good yes they are clearly i think i remember a comment about why the same port name shouldn't be reused too soon well, it could help catching programming errors that it helped catch bugs in applications that could deallocate/reallote quickly reallocate* without carefuly synchronization careful damn, i'm tired :/ but that's about debugging so we don't care about performance there yes i'll try to improve allocation performance too using e.g. bitmaps in each external node back to the root so that unused slots are quickly found i thknk that's what idr does in linux braunr: idr? antrik: a data structure used to map integers to pointers http://fxr.watson.org/fxr/source/lib/idr.c?v=linux-2.6 2011-06-18 < braunr> hmm, i'm having a problem with integrating my radix tree code in gnumach < braunr> inserting into such a tree can trigger memory allocation < braunr> so commonly, the tree i loaded with nodes before insertion, usually if it requires strong locking < braunr> ipc spaces are locked using "simple locks" (which are spin locks) < braunr> but spin locks are noops on UP, and gnumach is always UP .. < braunr> so, should i still include preloading code, even if it'll end up dead code ? < antrik> hm... I think we discussed this before; but isn't gnumach supposed to be SMP-capable, minus bugs?... < braunr> it is < braunr> but ofc, if i choose not to include preloading, i'll write #errors so that the day gnumach is built for SMP again, such support will be included < antrik> oh, sorry, I think I misread. what is UP? < braunr> uniprocessor < antrik> well, if it's just bugs forcing the current UP state, I think saying that gnumach is always UP is a stretch... < braunr> sure, but it's a practical consideration < antrik> does the locking complicate stuff? or is it just performance considerations? < braunr> no it's about correctness and completeness < braunr> if you don't preload a tree before locking < braunr> and memory allocation occurs while you're holding a simple lock < braunr> and memory allocation requires the kernel to sleep < braunr> you're screwed < braunr> but i hate the idea of including code that won't be used and which won't be easy to test < braunr> so i'm wondering if it's ok for now to just put this in a TODO comment and write it when the time is right < braunr> or if i should spens the week adding this and tweaking the userspace implementation to "emulate" spin locks < antrik> well, it's tricky situation. on one hand, it seems stupid to include handling for something that presently isn't used, and it's not clear when it will. on the other hand, I'd rather not see additional problems introduced that will make fixing SMP even harder... < braunr> that's why i'm asking here < antrik> of course, you could resolve this question by fixing SMP first... ;-) < braunr> ew < antrik> well, I guess it would be best first to make the code work... and we can still decide about the locking thing before it goes mainline I'd say? < braunr> "make the code work" ? < antrik> I mean make gnumach work with your radix tree code < braunr> without preloading then < antrik> yeah... as a first step... I guess adding it later won't be any harder than adding it right now? < braunr> not much < braunr> testing is what requires time really 2011-06-27 < braunr> ok, here is the radix tree code: http://git.sceen.net/rbraun/libbraunr.git/ < braunr> the preloading stuff will be added in the kernel only, as it's really pointless and not easily doable in userspace < youpi> preloading? < braunr> youpi: yes, preloading < braunr> radix trees allocate external nodes < youpi> well, providing a url at some random time of some random day is not a great way to get eyes on it :) < braunr> and ipc spaces are locked when inserting/allocating names < braunr> we normally don't need preloading in gnumach < braunr> since there is no preemption nor SMP < braunr> but in case someone changes that, i'd like the code to be mostly ready < braunr> and correctly handle those ugly simple locks < braunr> youpi: is what i say clear enough or do you need more background on what is done ? < youpi> about preloading? < braunr> yes < youpi> I guess it means allocating nodes in advance? < braunr> yes < youpi> k < braunr> before locking the ipc spaces 2011-06-28 < braunr> antrik: i think i won't write the code for the preloading stuff actually < braunr> antrik: it's not very difficult, but i really hate the idea of not being able to reliably test it < braunr> antrik: and i'd rather concentrate on integrating the radix tree code in gnu mach now < braunr> (i've already removed much code, even some files which weren't actually used before my changes !) < braunr> hmm, i won't be able not to write the preloading code after all < antrik> braunr: not able not to write? how's that? < braunr> antrik: it's actually required < braunr> there are three functions, ipc_entry_get, ipc_entry_alloc, and ipc_entry_grow_table < braunr> ipc_entry_get cannot allocate memory < braunr> if it fails, ipc_entry_grow_table is called, which will allocate memory < braunr> ipc_entry_alloc calls both of them depending on the result of ipc_entry_get < braunr> this is the equivalent of the preloading thing i had in mind < braunr> not a bad thing after all < braunr> the only thing i'm afraid of are the "optimized" version of those ipc functions in te so-called fast paths < braunr> i'm afraid if i don't deal right with those, the kernel may end up using mostly slow paths < braunr> but considering the purpose of those fast paths was merely to avoid the overhead of function calls and some locking functions, it shouldn't be that bad < braunr> this is such a mess eh < antrik> hurray microoptimisations ;-) < braunr> there, the preload functions are done, easy :) < antrik> braunr: seems you spent less time implementing it than pondering whether you should implement it ;-) < braunr> well, i couldn't implement it correctly before knowing what should have been done exactly < braunr> and there are still other problems :/ < braunr> and the other problems make me reconsider if this was useful at all eh < braunr> youpi: i'm unable to find where ipc tree entries are released except in ipc_entry_alloc_name(), which could mean they're leaked ... < braunr> youpi: would you have time to take a look ? < youpi> they aren't in ipc_entry_dealloc() ? < braunr> no ..... < youpi> it's not so unprobable that they're only freed when the task quits < braunr> i don't see that either < braunr> i only see them released in ipc_entry_alloc_name() < braunr> so maybe they're reused < braunr> but i'm not sure about that when reading the code < braunr> oh wait, yes, they are :/ < braunr> my bad < youpi> in the ipc_splay_tree_* fucntions I guess? < braunr> yes < braunr> it's just surprsing to see them allocated outside the tree code only < braunr> but released in both the entry and the splay tree code ... 2011-06-29 < braunr> hmm i missed an important thing :/ < braunr> and it's scary < braunr> it looks like the splay tree is mainly used when names are provided < braunr> whereas the entry table is used when names are allocated < braunr> which means the table is the main ipc data structure, even for tasks with lots of rights < braunr> i can make my root ext2fs have more than 10k rights, and i see the ipc table table grow along that number ... < braunr> now thetable has 15k+ entries < braunr> IOW there is no point to put the radix tree code in gnumach :( < antrik> braunr: what do you mean by "provided" and "allocated"? < antrik> and what is that table you are talking about? < braunr> antrik: provided means the user space tasks gives the name of the new right < braunr> antrik: allocated means the kernel generates it < braunr> antrik: the table i'm talking about is is_table in struct ipc_space < braunr> 55 * Every space has a non-NULL is_table with is_table_size entries. < braunr> 56 * A space may have a NULL is_tree. is_tree_small records the < braunr> 57 * number of entries in the tree that, if the table were to grow < braunr> 58 * to the next larger size, would move from the tree to the table. < braunr> here is the description which mislead me (in addition of the obscure code) < braunr> 50 * Spaces hold capabilities for ipc_object_t's (ports and port sets). < braunr> 51 * Each ipc_entry_t records a capability. Most capabilities have < braunr> 52 * small names, and the entries are elements of a table. < braunr> 53 * Capabilities can have large names, and a splay tree holds < braunr> 54 * those entries. The cutoff point between the table and the tree < braunr> 55 * is adjusted dynamically to minimize memory consumption. < antrik> ah, so the rights with a low name are in a linear table, and only those with "random" high names are stored in the splay tree instead? < antrik> seems a rather complex design... I guess though there isn't much room for further optimisation there :-( < antrik> (well, except for code size optimisation -- which could in fact make a considerable difference...) < braunr> well there are problems with this approach, but most don't concern performance < braunr> when the table gets big (close to the page size or more), it gets remapped when reallocated < braunr> which will incur some penalty because of the tlb < braunr> but it's annoying even for small tables < braunr> the initial table size is 4 entries, and from what i can see, most tables are 128 entries wide when tasks are destroyed < braunr> an obvious simple optimization is to set a larger default size < braunr> the same applies for the dead name tables < braunr> those reallocations are a pain, and they're due to this design < braunr> they can also fail because of fragmentation < braunr> there would be a point to radix trees if they would replace all that, and not just the splay tree < braunr> but this would cause a lot of changes in a lot of places, and in particular the "optimized" fast paths i mentioned yesterday < braunr> we'll see how they perform in x15 :> < braunr> there is a slight noticeable improvement when increasing the initial size of the entry table < antrik> braunr: well, if you use them in a completely different implementation, there will be no way of telling whether they make a difference < antrik> how did you test the improvement? < braunr> antrik: no actually it's completely negligeable < braunr> hm < braunr> is that a valid word ? :) < braunr> negligible < braunr> youpi: did you see my comments about the ipc stuff this earlier today ? < braunr> youpi: well to make things short, when port names are allocated, the right they refer to is allocated from the ipc table < braunr> youpi: the splay tree is only used for user provided names < braunr> youpi: i had tables as large as the number of rights in a space (i could easily reach 20k) < braunr> youpi: whereas the splay trees had at most ~40 entries ..