[[!meta copyright="Copyright © 2013, 2014 Free Software Foundation, Inc."]]
[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
id="license" text="Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License, Version 1.2 or
any later version published by the Free Software Foundation; with no Invariant
Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license
is included in the section entitled [[GNU Free Documentation
[[!meta title="Improved System Object Lookups"]]
The Hurd currently uses its ihash library ([[hurd/libihash]]) as a generic
container for various objects. While it does its job, it has been reported
to suffer from high collision rates. In addition, the "one size fits all"
approach contributes to slow things down. One particular use case is looking
up an object from a Mach port name, which basically translates to getting the
file or socket associated with a file descriptor in traditional Unix systems.
It's particular because there are actually two lookups for each object, the
first being finding the Mach port from a client port name, which is done in
the GNU Mach kernel, and the second being finding the server object from a
server port name. The best strategy would probably be to directly associate
the address of an object to the receive right of its port, eliminating the
need to look up again, but this is quite an intrusive change in the code base.
For the time being, optimizing lookups would already be an improvement.
The goal of this project is to increase system performance by speeding up
object lookups, with a particular focus on name-to-object lookups. Note that
there is little room for improvement in the kernel name-to-port lookups because
of the various optimizations IPC has received in the past. Looking up server
objects from port names could use an algorithm highly tuned for this task,
perhaps with better locking (shared/exclusive instead of always mutually
exclusive for example). Then, the libihash algorithm could be replaced with a
better one, not necessarily a hash based one, to improve all the other users.
This task requires proper knowledge of data structure algorithms, taking into
account machine properties such as processor caches, as well as the appropriate
skills in C and assembly to check the generated code. Being able to perform
accurate measurements in a system that lacks modern profiling tools would also
Possible mentors: Richard Braun
# IRC, freenode, #hurd, 2013-09-18
In context of [[!message-id "20130918081345.GA13789@dalaran.sceen.net"]].
<teythoon> braunr: (wrt the gnumach HACK) funny, I was thinking about doind
the same for userspace servers, renaming ports to the address of the
associated object, saving the need for the hash table...
<braunr> teythoon: see
<braunr> teythoon: my idea is to allow servers to set a label per port,
obtained at mesage recv time
<braunr> because, yes, looking up an object twice is ridiculous
<braunr> you normally still want port names to be close to 0 because it
allows some data structure optimizations
<teythoon> braunr: yes, I feared that ports should normally be smallish
integers and contigious at best
<teythoon> braunr: interesting that you say there that libihash suffers
from high collision rates
<teythoon> I've a theory to why that is, libihash doesn't do any hashing at
<pinotree> there are notes about that in the open_issues section of the
<teythoon> but I figured that this is probably ok for port names, as they
are small and contigious
<neal> braunr: That's called protected payload.
<neal> braunr: The idea is that the kernel appends data to the message in
## IRC, freenode, #hurd, 2013-10-24
<teythoon> and, with some effort, getting rid of the hash table lookup by
letting the kernel provide the address of the object (iirc neil knew the
proper term for that)
<braunr> teythoon: that is a big interface change
<teythoon> how so
<braunr> optimizing libihash and libpthread should already be a good start
<braunr> well how do you intend to add this information ?
<braunr> ok, "big" is overstatement, but still, it's a low level interface
change that would probably break a lot of things
<teythoon> store a pointer in the port structure in gnumach, make that
<braunr> yes but how ?
<teythoon> interesting question indeed
<braunr> my plan for x15 is to make this "label" part of received messages
<braunr> which means you need to change the format of messages
<braunr> that is what i call a big change
<teythoon> ok, so we need to provide an update path
<teythoon> but once done, the change to hurd will be minimal, patching
libports should cover most of that
<braunr> normally yes
<teythoon> so this amounts to messing with gnumach and mig and designing a
clever way to make the update process safe
<braunr> libihash is known to show high collision rates
<teythoon> right, libihash
<teythoon> it could use an integer hash function on the keys to distribute
<braunr> i think that's already what it tries to do
<braunr> so merely using a better hash algorithm such as murmur should do
<braunr> or use another data structure altogether
<teythoon> no, it does no hashing of its own on the keys
<braunr> are you sure ?
<teythoon> well, it uses only prime numbers as sizes, and computes key %
<braunr> well that's hashing .. :)
<teythoon> but this is not really a good hash
<braunr> isn't that what i said ?
<teythoon> ok, I didn't get that ;)
<teythoon> also, the sizes start quite small, 3, 7, 19...
<teythoon> and each time the hash table is grown, all items will have to be
<braunr> which is why we could consider another data structure
<teythoon> or, for starters, to thin out that list of sizes
<braunr> my personal preference being radix trees
<teythoon> I assume you have an implementation handy?
<teythoon> cool :D
<braunr> but good hashing is excellent too
<braunr> radix trees have their own issues
<teythoon> braunr: http://burtleburtle.net/bob/hash/integer.html
<braunr> i use thomas wang's hashing function in x15
<braunr> or rather, my own personal c utility library, since x15 doesn't
hash anything currently
<braunr> but murmur is better
<braunr> we prefer distribution over hashing performances
## IRC, freenode, #hurd, 2013-11-21
<teythoon> btw, about protected payloads in mach
<teythoon> I'm thinking about adding a flag to indicate that mach_msg
should return the protected payload pointer instead of the local port
field in the message header
<braunr> when would you set it ?
<braunr> i mean, how is it set ?
<teythoon> we don't really need the port name, right? and when we do, we
look it up in the referenced data structure
<teythoon> using a new rpc perhaps
<teythoon> what do you think?
<braunr> a new rpc on ports themselves, like mach_port_mod_refs i assume ?
<braunr> i think it's a good solution
<teythoon> the field is a natural_t, as far as i can see, pointers should
fit into it, right?
<braunr> the big problem is backward compatibility
<braunr> and your solution solves that
<braunr> natural_t was originally intended to be as large as the machine
<braunr> but it may no longer stay true
<braunr> i think youpi decided to keep it an int and not a long in his
<braunr> mach uses a trick for in-kernel port rights
<braunr> where the right is the port address
<teythoon> yes, I've seen that
<braunr> but i remember hearing about problems with this strategy in
<braunr> or maybe compat problems in mig interfaces
<braunr> i don't remember exactly
<braunr> so youpi looked at how macosx mach deals with the problem
<teythoon> well, but so far there is no 64 bit mach, so we do not need to
worry about compatibility there, no ?
<braunr> and if i'm right, they forced the ports on 32-bits
<braunr> no you're right
<braunr> we can simply force the field to 64-bits, whatever it contains
<teythoon> or change the message format from the beginning to include both
the name and the payload
<teythoon> then again, why bother
<braunr> have a 64-bits specific message format ?
<teythoon> well, it's worth discussing, no?
<braunr> i personally don't like the idea
<teythoon> we could fix stuff
<braunr> forcing the field to 64-bits should be enough
<teythoon> do you think the idea is worth prototyping ?
<braunr> teythoon: yes
<teythoon> braunr: cool :)
<braunr> teythoon: definitely :p
<braunr> actually, doing that can remove a large part (if not all)
contention from libports
<braunr> i still think we should work on libihash first
<braunr> converting libihash to murmur2/3 impacts more data structures
<braunr> it's also much easier
<teythoon> what exactly do you mean by that
<braunr> libports uses libihash
<braunr> but it's not the only user
<braunr> libihash is known to have high collision rates
<braunr> that should be fixed
<teythoon> right, but what do you mean by using murmur2/3
<braunr> that's a hashing algorithm name
<teythoon> using the integer finalizer used by murmur?
<braunr> i didn't dig the details
<braunr> and simply assumed it could be used for integer hashing as well
<teythoon> the way i see it, murmur can hash arbitrary ammounts
<braunr> if there are better integer hashing algorithms, let's just use
<teythoon> but that is not what we need
<teythoon> we have a fixed size integer
<braunr> but from what i remember, it's also very efficient for integer
## IRC, freenode, #hurd, 2013-11-22
<teythoon> braunr: /test-pp: msgh_protected_payload is 0x12345678
<teythoon> but currently I do another ipc_port_translate which is clearly
<teythoon> the msg handling in the kernel is... involved...
<teythoon> here is the thing... there are two (kernel) threads involved,
the sender and the receiver
<teythoon> for the sender, kmsg->ikm_header.msgh_remote_port is a pointer
(thanks to ipc_port_translate) to the destination's ipc_port_t
<teythoon> that's where the protected_payload is stored
<teythoon> but at the receiving thread, the pointer is gone, replaced by a
<teythoon> so currently I'm doing the lookup there again
<braunr> are you sure kmsg is the general structure for all messages ?
<braunr> or is it only for kernel messages ?
<braunr> i don't remember exactly
<teythoon> no, for all messages
<teythoon> I just need to get this pointer across cleanly
<braunr> i thought you wanted to replace that port name in the receiving
thread with the payload
<teythoon> I do
<braunr> i don't see the problem then
<teythoon> well, only the sending thread has the pointer, the receiving
thread only has the name
<braunr> i don't see what makes it hard to change it
<braunr> since that's what you want to do
<braunr> the sending thread doesn't have the pointer
<teythoon> yes it has
<braunr> only for in kernel objects
<braunr> and it's an optimization
<braunr> and you shouldn't have to care much about it
<braunr> your work only involves changing how messages are received
<teythoon> let me push it somewhere, so I can show you the patches
<braunr> teythoon: where should i look at ?
<teythoon> the last commit
<braunr> see what calls mach_msg_receive
<braunr> the payload flag must be handled before, when the message is
<braunr> ipc_kmsg_copyin perhaps
<teythoon> but this is the tricky part
<braunr> i'm not sure which of the sender or receiver code actually
performs these translations
<teythoon> b/c at this point it is not known whether the receiver has
specified the MACH_RCV_PROTECTED_PAYLOAD flag
<teythoon> or my understanding of the whole process is still somewhat off,
which might very well be...
<braunr> it's not something the receiver should set
<braunr> i.e. the flag shouldn't be set at mach_msg time
<braunr> because it's asynchronous
<braunr> it's a flag that should be set near port creation time
<teythoon> right, I can see how that could work
<braunr> mach_reply_port(); mach_port_set_payload(); mach_msg();
<teythoon> I think I found the right spot
<braunr> teythoon: looks better indeed :)
<teythoon> braunr: yes, thanks for the hint :)
<braunr> teythoon: keep in mind gnumach supports moving receive rights
<braunr> i don't think it's much of a burden but don't forget :)
<teythoon> right, if that happens, the protected payload field should
probably be just reset to 0
<teythoon> that preserves the old default behavior
<braunr> teythoon: you shouldn't name the payload "address" though
<braunr> but really "payload" or "label"
<braunr> vm_offset_t isn't the appropriate type either
<braunr> i suggest unsigned long payload
<teythoon> braunr: noted
<braunr> what i mean is
<braunr> the payload isn't the userspace structure you want to use
<braunr> it's the value stored in that unsigned long
<braunr> whether it's used as a pointer or an array index or whatever
should be at the application discretion
<teythoon> yes, I got that
<braunr> concerning vm_offset_t, it's misused a lot, mostly for historical
<braunr> vm_offset_t is actually the ancestor of off_t
<braunr> i.e. an offset inside a *memory object*
<braunr> the size of which may differ from the size of a pointer
<braunr> historically, physical and virtual addresses, in addition to
memory object sizes, were the same, hence the confusion
<braunr> today we might have 32-bits virtual addresses, 36-bits physical
addresses, and 44- to 64-bits file offsets
<braunr> (e.g. PAE with large file support)
## IRC, freenode, #hurd, 2013-11-25
<teythoon> braunr: the object lookup problem is worse than i assumed
<teythoon> the lookup is actually done twice...
<braunr> teythoon: isn't that usually the case :) ?
<braunr> inside gnumach ?
<teythoon> once in libports, once in the intrans function
<braunr> which intrans function ?
<braunr> can you point at an example ?
<teythoon> routine file_get_fs_options ( file: file_t;
<teythoon> file_t is special
<teythoon> mig magic
<teythoon> type file_t = mach_port_copy_send_t
<teythoon> #ifdef FILE_INTRAN
<teythoon> intran: FILE_INTRAN
<braunr> trivfs_begin_using_protid ?
<teythoon> for example, yes
<teythoon> however, I believe that can be fixed cleanly
<teythoon> I revised my gnumach changes
<teythoon> it works surprisingly well
<braunr> gnumach is largely clean code
<teythoon> i patched libports to use the new falicilty
<teythoon> all the fs translators i tested work fine
<teythoon> tmpfs, ext2fs, nfs, hello*
<teythoon> so does exec
<braunr> very nice
<teythoon> howevcer, the bootstrap fails
<braunr> a lot more straightforward than i expected
<teythoon> i believe proc crashes
<braunr> you can use mach_print to manually trace the bootstrap process
<teythoon> i did that
<teythoon> it's nice
<braunr> bare knives are :)
<teythoon> uh oh, this lookup fix requires some mig changes
<braunr> teythoon: have you built some packages on it ?
<teythoon> braunr: some clang test builds
<braunr> where is mig getting in the way ?
<teythoon> yes, and i compiled lots of stuff
<braunr> any debian package ?
<teythoon> let me just push my changes somewhere...
<teythoon> no, no deb
<teythoon> braunr: in particular,
## IRC, freenode, #hurd, 2013-11-27
<teythoon> btw, my protected payload work is progressing nicely
<teythoon> the system actually boots now :)
<braunr> that's great
<braunr> looking forward to seeing it in action
<teythoon> I'd love to quickly discuss my mig changes if you've got a
<teythoon> first, please look at this
<teythoon> in line 165, the msgh_local_port is restored
<teythoon> b/c later some intrans function might use this for the object
<braunr> yes ok
<braunr> makes sense
<teythoon> this makes mig payload aware
<teythoon> we'd specify another intrans function that takes a label and
returns an object
<braunr> let me remind
<braunr> you said there were 3 lookups actually
<braunr> the mach one
<braunr> the libports one
<braunr> and is the intran one the last, right ?
<teythoon> so now i optimized away the second one, the one in libports
<braunr> ok so you need intran aware functions to replace that lookup
<braunr> payload aware intran functions
<teythoon> mostly cast the label, ports_port_ref the object
<braunr> i assume they'd be pretty straightforward
<braunr> and easy to add for all existing intran functions
<teythoon> most likely
<braunr> the proposed change looks very appropriate
<braunr> i'd never thought about intran functions because i didn't want
that in my clone ;p
<braunr> they do add a bit of complexity
<braunr> but this upgrade path looks right
<teythoon> I think so too
<braunr> nothing more to say :)
<braunr> it's so simple i actually don't understand how i could miss it
last time i looked
<braunr> i guess i was exhausted heh
<teythoon> thanks for the review :)
<braunr> thanks for your work
<braunr> it's been a long time since we had someone spend that much time on
## IRC, freenode, #hurd, 2013-11-29
<teythoon> I came to believe that there is actually a lot of room for
improvement in our rpc path
## IRC, freenode, #hurd, 2013-12-19
<braunr> teythoon_: how is protected payload branch now ?
<braunr> ready for review ?
<teythoon_> the kernel and mig patch are
<braunr> so pending for approval rathr
<teythoon_> documentation is still missing for those ofc
<braunr> the last parts are the mig mutations iirc
<teythoon_> err, you lost me
<teythoon_> i haven't continued to work on the hurd patch series
<teythoon_> the patch series for gnu mach and mig are feature complete from
my point of view
<braunr> i mean the changes needed to remove the third lookup in the
<teythoon_> to do that in hurd, we need a patched mig
<braunr> i was just trying to remember correctly
<teythoon_> those patches need to be reviewed
<teythoon_> the hurd patch series is not yet working, but you can see the
approach i've taken
<teythoon_> the next thing i'd do in this regard is to fix all object
<braunr> so it didn't change from last time i looked
<teythoon_> some code, notoriously the control port handling in the *fs
libs, uses mach_port_t for the receiver and do the lookup themself. i'd
fix that next.
## IRC, freenode, #hurd, 2014-01-20
<teythoon> i've tied up enough loose ends, that i can start working on the
protected payload stuff again
<teythoon> the next step is fixing the receiver lookups everywhere
<braunr> good :)
<teythoon> if everyone uses mig magic for that, the switch will be easy
<teythoon> undoing the hack in mach-defpager too
## IRC, freenode, #hurd, 2014-01-24
<braunr> teythoon: what are you currently working on ? protected payload ?
<teythoon> braunr: yes, i'm working with coccinelle to fix all object
lookups in the hurd
<teythoon> i figured it is easier and cleaner to just fix them instead of
converting pointers back to port names for those functions that want port
## IRC, freenode, #hurd, 2014-02-17
<teythoon> braunr: do you think it's okay to make the 0 payload special ?
<braunr> teythoon: for our needs, sure
<braunr> it's similar to NULL or MACH_PORT_NULL
<teythoon> maybe i should add a symbolic name for that
<teythoon> for consistency
<braunr> but is it wise to let mach_port_set_protected_payload reset the
behaviour on a zero payload ?
<braunr> i don't think a symbolic name is needed
<braunr> or maybe
<braunr> as you want
<teythoon> what else should it do then ?
<braunr> 00:25 < braunr> but is it wise to let
mach_port_set_protected_payload reset the behaviour on a zero payload ?
<braunr> it could return invalid argument instead
<braunr> and the documentation would clearly state 0 is invalid
<braunr> but that would also prevent reverting the mode
<teythoon> yes, i consider that not really useful, but i'd be okay with the
<teythoon> but yes, the documentation should make that clear
## IRC, freenode, #hurd, 2014-02-22
<teythoon> braunr: once the pp patch set is in gnumach, i'll make
mach-defpager use it
<teythoon> it's a good target, as it does not use libports
<teythoon> and it's currently abusing the port rename procedure for the
lookup, making the rights spill into the splay tree
<teythoon> braunr: the wiki mentioned that you once considered to remove
the ability to rename ports altogether
<braunr> teythoon: ah right
<braunr> i actually intend to keep it for x15, but only because i want port
names to blend as file descriptors
<teythoon> right, for dup and friends
<braunr> and the radix tree is a data structure that can cope decently with
not too sparsed distributions
## IRC, freenode, #hurd, 2014-02-27
<braunr> teythoon: ah, just saw the commit that will make our network
<teythoon> network ?
<braunr> eh no, it's about ioctls actually
<braunr> i read a bit too quickly
<teythoon> one more small step towards fixing all receiver lookups in the
<teythoon> i did not anticipate how much the hurd has to be changed first
in order to make use of the protected payloads
<braunr> that was my main reason not to do it actually :/
<braunr> but you're almost finished with it, aren't you ?
<teythoon> not sure tbh
<teythoon> i believe the fsys stuff was the largest chunk
## IRC, freenode, #hurd, 2014-03-02
<teythoon> youpi: i cleaned up most of the receiver lookups in hurd by now
<teythoon> but there are some tricky cases left
<teythoon> 1/ the pager stuff
<teythoon> the mig declarations are in gnumach, and do not have the
necessary intran declarations that we can mutate
<teythoon> 2/ some uses of mach_port_t instead of some_type_t in the hurd
<teythoon> (e.g. fsys_forward)
<teythoon> for 1/, i'd like to extend the definitions in gnumach
<teythoon> i'm not so sure what to do for the second case
<teythoon> we could introduce some types for each case
<teythoon> or, we do not touch the definitions
<teythoon> my protected payload prototype allows us to map payloads back to
port names for the functions that want a name
<teythoon> i did this by redefining the mach_port_t type for mig that uses
the payload to port-name intran function
<teythoon> mig allows type redefinitions, but emits a warning message
<teythoon> though i consider that a useful feature, it allows one to refine
## IRC, freenode, #hurd, 2014-03-04
<teythoon> braunr: i fixed the object lookups in libpager yesterday, a
pretty mechanic change
<braunr> teythoon: can't be bad :)
<teythoon> amusingly, the resulting packages boot about half way through
<braunr> teythoon: ?
<teythoon> it hangs while cleaning left-over files from /tmp
<braunr> ugh, libpager ..
<teythoon> tricky pager stuff is tricky ?
<braunr> tricky buggy pager stuff is tricky and buggy
<braunr> let's assume you made things faster, increasing the likelihood of
<braunr> we had a patch once for a libpager deadlock
<teythoon> well, i'm not yet at the point where things might get faster
<braunr> see 901c61a1d25e7c8963e51012760a82730eda1910
<braunr> the same problem exists elsewhere i think, you might have hit it
<teythoon> i'm still just moving the object lookups from the server
functions to the mig translation functions
<teythoon> but yes, i might have influenced the timing, not sure
<braunr> shouldn't cost too much to add some prints
<braunr> iirc, the other potential deadlock is in libpager/pager-attr.c
<braunr> when memory_object_change_attributes is called
<braunr> (which loops back into libpager when the kernel sends data back
<braunr> tricky ..
<teythoon> i'll try that when i get home
<braunr> aren't you almost done ?
<teythoon> not sure tbh
<braunr> althouhg libpager would be really great
<teythoon> and mach-defpager
<braunr> since this is actually one of the biggest points of contention
<teythoon> i'll do that next, and return to libpager later
<teythoon> for both i needed to change some rpc type definitions in gnu
<braunr> skipping lookups in libpager would make it harder to suffer
writeback thread storms
<teythoon> so i want to make sure that these changes are fine so that i can