community/gsoc/project_ideas/object_lookups.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134

[[!meta copyright="Copyright © 2013 Free Software Foundation, Inc."]]

[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
id="license" text="Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License, Version 1.2 or
any later version published by the Free Software Foundation; with no Invariant
Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
is included in the section entitled [[GNU Free Documentation
License|/fdl]]."]]"""]]

[[!meta title="Improved System Object Lookups"]]

The Hurd currently uses its ihash library ([[hurd/libihash]]) as a generic
container for various objects. While it does its job, it has been reported
to suffer from high collision rates. In addition, the "one size fits all"
approach contributes to slow things down. One particular use case is looking
up an object from a Mach port name, which basically translates to getting the
file or socket associated with a file descriptor in traditional Unix systems.
It's particular because there are actually two lookups for each object, the
first being finding the Mach port from a client port name, which is done in
the GNU Mach kernel, and the second being finding the server object from a
server port name. The best strategy would probably be to directly associate
the address of an object to the receive right of its port, eliminating the
need to look up again, but this is quite an intrusive change in the code base.
For the time being, optimizing lookups would already be an improvement.

The goal of this project is to increase system performance by speeding up
object lookups, with a particular focus on name-to-object lookups. Note that
there is little room for improvement in the kernel name-to-port lookups because
of the various optimizations IPC has received in the past. Looking up server
objects from port names could use an algorithm highly tuned for this task,
perhaps with better locking (shared/exclusive instead of always mutually
exclusive for example). Then, the libihash algorithm could be replaced with a
better one, not necessarily a hash based one, to improve all the other users.

This task requires proper knowledge of data structure algorithms, taking into
account machine properties such as processor caches, as well as the appropriate
skills in C and assembly to check the generated code. Being able to perform
accurate measurements in a system that lacks modern profiling tools would also
be helpful.

Possible mentors: Richard Braun


# IRC, freenode, #hurd, 2013-09-18

In context of [[!message-id "20130918081345.GA13789@dalaran.sceen.net"]].

    <teythoon> braunr: (wrt the gnumach HACK) funny, I was thinking about doind
      the same for userspace servers, renaming ports to the address of the
      associated object, saving the need for the hash table...
    <braunr> teythoon: see
      http://darnassus.sceen.net/~hurd-web/community/gsoc/project_ideas/object_lookups/
    <braunr> teythoon: my idea is to allow servers to set a label per port,
      obtained at mesage recv time
    <braunr> because, yes, looking up an object twice is ridiculous
    <braunr> you normally still want port names to be close to 0 because it
      allows some data structure optimizations
    <teythoon> braunr: yes, I feared that ports should normally be smallish
      integers and contigious at best
    <teythoon> braunr: interesting that you say there that libihash suffers
      from high collision rates
    <teythoon> I've a theory to why that is, libihash doesn't do any hashing at
      all
    <pinotree> there are notes about that in the open_issues section of the
      wiki
    <teythoon> but I figured that this is probably ok for port names, as they
      are small and contigious
    <neal> braunr: That's called protected payload.
    <neal> braunr: The idea is that the kernel appends data to the message in
      flight.


## IRC, freenode, #hurd, 2013-10-24

    <teythoon> and, with some effort, getting rid of the hash table lookup by
      letting the kernel provide the address of the object (iirc neil knew the
      proper term for that)
    <braunr> teythoon: that is a big interface change
    <teythoon> how so
    <braunr> optimizing libihash and libpthread should already be a good start
    <braunr> well how do you intend to add this information ?
    <braunr> ok, "big" is overstatement, but still, it's a low level interface
      change that would probably break a lot of things
    <teythoon> store a pointer in the port structure in gnumach, make that
      accessible somehow
    <braunr> yes but how ?
    <teythoon> interesting question indeed
    <braunr> my plan for x15 is to make this "label" part of received messages
    <braunr> which means you need to change the format of messages
    <braunr> that is what i call a big change
    <teythoon> ok, so we need to provide an update path
    <teythoon> but once done, the change to hurd will be minimal, patching
      libports should cover most of that
    <braunr> normally yes
    <teythoon> so this amounts to messing with gnumach and mig and designing a
      clever way to make the update process safe

    <braunr> libihash is known to show high collision rates
    <teythoon> right, libihash
    <teythoon> it could use an integer hash function on the keys to distribute
      them better
    <braunr> i think that's already what it tries to do
    <braunr> so merely using a better hash algorithm such as murmur should do
      the job
    <braunr> or use another data structure altogether
    <teythoon> no, it does no hashing of its own on the keys
    <braunr> are you sure ?
    <teythoon> well, it uses only prime numbers as sizes, and computes key %
      size
    <braunr> well that's hashing .. :)
    <teythoon> but this is not really a good hash
    <braunr> yes
    <braunr> isn't that what i said ?
    <teythoon> right
    <teythoon> ok, I didn't get that ;)
    <teythoon> also, the sizes start quite small, 3, 7, 19...
    <teythoon> and each time the hash table is grown, all items will have to be
      updated
    <braunr> which is why we could consider another data structure
    <teythoon> or, for starters, to thin out that list of sizes
    <braunr> my personal preference being radix trees
    <teythoon> I assume you have an implementation handy?
    <braunr> yes
    <teythoon> cool :D
    <braunr> but good hashing is excellent too
    <braunr> radix trees have their own issues
    <teythoon> braunr: http://burtleburtle.net/bob/hash/integer.html
    <braunr> i use thomas wang's hashing function in x15
    <braunr> or rather, my own personal c utility library, since x15 doesn't
      hash anything currently
    <braunr> but murmur is better
    <braunr> we prefer distribution over hashing performances
    <braunr> https://131002.net/siphash/