open_issues/gnumach_vm_map_entry_forward_merging.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389

[[!meta copyright="Copyright © 2011, 2012 Free Software Foundation, Inc."]]

[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
id="license" text="Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License, Version 1.2 or
any later version published by the Free Software Foundation; with no Invariant
Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
is included in the section entitled [[GNU Free Documentation
License|/fdl]]."]]"""]]

[[!tag open_issue_gnumach]]

Mach is not always able to merge/coalesce mappings (VM entries) that
are made next to each other, leading to potentially very large numbers
of VM entries, which may slow down the VM functionality. This is said
to particularly affect ext2fs and bash.

The basic idea of Mach designers is that entry coalescing is only an
optimization anyway, not a hard guarantee. We can apply it in the
common simple case, and just refuse to do it in any remotely complex
cases (copies, shadows, multiply referenced objects, pageout in
progress, ...).

Suppose you define a special test program that intentionally maps
parts of a file next to each other and watches the resulting VM map
entries, and just ran a full Hurd system and observed results.

One can stress test ext2fs in particular to check for VM entry
merging:

     # grep NR -r /usr &> /dev/null
     # vminfo 8 | wc -l

That grep opens and reads lots of files to simulate a long-running
machine (perhaps a build server); then one can look at the number of
mappings in ext2fs afterwards. Depending on how much your /usr is
populated, you will get different numbers.  An older Hurd from say
2022, the above comand would result in 5,000-20,000 entries depending
on the machine!  In June 2023, GNUMach gained some forward merging
functinality, which lowered the number of mappings down to 93 entries!

(It is a separate question of why ext2fs makes that many mappings in
the first place. There could possible by a leak in ext2fs that would
be responsible for this, but none have been found so far. Possibly
another problem is that we have an unbounded node cache in libdiskfs
and Mach caching VM objects, which also keeps the node alive.)

These are the simple forward merging cases that GNUMach now supports:

- Forward merging: in `vm_map_enter`, merging with the next entry, in
  addition to merging with the previous entry that was already there;

- For forward merging, a `VM_OBJECT_NULL` can be merged in front of a
  non-null VM object, provided the second entry has large enough
  offset into the object to 'mount' the the first entry in front of
  it;

- A VM object can always be merged with itself (provded offsets/sizes
  match) -- this allows merging entries referencing non-anonymous VM
  objects too, such a file mappings;

- Operations such as `vm_protect` do "clipping", which means splitting
  up VM map entries, in case the specified region lands in the middle
  of an entry -- but they were never "gluing" (merging, coalescing)
  entries back together if the region is later vm_protect'ed back. Now
  this is done (and we try to coalesce in some other cases too). This
  should particularly help with "program break" (brk) in glibc, which
  vm_protect's the pages allocated for the brk back and forth all the
  time.

- As another optimization, throw away unmapped physical pages when
  there are no other references to the object (provided there is no
  pager). Previously the pages would remain in core until the object
  was either unmapped completely, or until another mapping was to be
  created in place of the unmapped one and coalescing kicked in.

- Also shrink the size of `struct vm_page` somewhat. This was a low
  hanging fruit.

`vm_map_coalesce_entry()` is analogous to `vm_map_simplify_entry()` in
other versions of Mach, but different enough to warrant a different
name. The same "coalesce" wording was used as in
`vm_object_coalesce()`, which is appropriate given that the former is
a wrapper for the latter.

### The following provides clarifies some inaccuracies in old IRC logs:

    any request, be it e.g. `mmap()`, or `mprotect()`, can easily split
    entries

`mmap ()` cannot split entries to my knowledge, unless we're talking about
`MAP_FIXED` and unampping parts of the existing mappings.

    my ext2fs has ~6500 entries, but I guess this is related to
    mapping blocks from the filesystem, right?

No. Neither libdiskfs nor ext2fs ever map the store contents into memory
(arguably maybe they should); they just read them with `store_read ()`,
and then dispose of the the read buffers properly. The excessive number
of VM map entries, as far as I can see, is just heap memory.

    (I'm perplexed about how the kernel can merge two memory objects if
    disctinct port names exist in the tasks' name space -- that's what
    `mem_obj` is, right?)

    if, say, 584 and 585 above are port names which the task expects to be
    able to access and do stuff with, what will happen to them when the
    memory objects are merged?

`mem_obj` in `vminfo` output is the VM object *name* port, not the
pager port (arguably `vminfo` should name it something other than
`mem_obj`). The name port is basically useful for seeing if two VM
regions have the exact same VM object mapped, and not much
else. Previously, it was also possible, as a GNU Mach extension, to
pass the name port into `vm_map ()`, but this was dropped for security
reasons. When Mach is built with `MACH_VM_DEBUG`, a name port can also
be used to query information about a VM object.

Mach can't merge two memory objects. Mach doesn't merge *memory objects*
at all, it only merges/coalesces *VM objects*. The difference is subtle,
but important in certain contexts like this one: a "VM object" refers to
Mach's internal representation (`struct vm_object`), and a "memory object"
refers to the memory manager's implementation. There is normally a
1-to-1 correspondence between the two, but this is not always the case:
internal VM objects start without a memory object (pager) port at all,
and only get one created if/when they're paged out. There can be
multiple VM objects referencing the same backing memory object due to
copying and shadowing.

So what Mach could do is merge the internal VM objects, by altering
page offsets to paste pages of one of the objects after the pager of
the other. But this is not implemented yet. What Mach actually does is
it avoids creating those internal VM objects and entries in the first
place, instead extending an already existing VM object and entry to
cover the new mapping.

    but at least, if two `vm_objects` are created but reference the same
    externel memory object, the vm should be able to merge them back

That never ever happens. There can only be a single `vm_object` for a
memory object. (In a single instance of Mach, that is -- if multiple
Machs access the same memory object over network-transparent IPC, each
is going to have its own `vm_object` representing the memory object.)

See `vm_object_enter()` function, which looks up an existing VM object for
a memory object, and creates one if it doesn't yet exist.

    ok so if I get it right, the entries shown by `vmstat` are the
    `vm_object`, and the mem_obj listed is a send right to the memory
    object they're referencing ?

    yes

No. The entries shown are VM map entries (`struct vm_map_entry`). There
can be entries that reference no VM object at all (`VM_OBJECT_NULL`), or
multiple entries that reference the same VM object. In fact this is
visible in the example above, the two entries mapped at `0x1311000` and at
`0x1314000` reference the same VM object, whose name port is 586.

`mem_obj` listed is a send right to the *name* port of the VM object, not
to the memory object. Letting a task get the memory object port would be
disastrous for security (see the "No read-only mappings" vulnerability).

    i'm not sure about the type of the integer showed (port name or simply
    an index)

It is a port name (in vminfo's IPC name space) of the VM object name
port.

    if every `vm_allocate` request implies the creation of a memory object
    from the default pager

Not immediately, no. Only if the memory has to be paged out. Otherwise
an internal VM object is created without a memory object.

    and a `vm_object` is not a capability, but just an internal kernel
    structure used to record the composition of the address space

It is a kernel structure, but it also is a capability in the same way as
a task or a thread is a capability -- it is exposed as a port.
Specifically, a `memory_object_control_t` port is directly converted to a
`struct vm_object` by MIG. This would perhaps be clearer if
`memory_object_control_t` was instead named `vm_object_t`. The VM object
name port is also converted to a VM object, but this is only used in the
`MACH_VM_DEBUG RPCs`.

    i wonder when `vm_map_enter()` gets null objects though :/

Whenever you do `vm_map ()` with `MACH_PORT_NULL` for the object, or on
`vm_allocate ()` which is a shortcut for the same.

    the default pager backs `vm_objects` providing zero filled memory

If that was the case, there would not be a need for a pager, Mach could
just hand out zero-filled pages. The anonymous mappings do start out
zero-filled, that is true. The default pager gets involved when the
pages are dirtied (so they no longer zero-filled) and there's memory
shortage so the pages have to paged out.


# IRC, freenode, #hurd, 2011-07-20

    <braunr> could we add gnumach forward map entry merging as an open issue ?
    <braunr> probably hurting anything using bash extensively, like build most
      build systems
    <braunr> mcsim: this map entry merging problem might interest you
    <braunr> tschwinge: see vm/vm_map.c, line ~905
    <braunr> "See whether we can avoid creating a new entry (and object) by
      extending one of our neighbors.  [So far, we only attempt to extend from
      below.]"
    <braunr> and also vm_object_coalesce
    <braunr> "NOTE:   Only works at the moment if the second object is NULL -
      if it's not, which object do we lock first?"
    <braunr> although map entry merging should be enough
    <braunr> this seems to be the cause for bash having between 400 and 1000+
      map entries
    <braunr> thi makes allocations and faults slow, and forks even more
    <braunr> but again, this should be checked before attempting anything
    <braunr> (for example, this comment still exists in freebsd, although they
      solved the problem, so who knows)
    <antrik> braunr: what exactly would you want to check?
    <antrik> braunr: this rather sounds like something you would just have to
      try...
    <braunr> antrik: that map merging is actually incomplete
    <braunr> and that entries can actually be merged
    <antrik> hm, I see...
    <braunr> (i.e. they are adjacent and have compatible properties
    <braunr> )
    <braunr> antrik: i just want to avoid the "hey, splay trees mak fork slow,
      let's work on it for a month to see it wasn't the problem"
    <antrik> so basically you need a dump of a task's map to check whether
      there are indeed entries that could/should be merged?
    <antrik> hehe :-)
    <braunr> well, vminfo should give that easily, i just didn't take the time
      to check it
    <jkoenig> braunr, as you pointed out, "vminfo $$" seems to indicate that
      merging _is_ incomplete.
    <braunr> this could actually have a noticeable impact on package builds
    <braunr> hm
    <braunr> the number of entries for instances of bash running scripts don't
      exceed 50-55 :/
    <braunr> the issue seems to affect only certain instances (login shells,
      and su -)
    <braunr> jkoenig: i guess dash is just much lighter than bash in many ways
      :)
    <jkoenig> braunr, the number seems to increase with usage (100 here for a
      newly started interactive shell, vs. 150 in an old one)
    <braunr> yes, merging is far from complete in the vm_map code
    <braunr> it only handles null objects (private zeroed memory), and only
      tries to extend a previous entry (this isn't even a true merge)
    <braunr> this works well for the kernel however, which is why there are so
      few as 25 entries
    <braunr> but any request, be it e.g. mmap(), or mprotect(), can easily
      split entries
    <braunr> making their number larger
    <jkoenig> my ext2fs has ~6500 entries, but I guess this is related to
      mapping blocks from the filesystem, right?
    <braunr> i think so
    <braunr> hm not sure actually
    <braunr> i'd say it's fragmentation due to copy on writes when client have
      mapped memory from it
    <braunr> there aren't that many file mappings though :(
    <braunr> jkoenig: this might just be the same problem as in bash
    <braunr>  0x1308000[0x3000] (prot=RW, max_prot=RWX, mem_obj=584)
    <braunr>  0x130b000[0x6000] (prot=RW, max_prot=RWX, mem_obj=585)
    <braunr>  0x1311000[0x3000] (prot=RX, max_prot=RWX, mem_obj=586)
    <braunr>  0x1314000[0x1000] (prot=RW, max_prot=RWX, mem_obj=586)
    <braunr>  0x1315000[0x2000] (prot=RX, max_prot=RWX, mem_obj=587)
    <braunr> the first two could be merged but not the others
    <jkoenig> theoritically, those may correspond to memory objects backed by
      different portions of the disk, right?
    <braunr> jkoenig: this looks very much like the same issue (many private
      mappings not merged)
    <braunr> jkoenig: i'm not sure
    <braunr> jkoenig: normally there is an offset when the object is valid
    <braunr> but vminfo may simply not display it if 0
    * jkoenig goes read about memory object
    <braunr> ok, vminfo can't actually tell if the object is anonymous or
      file-backed memory
    <jkoenig> (I'm perplexed about how the kernel can merge two memory objects
      if disctinct port names exist in the tasks' name space -- that's what
      mem_obj is, right?)
    <braunr> i don't see why
    <braunr> jkoenig: can you be more specific ?
    <jkoenig> braunr, if, say, 584 and 585 above are port names which the task
      expects to be able to access and do stuff with, what will happen to them
      when the memory objects are merged?
    <braunr> good question
    <braunr> but hm
    <braunr> no it's not really a problem
    <braunr> memory objects aren't directly handled by the vm system
    <braunr> vm_object and memory_object are different things
    <braunr> vm_objects can be split and merged
    <braunr> and shadow objects form chains ending on a final vm_object
    <braunr> which references a memory object
    <braunr> hm
    <braunr> jkoenig: ok no solution, they can't be merged :)
    <jkoenig> braunr, I'm confused :-)
    <braunr> jkoenig: but at least, if two vm_objects are created but reference
      the same externel memory object, the vm should be able to merge them back
    <braunr> external*
    <braunr> are created as a result of a split
    <braunr> say, you map a memory object, mprotect part of it (=split), then
      mprotect the reste of it (=merge), it should work
    <braunr> jkoenig: does that clarify things a bit ?
    <jkoenig> ok so if I get it right, the entries shown by vmstat are the
      vm_object, and the mem_obj listed is a send right to the memory object
      they're referencing ?
    <braunr> yes
    <braunr> i'm not sure about the type of the integer showed (port name or
      simply an index)
    <braunr> jkoenig: another possibility explaining the high number of entries
      is how anonymous memory is implemented
    <braunr> if every vm_allocate request implies the creation of a memory
      object from the default pager
    <braunr> the vm has no way to merge them
    <jkoenig> and a vm_object is not a capability, but just an internal kernel
      structure used to record the composition of the address space
    <braunr> jkoenig: not exactly the address space, but close enough
    <braunr> jkoenig: it's a container used to know what's in physical memory
      and what isn't
    <jkoenig> braunr, ok I think I'm starting to get it, thanks.
    <braunr> glad i could help
    <braunr> i wonder when vm_map_enter() gets null objects though :/
    <braunr> "If this port is MEMORY_OBJECT_NULL, then zero-filled memory is
      allocated instead"
    <braunr> which means vm_allocate()
    <jkoenig> braunr, when the task uses vm_allocate(), or maybe vm_map_enter()
      with MEMORY_OBJECT_NULL, there's an opportunity to extend an existing
      object though, is that what you referred to earlier ?
    <braunr> jkoenig: yes, and that's what is done
    <jkoenig> but how does that play out with the default pager? (I'm thinking
      aloud, as always feel free to ignore ;-)
    <braunr> the default pager backs vm_objects providing zero filled memory
    <braunr> hm, guess it wasn't your question
    <braunr> well, swap isn't like a file, pages can be placed dynamically,
      which is why the offset is always 0 for this type of memory
    <jkoenig> hmm I see, apparently a memory object does not have a size
    <braunr> are you sure ?
    <jkoenig> from what I can gather from
      http://www.gnu.org/software/hurd/gnumach-doc/External-Memory-Management.html,
      but I looked very quickly
    <braunr> vm_objects have a size
    <braunr> and each map entry recors the offset within the object where the
      mapping begins
    <braunr> offset and sizes are used by the kernel when querying the memory
      object pager
    <braunr> see memory_object_data_request for example
    <jkoenig> right.
    <braunr> but the default pager has another interface
    <braunr> jkoenig: after some simple tests, i haven't seen a simple case
      where forward merging could be applied :(
    <braunr> which means it's a lot harder than it first looked
    <braunr> hm
    <braunr> actually, there seems to be cases where this can be done
    <braunr> all of them occurring after a first merge was done
    <braunr> (which means a mapping request perfectly fits between two map
      entries)


# IRC, freenode, #hurd, 2011-07-21

    <braunr> tschwinge: you may remove the forward map entry merging issue :/
    <pinotree> what did you discover?
    <braunr> tschwinge: it's actually much more complicated than i thought, and
      needs major changes in the vm, and about the way anonymous memory is
      handled
    <braunr> from what i could see, part of the problem still exists in freebsd
    <braunr> for the same reasons (shadow objects being one of them)

[[mach_shadow_objects]].


# GCC build time using bash vs. dash

<http://gcc.gnu.org/ml/gcc/2011-07/msg00444.html>


# Procedure

  * Analyze.

  * Measure.

  * Fix.

  * Measure again.

  * Have Samuel measure on the buildd.