1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
|
[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]]
[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
id="license" text="Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License, Version 1.2 or
any later version published by the Free Software Foundation; with no Invariant
Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license
is included in the section entitled [[GNU Free Documentation
License|/fdl]]."]]"""]]
[[!tag open_issue_gnumach]]
There is a [[!FF_project 266]][[!tag bounty]] on this task.
[[!toc]]
# IRC, freenode, #hurd, 2011-04-12
<antrik> braunr: do you think the allocator you wrote for x15 could be used
for gnumach? and would you be willing to mentor this? :-)
<braunr> antrik: to be willing to isn't my current problem
<braunr> antrik: and yes, I think my allocator can be used
<braunr> it's a slab allocator after all, it only requires reap() and
grow()
<braunr> or mmap()/munmap() whatever you want to call it
<braunr> a backend
<braunr> antrik: although i've been having other ideas recently
<braunr> that would have more impact on our usage patterns I think
<antrik> mcsim: have you investigated how the zone allocator works and how
it's hooked into the system yet?
<braunr> mcsim: now let me give you a link
<braunr> mcsim:
http://git.sceen.net/rbraun/libbraunr.git/?a=blob;f=mem.c;h=330436e799f322949bfd9e2fedf0475660309946;hb=HEAD
<braunr> mcsim: this is an implementation of the slab allocator i've been
working on recently
<braunr> mcsim: i haven't made it public because i reworked the per
processor layer, and this part isn't complete yet
<braunr> mcsim: you could use it as a reference for your project
<mcsim> braunr: ok
<braunr> it used to be close to the 2001 vmem paper
<braunr> but after many tests, fragmentation and accounting issues have
been found
<braunr> so i rewrote it to be closer to the linux implementation (cache
filling/draining in bukl transfers)
<braunr> bulk*
<braunr> they actually use the word draining in linux too :)
<mcsim> antrik: not complete yet.
<antrik> braunr: oh, it's unfinished? that's unfortunate...
<braunr> antrik: only the per processor part
<braunr> antrik: so it doesn't matter much for gnumach
<braunr> and it's not difficult to set up
<antrik> mcsim: hm, OK... but do you think you will have a fairly good
understanding in the next couple of days?...
<antrik> I'm asking because I'd really like to see a proposal a bit more
specific than "I'll look into things..."
<antrik> i.e. you should have an idea which things you will actually have
to change to hook up a new allocator etc.
<antrik> braunr: OK. will the interface remain unchanged, so it could be
easily replaced with an improved implementation later?
<braunr> the zone allocator in gnumach is a badly written bare object
allocator actually, there aren't many things to understand about it
<braunr> antrik: yes
<antrik> great :-)
<braunr> and the per processor part should be very close to the phys
allocator sitting next to it
<braunr> (with the slight difference that, as per cpu caches have variable
sizes, they are allocated on the free path rather than on the allocation
path)
<braunr> this is a nice trick in the vmem paper i've kept in mind
<braunr> and the interface also allows to set a "source" for caches
<antrik> ah, good point... do you think we should replace the physmem
allocator too? and if so, do it in one step, or one piece at a time?...
<braunr> no
<braunr> too many drivers currently depend on the physical allocator and
the pmap module as they are
<braunr> remember linux 2.0 drivers need a direct virtual to physical
mapping
<braunr> (especially true for dma mappings)
<antrik> OK
<braunr> the nice thing about having a configurable memory source is that
<antrik> whot do you mean by "allocated on the free path"?
<braunr> even if most caches will use the standard vm_kmem module as their
backend
<braunr> there is one exception in the vm_map module, allowing us to get
rid of either a static limit, or specific allocation code
<braunr> antrik: well, when you allocate a page, the allocator will lookup
one in a per cpu cache
<braunr> if it's empty, it fills the cache
<braunr> (called pools in my implementations)
<braunr> it then retries
<braunr> the problem in the slab allocator is that per cpu caches have
variable sizes
<braunr> so per cpu pools are allocated from their own pools
<braunr> (remember the magazine_xx caches in the output i showed you, this
is the same thing)
<braunr> but if you allocate them at allocation time, you could end up in
an infinite loop
<braunr> so, in the slab allocator, when a per cpu cache is empty, you just
fall back to the slab layer
<braunr> on the free path, when a per cpu cache doesn't exist, you allocate
it from its own cache
<braunr> this way you can't have an infinite loop
<mcsim> antrik: I'll try, but I have exams now.
<mcsim> As I understand amount of elements which could be allocated we
determine by zone initialization. And at this time memory for zone is
reserved. I'm going to change this. And make something similar to kmalloc
and vmalloc (support for pages consecutive physically and virtually). And
pages in zones consecutive always physically.
<mcsim> Am I right?
<braunr> mcsim: don't try to do that
<mcsim> why?
<braunr> mcsim: we just need a slab allocator with an interface close to
the zone allocator
<antrik> mcsim: IIRC the size of the complete zalloc map is fixed; but not
the number of elements per zone
<braunr> we don't need two allocators like kmalloc and vmalloc
<braunr> actually we just need vmalloc
<braunr> IIRC the limits are only present because the original developers
wanted to track leaks
<braunr> they assumed zones would be large enough, which isn't true any
more today
<braunr> but i didn't see any true reservation
<braunr> antrik: i'm not sure i was clear enough about the "allocation of
cpu caches on the free path"
<braunr> antrik: for a better explanation, read the vmem paper ;)
<antrik> braunr: you mean there is no fundamental reason why the zone map
has a limited maximal size; and it was only put in to catch cases where
something eats up all memory with kernel object creation?...
<antrik> braunr: I think I got it now :-)
<braunr> antrik: i'm pretty certin of it yes
<antrik> I don't see though how it is related to what we were talking
about...
<braunr> 10:55 < braunr> and the per processor part should be very close to
the phys allocator sitting next to it
<braunr> the phys allocator doesn't have to use this trick
<braunr> because pages have a fixed size, so per cpu caches all have the
same size too
<braunr> and the number of "caches", that is, physical segments, is limited
and known at compile time
<braunr> so having them statically allocated is possible
<antrik> I see
<braunr> it would actually be very difficult to have a phys allocator
requiring dynamic allocation when the dynamic allocator isn't yet ready
<antrik> hehe :-)
<mcsim> total size of all zone allocations is limited to 12 MB. And is "was
only put in to catch cases where something eats up all memory with kernel
object creation?"
<braunr> mcsim: ah right, there could be a kernel submap backing all the
zones
<braunr> but this can be increased too
<braunr> submaps are kind of evil :/
<antrik> mcsim: I think it's actually 32 MiB or something like that in the
Debian version...
<antrik> braunr: I'm not sure I ever fully understood what the zalloc map
is... I looked through the code once, and I think I got a rough
understading, but I was still pretty uncertain about some bits. and I
don't remember the details anyways :-)
<braunr> antrik: IIRC, it's a kernel submap
<braunr> it's named kmem_map in x15
<antrik> don't know what a submap is
<braunr> submaps are vm_map objects
<braunr> in a top vm_map, there are vm_map_entries
<braunr> these entries usually point to vm_objects
<braunr> (for the page cache)
<braunr> but they can point to other maps too
<braunr> the goal is to reduce fragmentation by isolating allocations
<braunr> this also helps reducing contention
<braunr> for exemple, on BSD, there is a submap for mbufs, so that the
network code doesn't interfere too much with other kernel allocations
<braunr> antrik: they are similar to spans in vmem, but vmem has an elegant
importing mechanism which eliminates the static limit problem
<antrik> so memory is not directly allocated from the physical allocator,
but instead from another map which in turn contains physical memory, or
something like that?...
<braunr> no, this is entirely virtual
<braunr> submaps are almost exclusively used for the kernel_map
<antrik> you are using a lot of identifies here, but I don't remember (or
never knew) what most of them mean :-(
<braunr> sorry :)
<braunr> the kernel map is the vm_map used to represent the ~1 GiB of
virtual memory the kernel has (on i386)
<braunr> vm_map objects are simple virtual space maps
<braunr> they contain what you see in linux when doing /proc/self/maps
<braunr> cat /proc/self/maps
<braunr> (linux uses entirely different names but it's roughly the same
structure)
<braunr> each line is a vm_map_entry
<braunr> (well, there aren't submaps in linux though)
<braunr> the pmap tool on netbsd is able to show the kernel map with its
submaps, but i don't have any image around
<mcsim> braunr: is limit for zones is feature and shouldn't be changed?
<braunr> mcsim: i think we shouldn't have fixed limits for zones
<braunr> mcsim: this should be part of the debugging facilities in the slab
allocator
<braunr> is this fixed limit really a major problem ?
<braunr> i mean, don't focus on that too much, there are other issues
requiring more attention
<antrik> braunr: at 12 MiB, it used to be, causing a lot of zalloc
panics. after increasing, I don't think it's much of a problem anymore...
<antrik> but as memory sizes grow, it might become one again
<antrik> that's the problem with a fixed size...
<braunr> yes, that's the issue with submaps
<braunr> but gnumach is full of those, so let's fix them by order of
priority
<antrik> well, I'm still trying to digest what you wrote about submaps :-)
<braunr> i'm downloading netbsd, so you can have a good view of all this
<antrik> so, when the kernel allocates virtual address space regions
(mostly for itself), instead of grabbing chunks of the address space
directly, it takes parts out of a pre-reserved region?
<braunr> not exactly
<braunr> both statements are true
<mcsim> antrik: only virtual addresses are reserved
<braunr> it grabs chunks of the address space directly, but does so in a
reserved region of the address space
<braunr> a submap is like a normal map, it has a start address, a size, and
is empty, then it's populated with vm_map_entries
<braunr> so instead of allocating from 3-4 GiB, you allocate from, say,
3.1-3.2 GiB
<antrik> yeah, that's more or less what I meant...
<mcsim> braunr: I see two problems: limited zones and absence of caching.
<mcsim> with caching absence of readahead paging will be not so significant
<braunr> please avoid readahead
<mcsim> ok
<braunr> and it's not about paging, it's about kernel memory, which is
wired
<braunr> (well most of it)
<braunr> what about limited zones ?
<braunr> the whole kernel space is limited, there has to be limits
<braunr> the problem is how to handle them
<antrik> braunr: almost all. I looked through all zones once, and IIRC I
found exactly one that actually allows paging...
<braunr> currently, when you reach the limit, you have an OOM error
<braunr> antrik: yes, there are
<braunr> i don't remember which implementation does that but, when
processes haven't been active for a minute or so, they are "swapedout"
<braunr> completely
<braunr> even the kernel stack
<braunr> and the page tables
<braunr> (most of the pmap structures are destroyed, some are retained)
<antrik> that might very well be true... at least inactive processes often
show up with 0 memory use in top on Hurd
<braunr> this is done by having a pageable kernel map, with wired entries
<braunr> when the swapper thread swaps tasks out, it unwires them
<braunr> but i think modern implementations don't do that any more
<antrik> well, I was talking about zalloc only :-)
<braunr> oh
<braunr> so the zalloc_map must be pageable
<braunr> or there are two submaps ?
<antrik> not sure whether "morden implementations" includes Linux ;-)
<braunr> no, i'm talking about the bsd family only
<antrik> but it's certainly true that on Linux even inactive processes
retain some memory
<braunr> linux doesn't make any difference between processor-bound and
I/O-bound processes
<antrik> braunr: I have no idea how it works. I just remember that when
creating zones, one of the optional flags decides whether the zone is
pagable. but as I said, IIRC there is exactly one that actually is...
<braunr> zone_map = kmem_suballoc(kernel_map, &zone_min, &zone_max,
zone_map_size, FALSE);
<braunr> kmem_suballoc(parent, min, max, size, pageable)
<braunr> so the zone_map isn't
<antrik> IIRC my conclusion was that pagable zones do not count in the
fixed zone map limit... but I'm not sure anymore
<braunr> zinit() has a memtype parameter
<braunr> with ZONE_PAGEABLE as a possible flag
<braunr> this is wierd :)
<mcsim> There is no any zones which use ZONE_PAGEABLE flag
<antrik> mcsim: are you sure? I think I found one...
<braunr> if (zone->type & ZONE_PAGEABLE) {
<antrik> admittedly, it is several years ago that I looked into this, so my
memory is rather dim...
<braunr> if (kmem_alloc_pageable(zone_map, &addr, ...
<braunr> calling kmem_alloc_pageable() on an unpageable submap seems wrong
<mcsim> I've greped gnumach code and there is no any zinit procedure call
with ZONE_PAGEABLE flag
<braunr> good
<antrik> hm... perhaps it was in some code that has been removed
alltogether since ;-)
<antrik> actually I think it would be pretty neat to have pageable kernel
objects... but I guess it would require considerable effort to implement
this right
<braunr> mcsim: you also mentioned absence of caching
<braunr> mcsim: the zone allocator actually is a bare caching object
allocator
<braunr> antrik: no, it's easy
<braunr> antrik: i already had that in x15 0.1
<braunr> antrik: the problem is being sure the objects you allocate from a
pageable backing store are never used when resolving a page fault
<braunr> that's all
<antrik> I wouldn't expect that to be easy... but surely you know better
:-)
<mcsim> braunr: indeed. I was wrong.
<antrik> braunr: what is a caching object allocator?...
<braunr> antrik: ok, it's not easy
<braunr> antrik: but once you have vm_objects implemented, having pageable
kernel object is just a matter of using the right options, really
<braunr> antrik: an allocator that caches its buffers
<braunr> some years ago, the term "object" would also apply to
preconstructed buffers
<antrik> I have no idea what you mean by "caches its buffers" here :-)
<braunr> well, a memory allocator which doesn't immediately free its
buffers caches them
<mcsim> braunr: but can it return objects to system?
<braunr> mcsim: which one ?
<antrik> yeah, obviously the *implementation* of pageable kernel objects is
not hard. the tricky part is deciding which objects can be pageable, and
which need to be wired...
<mcsim> Can zone allocator return cached objects to system as in slab?
<mcsim> I mean reap()
<braunr> well yes, it does so, and it does that too often
<braunr> the caching in the zone allocator is actually limited to the
pagesize
<braunr> once page is completely free, it is returned to the vm
<mcsim> this is bad caching
<braunr> yes
<mcsim> if object takes all page than there is now caching at all
<braunr> caching by side effect
<braunr> true
<braunr> but the linux slab allocator does the same thing :p
<braunr> hm
<braunr> no, the solaris slab allocator does so
<mcsim> linux's slab returns objects only when system ask
<antrik> without preconstructed objects, is there actually any point in
caching empty slabs?...
<mcsim> Once I've changed my allocator to slab and it cached more than 1GB
of my memory)
<braunr> ok wait, need to fix a few mistakes first
<mcsim> s/ask/asks
<braunr> the zone allocator (in gnumach) actually has a garbage collector
<antrik> braunr: well, the Solaris allocator follows the slab/magazine
paper, right? so there is caching at the magazine layer... in that case
caching empty slabs too would be rather redundant I'd say...
<braunr> which is called when running low on memory, similar to the slab
allocaotr
<braunr> antrik: yes
<antrik> (or rather the paper follows the Solaris allocator ;-) )
<braunr> mcsim: the zone allocator reap() is zone_gc()
<antrik> braunr: hm, right, there is a "collectable" flag for zones... but
I never understood what it means
<antrik> braunr: BTW, I heard Linux has yet another allocator now called
"slob"... do you happen to know what that is?
<braunr> slob is a very simple allocator for embedded devices
<mcsim> AFAIR this is just heap allocator
<braunr> useful when you have a very low amount of memory
<braunr> like 1 MiB
<braunr> yes
<antrik> just googled it :-)
<braunr> zone and slab are very similar
<antrik> sounds like a simple heap allocator
<mcsim> there is another allocator that calls slub, and it better than slab
in many cases
<braunr> the main difference is the data structures used to store slabs
<braunr> mcsim: i disagree
<antrik> mcsim: ah, you already said that :-)
<braunr> mcsim: slub is better for systems with very large amounts of
memory and processors
<braunr> otherwise, slab is better
<braunr> in addition, there are accounting issues with slub
<braunr> because of cache merging
<mcsim> ok. This strange that slub is default allocator
<braunr> well both are very good
<braunr> iirc, linus stated that he really doesn't care as long as its
works fine
<braunr> he refused slqb because of that
<braunr> slub is nice because it requires less memory than slab, while
still being as fast for most cases
<braunr> it gets slower on the free path, when the cpu performing the free
is different from the one which allocated the object
<braunr> that's a reasonable cost
<mcsim> slub uses heap for large object. Are there any tests that compare
what is better for large objects?
<antrik> well, if slub requires less memory, why do you think slab is
better for smaller systems? :-)
<braunr> antrik: smaller is relative
<antrik> mcsim: for large objects slab allocation is rather pointless, as
you don't have multiple objects in a page anyways...
<braunr> antrik: when lameter wrote slub, it was intended for systems with
several hundreds processors
<antrik> BTW, was slqb really refused only because the other ones are "good
enough"?...
<braunr> yes
<antrik> wow, that's a strange argument...
<braunr> linus is already unhappy of having "so many" allocators
<antrik> well, if the new one is better, it could replace one of the others
:-)
<antrik> or is it useful only in certain cases?
<braunr> that's the problem
<braunr> nobody really knows
<antrik> hm, OK... I guess that should be tested *before* merging ;-)
<antrik> is anyone still working on it, or was it abandonned?
<antrik> mcsim: back to caching...
<antrik> what does caching in the kernel object allocator got to do with
readahead (i.e. clustered paging)?...
<mcsim> if we cached some physical pages we don't need to find new ones for
allocating new object. And that's why there will not be a page fault.
<mcsim> antrik: Regarding kam. Hasn't he finished his project?
<antrik> err... what?
<antrik> one of us must be seriously confused
<antrik> I totally fail to see what caching of physical pages (which isn't
even really a correct description of what slab does) has to do with page
faults
<antrik> right, KAM didn't finish his project
<mcsim> If we free the physical page and return it to system we need
another one for next allocation. But if we keep it, we don't need to find
new physical page.
<mcsim> And physical page is allocated only then when page fault
occurs. Probably, I'm wrong
<antrik> what does "return to system" mean? we are talking about the
kernel...
<antrik> zalloc/slab are about allocating kernel objects. this doesn't have
*anything* to do with paging of userspace processes
<antrik> only thing the have in common is that they need to get pages from
the physical page allocator. but that's yet another topic
<mcsim> Under "return to system" I mean ability to use this page for other
needs.
<braunr> mcsim: consider kernel memory to be wired
<braunr> here, return to system means releasing a page back to the vm
system
<braunr> the vm_kmem module then unmaps the physical page and free its
virtual address in the kernel map
<mcsim> ok
<braunr> antrik: the problem with new allocators like slqb is that it's
very difficult to really know if they're better, even with extensive
testing
<braunr> antrik: there are papers (like wilson95) about the difficulties in
making valuable results in this field
<braunr> see
http://www.sceen.net/~rbraun/dynamic_storage_allocation_a_survey_and_critical_review.pdf
<mcsim> how can be allocated physically continuous object now?
<braunr> mcsim: rephrase please
<mcsim> what is similar to kmalloc in Linux to gnumach?
<braunr> i know memory is reserved for dma in a direct virtual to physical
mapping
<braunr> so even if the allocation is done similarly to vmalloc()
<braunr> the selected region of virtual space maps physical memory, so
memory is physically contiguous too
<braunr> for other allocation types, a block large enough is allocated, so
it's contiguous too
<mcsim> I don't clearly understand. If we have fragmentation in physical
ram, so there aren't 2 free pages in a row, but there are able apart, we
can't to allocate these 2 pages along?
<braunr> no
<braunr> but every system has this problem
<mcsim> But since we have only 12 or 32 MB of memory the problem becomes
more significant
<braunr> you're confusing virtual and physical memory
<braunr> those 32 MiB are virtual
<braunr> the physical pages backing them don't have to be contiguous
<mcsim> Oh, indeed
<mcsim> So the only problem are limits?
<braunr> and performance
<braunr> and correctness
<braunr> i find the zone allocator badly written
<braunr> antrik: mcsim: here is the content of the kernel pmap on NetBSD
(which uses a virtual memory system close to the Mach VM)
<braunr> antrik: mcsim: http://www.sceen.net/~rbraun/pmap.out
[[pmap.out]]
<braunr> you can see the kmem_map (which is used for most general kernel
allocations) is 128 MiB large
<braunr> actually it's not the kernel pmap, it's the kernel_map
<antrik> braunr: why is it called pmap.out then? ;-)
<braunr> antrik: because the tool is named pmap
<braunr> for process map
<braunr> it also exists under Linux, although direct access to
/proc/xx/maps gives more info
<mcsim> braunr: I've said that this is kernel_map. Can I see kernel_map for
Linux?
<braunr> mcsim: I don't know how to do that
<mcsim> s/I've/You've
<braunr> but Linux doesn't have submaps, and uses a direct virtual to
physical mapping, so it's used differently
<antrik> how are things (such as zalloc zones) entered into kernel_map?
<braunr> in zone_init() you have
<braunr> zone_map = kmem_suballoc(kernel_map, &zone_min, &zone_max,
zone_map_size, FALSE);
<braunr> so here, kmem_map is named zone_map
<braunr> then, in zalloc()
<braunr> kmem_alloc_wired(zone_map, &addr, zone->alloc_size)
<antrik> so, kmem_alloc just deals out chunks of memory referenced directly
by the address, and without knowing anything about the use?
<braunr> kmem_alloc() gives virtual pages
<braunr> zalloc() carves them into buffers, as in the slab allocator
<braunr> the difference is essentially the lack of formal "slab" object
<braunr> which makes the zone code look like a mess
<antrik> so kmem_suballoc() essentially just takes a bunch of pages from
the main kernel_map, and uses these to back another map which then in
turn deals out pages just like the main kernel_map?
<braunr> no
<braunr> kmem_suballoc creates a vm_map_entry object, and sets its start
and end address
<braunr> and creates a vm_map object, which is then inserted in the new
entry
<braunr> maybe that's what you meant with "essentially just takes a bunch
of pages from the main kernel_map"
<braunr> but there really is no allocation at this point
<braunr> except the map entry and the new map objects
<antrik> well, I'm trying to understand how kmem_alloc() manages things. so
it has map_entry structures like the maps of userspace processes? do
these also reference actual memory objects?
<braunr> kmem_alloc just allocates virtual pages from a vm_map, and backs
those with physical pages (unless the user requested pageable memory)
<braunr> it's not "like the maps of userspace processes"
<braunr> these are actually the same structures
<braunr> a vm_map_entry can reference a memory object or a kernel submap
<braunr> in netbsd, it can also referernce nothing (for pure wired kernel
memory like the vm_page array)
<braunr> maybe it's the same in mach, i don't remember exactly
<braunr> antrik: this is actually very clear in vm/vm_kern.c
<braunr> kmem_alloc() creates a new kernel object for the allocation
<braunr> allocates a new entry (or uses a previous existing one if it can
be extended) through vm_map_find_entry()
<braunr> then calls kmem_alloc_pages() to back it with wired memory
<antrik> "creates a new kernel object" -- what kind of kernel object?
<braunr> kmem_alloc_wired() does roughly the same thing, except it doesn't
need a new kernel object because it knows the new area won't be pageable
<braunr> a simple vm_object
<braunr> used as a container for anonymous memory in case the pages are
swapped out
<antrik> vm_object is the same as memory object/pager? or yet something
different?
<braunr> antrik: almost
<braunr> antrik: a memory_object is the user view of a vm_object
<braunr> as in the kernel/user interfaces used by external pagers
<braunr> vm_object is a more internal name
<mcsim> Is fragmentation a big problem in slab allocator?
<mcsim> I've tested it on my computer in Linux and for some caches it
reached 30-40%
<antrik> well, fragmentation is a major problem for any allocator...
<antrik> the original slab allocator was design specifically with the goal
of reducing fragmentation
<antrik> the revised version with the addition of magazines takes a step
back on this though
<antrik> have you compared it to slub? would be pretty interesting...
<mcsim> I have an idea how can it be decreased, but it will hurt by
performance...
<mcsim> antrik: no I haven't, but there will be might the same, I think
<mcsim> if each cache will handle two types of object: with sizes that will
fit cache sizes (or I bit smaller) and with sizes which are much smaller
than maximal cache size. For first type of object will be used standard
slab allocator and for latter type will be used (within page) heap
allocator.
<mcsim> I think that than fragmentation will be decreased
<antrik> not at all. heap allocator has much worse fragmentation. that's
why slab allocator was invented
<antrik> the problem is that in a long-running program (such an the
kernel), objects tend to have vastly varying lifespans
<mcsim> but we use heap only for objects of specified sizes
<antrik> so often a few old objects will keep a whole page hostage
<mcsim> for example for 32 byte cache it could be 20-28 byte objects
<antrik> that's particularily visible in programs such as firefox, which
will grow the heap during use even though actual needs don't change
<antrik> the slab allocator groups objects in a fashion that makes it more
likely adjacent objects will be freed at similar times
<antrik> well, that's pretty oversimplyfied, but I hope you get the
idea... it's about locality
<mcsim> I agree, but I speak not about general heap allocation. We have
many heaps for objects with different sizes.
<mcsim> Could it be better?
<antrik> note that this has been a topic of considerable research. you
shouldn't seek to improve the actual algorithms -- you would have to read
up on the existing research at least before you can contribute anything
to the field :-)
<antrik> how would that be different from the slab allocator?
<mcsim> slab will allocate 32 byte for both 20 and 32 byte requests
<mcsim> And if there was request for 20 bytes we get 12 unused
<antrik> oh, you mean the implementation of the generic allocator on top of
slabs? well, that might not be optimal... but it's not an often used case
anyways. mostly the kernel uses constant-sized objects, which get their
own caches with custom tailored size
<antrik> I don't think the waste here matters at all
<mcsim> affirmative. So my idea is useless.
<antrik> does the statistic you refer to show the fragmentation in absolute
sizes too?
<mcsim> Can you explain what is absolute size?
<mcsim> I've counted what were requested (as parameter of kmalloc) and what
was really allocated (according to best fit cache size).
<antrik> how did you get that information?
<mcsim> I simply wrote a hook
<antrik> I mean total. i.e. how many KiB or MiB are wasted due to
fragmentation alltogether
<antrik> ah, interesting. how does it work?
<antrik> BTW, did you read the slab papers?
<mcsim> Do you mean articles from lwn.net?
<antrik> no
<antrik> I mean the papers from the Sun hackers who invented the slab
allocator(s)
<antrik> Bonwick mostly IIRC
<mcsim> Yes
<antrik> hm... then you really should know the rationale behind it...
<mcsim> There he says about 11% percent of memory waste
<antrik> you didn't answer my other questions BTW :-)
<mcsim> I've corrupted kernel tree with patch, and tomorrow I'm going to
read myself up for exam (I have it on Thursday). But than I'll send you a
module which I've used for testing.
<antrik> OK
<mcsim> I can send you module now, but it will not work without patch.
<mcsim> It would be better to rewrite it using debugfs, but when I was
writing this test I didn't know about trace_* macros
# IRC, freenode, #hurd, 2011-04-15
<mcsim> There is a hack in zone_gc when it allocates and frees two
vm_map_kentry_zone elements to make sure the gc will be able to allocate
two in vm_map_delete. Isn't it better to allocate memory for these
entries statically?
<youpi> mcsim: that's not the point of the hack
<youpi> mcsim: the point of the hack is to make sure vm_map_delete will be
able to allocate stuff
<youpi> allocating them statically will just work once
<youpi> it may happen several times that vm_map_delete needs to allocate it
while it's empty (and thus zget_space has to get called, leading to a
hang)
<youpi> funnily enough, the bug is also in macos X
<youpi> it's still in my TODO list to manage to find how to submit the
issue to them
<braunr> really ?
<braunr> eh
<braunr> is that because of map entry splitting ?
<youpi> it's git commit efc3d9c47cd744c316a8521c9a29fa274b507d26
<youpi> braunr: iirc something like this, yes
<braunr> netbsd has this issue too
<youpi> possibly
<braunr> i think it's a fundamental problem with the design
<braunr> people think of munmap() as something similar to free()
<braunr> whereas it's really unmap
<braunr> with a BSD-like VM, unmap can easily end up splitting one entry in
two
<braunr> but your issue is more about harmful recursion right ?
<youpi> I don't remember actually
<youpi> it's quite some time ago :)
<braunr> ok
<braunr> i think that's why i have "sources" in my slab allocator, the
default source (vm_kern) and a custom one for kernel map entries
# IRC, freenode, #hurd, 2011-04-18
<mcsim> braunr: you've said that once page is completely free, it is
returned to the vm.
<mcsim> who else, besides zone_gc, can return free pages to the vm?
<braunr> mcsim: i also said i was wrong about that
<braunr> zone_gc is the only one
# IRC, freenode, #hurd, 2011-04-19
<braunr> antrik: mcsim: i added back a new per-cpu layer as planned
<braunr>
http://git.sceen.net/rbraun/libbraunr.git/?a=blob;f=mem.c;h=c629b2b9b149f118a30f0129bd8b7526b0302c22;hb=HEAD
<braunr> mcsim: btw, in mem_cache_reap(), you can clearly see there are two
loops, just as in zone_gc, to reduce contention and avoid deadlocks
<braunr> this is really common in memory allocators
# IRC, freenode, #hurd, 2011-04-23
<mcsim> I've looked through some allocators and all of them use different
per cpu cache policy. AFAIK gnuhurd doesn't support multiprocessing, but
still multiprocessing must be kept in mind. So, what do you think what
kind of cpu caches is better? As for me I like variant with only per-cpu
caches (like in slqb).
<antrik> mcsim: well, have you looked at the allocator braunr wrote
himself? :-)
<antrik> I'm not sure I suggested that explicitly to you; but probably it
makes most sense to use that in gnumach
# IRC, freenode, #hurd, 2011-04-24
<mcsim> antrik: Yes, I have. He uses both global and per cpu caches. But he
also suggested to look through slqb, where there are only per cpu
caches.\
<braunr> i don't remember slqb in detail
<braunr> what do you mean by "only per-cpu caches" ?
<braunr> a whole slab sytem for each cpu ?
<mcsim> I mean that there are no global queues in caches, but there are
special queues for each cpu.
<mcsim> I've just started investigating slqb's code, but I've read an
article on lwn about it. And I've read that it is used for zen kernel.
<braunr> zen ?
<mcsim> Here is this article http://lwn.net/Articles/311502/
<mcsim> Yes, this is linux kernel with some patches which haven't been
approved to torvald's tree
<mcsim> http://zen-kernel.org/
<braunr> i see
<braunr> well it looks nice
<braunr> but as for slub, the problem i can see is cross-CPU freeing
<braunr> and I think nick piggins mentions it
<braunr> piggin*
<braunr> this means that sometimes, objects are "burst-free" from one cpu
cache to another
<braunr> which has the same bad effects as in most other allocators, mainly
fragmentation
<mcsim> There is a special list for freeing object allocated for another
CPU
<mcsim> And garbage collector frees such object on his own
<braunr> so what's your question ?
<mcsim> It is described in the end of article.
<mcsim> What cpu-cache policy do you think is better to implement?
<braunr> at this point, any
<braunr> and even if we had a kernel that perfectly supports
multiprocessor, I wouldn't care much now
<braunr> it's very hard to evaluate such allocators
<braunr> slqb looks nice, but if you have the same amount of fragmentation
per slab as other allocators do (which is likely), you have tat amount of
fragmentation multiplied by the number of processors
<braunr> whereas having shared queues limit the problem somehow
<braunr> having shared queues mean you have a bit more contention
<braunr> so, as is the case most of the time, it's a tradeoff
<braunr> by the way, does pigging say why he "doesn't like" slub ? :)
<braunr> piggin*
<mcsim> http://lwn.net/Articles/311093/
<mcsim> here he describes what slqb is better.
<braunr> well it doesn't describe why slub is worse
<mcsim> but not very particularly
<braunr> except for order-0 allocations
<braunr> and that's a form of fragmentation like i mentioned above
<braunr> in mach those problems have very different impacts
<braunr> the backend memory isn't physical, it's the kernel virtual space
<braunr> so the kernel allocator can request chunks of higher than order-0
pages
<braunr> physical pages are allocated one at a time, then mapped in the
kernel space
<mcsim> Doesn't order of page depend on buffer size?
<braunr> it does
<mcsim> And why does gnumach allocates higher than order-0 pages more?
<braunr> why more ?
<braunr> i didn't say more
<mcsim> And why in mach those problems have very different impact?
<braunr> ?
<braunr> i've just explained why :)
<braunr> 09:37 < braunr> physical pages are allocated one at a time, then
mapped in the kernel space
<braunr> "one at a time" means order-0 pages, even if you allocate higher
than order-0 chunks
<mcsim> And in Linux they allocated more than one at time because of
prefetching page reading?
<braunr> do you understand what virtual memory is ?
<braunr> linux allocators allocate "physical memory"
<braunr> mach kernel allocator allocates "virtual memory"
<braunr> so even if you allocate a big chunk of virtual memory, it's backed
by order-0 physical pages
<mcsim> yes, I understand this
<braunr> you don't seem to :/
<braunr> the problem of higher than order-0 page allocations is
fragmentation
<braunr> do you see why ?
<mcsim> yes
<braunr> so
<braunr> fragmentation in the kernel space is less likely to create issues
than it does in physical memory
<braunr> keep in mind physical memory is almost always full because of the
page cache
<braunr> and constantly under some pressure
<braunr> whereas the kernel space is mostly empty
<braunr> so allocating higher then order-0 pages in linux is more dangerous
than it is in Mach or BSD
<mcsim> ok
<braunr> on the other hand, linux focuses pure performance, and not having
to map memory means less operations, less tlb misses, quicker allocations
<braunr> the Mach VM must map pages "one at a time", which can be expensive
<braunr> it should be adapted to handle multiple page sizes (e.g. 2 MiB) so
that many allocations can be made with few mappings
<braunr> but that's not easy
<braunr> as always: tradeoffs
<mcsim> There are other benefits of physical allocating. In big DMA
transfers can be needed few continuous physical pages. How does mach
handles such cases?
<braunr> gnumach does that awfully
<braunr> it just reserves the whole DMA-able memory and uses special
allocation functions on it, IIRC
<braunr> but kernels which have a MAch VM like memory sytem such as BSDs
have cleaner methods
<braunr> NetBSD provides a function to allocate contiguous physical memory
<braunr> with many constraints
<braunr> FreeBSD uses a binary buddy system like Linux
<braunr> the fact that the kernel allocator uses virtual memory doesn't
mean the kernel has no mean to allocate contiguous physical memory ...
# IRC, freenode, #hurd, 2011-05-02
<braunr> hm nice, my allocator uses less memory than glibc (squeeze
version) on both 32 and 64 bits systems
<braunr> the new per-cpu layer is proving effective
<neal> braunr: Are you reimplementation malloc?
<braunr> no
<braunr> it's still the slab allocator for mach, but tested in userspace
<braunr> so i wrote malloc wrappers
<neal> Oh.
<braunr> i try to heavily test most of my code in userspace now
<neal> it's easier :-)
<neal> I agree
<braunr> even the physical memory allocator has been implemented this way
<neal> is this your mach version?
<braunr> virtual memory allocation will follow
<neal> or are you working on gnu mach?
<braunr> for now it's my version
<braunr> but i intend to spend the summer working on ipc port names
management
[[rework_gnumach_IPC_spaces]].
<braunr> and integrate the result in gnu mach
<neal> are you keeping the same user-space API?
<neal> Or are you experimenting with something new?
<antrik> braunr: to be fair, it's not terribly hard to use less memory than
glibc :-)
<braunr> yes
<braunr> antrik: well ptmalloc3 received some nice improvements
<braunr> neal: the goal is to rework some of the internals only
<braunr> neal: namely, i simply intend to replace the splay tree with a
radix tree
<antrik> braunr: the glibc allocator is emphasising performace, unlike some
other allocators that trade some performance for much better memory
utilisation...
<antrik> ptmalloc3?
<braunr> that's the allocator used in glibc
<braunr> http://www.malloc.de/en/
<antrik> OK. haven't seen any recent numbers... the comparision I have in
mind is many years old...
<braunr> i also made some additions to my avl and red-black trees this week
end, which finally make them suitable for almost all generic uses
<braunr> the red-black tree could be used in e.g. gnu mach to augment the
linked list used in vm maps
<braunr> which is what's done in most modern systems
<braunr> it could also be used to drop the overloaded (and probably over
imbalanced) page cache hash table
# IRC, freenode, #hurd, 2011-05-03
<mcsim> antrik: How should I start porting? Have I just include rbraun's
allocator to gnumach and make it compile?
<antrik> mcsim: well, basically yes I guess... but you will have to look at
the code in question first before we know anything more specific :-)
<antrik> I guess braunr might know better how to start, but he doesn't
appear to be here :-(
<braunr> mcsim: you can't juste put my code into gnu mach and make it run,
it really requires a few careful changes
<braunr> mcsim: you will have to analyse how the current zone allocator
interacts with regard to locking
<braunr> if it is used in interrupt handlers
<braunr> what kind of locks it should use instead of the pthread stuff
available in userspace
<braunr> you will have to change the reclamiing policy, so that caches are
reaped on demand
<braunr> (this basically boils down to calling the new reclaiming function
instead of zone_gc())
<braunr> you must be careful about types too
<braunr> there is work to be done ;)
<braunr> (not to mention the obvious about replacing all the calls to the
zone allocator, and testing/debugging afterwards)
# IRC, freenode, #hurd, 2011-07-14
<braunr> can you make your patch available ?
<mcsim> it is available in gnumach repository at savannah
<mcsim> tree mplaneta/libbraunr/master
<braunr> mcsim: i'll test your branch
<mcsim> ok. I'll give you a link in a minute
<braunr> hm why balloc ?
<mcsim> Braun's allocator
<braunr> err
<braunr>
http://git.sceen.net/rbraun/x15mach.git/?a=blob;f=kern/kmem.c;h=37173fa0b48fc9d7e177bf93de531819210159ab;hb=HEAD
<braunr> mcsim: this is the interface i had in mind for a kernel version :)
<braunr> very similar to the original slab allocator interface actually
<braunr> well, you've been working
<mcsim> But I have a problem with this patch. When I apply it to gnumach
code from debian repository. I have to make a change in file ramdisk.c
with sed -i 's/kernel_map/\&kernel_map/' device/ramdisk.c
<mcsim> because in git repository there is no such file
<braunr> mcsim: how do you configure the kernel before building ?
<braunr> mcsim: you should keep in touch more often i think, so that you
get feedback from us and don't spend too much time "off course"
<mcsim> I didn't configure it. I just run dpkg-buildsource -b.
<braunr> oh you build the debian package
<braunr> well my version was by configure --enable-kdb --enable-rtl8139
<braunr> and it seems stuck in an infinite loop during bootstrap
<mcsim> and printf doesn't work. The first function called by c_boot_entry
is printf(version).
<braunr> mcsim: also, you're invited to get the x15mach version of my
files, which are gplv2+ licensed
<braunr> be careful of my macros.h file, it can conflict with the
macros_help.h file from gnumach iirc
<mcsim> There were conflicts with MACRO_BEGIN and MACRO_END. But I solved
it
<braunr> ok
<braunr> it's tricky
<braunr> mcsim: try to find where the first use of the allocator is made
|