summaryrefslogtreecommitdiff
path: root/open_issues/gnumach_memory_management.mdwn
blob: 1fe2f9be96eb91afb9e9b0d23f9e664ff522e8b0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]]

[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
id="license" text="Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License, Version 1.2 or
any later version published by the Free Software Foundation; with no Invariant
Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
is included in the section entitled [[GNU Free Documentation
License|/fdl]]."]]"""]]

[[!tag open_issue_gnumach]]

There is a [[!FF_project 266]][[!tag bounty]] on this task.

[[!toc]]


# IRC, freenode, #hurd, 2011-04-12

    <antrik> braunr: do you think the allocator you wrote for x15 could be used
      for gnumach? and would you be willing to mentor this? :-)
    <braunr> antrik: to be willing to isn't my current problem
    <braunr> antrik: and yes, I think my allocator can be used
    <braunr> it's a slab allocator after all, it only requires reap() and
      grow()
    <braunr> or mmap()/munmap() whatever you want to call it
    <braunr> a backend
    <braunr> antrik: although i've been having other ideas recently
    <braunr> that would have more impact on our usage patterns I think
    <antrik> mcsim: have you investigated how the zone allocator works and how
      it's hooked into the system yet?
    <braunr> mcsim: now let me give you a link
    <braunr> mcsim:
      http://git.sceen.net/rbraun/libbraunr.git/?a=blob;f=mem.c;h=330436e799f322949bfd9e2fedf0475660309946;hb=HEAD
    <braunr> mcsim: this is an implementation of the slab allocator i've been
      working on recently
    <braunr> mcsim: i haven't made it public because i reworked the per
      processor layer, and this part isn't complete yet
    <braunr> mcsim: you could use it as a reference for your project
    <mcsim> braunr: ok
    <braunr> it used to be close to the 2001 vmem paper
    <braunr> but after many tests, fragmentation and accounting issues have
      been found
    <braunr> so i rewrote it to be closer to the linux implementation (cache
      filling/draining in bukl transfers)
    <braunr> bulk*
    <braunr> they actually use the word draining in linux too :)
    <mcsim> antrik: not complete yet.
    <antrik> braunr: oh, it's unfinished? that's unfortunate...
    <braunr> antrik: only the per processor part
    <braunr> antrik: so it doesn't matter much for gnumach
    <braunr> and it's not difficult to set up
    <antrik> mcsim: hm, OK... but do you think you will have a fairly good
      understanding in the next couple of days?...
    <antrik> I'm asking because I'd really like to see a proposal a bit more
      specific than "I'll look into things..."
    <antrik> i.e. you should have an idea which things you will actually have
      to change to hook up a new allocator etc.
    <antrik> braunr: OK. will the interface remain unchanged, so it could be
      easily replaced with an improved implementation later?
    <braunr> the zone allocator in gnumach is a badly written bare object
      allocator actually, there aren't many things to understand about it
    <braunr> antrik: yes
    <antrik> great :-)
    <braunr> and the per processor part should be very close to the phys
      allocator sitting next to it
    <braunr> (with the slight difference that, as per cpu caches have variable
      sizes, they are allocated on the free path rather than on the allocation
      path)
    <braunr> this is a nice trick in the vmem paper i've kept in mind
    <braunr> and the interface also allows to set a "source" for caches
    <antrik> ah, good point... do you think we should replace the physmem
      allocator too? and if so, do it in one step, or one piece at a time?...
    <braunr> no
    <braunr> too many drivers currently depend on the physical allocator and
      the pmap module as they are
    <braunr> remember linux 2.0 drivers need a direct virtual to physical
      mapping
    <braunr> (especially true for dma mappings)
    <antrik> OK
    <braunr> the nice thing about having a configurable memory source is that
    <antrik> whot do you mean by "allocated on the free path"?
    <braunr> even if most caches will use the standard vm_kmem module as their
      backend
    <braunr> there is one exception in the vm_map module, allowing us to get
      rid of either a static limit, or specific allocation code
    <braunr> antrik: well, when you allocate a page, the allocator will lookup
      one in a per cpu cache
    <braunr> if it's empty, it fills the cache
    <braunr> (called pools in my implementations)
    <braunr> it then retries
    <braunr> the problem in the slab allocator is that per cpu caches have
      variable sizes
    <braunr> so per cpu pools are allocated from their own pools
    <braunr> (remember the magazine_xx caches in the output i showed you, this
      is the same thing)
    <braunr> but if you allocate them at allocation time, you could end up in
      an infinite loop
    <braunr> so, in the slab allocator, when a per cpu cache is empty, you just
      fall back to the slab layer
    <braunr> on the free path, when a per cpu cache doesn't exist, you allocate
      it from its own cache
    <braunr> this way you can't have an infinite loop
    <mcsim> antrik: I'll try, but I have exams now.
    <mcsim> As I understand amount of elements which could be allocated we
      determine by zone initialization. And at this time memory for zone is
      reserved. I'm going to change this. And make something similar to kmalloc
      and vmalloc (support for pages consecutive physically and virtually). And
      pages in zones consecutive always physically.
    <mcsim> Am I right?
    <braunr> mcsim: don't try to do that
    <mcsim> why?
    <braunr> mcsim: we just need a slab allocator with an interface close to
      the zone allocator
    <antrik> mcsim: IIRC the size of the complete zalloc map is fixed; but not
      the number of elements per zone
    <braunr> we don't need two allocators like kmalloc and vmalloc
    <braunr> actually we just need vmalloc
    <braunr> IIRC the limits are only present because the original developers
      wanted to track leaks
    <braunr> they assumed zones would be large enough, which isn't true any
      more today
    <braunr> but i didn't see any true reservation
    <braunr> antrik: i'm not sure i was clear enough about the "allocation of
      cpu caches on the free path"
    <braunr> antrik: for a better explanation, read the vmem paper ;)
    <antrik> braunr: you mean there is no fundamental reason why the zone map
      has a limited maximal size; and it was only put in to catch cases where
      something eats up all memory with kernel object creation?...
    <antrik> braunr: I think I got it now :-)
    <braunr> antrik: i'm pretty certin of it yes
    <antrik> I don't see though how it is related to what we were talking
      about...
    <braunr> 10:55 < braunr> and the per processor part should be very close to
      the phys allocator sitting next to it
    <braunr> the phys allocator doesn't have to use this trick
    <braunr> because pages have a fixed size, so per cpu caches all have the
      same size too
    <braunr> and the number of "caches", that is, physical segments, is limited
      and known at compile time
    <braunr> so having them statically allocated is possible
    <antrik> I see
    <braunr> it would actually be very difficult to have a phys allocator
      requiring dynamic allocation when the dynamic allocator isn't yet ready
    <antrik> hehe :-)
    <mcsim> total size of all zone allocations is limited to 12 MB. And is "was
      only put in to catch cases where something eats up all memory with kernel
      object creation?"
    <braunr> mcsim: ah right, there could be a kernel submap backing all the
      zones
    <braunr> but this can be increased too
    <braunr> submaps are kind of evil :/
    <antrik> mcsim: I think it's actually 32 MiB or something like that in the
      Debian version...
    <antrik> braunr: I'm not sure I ever fully understood what the zalloc map
      is... I looked through the code once, and I think I got a rough
      understading, but I was still pretty uncertain about some bits. and I
      don't remember the details anyways :-)
    <braunr> antrik: IIRC, it's a kernel submap
    <braunr> it's named kmem_map in x15
    <antrik> don't know what a submap is
    <braunr> submaps are vm_map objects
    <braunr> in a top vm_map, there are vm_map_entries
    <braunr> these entries usually point to vm_objects
    <braunr> (for the page cache)
    <braunr> but they can point to other maps too
    <braunr> the goal is to reduce fragmentation by isolating allocations
    <braunr> this also helps reducing contention
    <braunr> for exemple, on BSD, there is a submap for mbufs, so that the
      network code doesn't interfere too much with other kernel allocations
    <braunr> antrik: they are similar to spans in vmem, but vmem has an elegant
      importing mechanism which eliminates the static limit problem
    <antrik> so memory is not directly allocated from the physical allocator,
      but instead from another map which in turn contains physical memory, or
      something like that?...
    <braunr> no, this is entirely virtual
    <braunr> submaps are almost exclusively used for the kernel_map
    <antrik> you are using a lot of identifies here, but I don't remember (or
      never knew) what most of them mean :-(
    <braunr> sorry :)
    <braunr> the kernel map is the vm_map used to represent the ~1 GiB of
      virtual memory the kernel has (on i386)
    <braunr> vm_map objects are simple virtual space maps
    <braunr> they contain what you see in linux when doing /proc/self/maps
    <braunr> cat /proc/self/maps
    <braunr> (linux uses entirely different names but it's roughly the same
      structure)
    <braunr> each line is a vm_map_entry
    <braunr> (well, there aren't submaps in linux though)
    <braunr> the pmap tool on netbsd is able to show the kernel map with its
      submaps, but i don't have any image around
    <mcsim> braunr: is limit for zones is feature and shouldn't be changed?
    <braunr> mcsim: i think we shouldn't have fixed limits for zones
    <braunr> mcsim: this should be part of the debugging facilities in the slab
      allocator
    <braunr> is this fixed limit really a major problem ?
    <braunr> i mean, don't focus on that too much, there are other issues
      requiring more attention
    <antrik> braunr: at 12 MiB, it used to be, causing a lot of zalloc
      panics. after increasing, I don't think it's much of a problem anymore...
    <antrik> but as memory sizes grow, it might become one again
    <antrik> that's the problem with a fixed size...
    <braunr> yes, that's the issue with submaps
    <braunr> but gnumach is full of those, so let's fix them by order of
      priority
    <antrik> well, I'm still trying to digest what you wrote about submaps :-)
    <braunr> i'm downloading netbsd, so you can have a good view of all this
    <antrik> so, when the kernel allocates virtual address space regions
      (mostly for itself), instead of grabbing chunks of the address space
      directly, it takes parts out of a pre-reserved region?
    <braunr> not exactly
    <braunr> both statements are true
    <mcsim> antrik: only virtual addresses are reserved
    <braunr> it grabs chunks of the address space directly, but does so in a
      reserved region of the address space
    <braunr> a submap is like a normal map, it has a start address, a size, and
      is empty, then it's populated with vm_map_entries
    <braunr> so instead of allocating from 3-4 GiB, you allocate from, say,
      3.1-3.2 GiB
    <antrik> yeah, that's more or less what I meant...
    <mcsim> braunr: I see two problems: limited zones and absence of caching. 
    <mcsim> with caching absence of readahead paging will be not so significant
    <braunr> please avoid readahead
    <mcsim> ok
    <braunr> and it's not about paging, it's about kernel memory, which is
      wired
    <braunr> (well most of it)
    <braunr> what about limited zones ?
    <braunr> the whole kernel space is limited, there has to be limits
    <braunr> the problem is how to handle them
    <antrik> braunr: almost all. I looked through all zones once, and IIRC I
      found exactly one that actually allows paging...
    <braunr> currently, when you reach the limit, you have an OOM error
    <braunr> antrik: yes, there are
    <braunr> i don't remember which implementation does that but, when
      processes haven't been active for a minute or so, they are "swapedout"
    <braunr> completely
    <braunr> even the kernel stack
    <braunr> and the page tables
    <braunr> (most of the pmap structures are destroyed, some are retained)
    <antrik> that might very well be true... at least inactive processes often
      show up with 0 memory use in top on Hurd
    <braunr> this is done by having a pageable kernel map, with wired entries
    <braunr> when the swapper thread swaps tasks out, it unwires them
    <braunr> but i think modern implementations don't do that any more
    <antrik> well, I was talking about zalloc only :-)
    <braunr> oh
    <braunr> so the zalloc_map must be pageable
    <braunr> or there are two submaps ?
    <antrik> not sure whether "morden implementations" includes Linux ;-)
    <braunr> no, i'm talking about the bsd family only
    <antrik> but it's certainly true that on Linux even inactive processes
      retain some memory
    <braunr> linux doesn't make any difference between processor-bound and
      I/O-bound processes
    <antrik> braunr: I have no idea how it works. I just remember that when
      creating zones, one of the optional flags decides whether the zone is
      pagable. but as I said, IIRC there is exactly one that actually is...
    <braunr> zone_map = kmem_suballoc(kernel_map, &zone_min, &zone_max,
      zone_map_size, FALSE);
    <braunr> kmem_suballoc(parent, min, max, size, pageable)
    <braunr> so the zone_map isn't
    <antrik> IIRC my conclusion was that pagable zones do not count in the
      fixed zone map limit... but I'm not sure anymore
    <braunr> zinit() has a memtype parameter
    <braunr> with ZONE_PAGEABLE as a possible flag
    <braunr> this is wierd :)
    <mcsim> There is no any zones which use ZONE_PAGEABLE flag
    <antrik> mcsim: are you sure? I think I found one...
    <braunr> if (zone->type & ZONE_PAGEABLE) {
    <antrik> admittedly, it is several years ago that I looked into this, so my
      memory is rather dim...
    <braunr> if (kmem_alloc_pageable(zone_map, &addr, ...
    <braunr> calling kmem_alloc_pageable() on an unpageable submap seems wrong
    <mcsim> I've greped gnumach code and there is no any zinit procedure call
      with ZONE_PAGEABLE flag
    <braunr> good
    <antrik> hm... perhaps it was in some code that has been removed
      alltogether since ;-)
    <antrik> actually I think it would be pretty neat to have pageable kernel
      objects... but I guess it would require considerable effort to implement
      this right
    <braunr> mcsim: you also mentioned absence of caching
    <braunr> mcsim: the zone allocator actually is a bare caching object
      allocator
    <braunr> antrik: no, it's easy
    <braunr> antrik: i already had that in x15 0.1
    <braunr> antrik: the problem is being sure the objects you allocate from a
      pageable backing store are never used when resolving a page fault
    <braunr> that's all
    <antrik> I wouldn't expect that to be easy... but surely you know better
      :-)
    <mcsim> braunr: indeed. I was wrong.
    <antrik> braunr: what is a caching object allocator?...
    <braunr> antrik: ok, it's not easy
    <braunr> antrik: but once you have vm_objects implemented, having pageable
      kernel object is just a matter of using the right options, really
    <braunr> antrik: an allocator that caches its buffers
    <braunr> some years ago, the term "object" would also apply to
      preconstructed buffers
    <antrik> I have no idea what you mean by "caches its buffers" here :-)
    <braunr> well, a memory allocator which doesn't immediately free its
      buffers caches them
    <mcsim> braunr: but can it return objects to system?
    <braunr> mcsim: which one ?
    <antrik> yeah, obviously the *implementation* of pageable kernel objects is
      not hard. the tricky part is deciding which objects can be pageable, and
      which need to be wired...
    <mcsim> Can zone allocator return cached objects to system as in slab?
    <mcsim> I mean reap()
    <braunr> well yes, it does so, and it does that too often
    <braunr> the caching in the zone allocator is actually limited to the
      pagesize
    <braunr> once page is completely free, it is returned to the vm
    <mcsim> this is bad caching
    <braunr> yes
    <mcsim> if object takes all page than there is now caching at all
    <braunr> caching by side effect
    <braunr> true
    <braunr> but the linux slab allocator does the same thing :p
    <braunr> hm
    <braunr> no, the solaris slab allocator does so
    <mcsim> linux's slab returns objects only when system ask
    <antrik> without preconstructed objects, is there actually any point in
      caching empty slabs?...
    <mcsim> Once I've changed my allocator to slab and it cached more than 1GB
      of my memory)
    <braunr> ok wait, need to fix a few mistakes first
    <mcsim> s/ask/asks
    <braunr> the zone allocator (in gnumach) actually has a garbage collector
    <antrik> braunr: well, the Solaris allocator follows the slab/magazine
      paper, right? so there is caching at the magazine layer... in that case
      caching empty slabs too would be rather redundant I'd say...
    <braunr> which is called when running low on memory, similar to the slab
      allocaotr
    <braunr> antrik: yes
    <antrik> (or rather the paper follows the Solaris allocator ;-) )
    <braunr> mcsim: the zone allocator reap() is zone_gc()
    <antrik> braunr: hm, right, there is a "collectable" flag for zones... but
      I never understood what it means
    <antrik> braunr: BTW, I heard Linux has yet another allocator now called
      "slob"... do you happen to know what that is?
    <braunr> slob is a very simple allocator for embedded devices
    <mcsim> AFAIR this is just heap allocator
    <braunr> useful when you have a very low amount of memory
    <braunr> like 1 MiB
    <braunr> yes
    <antrik> just googled it :-)
    <braunr> zone and slab are very similar
    <antrik> sounds like a simple heap allocator
    <mcsim> there is another allocator that calls slub, and it better than slab
      in many cases
    <braunr> the main difference is the data structures used to store slabs
    <braunr> mcsim: i disagree
    <antrik> mcsim: ah, you already said that :-)
    <braunr> mcsim: slub is better for systems with very large amounts of
      memory and processors
    <braunr> otherwise, slab is better
    <braunr> in addition, there are accounting issues with slub
    <braunr> because of cache merging
    <mcsim> ok. This strange that slub is default allocator
    <braunr> well both are very good
    <braunr> iirc, linus stated that he really doesn't care as long as its
      works fine
    <braunr> he refused slqb because of that
    <braunr> slub is nice because it requires less memory than slab, while
      still being as fast for most cases
    <braunr> it gets slower on the free path, when the cpu performing the free
      is different from the one which allocated the object
    <braunr> that's a reasonable cost
    <mcsim> slub uses heap for large object. Are there any tests that compare
      what is better for large objects?
    <antrik> well, if slub requires less memory, why do you think slab is
      better for smaller systems? :-)
    <braunr> antrik: smaller is relative
    <antrik> mcsim: for large objects slab allocation is rather pointless, as
      you don't have multiple objects in a page anyways...
    <braunr> antrik: when lameter wrote slub, it was intended for systems with
      several hundreds processors
    <antrik> BTW, was slqb really refused only because the other ones are "good
      enough"?...
    <braunr> yes
    <antrik> wow, that's a strange argument...
    <braunr> linus is already unhappy of having "so many" allocators
    <antrik> well, if the new one is better, it could replace one of the others
      :-)
    <antrik> or is it useful only in certain cases?
    <braunr> that's the problem
    <braunr> nobody really knows
    <antrik> hm, OK... I guess that should be tested *before* merging ;-)
    <antrik> is anyone still working on it, or was it abandonned?
    <antrik> mcsim: back to caching...
    <antrik> what does caching in the kernel object allocator got to do with
      readahead (i.e. clustered paging)?...
    <mcsim> if we cached some physical pages we don't need to find new ones for
      allocating new object. And that's why there will not be a page fault.
    <mcsim> antrik: Regarding kam. Hasn't he finished his project?
    <antrik> err... what?
    <antrik> one of us must be seriously confused
    <antrik> I totally fail to see what caching of physical pages (which isn't
      even really a correct description of what slab does) has to do with page
      faults
    <antrik> right, KAM didn't finish his project
    <mcsim> If we free the physical page and return it to system we need
      another one for next allocation. But if we keep it, we don't need to find
      new physical page. 
    <mcsim> And physical page is allocated only then when page fault
      occurs. Probably, I'm wrong
    <antrik> what does "return to system" mean? we are talking about the
      kernel...
    <antrik> zalloc/slab are about allocating kernel objects. this doesn't have
      *anything* to do with paging of userspace processes
    <antrik> only thing the have in common is that they need to get pages from
      the physical page allocator. but that's yet another topic
    <mcsim> Under "return to system" I mean ability to use this page for other
      needs.
    <braunr> mcsim: consider kernel memory to be wired
    <braunr> here, return to system means releasing a page back to the vm
      system
    <braunr> the vm_kmem module then unmaps the physical page and free its
      virtual address in the kernel map
    <mcsim> ok
    <braunr> antrik: the problem with new allocators like slqb is that it's
      very difficult to really know if they're better, even with extensive
      testing
    <braunr> antrik: there are papers (like wilson95) about the difficulties in
      making valuable results in this field
    <braunr> see
      http://www.sceen.net/~rbraun/dynamic_storage_allocation_a_survey_and_critical_review.pdf
    <mcsim> how can be allocated physically continuous object now?
    <braunr> mcsim: rephrase please
    <mcsim> what is similar to kmalloc in Linux to gnumach?
    <braunr> i know memory is reserved for dma in a direct virtual to physical
      mapping
    <braunr> so even if the allocation is done similarly to vmalloc()
    <braunr> the selected region of virtual space maps physical memory, so
      memory is physically contiguous too
    <braunr> for other allocation types, a block large enough is allocated, so
      it's contiguous too
    <mcsim> I don't clearly understand. If we have fragmentation in physical
      ram, so there aren't 2 free pages in a row, but there are able apart, we
      can't to allocate these 2 pages along?
    <braunr> no
    <braunr> but every system has this problem
    <mcsim> But since we have only 12 or 32 MB of memory the problem becomes
      more significant
    <braunr> you're confusing virtual and physical memory
    <braunr> those 32 MiB are virtual
    <braunr> the physical pages backing them don't have to be contiguous
    <mcsim> Oh, indeed 
    <mcsim> So the only problem are limits?
    <braunr> and performance
    <braunr> and correctness
    <braunr> i find the zone allocator badly written
    <braunr> antrik: mcsim: here is the content of the kernel pmap on NetBSD
      (which uses a virtual memory system close to the Mach VM)
    <braunr> antrik: mcsim: http://www.sceen.net/~rbraun/pmap.out

[[pmap.out]]

    <braunr> you can see the kmem_map (which is used for most general kernel
      allocations) is 128 MiB large
    <braunr> actually it's not the kernel pmap, it's the kernel_map
    <antrik> braunr: why is it called pmap.out then? ;-)
    <braunr> antrik: because the tool is named pmap
    <braunr> for process map
    <braunr> it also exists under Linux, although direct access to
      /proc/xx/maps gives more info
    <mcsim> braunr: I've said that this is kernel_map. Can I see kernel_map for
      Linux?
    <braunr> mcsim: I don't know how to do that
    <mcsim> s/I've/You've
    <braunr> but Linux doesn't have submaps, and uses a direct virtual to
      physical mapping, so it's used differently
    <antrik> how are things (such as zalloc zones) entered into kernel_map?
    <braunr> in zone_init() you have
    <braunr> zone_map = kmem_suballoc(kernel_map, &zone_min, &zone_max,
      zone_map_size, FALSE);
    <braunr> so here, kmem_map is named zone_map
    <braunr> then, in zalloc()
    <braunr> kmem_alloc_wired(zone_map, &addr, zone->alloc_size)
    <antrik> so, kmem_alloc just deals out chunks of memory referenced directly
      by the address, and without knowing anything about the use?
    <braunr> kmem_alloc() gives virtual pages
    <braunr> zalloc() carves them into buffers, as in the slab allocator
    <braunr> the difference is essentially the lack of formal "slab" object
    <braunr> which makes the zone code look like a mess
    <antrik> so kmem_suballoc() essentially just takes a bunch of pages from
      the main kernel_map, and uses these to back another map which then in
      turn deals out pages just like the main kernel_map?
    <braunr> no
    <braunr> kmem_suballoc creates a vm_map_entry object, and sets its start
      and end address
    <braunr> and creates a vm_map object, which is then inserted in the new
      entry
    <braunr> maybe that's what you meant with "essentially just takes a bunch
      of pages from the main kernel_map"
    <braunr> but there really is no allocation at this point
    <braunr> except the map entry and the new map objects
    <antrik> well, I'm trying to understand how kmem_alloc() manages things. so
      it has map_entry structures like the maps of userspace processes? do
      these also reference actual memory objects?
    <braunr> kmem_alloc just allocates virtual pages from a vm_map, and backs
      those with physical pages (unless the user requested pageable memory)
    <braunr> it's not "like the maps of userspace processes"
    <braunr> these are actually the same structures
    <braunr> a vm_map_entry can reference a memory object or a kernel submap
    <braunr> in netbsd, it can also referernce nothing (for pure wired kernel
      memory like the vm_page array)
    <braunr> maybe it's the same in mach, i don't remember exactly
    <braunr> antrik: this is actually very clear in vm/vm_kern.c
    <braunr> kmem_alloc() creates a new kernel object for the allocation
    <braunr> allocates a new entry (or uses a previous existing one if it can
      be extended) through vm_map_find_entry()
    <braunr> then calls kmem_alloc_pages() to back it with wired memory
    <antrik> "creates a new kernel object" -- what kind of kernel object?
    <braunr> kmem_alloc_wired() does roughly the same thing, except it doesn't
      need a new kernel object because it knows the new area won't be pageable
    <braunr> a simple vm_object
    <braunr> used as a container for anonymous memory in case the pages are
      swapped out
    <antrik> vm_object is the same as memory object/pager? or yet something
      different?
    <braunr> antrik: almost
    <braunr> antrik: a memory_object is the user view of a vm_object
    <braunr> as in the kernel/user interfaces used by external pagers
    <braunr> vm_object is a more internal name
    <mcsim> Is fragmentation a big problem in slab allocator?
    <mcsim> I've tested it on my computer in Linux and for some caches it
      reached 30-40%
    <antrik> well, fragmentation is a major problem for any allocator...
    <antrik> the original slab allocator was design specifically with the goal
      of reducing fragmentation
    <antrik> the revised version with the addition of magazines takes a step
      back on this though
    <antrik> have you compared it to slub? would be pretty interesting...
    <mcsim> I have an idea how can it be decreased, but it will hurt by
      performance...
    <mcsim> antrik: no I haven't, but there will be might the same, I think
    <mcsim> if each cache will handle two types of object: with sizes that will
      fit cache sizes (or I bit smaller) and with sizes which are much smaller
      than maximal cache size. For first type of object will be used standard
      slab allocator and for latter type will be used (within page) heap
      allocator.
    <mcsim> I think that than fragmentation will be decreased
    <antrik> not at all. heap allocator has much worse fragmentation. that's
      why slab allocator was invented
    <antrik> the problem is that in a long-running program (such an the
      kernel), objects tend to have vastly varying lifespans
    <mcsim> but we use heap only for objects of specified sizes
    <antrik> so often a few old objects will keep a whole page hostage
    <mcsim> for example for 32 byte cache it could be 20-28 byte objects
    <antrik> that's particularily visible in programs such as firefox, which
      will grow the heap during use even though actual needs don't change
    <antrik> the slab allocator groups objects in a fashion that makes it more
      likely adjacent objects will be freed at similar times
    <antrik> well, that's pretty oversimplyfied, but I hope you get the
      idea... it's about locality
    <mcsim> I agree, but I speak not about general heap allocation. We have
      many heaps for objects with different sizes.
    <mcsim> Could it be better?
    <antrik> note that this has been a topic of considerable research. you
      shouldn't seek to improve the actual algorithms -- you would have to read
      up on the existing research at least before you can contribute anything
      to the field :-)
    <antrik> how would that be different from the slab allocator?
    <mcsim> slab will allocate 32 byte for both 20 and 32 byte requests
    <mcsim> And if there was request for 20 bytes we get 12 unused
    <antrik> oh, you mean the implementation of the generic allocator on top of
      slabs? well, that might not be optimal... but it's not an often used case
      anyways. mostly the kernel uses constant-sized objects, which get their
      own caches with custom tailored size
    <antrik> I don't think the waste here matters at all
    <mcsim> affirmative. So my idea is useless. 
    <antrik> does the statistic you refer to show the fragmentation in absolute
      sizes too?
    <mcsim> Can you explain what is absolute size?
    <mcsim> I've counted what were requested (as parameter of kmalloc) and what
      was really allocated (according to best fit cache size).
    <antrik> how did you get that information?
    <mcsim> I simply wrote a hook
    <antrik> I mean total. i.e. how many KiB or MiB are wasted due to
      fragmentation alltogether
    <antrik> ah, interesting. how does it work?
    <antrik> BTW, did you read the slab papers?
    <mcsim> Do you mean articles from lwn.net?
    <antrik> no 
    <antrik> I mean the papers from the Sun hackers who invented the slab
      allocator(s)
    <antrik> Bonwick mostly IIRC
    <mcsim> Yes
    <antrik> hm... then you really should know the rationale behind it...
    <mcsim> There he says about 11% percent of memory waste
    <antrik> you didn't answer my other questions BTW :-)
    <mcsim> I've corrupted kernel tree with patch, and tomorrow I'm going to
      read myself up for exam (I have it on Thursday). But than I'll send you a
      module which I've used for testing.
    <antrik> OK
    <mcsim> I can send you module now, but it will not work without patch.
    <mcsim> It would be better to rewrite it using debugfs, but when I was
      writing this test I didn't know about trace_* macros


# IRC, freenode, #hurd, 2011-04-15

    <mcsim> There is a hack in zone_gc when it allocates and frees two
      vm_map_kentry_zone elements to make sure the gc will be able to allocate
      two in vm_map_delete. Isn't it better to allocate memory for these
      entries statically?
    <youpi> mcsim: that's not the point of the hack
    <youpi> mcsim: the point of the hack is to make sure vm_map_delete will be
      able to allocate stuff
    <youpi> allocating them statically will just work once
    <youpi> it may happen several times that vm_map_delete needs to allocate it
      while it's empty (and thus zget_space has to get called, leading to a
      hang)
    <youpi> funnily enough, the bug is also in macos X
    <youpi> it's still in my TODO list to manage to find how to submit the
      issue to them
    <braunr> really ?
    <braunr> eh
    <braunr> is that because of map entry splitting ?
    <youpi> it's git commit efc3d9c47cd744c316a8521c9a29fa274b507d26
    <youpi> braunr: iirc something like this, yes
    <braunr> netbsd has this issue too
    <youpi> possibly
    <braunr> i think it's a fundamental problem with the design
    <braunr> people think of munmap() as something similar to free()
    <braunr> whereas it's really unmap
    <braunr> with a BSD-like VM, unmap can easily end up splitting one entry in
      two
    <braunr> but your issue is more about harmful recursion right ?
    <youpi> I don't remember actually
    <youpi> it's quite some time ago :)
    <braunr> ok
    <braunr> i think that's why i have "sources" in my slab allocator, the
      default source (vm_kern) and a custom one for kernel map entries


# IRC, freenode, #hurd, 2011-04-18

    <mcsim> braunr: you've said that once page is completely free, it is
      returned to the vm.
    <mcsim> who else, besides zone_gc, can return free pages to the vm?
    <braunr> mcsim: i also said i was wrong about that
    <braunr> zone_gc is the only one


# IRC, freenode, #hurd, 2011-04-19

    <braunr> antrik: mcsim: i added back a new per-cpu layer as planned
    <braunr>
      http://git.sceen.net/rbraun/libbraunr.git/?a=blob;f=mem.c;h=c629b2b9b149f118a30f0129bd8b7526b0302c22;hb=HEAD
    <braunr> mcsim: btw, in mem_cache_reap(), you can clearly see there are two
      loops, just as in zone_gc, to reduce contention and avoid deadlocks
    <braunr> this is really common in memory allocators


# IRC, freenode, #hurd, 2011-04-23

    <mcsim> I've looked through some allocators and all of them use different
      per cpu cache policy. AFAIK gnuhurd doesn't support multiprocessing, but
      still multiprocessing must be kept in mind. So, what do you think what
      kind of cpu caches is better? As for me I like variant with only per-cpu
      caches (like in slqb).
    <antrik> mcsim: well, have you looked at the allocator braunr wrote
      himself? :-)
    <antrik> I'm not sure I suggested that explicitly to you; but probably it
      makes most sense to use that in gnumach


# IRC, freenode, #hurd, 2011-04-24

    <mcsim> antrik: Yes, I have. He uses both global and per cpu caches. But he
      also suggested to look through slqb, where there are only per cpu
      caches.\
    <braunr> i don't remember slqb in detail
    <braunr> what do you mean by "only per-cpu caches" ?
    <braunr> a whole slab sytem for each cpu ?
    <mcsim> I mean that there are no global queues in caches, but there are
      special queues for each cpu.
    <mcsim> I've just started investigating slqb's code, but I've read an
      article on lwn about it. And I've read that it is used for zen kernel.
    <braunr> zen ?
    <mcsim> Here is this article http://lwn.net/Articles/311502/
    <mcsim> Yes, this is linux kernel with some patches which haven't been
      approved to torvald's tree
    <mcsim> http://zen-kernel.org/
    <braunr> i see
    <braunr> well it looks nice
    <braunr> but as for slub, the problem i can see is cross-CPU freeing
    <braunr> and I think nick piggins mentions it
    <braunr> piggin*
    <braunr> this means that sometimes, objects are "burst-free" from one cpu
      cache to another
    <braunr> which has the same bad effects as in most other allocators, mainly
      fragmentation
    <mcsim> There is a special list for freeing object allocated for another
      CPU
    <mcsim> And garbage collector frees such object on his own
    <braunr> so what's your question ?
    <mcsim> It is described in the end of article.
    <mcsim> What cpu-cache policy do you think is better to implement?
    <braunr> at this point, any
    <braunr> and even if we had a kernel that perfectly supports
      multiprocessor, I wouldn't care much now
    <braunr> it's very hard to evaluate such allocators
    <braunr> slqb looks nice, but if you have the same amount of fragmentation
      per slab as other allocators do (which is likely), you have tat amount of
      fragmentation multiplied by the number of processors
    <braunr> whereas having shared queues limit the problem somehow
    <braunr> having shared queues mean you have a bit more contention
    <braunr> so, as is the case most of the time, it's a tradeoff
    <braunr> by the way, does pigging say why he "doesn't like" slub ? :)
    <braunr> piggin*
    <mcsim> http://lwn.net/Articles/311093/
    <mcsim> here he describes what slqb is better.
    <braunr> well it doesn't describe why slub is worse
    <mcsim> but not very particularly 
    <braunr> except for order-0 allocations
    <braunr> and that's a form of fragmentation like i mentioned above
    <braunr> in mach those problems have very different impacts
    <braunr> the backend memory isn't physical, it's the kernel virtual space
    <braunr> so the kernel allocator can request chunks of higher than order-0
      pages
    <braunr> physical pages are allocated one at a time, then mapped in the
      kernel space
    <mcsim> Doesn't order of page depend on buffer size?
    <braunr> it does
    <mcsim> And why does gnumach allocates higher than order-0 pages more?
    <braunr> why more ?
    <braunr> i didn't say more
    <mcsim> And why in mach those problems have very different impact?
    <braunr> ?
    <braunr> i've just explained why :)
    <braunr> 09:37 < braunr> physical pages are allocated one at a time, then
      mapped in the kernel space
    <braunr> "one at a time" means order-0 pages, even if you allocate higher
      than order-0 chunks
    <mcsim> And in Linux they allocated more than one at time because of
      prefetching page reading?
    <braunr> do you understand what virtual memory is ?
    <braunr> linux allocators allocate "physical memory"
    <braunr> mach kernel allocator allocates "virtual memory"
    <braunr> so even if you allocate a big chunk of virtual memory, it's backed
      by order-0 physical pages
    <mcsim> yes, I understand this
    <braunr> you don't seem to :/
    <braunr> the problem of higher than order-0 page allocations is
      fragmentation
    <braunr> do you see why ?
    <mcsim> yes
    <braunr> so
    <braunr> fragmentation in the kernel space is less likely to create issues
      than it does in physical memory
    <braunr> keep in mind physical memory is almost always full because of the
      page cache
    <braunr> and constantly under some pressure
    <braunr> whereas the kernel space is mostly empty
    <braunr> so allocating higher then order-0 pages in linux is more dangerous
      than it is in Mach or BSD
    <mcsim> ok
    <braunr> on the other hand, linux focuses pure performance, and not having
      to map memory means less operations, less tlb misses, quicker allocations
    <braunr> the Mach VM must map pages "one at a time", which can be expensive
    <braunr> it should be adapted to handle multiple page sizes (e.g. 2 MiB) so
      that many allocations can be made with few mappings
    <braunr> but that's not easy
    <braunr> as always: tradeoffs
    <mcsim> There are other benefits of physical allocating. In big DMA
      transfers can be needed few continuous physical pages. How does mach
      handles such cases?
    <braunr> gnumach does that awfully
    <braunr> it just reserves the whole DMA-able memory and uses special
      allocation functions on it, IIRC
    <braunr> but kernels which have a MAch VM like memory sytem such as BSDs
      have cleaner methods
    <braunr> NetBSD provides a function to allocate contiguous physical memory
    <braunr> with many constraints
    <braunr> FreeBSD uses a binary buddy system like Linux
    <braunr> the fact that the kernel allocator uses virtual memory doesn't
      mean the kernel has no mean to allocate contiguous physical memory ...


# IRC, freenode, #hurd, 2011-05-02

    <braunr> hm nice, my allocator uses less memory than glibc (squeeze
      version) on both 32 and 64 bits systems
    <braunr> the new per-cpu layer is proving effective
    <neal> braunr: Are you reimplementation malloc?
    <braunr> no
    <braunr> it's still the slab allocator for mach, but tested in userspace
    <braunr> so i wrote malloc wrappers
    <neal> Oh.
    <braunr> i try to heavily test most of my code in userspace now
    <neal> it's easier :-)
    <neal> I agree
    <braunr> even the physical memory allocator has been implemented this way
    <neal> is this your mach version?
    <braunr> virtual memory allocation will follow
    <neal> or are you working on gnu mach?
    <braunr> for now it's my version
    <braunr> but i intend to spend the summer working on ipc port names
      management

[[rework_gnumach_IPC_spaces]].

    <braunr> and integrate the result in gnu mach
    <neal> are you keeping the same user-space API?
    <neal> Or are you experimenting with something new?
    <antrik> braunr: to be fair, it's not terribly hard to use less memory than
      glibc :-)
    <braunr> yes
    <braunr> antrik: well ptmalloc3 received some nice improvements
    <braunr> neal: the goal is to rework some of the internals only
    <braunr> neal: namely, i simply intend to replace the splay tree with a
      radix tree
    <antrik> braunr: the glibc allocator is emphasising performace, unlike some
      other allocators that trade some performance for much better memory
      utilisation...
    <antrik> ptmalloc3?
    <braunr> that's the allocator used in glibc
    <braunr> http://www.malloc.de/en/
    <antrik> OK. haven't seen any recent numbers... the comparision I have in
      mind is many years old...
    <braunr> i also made some additions to my avl and red-black trees this week
      end, which finally make them suitable for almost all generic uses
    <braunr> the red-black tree could be used in e.g. gnu mach to augment the
      linked list used in vm maps
    <braunr> which is what's done in most modern systems
    <braunr> it could also be used to drop the overloaded (and probably over
      imbalanced) page cache hash table


# IRC, freenode, #hurd, 2011-05-03

    <mcsim> antrik: How should I start porting? Have I just include rbraun's
      allocator to gnumach and make it compile?
    <antrik> mcsim: well, basically yes I guess... but you will have to look at
      the code in question first before we know anything more specific :-)
    <antrik> I guess braunr might know better how to start, but he doesn't
      appear to be here :-(
    <braunr> mcsim: you can't juste put my code into gnu mach and make it run,
      it really requires a few careful changes
    <braunr> mcsim: you will have to analyse how the current zone allocator
      interacts with regard to locking
    <braunr> if it is used in interrupt handlers
    <braunr> what kind of locks it should use instead of the pthread stuff
      available in userspace
    <braunr> you will have to change the reclamiing policy, so that caches are
      reaped on demand
    <braunr> (this basically boils down to calling the new reclaiming function
      instead of zone_gc())
    <braunr> you must be careful about types too
    <braunr> there is work to be done ;)
    <braunr> (not to mention the obvious about replacing all the calls to the
      zone allocator, and testing/debugging afterwards)


# IRC, freenode, #hurd, 2011-07-14

    <braunr> can you make your patch available ?
    <mcsim> it is available in gnumach repository at savannah 
    <mcsim> tree mplaneta/libbraunr/master
    <braunr> mcsim: i'll test your branch
    <mcsim> ok. I'll give you a link in a minute
    <braunr> hm why balloc ?
    <mcsim> Braun's allocator
    <braunr> err
    <braunr>
      http://git.sceen.net/rbraun/x15mach.git/?a=blob;f=kern/kmem.c;h=37173fa0b48fc9d7e177bf93de531819210159ab;hb=HEAD
    <braunr> mcsim: this is the interface i had in mind for a kernel version :)
    <braunr> very similar to the original slab allocator interface actually
    <braunr> well, you've been working
    <mcsim> But I have a problem with this patch. When I apply it to gnumach
      code from debian repository. I have to make a change in file ramdisk.c
      with sed -i 's/kernel_map/\&kernel_map/' device/ramdisk.c
    <mcsim> because in git repository there is no such file
    <braunr> mcsim: how do you configure the kernel before building ?
    <braunr> mcsim: you should keep in touch more often i think, so that you
      get feedback from us and don't spend too much time "off course"
    <mcsim> I didn't configure it. I just run dpkg-buildsource -b.
    <braunr> oh you build the debian package
    <braunr> well my version was by configure --enable-kdb --enable-rtl8139
    <braunr> and it seems stuck in an infinite loop during bootstrap
    <mcsim> and printf doesn't work. The first function called by c_boot_entry
      is printf(version).
    <braunr> mcsim: also, you're invited to get the x15mach version of my
      files, which are gplv2+ licensed
    <braunr> be careful of my macros.h file, it can conflict with the
      macros_help.h file from gnumach iirc
    <mcsim> There were conflicts with MACRO_BEGIN and MACRO_END. But I solved
      it
    <braunr> ok
    <braunr> it's tricky
    <braunr> mcsim: try to find where the first use of the allocator is made


# IRC, freenode, #hurd, 2011-07-22

    <mcsim> braunr, hello. Kernel with your allocator already compiles and
      runs. There still some problems, but, certainly, I'm on the final stage
      already. I hope I'll finish in a few days.
    <tschwinge> mcsim: Oh, cool!  Have you done some measurements already?
    <mcsim> Not yet
    <tschwinge> OK.
    <tschwinge> But if it able to run a GNU/Hurd system, then that already is
      something, a big milestone!
    <braunr> nice
    <braunr> although you'll probably need to tweak the garbage collecting
      process
    <mcsim> tschwinge: thanks
    <mcsim> braunr: As back-end for allocating memory I use
      kmem_alloc_wired. But in zalloc was an opportunity to use as back-end
      kmem_alloc_pageable. Although there was no any zone that used
      kmem_alloc_pageable. Do I need to implement this functionality?
    <braunr> mcsim: do *not* use kmem_alloc_pageable()
    <mcsim> braunr: Ok. This is even better)
    <braunr> mcsim: in x15, i've taken this even further: there is *no* kernel
      vm object, which means all kernel memory is wired and unmanaged
    <braunr> making it fast and safe
    <braunr> pageable kernel memory was useful back when RAM was really scarce
    <braunr> 20 years ago
    <braunr> but it's a source of deadlock
    <mcsim> Indeed. I'll won't use kmem_alloc_pageable.


# IRC, freenode, #hurd, 2011-08-09

    < braunr> mcsim: what's the "bug related to MEM_CF_VERIFY" you refer to in
      one of your commits ?
    < braunr> mcsim: don't use spin_lock_t as a member of another structure
    < mcsim> braunr: I confused with types in *_verify functions, so they
      didn't work. Than I fixed it in the commit you mentioned.
    < braunr> in gnumach, most types are actually structure pointers
    < braunr> use simple_lock_data_t
    < braunr> mcsim: ok
    < mcsim> > use simple_lock_data_t
    < mcsim> braunr: ok
    < braunr> mcsim: don't make too many changes to the code base, and if
      you're unsure, don't hesitate to ask
    < braunr> also, i really insist you rename the allocator, as done in x15
      for example
      (http://git.sceen.net/rbraun/x15mach.git/?a=blob;f=vm/kmem.c), instead of
      a name based on mine :/
    < mcsim> braunr: Ok. It was just work name. When I finish I'll rename the
      allocator.
    < braunr> other than that, it's nice to see progress
    < braunr> although again, it would be better with some reports along
    < braunr> i won't be present at the meeting tomorrow unfortunately, but you
      should use those to report the status of your work
    < mcsim> braunr: You've said that I have to tweak gc process. Did you mean
      to call mem_gc() when physical memory ends instead of calling it every x
      seconds? Or something else?
    < braunr> there are multiple topics, alhtough only one that really matters
    < braunr> study how zone_gc was called
    < braunr> reclaiming memory should happen when there is pressure on the VM
      subsystem
    < braunr> but it shouldn't happen too ofte, otherwise there is trashing
    < braunr> and your caches become mostly useless
    < braunr> the original slab allocator uses a 15-second period after a
      reclaim during which reclaiming has no effect
    < braunr> this allows having a somehow stable working set for this duration
    < braunr> the linux slab allocator uses 5 seconds, but has a more
      complicated reclaiming mechanism
    < braunr> it releases memory gradually, and from reclaimable caches only
      (dentry for example)
    < braunr> for x15 i intend to implement the original 15 second interval and
      then perform full reclaims
    < mcsim> In zalloc mem_gc is called by vm_pageout_scan, but not often than
      once a second.
    < mcsim> In balloc I've changed interval to once in 15 seconds.
    < braunr> don't use the code as it is
    < braunr> the version you've based your work on was meant for userspace
    < braunr> where there isn't memory pressure
    < braunr> so a timer is used to trigger reclaims at regular intervals
    < braunr> it's different in a kernel
    < braunr> mcsim: where did you see vm_pageout_scan call the zone gc once a
      second ?
    < mcsim> vm_pageout_scan calls consider_zone_gc and consider_zone_gc checks
      if second is passed.
    < braunr> where ?
    < mcsim> Than zone_gc can be called.
    < braunr> ah ok, it's in zaclloc.c then
    < braunr> zalloc.c
    < braunr> yes this function is fine
    < mcsim> so old gc didn't consider vm pressure. Or I missed something.
    < braunr> it did
    < mcsim> how?
    < braunr> well, it's called by the pageout daemon
    < braunr> under memory pressure
    < braunr> so it's fine
    < mcsim> so if mem_gc is called by pageout daemon is it fine?
    < braunr> it must be changed to do something similar to what
      consider_zone_gc does
    < mcsim> It does. mem_gc does the same work as consider_zone_gc and
      zone_gc.
    < braunr> good
    < mcsim> so gc process is fine?
    < braunr> should be
    < braunr> i see mem.c only includes mem.h, which then includes other
      headers
    < braunr> don't do that
    < braunr> always include all the headers you need where you need them
    < braunr> if you need avltree.h in both mem.c and mem.h, include it in both
      files
    < braunr> and by the way, i recommend you use the red black tree instead of
      the avl type
    < braunr> (it's the same interface so it shouldn't take long)
    < mcsim> As to report. If you won't be present at the meeting, I can tell
      you what I have to do now.
    < braunr> sure
    < braunr> in addition, use GPLv2 as the license, teh BSD one is meant for
      the userspace version only
    < braunr> GPLv2+ actually
    < braunr> hm you don't need list.c
    < braunr> it would only add dead code
    < braunr> "Zone for dynamical allocator", don't mix terms
    < braunr> this comment refers to a vm_map, so call it a map
    < mcsim> 1. Change constructor for kentry_alloc_cache.
    < mcsim> 2. Make measurements.
    < mcsim> +
    < mcsim> 3. Use simple_lock_data_t
    < mcsim> 4. Replace license
    < braunr> kentry_alloc_cache <= what is that ?
    < braunr> cache for kernel map entries in vm_map ?
    < braunr> the comment for mem_cpu_pool_get doesn't apply in gnumach, as
      there is no kernel preemption
    < braunr> "Don't attempt mem GC more frequently than hz/MEM_GC_INTERVAL
      times a second.
    < braunr> "
    < mcsim> sorry. I meant vm_map_kentry_cache
    < braunr> hm nothing actually about this comment
    < braunr> mcsim: ok
    < braunr> yes kernel map entries need special handling
    < braunr> i don't know how it's done in gnumach though
    < braunr> static preallocation ?
    < mcsim> yes
    < braunr> that's ugly :p
    < mcsim> but it uses dynamic allocation further even for vm_map kernel
      entries
    < braunr> although such bootstrapping issues are generally difficult to
      solve elegantly
    < braunr> ah
    < mcsim> now I use only static allocation, but I'll add dynamic allocation
      too
    < braunr> when you have time, mind the coding style (convert everything to
      gnumach style, which mostly implies using tabs instead of 4-spaces
      indentation)
    < braunr> when you'll work on dynamic allocation for the kernel map
      entries, you may want to review how it's done in x15
    < braunr> the mem_source type was originally intended for that purpose, but
      has slightly changed once the allocator was adapted to work in my kernel
    < mcsim> ok
    < braunr> vm_map_kentry_zone is the only zone created with ZONE_FIXED
    < braunr> and it is zcram()'ed immediately after
    < braunr> so you can consider it a statically allocated zone
    < braunr> in x15 i use another strategy: there is a special kernel submap
      named kentry_map which contains only one map entry (statically allocated)
    < braunr> this map is the backend (mem_source) for the kentry_cache
    < braunr> the kentry_cache is created with a special flag that tells it
      memory can't be reclaimed
    < braunr> when the cache needs to grow, the single map entry is extended to
      cover the allocated memory
    < braunr> it's similar to the way pmap_growkernel() works for kernel page
      table pages
    < braunr> (and is actually based on that idea)
    < braunr> it's a compromise between full static and dynamic allocation
      types
    < braunr> the advantage is that the allocator code can be used (so there is
      no need for a special allocator like in netbsd)
    < braunr> the drawback is that some resources can never be returned to
      their source (and under peaks, the amount of unfreeable resources could
      become large, but this is unexpected)
    < braunr> mcsim: for now you shouldn't waste your time with this
    < braunr> i see the number of kernel map entries is fixed at 256
    < braunr> and i've never seen the kernel use more than around 30 entries
    < mcsim> Do you think that I have to left this problem to the end?
    < braunr> yes


# IRC, freenode, #hurd, 2011-08-11

    < mcsim> braunr: Hello. Can you give me an advice how can I make
      measurements better?
    < braunr> mcsim: what kind of measurements
    < mcsim> braunr: How much is your allocator better than zalloc.
    < braunr> slightly :p
    < braunr> that's why i never took the time to put it in gnumach
    < mcsim> braunr: Just I thought that there are some rules or
      recommendations of such measurements. Or I can do them any way I want?
    < braunr> mcsim: i don't know
    < braunr> mcsim: benchmarking is an art of its own, and i don't even know
      how to use the bits of profiling code available in gnumach (if it still
      works)
    < antrik> mcsim: hm... are you saying you already have a running system
      with slab allocator?... :-)
    < braunr> mcsim: the main advantage i can see is the removal of many
      arbitrary hard limits
    < mcsim> antrik: yes
    < antrik> \o/
    < antrik> nice work!
    < braunr> :)
    < braunr> the cpu layer should also help a bit, but it's hard to measure
    < braunr> i guess it could be seen on the ipc path for very small buffers
    < mcsim> antrik: Thanks. But I still have to 1. Change constructor for
      kentry_alloc_cache. and 2. Make measurements.
    < braunr> and polish the whole thing :p
    < antrik> mcsim: I'm not sure this can be measured... the performance
      differente in any real live usage is probably just a few percent at most
      -- it's hard to construct a benchmark giving enough precision so it's not
      drowned in noise...
    < antrik> perhaps it conserves some memory -- but that too would be hard to
      measure I fear
    < braunr> yes
    < braunr> there *should* be better allocation times, less fragmentation,
      better accounting ... :)
    < braunr> and no arbitrary limits !
    < antrik> :-)
    < braunr> oh, and the self debugging features can be nice too
    < mcsim> But I need to prove that my work wasn't useless
    < braunr> well it wasn't, but that's hard to measure
    < braunr> it's easy to prove though, since there are additional features
      that weren't present in the zone allocator
    < mcsim> Ok. If there are some profiling features in gnumach can you give
      me a link with their description?
    < braunr> mcsim: sorry, no
    < braunr> mcsim: you could still write the basic loop test, which counts
      the number of allocations performed in a fixed time interval
    < braunr> but as it doesn't match many real life patterns, it won't be very
      useful
    < braunr> and i'm afraid that if you consider real life patterns, you'll
      see how negligeable the improvement can be compared to other operations
      such as memory copies or I/O (ouch)
    < mcsim> Do network drivers use this allocator?
    < mcsim> ok. I'll scrape up some test and than I'll report results.


# IRC, freenode, #hurd, 2011-08-26

    < mcsim> hello. Are there any analogs of copy_to_user and copy_from_user in
      linux for gnumach?
    < mcsim> Or how can I determine memory map if I know address? I need this
      for vm_map_copyin
    < guillem> mcsim: vm_map_lookup_entry?
    < mcsim> guillem: but I need to transmit map to this function and it will
      return an entry which contains specified address.
    < mcsim> And I don't know what map have I transmit.
    < mcsim> I need to transfer static array from kernel to user. What map
      contains static data?
    < antrik> mcsim: Mach doesn't have copy_{from,to}_user -- instead, large
      chunks of data are transferred as out-of-line data in IPC messages
      (i.e. using VM magic)
    < mcsim> antrik: can you give me an example? I just found using
      vm_map_copyin in host_zone_info.
    < antrik> no idea what vm_map_copyin is to be honest...


# IRC, freenode, #hurd, 2011-08-27

    < braunr> mcsim: the primitives are named copyin/copyout, and they are used
      for messages with inline data
    < braunr> or copyinmsg/copyoutmsg
    < braunr> vm_map_copyin/out should be used for chunks larger than a page
      (or roughly a page)
    < braunr> also, when writing to a task space, see which is better suited:
      vm_map_copyout or vm_map_copy_overwrite
    < mcsim> braunr: and what will be src_map for vm_map_copyin/out?
    < braunr> the caller map
    < braunr> which you can get with current_map() iirc
    < mcsim> braunr: thank you
    < braunr> be careful not to leak anything in the transferred buffers
    < braunr> memset() to 0 if in doubt
    < mcsim> braunr:ok
    < braunr> antrik: vm_map_copyin() is roughly vm_read()
    < antrik> braunr: what is it used for?
    < braunr> antrik: 01:11 < antrik> mcsim: Mach doesn't have
      copy_{from,to}_user -- instead, large chunks of data are transferred as
      out-of-line data in IPC messages (i.e. using VM magic)
    < braunr> antrik: that "VM magic" is partly implemented using vm_map_copy*
      functions
    < antrik> braunr: oh, you mean it doesn't actually copy data, but only page
      table entries? if so, that's *not* really comparable to
      copy_{from,to}_user()...


# IRC, freenode, #hurd, 2011-08-28

    < braunr> antrik: the equivalent of copy_{from,to}_user are
      copy{in,out}{,msg}
    < braunr> antrik: but when the data size is about a page or more, it's
      better not to copy, of course
    < antrik> braunr: it's actually not clear at all that it's really better to
      do VM magic than to copy...


# IRC, freenode, #hurd, 2011-08-29

    < braunr> antrik: at least, that used to be the general idea, and with a
      simpler VM i suspect it's still true
    < braunr> mcsim: did you progress on your host_zone_info replacement ?
    < braunr> mcsim: i think you should stick to what the original
      implementation did
    < braunr> which is making an inline copy if caller provided enough space,
      using kmem_alloc_pageable otherwise
    < braunr> specify ipc_kernel_map if using kmem_alloc_pageable
    < mcsim> braunr: yes. And it works. But I use kmem_alloc, not pageable. Is
      it worse?
    < mcsim> braunr: host_zone_info replacement is pushed to savannah
      repository. 
    < braunr> mcsim: i'll have a look
    < mcsim> braunr: I've pushed one more commit just now, which has attitude
      to host_zone_info.
    < braunr> mem_alloc_early_init should be renamed mem_bootstrap
    < mcsim> ok
    < braunr> mcsim: i don't understand your call to kmem_free
    < mcsim> braunr: It shouldn't be there?
    < braunr> why should it be there ?
    < braunr> you're freeing what the copy object references
    < braunr> it's strange that it even works
    < braunr> also, you shouldn't pass infop directly as the copy object
    < braunr> i guess you get a warning for that
    < braunr> do what the original code does: use an intermediate copy object
      and a cast
    < mcsim> ok
    < braunr> another error (without consequence but still, you should mind it)
    < braunr> simple_lock(&mem_cache_list_lock);
    < braunr> [...]
    < braunr> kr = kmem_alloc(ipc_kernel_map, &info, info_size);
    < braunr> you can't hold simple locks while allocating memory
    < braunr> read how the original implementation works around this
    < mcsim> ok
    < braunr> i guess host_zone_info assumes the zone list doesn't change much
      while unlocked
    < braunr> or that's it's rather unimportant since it's for debugging
    < braunr> a strict snapshot isn't required
    < braunr> list_for_each_entry(&mem_cache_list, cache, node) max_caches++;
    < braunr> you should really use two separate lines for readability
    < braunr> also, instead of counting each time, you could just maintain a
      global counter
    < braunr> mcsim: use strncpy instead of strcpy for the cache names
    < braunr> not to avoid overflow but rather to clear the unused bytes at the
      end of the buffer
    < braunr> mcsim: about kmem_alloc vs kmem_alloc_pageable, it's a minor
      issue
    < braunr> you're handing off debugging data to a userspace application
    < braunr> a rather dull reporting tool in most cases, which doesn't require
      wired down memory
    < braunr> so in order to better use available memory, pageable memory
      should be used
    < braunr> in the future i guess it could become a not-so-minor issue though
    < mcsim> ok. I'll fix it
    < braunr> mcsim: have you tried to run the kernel with MC_VERIFY always on
      ?
    < braunr> MEM_CF_VERIFY actually
    < mcsim1> yes.
    < braunr> oh
    < braunr> nothing wrong 
    < braunr> ?
    < mcsim1> it is always set
    < braunr> ok
    < braunr> ah, you set it in macros.h ..
    < braunr> don't
    < braunr> put it in mem.c if you want, or better, make it a compile-time
      option
    < braunr> macros.h is a tiny macro library, it shouldn't define such
      unrelated options
    < mcsim1> ok.
    < braunr> mcsim1: did you try fault injection to make sure the checking
      code actually works and how it behaves when an error occurs ?
    < mcsim1> I think that when I finish I'll merge files cpu.h and macros.h
      with mem.c
    < braunr> yes that would simplify things
    < mcsim1> Yes. When I confused with types mem_buf_fill worked wrong and
      panic occurred.
    < braunr> very good
    < braunr> have you progressed concerning the measurements you wanted to do
      ?
    < mcsim1> not much.
    < braunr> ok
    < mcsim1> I think they will be ready in a few days.
    < antrik> what measurements are these?
    < mcsim1> braunr: What maximal size for static data and stack in kernel?
    < braunr> what do you mean ?
    < braunr> kernel stacks are one page if i'm right
    < braunr> static data (rodata+data+bss) are limited by grub bugs only :)
    < mcsim1> braunr: probably they are present, because when I created too big
      array I couldn't boot kernel
    < braunr> local variable or static ?
    < mcsim1> static
    < braunr> how large ?
    < mcsim1> 4Mb
    < braunr> hm
    < braunr> it's not a grub bug then
    < braunr> i was able to embed as much as 32 MiB in x15 while doing this
      kind of tests
    < braunr> I guess it's the gnu mach boot code which only preallocates one
      page for the initial kernel mapping
    < braunr> one PTP (page table page) maps 4 MiB
    < braunr> (x15 does this completely dynamically, unlike mach or even
      current BSDs)
    < mcsim1> antrik: First I want to measure time of each cache
      creation/allocation/deallocation and then compile kernel.
    < braunr> cache creation is irrelevant
    < braunr> because of the cpu pools in the new allocator, you should test at
      least two different allocation patterns
    < braunr> one with quick allocs/frees
    < braunr> the other with large numbers of allocs then their matching frees
    < braunr> (larger being at least 100)
    < braunr> i'd say the cpu pool layer is the real advantage over the
      previous zone allocator
    < braunr> (from a performance perspective)
    < mcsim1> But there is only one cpu
    < braunr> it doesn't matter
    < braunr> it's stil a very effective cache
    < braunr> in addition to reducing contention
    < braunr> compare mem_cpu_pool_pop() against mem_cache_alloc_from_slab()
    < braunr> mcsim1: work is needed to polish the whole thing, but getting it
      actually working is a nice achievement for someone new on the project
    < braunr> i hope it helped you learn about memory allocation, virtual
      memory, gnu mach and the hurd in general :)
    < antrik> indeed :-)


# IRC, freenode, #hurd, 2011-09-06

    [some performance testing]
    <braunr> i'm not sure such long tests are relevant but let's assume balloc
      is slower
    <braunr> some tuning is needed here
    <braunr> first, we can see that slab allocation occurs more often in balloc
      than page allocation does in zalloc
    <braunr> so yes, as slab allocation is slower (have you measured which part
      actually is slow ? i guess it's the kmem_alloc call)
    <braunr> the whole process gets a bit slower too
    <mcsim> I used alloc_size = 4096 for zalloc
    <braunr> i don't know what that is exactly
    <braunr> but you can't hold 500 16 bytes buffers in a page so zalloc must
      have had free pages around for that
    <mcsim> I use kmem_alloc_wired
    <braunr> if you have time, measure it, so that we know how much it accounts
      for
    <braunr> where are the results for dealloc ?
    <mcsim> I can't give you result right now because internet works very
      bad. But for first DEALLOC result are the same, exept some cases when it
      takes balloc for more than 1000 ticks
    <braunr> must be the transfer from the cpu layer to the slab layer
    <mcsim> as to kmem_alloc_wired. I think zalloc uses this function too for
      allocating objects in zone I test.
    <braunr> mcsim: yes, but less frequently, which is why it's faster
    <braunr> mcsim: another very important aspect that should be measured is
      memory consumption, have you looked into that ?
    <mcsim> I think that I made too little iterations in test SMALL
    <mcsim> If I increase constant SMALL_TESTS will it be good enough?
    <braunr> mcsim: i don't know, try both :)
    <braunr> if you increase the number of iterations, balloc average time will
      be lower than zalloc, but this doesn't remove the first long
      initialization step on the allocated slab
    <mcsim> SMALL_TESTS to 500, I mean
    <braunr> i wonder if maintaining the slabs sorted through insertion sort is
      what makes it slow
    <mcsim> braunr: where do you sort slabs? I don't see this.
    <braunr> mcsim: mem_cache_alloc_from_slab and its free counterpart
    <braunr> mcsim: the mem_source stuff is useless in gnumach, you can remove
      it and directly call the kmem_alloc/free functions
    <mcsim> But I have to make special allocator for kernel map entries.
    <braunr> ah right
    <mcsim> btw. It turned out that 256 entries are not enough.
    <braunr> that's weird
    <braunr> i'll make a patch so that the mem_source code looks more like what
      i have in x15 then
    <braunr> about the results, i don't think the slab layer is that slow
    <braunr> it's the cpu_pool_fill/drain functions that take time
    <braunr> they preallocate many objects (64 for your objects size if i'm
      right) at once
    <braunr> mcsim: look at the first result page: some times, a number around
      8000 is printed
    <braunr> the common time (ticks, whatever) for a single object is 120
    <braunr> 8132/120 is 67, close enough to the 64 value
    <mcsim> I forgot about SMALL tests here are they:
      http://paste.debian.net/128533/ (balloc) http://paste.debian.net/128534/
      (zalloc)
    <mcsim> braunr: why do you divide 8132 by 120?
    <braunr> mcsim: to see if it matches my assumption that the ~8000 number
      matches the cpu_pool_fill call
    <mcsim> braunr: I've got it
    <braunr> mcsim: i'd be much interested in the dealloc results if you can
      paste them too
    <mcsim> dealloc: http://paste.debian.net/128589/
      http://paste.debian.net/128590/
    <braunr> mcsim: thanks
    <mcsim> second dealloc: http://paste.debian.net/128591/
      http://paste.debian.net/128592/
    <braunr> mcsim: so the main conclusion i retain from your tests is that the
      transfers from the cpu and the slab layers are what makes the new
      allocator a bit slower
    <mcsim> OPERATION_SMALL dealloc: http://paste.debian.net/128593/
      http://paste.debian.net/128594/
    <braunr> mcsim: what needs to be measured now is global memory usage
    <mcsim> braunr: data from /proc/vmstat after kernel compilation will be
      enough?
    <braunr> mcsim: let me check
    <braunr> mcsim: no it won't do, you need to measure kernel memory usage
    <braunr> the best moment to measure it is right after zone_gc is called
    <mcsim> Are there any facilities in gnumach for memory measurement?
    <braunr> it's specific to the allocators
    <braunr> just count the number of used pages
    <braunr> after garbage collection, there should be no free page, so this
      should be rather simple
    <mcsim> ok
    <mcsim> braunr: When I measure memory usage in balloc, what formula is
      better cache->nr_slabs * cache->bufs_per_slab * cache->buf_size or
      cache->nr_slabs * cache->slab_size?
    <braunr> the latter