1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
|
[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]]
[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
id="license" text="Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License, Version 1.2 or
any later version published by the Free Software Foundation; with no Invariant
Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license
is included in the section entitled [[GNU Free Documentation
License|/fdl]]."]]"""]]
[[!tag open_issue_gnumach]]
There is a [[!FF_project 266]][[!tag bounty]] on this task.
[[!toc]]
# IRC, freenode, #hurd, 2011-04-12
<antrik> braunr: do you think the allocator you wrote for x15 could be used
for gnumach? and would you be willing to mentor this? :-)
<braunr> antrik: to be willing to isn't my current problem
<braunr> antrik: and yes, I think my allocator can be used
<braunr> it's a slab allocator after all, it only requires reap() and
grow()
<braunr> or mmap()/munmap() whatever you want to call it
<braunr> a backend
<braunr> antrik: although i've been having other ideas recently
<braunr> that would have more impact on our usage patterns I think
<antrik> mcsim: have you investigated how the zone allocator works and how
it's hooked into the system yet?
<braunr> mcsim: now let me give you a link
<braunr> mcsim:
http://git.sceen.net/rbraun/libbraunr.git/?a=blob;f=mem.c;h=330436e799f322949bfd9e2fedf0475660309946;hb=HEAD
<braunr> mcsim: this is an implementation of the slab allocator i've been
working on recently
<braunr> mcsim: i haven't made it public because i reworked the per
processor layer, and this part isn't complete yet
<braunr> mcsim: you could use it as a reference for your project
<mcsim> braunr: ok
<braunr> it used to be close to the 2001 vmem paper
<braunr> but after many tests, fragmentation and accounting issues have
been found
<braunr> so i rewrote it to be closer to the linux implementation (cache
filling/draining in bukl transfers)
<braunr> bulk*
<braunr> they actually use the word draining in linux too :)
<mcsim> antrik: not complete yet.
<antrik> braunr: oh, it's unfinished? that's unfortunate...
<braunr> antrik: only the per processor part
<braunr> antrik: so it doesn't matter much for gnumach
<braunr> and it's not difficult to set up
<antrik> mcsim: hm, OK... but do you think you will have a fairly good
understanding in the next couple of days?...
<antrik> I'm asking because I'd really like to see a proposal a bit more
specific than "I'll look into things..."
<antrik> i.e. you should have an idea which things you will actually have
to change to hook up a new allocator etc.
<antrik> braunr: OK. will the interface remain unchanged, so it could be
easily replaced with an improved implementation later?
<braunr> the zone allocator in gnumach is a badly written bare object
allocator actually, there aren't many things to understand about it
<braunr> antrik: yes
<antrik> great :-)
<braunr> and the per processor part should be very close to the phys
allocator sitting next to it
<braunr> (with the slight difference that, as per cpu caches have variable
sizes, they are allocated on the free path rather than on the allocation
path)
<braunr> this is a nice trick in the vmem paper i've kept in mind
<braunr> and the interface also allows to set a "source" for caches
<antrik> ah, good point... do you think we should replace the physmem
allocator too? and if so, do it in one step, or one piece at a time?...
<braunr> no
<braunr> too many drivers currently depend on the physical allocator and
the pmap module as they are
<braunr> remember linux 2.0 drivers need a direct virtual to physical
mapping
<braunr> (especially true for dma mappings)
<antrik> OK
<braunr> the nice thing about having a configurable memory source is that
<antrik> whot do you mean by "allocated on the free path"?
<braunr> even if most caches will use the standard vm_kmem module as their
backend
<braunr> there is one exception in the vm_map module, allowing us to get
rid of either a static limit, or specific allocation code
<braunr> antrik: well, when you allocate a page, the allocator will lookup
one in a per cpu cache
<braunr> if it's empty, it fills the cache
<braunr> (called pools in my implementations)
<braunr> it then retries
<braunr> the problem in the slab allocator is that per cpu caches have
variable sizes
<braunr> so per cpu pools are allocated from their own pools
<braunr> (remember the magazine_xx caches in the output i showed you, this
is the same thing)
<braunr> but if you allocate them at allocation time, you could end up in
an infinite loop
<braunr> so, in the slab allocator, when a per cpu cache is empty, you just
fall back to the slab layer
<braunr> on the free path, when a per cpu cache doesn't exist, you allocate
it from its own cache
<braunr> this way you can't have an infinite loop
<mcsim> antrik: I'll try, but I have exams now.
<mcsim> As I understand amount of elements which could be allocated we
determine by zone initialization. And at this time memory for zone is
reserved. I'm going to change this. And make something similar to kmalloc
and vmalloc (support for pages consecutive physically and virtually). And
pages in zones consecutive always physically.
<mcsim> Am I right?
<braunr> mcsim: don't try to do that
<mcsim> why?
<braunr> mcsim: we just need a slab allocator with an interface close to
the zone allocator
<antrik> mcsim: IIRC the size of the complete zalloc map is fixed; but not
the number of elements per zone
<braunr> we don't need two allocators like kmalloc and vmalloc
<braunr> actually we just need vmalloc
<braunr> IIRC the limits are only present because the original developers
wanted to track leaks
<braunr> they assumed zones would be large enough, which isn't true any
more today
<braunr> but i didn't see any true reservation
<braunr> antrik: i'm not sure i was clear enough about the "allocation of
cpu caches on the free path"
<braunr> antrik: for a better explanation, read the vmem paper ;)
<antrik> braunr: you mean there is no fundamental reason why the zone map
has a limited maximal size; and it was only put in to catch cases where
something eats up all memory with kernel object creation?...
<antrik> braunr: I think I got it now :-)
<braunr> antrik: i'm pretty certin of it yes
<antrik> I don't see though how it is related to what we were talking
about...
<braunr> 10:55 < braunr> and the per processor part should be very close to
the phys allocator sitting next to it
<braunr> the phys allocator doesn't have to use this trick
<braunr> because pages have a fixed size, so per cpu caches all have the
same size too
<braunr> and the number of "caches", that is, physical segments, is limited
and known at compile time
<braunr> so having them statically allocated is possible
<antrik> I see
<braunr> it would actually be very difficult to have a phys allocator
requiring dynamic allocation when the dynamic allocator isn't yet ready
<antrik> hehe :-)
<mcsim> total size of all zone allocations is limited to 12 MB. And is "was
only put in to catch cases where something eats up all memory with kernel
object creation?"
<braunr> mcsim: ah right, there could be a kernel submap backing all the
zones
<braunr> but this can be increased too
<braunr> submaps are kind of evil :/
<antrik> mcsim: I think it's actually 32 MiB or something like that in the
Debian version...
<antrik> braunr: I'm not sure I ever fully understood what the zalloc map
is... I looked through the code once, and I think I got a rough
understading, but I was still pretty uncertain about some bits. and I
don't remember the details anyways :-)
<braunr> antrik: IIRC, it's a kernel submap
<braunr> it's named kmem_map in x15
<antrik> don't know what a submap is
<braunr> submaps are vm_map objects
<braunr> in a top vm_map, there are vm_map_entries
<braunr> these entries usually point to vm_objects
<braunr> (for the page cache)
<braunr> but they can point to other maps too
<braunr> the goal is to reduce fragmentation by isolating allocations
<braunr> this also helps reducing contention
<braunr> for exemple, on BSD, there is a submap for mbufs, so that the
network code doesn't interfere too much with other kernel allocations
<braunr> antrik: they are similar to spans in vmem, but vmem has an elegant
importing mechanism which eliminates the static limit problem
<antrik> so memory is not directly allocated from the physical allocator,
but instead from another map which in turn contains physical memory, or
something like that?...
<braunr> no, this is entirely virtual
<braunr> submaps are almost exclusively used for the kernel_map
<antrik> you are using a lot of identifies here, but I don't remember (or
never knew) what most of them mean :-(
<braunr> sorry :)
<braunr> the kernel map is the vm_map used to represent the ~1 GiB of
virtual memory the kernel has (on i386)
<braunr> vm_map objects are simple virtual space maps
<braunr> they contain what you see in linux when doing /proc/self/maps
<braunr> cat /proc/self/maps
<braunr> (linux uses entirely different names but it's roughly the same
structure)
<braunr> each line is a vm_map_entry
<braunr> (well, there aren't submaps in linux though)
<braunr> the pmap tool on netbsd is able to show the kernel map with its
submaps, but i don't have any image around
<mcsim> braunr: is limit for zones is feature and shouldn't be changed?
<braunr> mcsim: i think we shouldn't have fixed limits for zones
<braunr> mcsim: this should be part of the debugging facilities in the slab
allocator
<braunr> is this fixed limit really a major problem ?
<braunr> i mean, don't focus on that too much, there are other issues
requiring more attention
<antrik> braunr: at 12 MiB, it used to be, causing a lot of zalloc
panics. after increasing, I don't think it's much of a problem anymore...
<antrik> but as memory sizes grow, it might become one again
<antrik> that's the problem with a fixed size...
<braunr> yes, that's the issue with submaps
<braunr> but gnumach is full of those, so let's fix them by order of
priority
<antrik> well, I'm still trying to digest what you wrote about submaps :-)
<braunr> i'm downloading netbsd, so you can have a good view of all this
<antrik> so, when the kernel allocates virtual address space regions
(mostly for itself), instead of grabbing chunks of the address space
directly, it takes parts out of a pre-reserved region?
<braunr> not exactly
<braunr> both statements are true
<mcsim> antrik: only virtual addresses are reserved
<braunr> it grabs chunks of the address space directly, but does so in a
reserved region of the address space
<braunr> a submap is like a normal map, it has a start address, a size, and
is empty, then it's populated with vm_map_entries
<braunr> so instead of allocating from 3-4 GiB, you allocate from, say,
3.1-3.2 GiB
<antrik> yeah, that's more or less what I meant...
<mcsim> braunr: I see two problems: limited zones and absence of caching.
<mcsim> with caching absence of readahead paging will be not so significant
<braunr> please avoid readahead
<mcsim> ok
<braunr> and it's not about paging, it's about kernel memory, which is
wired
<braunr> (well most of it)
<braunr> what about limited zones ?
<braunr> the whole kernel space is limited, there has to be limits
<braunr> the problem is how to handle them
<antrik> braunr: almost all. I looked through all zones once, and IIRC I
found exactly one that actually allows paging...
<braunr> currently, when you reach the limit, you have an OOM error
<braunr> antrik: yes, there are
<braunr> i don't remember which implementation does that but, when
processes haven't been active for a minute or so, they are "swapedout"
<braunr> completely
<braunr> even the kernel stack
<braunr> and the page tables
<braunr> (most of the pmap structures are destroyed, some are retained)
<antrik> that might very well be true... at least inactive processes often
show up with 0 memory use in top on Hurd
<braunr> this is done by having a pageable kernel map, with wired entries
<braunr> when the swapper thread swaps tasks out, it unwires them
<braunr> but i think modern implementations don't do that any more
<antrik> well, I was talking about zalloc only :-)
<braunr> oh
<braunr> so the zalloc_map must be pageable
<braunr> or there are two submaps ?
<antrik> not sure whether "morden implementations" includes Linux ;-)
<braunr> no, i'm talking about the bsd family only
<antrik> but it's certainly true that on Linux even inactive processes
retain some memory
<braunr> linux doesn't make any difference between processor-bound and
I/O-bound processes
<antrik> braunr: I have no idea how it works. I just remember that when
creating zones, one of the optional flags decides whether the zone is
pagable. but as I said, IIRC there is exactly one that actually is...
<braunr> zone_map = kmem_suballoc(kernel_map, &zone_min, &zone_max,
zone_map_size, FALSE);
<braunr> kmem_suballoc(parent, min, max, size, pageable)
<braunr> so the zone_map isn't
<antrik> IIRC my conclusion was that pagable zones do not count in the
fixed zone map limit... but I'm not sure anymore
<braunr> zinit() has a memtype parameter
<braunr> with ZONE_PAGEABLE as a possible flag
<braunr> this is wierd :)
<mcsim> There is no any zones which use ZONE_PAGEABLE flag
<antrik> mcsim: are you sure? I think I found one...
<braunr> if (zone->type & ZONE_PAGEABLE) {
<antrik> admittedly, it is several years ago that I looked into this, so my
memory is rather dim...
<braunr> if (kmem_alloc_pageable(zone_map, &addr, ...
<braunr> calling kmem_alloc_pageable() on an unpageable submap seems wrong
<mcsim> I've greped gnumach code and there is no any zinit procedure call
with ZONE_PAGEABLE flag
<braunr> good
<antrik> hm... perhaps it was in some code that has been removed
alltogether since ;-)
<antrik> actually I think it would be pretty neat to have pageable kernel
objects... but I guess it would require considerable effort to implement
this right
<braunr> mcsim: you also mentioned absence of caching
<braunr> mcsim: the zone allocator actually is a bare caching object
allocator
<braunr> antrik: no, it's easy
<braunr> antrik: i already had that in x15 0.1
<braunr> antrik: the problem is being sure the objects you allocate from a
pageable backing store are never used when resolving a page fault
<braunr> that's all
<antrik> I wouldn't expect that to be easy... but surely you know better
:-)
<mcsim> braunr: indeed. I was wrong.
<antrik> braunr: what is a caching object allocator?...
<braunr> antrik: ok, it's not easy
<braunr> antrik: but once you have vm_objects implemented, having pageable
kernel object is just a matter of using the right options, really
<braunr> antrik: an allocator that caches its buffers
<braunr> some years ago, the term "object" would also apply to
preconstructed buffers
<antrik> I have no idea what you mean by "caches its buffers" here :-)
<braunr> well, a memory allocator which doesn't immediately free its
buffers caches them
<mcsim> braunr: but can it return objects to system?
<braunr> mcsim: which one ?
<antrik> yeah, obviously the *implementation* of pageable kernel objects is
not hard. the tricky part is deciding which objects can be pageable, and
which need to be wired...
<mcsim> Can zone allocator return cached objects to system as in slab?
<mcsim> I mean reap()
<braunr> well yes, it does so, and it does that too often
<braunr> the caching in the zone allocator is actually limited to the
pagesize
<braunr> once page is completely free, it is returned to the vm
<mcsim> this is bad caching
<braunr> yes
<mcsim> if object takes all page than there is now caching at all
<braunr> caching by side effect
<braunr> true
<braunr> but the linux slab allocator does the same thing :p
<braunr> hm
<braunr> no, the solaris slab allocator does so
<mcsim> linux's slab returns objects only when system ask
<antrik> without preconstructed objects, is there actually any point in
caching empty slabs?...
<mcsim> Once I've changed my allocator to slab and it cached more than 1GB
of my memory)
<braunr> ok wait, need to fix a few mistakes first
<mcsim> s/ask/asks
<braunr> the zone allocator (in gnumach) actually has a garbage collector
<antrik> braunr: well, the Solaris allocator follows the slab/magazine
paper, right? so there is caching at the magazine layer... in that case
caching empty slabs too would be rather redundant I'd say...
<braunr> which is called when running low on memory, similar to the slab
allocaotr
<braunr> antrik: yes
<antrik> (or rather the paper follows the Solaris allocator ;-) )
<braunr> mcsim: the zone allocator reap() is zone_gc()
<antrik> braunr: hm, right, there is a "collectable" flag for zones... but
I never understood what it means
<antrik> braunr: BTW, I heard Linux has yet another allocator now called
"slob"... do you happen to know what that is?
<braunr> slob is a very simple allocator for embedded devices
<mcsim> AFAIR this is just heap allocator
<braunr> useful when you have a very low amount of memory
<braunr> like 1 MiB
<braunr> yes
<antrik> just googled it :-)
<braunr> zone and slab are very similar
<antrik> sounds like a simple heap allocator
<mcsim> there is another allocator that calls slub, and it better than slab
in many cases
<braunr> the main difference is the data structures used to store slabs
<braunr> mcsim: i disagree
<antrik> mcsim: ah, you already said that :-)
<braunr> mcsim: slub is better for systems with very large amounts of
memory and processors
<braunr> otherwise, slab is better
<braunr> in addition, there are accounting issues with slub
<braunr> because of cache merging
<mcsim> ok. This strange that slub is default allocator
<braunr> well both are very good
<braunr> iirc, linus stated that he really doesn't care as long as its
works fine
<braunr> he refused slqb because of that
<braunr> slub is nice because it requires less memory than slab, while
still being as fast for most cases
<braunr> it gets slower on the free path, when the cpu performing the free
is different from the one which allocated the object
<braunr> that's a reasonable cost
<mcsim> slub uses heap for large object. Are there any tests that compare
what is better for large objects?
<antrik> well, if slub requires less memory, why do you think slab is
better for smaller systems? :-)
<braunr> antrik: smaller is relative
<antrik> mcsim: for large objects slab allocation is rather pointless, as
you don't have multiple objects in a page anyways...
<braunr> antrik: when lameter wrote slub, it was intended for systems with
several hundreds processors
<antrik> BTW, was slqb really refused only because the other ones are "good
enough"?...
<braunr> yes
<antrik> wow, that's a strange argument...
<braunr> linus is already unhappy of having "so many" allocators
<antrik> well, if the new one is better, it could replace one of the others
:-)
<antrik> or is it useful only in certain cases?
<braunr> that's the problem
<braunr> nobody really knows
<antrik> hm, OK... I guess that should be tested *before* merging ;-)
<antrik> is anyone still working on it, or was it abandonned?
<antrik> mcsim: back to caching...
<antrik> what does caching in the kernel object allocator got to do with
readahead (i.e. clustered paging)?...
<mcsim> if we cached some physical pages we don't need to find new ones for
allocating new object. And that's why there will not be a page fault.
<mcsim> antrik: Regarding kam. Hasn't he finished his project?
<antrik> err... what?
<antrik> one of us must be seriously confused
<antrik> I totally fail to see what caching of physical pages (which isn't
even really a correct description of what slab does) has to do with page
faults
<antrik> right, KAM didn't finish his project
<mcsim> If we free the physical page and return it to system we need
another one for next allocation. But if we keep it, we don't need to find
new physical page.
<mcsim> And physical page is allocated only then when page fault
occurs. Probably, I'm wrong
<antrik> what does "return to system" mean? we are talking about the
kernel...
<antrik> zalloc/slab are about allocating kernel objects. this doesn't have
*anything* to do with paging of userspace processes
<antrik> only thing the have in common is that they need to get pages from
the physical page allocator. but that's yet another topic
<mcsim> Under "return to system" I mean ability to use this page for other
needs.
<braunr> mcsim: consider kernel memory to be wired
<braunr> here, return to system means releasing a page back to the vm
system
<braunr> the vm_kmem module then unmaps the physical page and free its
virtual address in the kernel map
<mcsim> ok
<braunr> antrik: the problem with new allocators like slqb is that it's
very difficult to really know if they're better, even with extensive
testing
<braunr> antrik: there are papers (like wilson95) about the difficulties in
making valuable results in this field
<braunr> see
http://www.sceen.net/~rbraun/dynamic_storage_allocation_a_survey_and_critical_review.pdf
<mcsim> how can be allocated physically continuous object now?
<braunr> mcsim: rephrase please
<mcsim> what is similar to kmalloc in Linux to gnumach?
<braunr> i know memory is reserved for dma in a direct virtual to physical
mapping
<braunr> so even if the allocation is done similarly to vmalloc()
<braunr> the selected region of virtual space maps physical memory, so
memory is physically contiguous too
<braunr> for other allocation types, a block large enough is allocated, so
it's contiguous too
<mcsim> I don't clearly understand. If we have fragmentation in physical
ram, so there aren't 2 free pages in a row, but there are able apart, we
can't to allocate these 2 pages along?
<braunr> no
<braunr> but every system has this problem
<mcsim> But since we have only 12 or 32 MB of memory the problem becomes
more significant
<braunr> you're confusing virtual and physical memory
<braunr> those 32 MiB are virtual
<braunr> the physical pages backing them don't have to be contiguous
<mcsim> Oh, indeed
<mcsim> So the only problem are limits?
<braunr> and performance
<braunr> and correctness
<braunr> i find the zone allocator badly written
<braunr> antrik: mcsim: here is the content of the kernel pmap on NetBSD
(which uses a virtual memory system close to the Mach VM)
<braunr> antrik: mcsim: http://www.sceen.net/~rbraun/pmap.out
[[pmap.out]]
<braunr> you can see the kmem_map (which is used for most general kernel
allocations) is 128 MiB large
<braunr> actually it's not the kernel pmap, it's the kernel_map
<antrik> braunr: why is it called pmap.out then? ;-)
<braunr> antrik: because the tool is named pmap
<braunr> for process map
<braunr> it also exists under Linux, although direct access to
/proc/xx/maps gives more info
<mcsim> braunr: I've said that this is kernel_map. Can I see kernel_map for
Linux?
<braunr> mcsim: I don't know how to do that
<mcsim> s/I've/You've
<braunr> but Linux doesn't have submaps, and uses a direct virtual to
physical mapping, so it's used differently
<antrik> how are things (such as zalloc zones) entered into kernel_map?
<braunr> in zone_init() you have
<braunr> zone_map = kmem_suballoc(kernel_map, &zone_min, &zone_max,
zone_map_size, FALSE);
<braunr> so here, kmem_map is named zone_map
<braunr> then, in zalloc()
<braunr> kmem_alloc_wired(zone_map, &addr, zone->alloc_size)
<antrik> so, kmem_alloc just deals out chunks of memory referenced directly
by the address, and without knowing anything about the use?
<braunr> kmem_alloc() gives virtual pages
<braunr> zalloc() carves them into buffers, as in the slab allocator
<braunr> the difference is essentially the lack of formal "slab" object
<braunr> which makes the zone code look like a mess
<antrik> so kmem_suballoc() essentially just takes a bunch of pages from
the main kernel_map, and uses these to back another map which then in
turn deals out pages just like the main kernel_map?
<braunr> no
<braunr> kmem_suballoc creates a vm_map_entry object, and sets its start
and end address
<braunr> and creates a vm_map object, which is then inserted in the new
entry
<braunr> maybe that's what you meant with "essentially just takes a bunch
of pages from the main kernel_map"
<braunr> but there really is no allocation at this point
<braunr> except the map entry and the new map objects
<antrik> well, I'm trying to understand how kmem_alloc() manages things. so
it has map_entry structures like the maps of userspace processes? do
these also reference actual memory objects?
<braunr> kmem_alloc just allocates virtual pages from a vm_map, and backs
those with physical pages (unless the user requested pageable memory)
<braunr> it's not "like the maps of userspace processes"
<braunr> these are actually the same structures
<braunr> a vm_map_entry can reference a memory object or a kernel submap
<braunr> in netbsd, it can also referernce nothing (for pure wired kernel
memory like the vm_page array)
<braunr> maybe it's the same in mach, i don't remember exactly
<braunr> antrik: this is actually very clear in vm/vm_kern.c
<braunr> kmem_alloc() creates a new kernel object for the allocation
<braunr> allocates a new entry (or uses a previous existing one if it can
be extended) through vm_map_find_entry()
<braunr> then calls kmem_alloc_pages() to back it with wired memory
<antrik> "creates a new kernel object" -- what kind of kernel object?
<braunr> kmem_alloc_wired() does roughly the same thing, except it doesn't
need a new kernel object because it knows the new area won't be pageable
<braunr> a simple vm_object
<braunr> used as a container for anonymous memory in case the pages are
swapped out
<antrik> vm_object is the same as memory object/pager? or yet something
different?
<braunr> antrik: almost
<braunr> antrik: a memory_object is the user view of a vm_object
<braunr> as in the kernel/user interfaces used by external pagers
<braunr> vm_object is a more internal name
<mcsim> Is fragmentation a big problem in slab allocator?
<mcsim> I've tested it on my computer in Linux and for some caches it
reached 30-40%
<antrik> well, fragmentation is a major problem for any allocator...
<antrik> the original slab allocator was design specifically with the goal
of reducing fragmentation
<antrik> the revised version with the addition of magazines takes a step
back on this though
<antrik> have you compared it to slub? would be pretty interesting...
<mcsim> I have an idea how can it be decreased, but it will hurt by
performance...
<mcsim> antrik: no I haven't, but there will be might the same, I think
<mcsim> if each cache will handle two types of object: with sizes that will
fit cache sizes (or I bit smaller) and with sizes which are much smaller
than maximal cache size. For first type of object will be used standard
slab allocator and for latter type will be used (within page) heap
allocator.
<mcsim> I think that than fragmentation will be decreased
<antrik> not at all. heap allocator has much worse fragmentation. that's
why slab allocator was invented
<antrik> the problem is that in a long-running program (such an the
kernel), objects tend to have vastly varying lifespans
<mcsim> but we use heap only for objects of specified sizes
<antrik> so often a few old objects will keep a whole page hostage
<mcsim> for example for 32 byte cache it could be 20-28 byte objects
<antrik> that's particularily visible in programs such as firefox, which
will grow the heap during use even though actual needs don't change
<antrik> the slab allocator groups objects in a fashion that makes it more
likely adjacent objects will be freed at similar times
<antrik> well, that's pretty oversimplyfied, but I hope you get the
idea... it's about locality
<mcsim> I agree, but I speak not about general heap allocation. We have
many heaps for objects with different sizes.
<mcsim> Could it be better?
<antrik> note that this has been a topic of considerable research. you
shouldn't seek to improve the actual algorithms -- you would have to read
up on the existing research at least before you can contribute anything
to the field :-)
<antrik> how would that be different from the slab allocator?
<mcsim> slab will allocate 32 byte for both 20 and 32 byte requests
<mcsim> And if there was request for 20 bytes we get 12 unused
<antrik> oh, you mean the implementation of the generic allocator on top of
slabs? well, that might not be optimal... but it's not an often used case
anyways. mostly the kernel uses constant-sized objects, which get their
own caches with custom tailored size
<antrik> I don't think the waste here matters at all
<mcsim> affirmative. So my idea is useless.
<antrik> does the statistic you refer to show the fragmentation in absolute
sizes too?
<mcsim> Can you explain what is absolute size?
<mcsim> I've counted what were requested (as parameter of kmalloc) and what
was really allocated (according to best fit cache size).
<antrik> how did you get that information?
<mcsim> I simply wrote a hook
<antrik> I mean total. i.e. how many KiB or MiB are wasted due to
fragmentation alltogether
<antrik> ah, interesting. how does it work?
<antrik> BTW, did you read the slab papers?
<mcsim> Do you mean articles from lwn.net?
<antrik> no
<antrik> I mean the papers from the Sun hackers who invented the slab
allocator(s)
<antrik> Bonwick mostly IIRC
<mcsim> Yes
<antrik> hm... then you really should know the rationale behind it...
<mcsim> There he says about 11% percent of memory waste
<antrik> you didn't answer my other questions BTW :-)
<mcsim> I've corrupted kernel tree with patch, and tomorrow I'm going to
read myself up for exam (I have it on Thursday). But than I'll send you a
module which I've used for testing.
<antrik> OK
<mcsim> I can send you module now, but it will not work without patch.
<mcsim> It would be better to rewrite it using debugfs, but when I was
writing this test I didn't know about trace_* macros
# IRC, freenode, #hurd, 2011-04-15
<mcsim> There is a hack in zone_gc when it allocates and frees two
vm_map_kentry_zone elements to make sure the gc will be able to allocate
two in vm_map_delete. Isn't it better to allocate memory for these
entries statically?
<youpi> mcsim: that's not the point of the hack
<youpi> mcsim: the point of the hack is to make sure vm_map_delete will be
able to allocate stuff
<youpi> allocating them statically will just work once
<youpi> it may happen several times that vm_map_delete needs to allocate it
while it's empty (and thus zget_space has to get called, leading to a
hang)
<youpi> funnily enough, the bug is also in macos X
<youpi> it's still in my TODO list to manage to find how to submit the
issue to them
<braunr> really ?
<braunr> eh
<braunr> is that because of map entry splitting ?
<youpi> it's git commit efc3d9c47cd744c316a8521c9a29fa274b507d26
<youpi> braunr: iirc something like this, yes
<braunr> netbsd has this issue too
<youpi> possibly
<braunr> i think it's a fundamental problem with the design
<braunr> people think of munmap() as something similar to free()
<braunr> whereas it's really unmap
<braunr> with a BSD-like VM, unmap can easily end up splitting one entry in
two
<braunr> but your issue is more about harmful recursion right ?
<youpi> I don't remember actually
<youpi> it's quite some time ago :)
<braunr> ok
<braunr> i think that's why i have "sources" in my slab allocator, the
default source (vm_kern) and a custom one for kernel map entries
# IRC, freenode, #hurd, 2011-04-18
<mcsim> braunr: you've said that once page is completely free, it is
returned to the vm.
<mcsim> who else, besides zone_gc, can return free pages to the vm?
<braunr> mcsim: i also said i was wrong about that
<braunr> zone_gc is the only one
# IRC, freenode, #hurd, 2011-04-19
<braunr> antrik: mcsim: i added back a new per-cpu layer as planned
<braunr>
http://git.sceen.net/rbraun/libbraunr.git/?a=blob;f=mem.c;h=c629b2b9b149f118a30f0129bd8b7526b0302c22;hb=HEAD
<braunr> mcsim: btw, in mem_cache_reap(), you can clearly see there are two
loops, just as in zone_gc, to reduce contention and avoid deadlocks
<braunr> this is really common in memory allocators
# IRC, freenode, #hurd, 2011-04-23
<mcsim> I've looked through some allocators and all of them use different
per cpu cache policy. AFAIK gnuhurd doesn't support multiprocessing, but
still multiprocessing must be kept in mind. So, what do you think what
kind of cpu caches is better? As for me I like variant with only per-cpu
caches (like in slqb).
<antrik> mcsim: well, have you looked at the allocator braunr wrote
himself? :-)
<antrik> I'm not sure I suggested that explicitly to you; but probably it
makes most sense to use that in gnumach
# IRC, freenode, #hurd, 2011-04-24
<mcsim> antrik: Yes, I have. He uses both global and per cpu caches. But he
also suggested to look through slqb, where there are only per cpu
caches.\
<braunr> i don't remember slqb in detail
<braunr> what do you mean by "only per-cpu caches" ?
<braunr> a whole slab sytem for each cpu ?
<mcsim> I mean that there are no global queues in caches, but there are
special queues for each cpu.
<mcsim> I've just started investigating slqb's code, but I've read an
article on lwn about it. And I've read that it is used for zen kernel.
<braunr> zen ?
<mcsim> Here is this article http://lwn.net/Articles/311502/
<mcsim> Yes, this is linux kernel with some patches which haven't been
approved to torvald's tree
<mcsim> http://zen-kernel.org/
<braunr> i see
<braunr> well it looks nice
<braunr> but as for slub, the problem i can see is cross-CPU freeing
<braunr> and I think nick piggins mentions it
<braunr> piggin*
<braunr> this means that sometimes, objects are "burst-free" from one cpu
cache to another
<braunr> which has the same bad effects as in most other allocators, mainly
fragmentation
<mcsim> There is a special list for freeing object allocated for another
CPU
<mcsim> And garbage collector frees such object on his own
<braunr> so what's your question ?
<mcsim> It is described in the end of article.
<mcsim> What cpu-cache policy do you think is better to implement?
<braunr> at this point, any
<braunr> and even if we had a kernel that perfectly supports
multiprocessor, I wouldn't care much now
<braunr> it's very hard to evaluate such allocators
<braunr> slqb looks nice, but if you have the same amount of fragmentation
per slab as other allocators do (which is likely), you have tat amount of
fragmentation multiplied by the number of processors
<braunr> whereas having shared queues limit the problem somehow
<braunr> having shared queues mean you have a bit more contention
<braunr> so, as is the case most of the time, it's a tradeoff
<braunr> by the way, does pigging say why he "doesn't like" slub ? :)
<braunr> piggin*
<mcsim> http://lwn.net/Articles/311093/
<mcsim> here he describes what slqb is better.
<braunr> well it doesn't describe why slub is worse
<mcsim> but not very particularly
<braunr> except for order-0 allocations
<braunr> and that's a form of fragmentation like i mentioned above
<braunr> in mach those problems have very different impacts
<braunr> the backend memory isn't physical, it's the kernel virtual space
<braunr> so the kernel allocator can request chunks of higher than order-0
pages
<braunr> physical pages are allocated one at a time, then mapped in the
kernel space
<mcsim> Doesn't order of page depend on buffer size?
<braunr> it does
<mcsim> And why does gnumach allocates higher than order-0 pages more?
<braunr> why more ?
<braunr> i didn't say more
<mcsim> And why in mach those problems have very different impact?
<braunr> ?
<braunr> i've just explained why :)
<braunr> 09:37 < braunr> physical pages are allocated one at a time, then
mapped in the kernel space
<braunr> "one at a time" means order-0 pages, even if you allocate higher
than order-0 chunks
<mcsim> And in Linux they allocated more than one at time because of
prefetching page reading?
<braunr> do you understand what virtual memory is ?
<braunr> linux allocators allocate "physical memory"
<braunr> mach kernel allocator allocates "virtual memory"
<braunr> so even if you allocate a big chunk of virtual memory, it's backed
by order-0 physical pages
<mcsim> yes, I understand this
<braunr> you don't seem to :/
<braunr> the problem of higher than order-0 page allocations is
fragmentation
<braunr> do you see why ?
<mcsim> yes
<braunr> so
<braunr> fragmentation in the kernel space is less likely to create issues
than it does in physical memory
<braunr> keep in mind physical memory is almost always full because of the
page cache
<braunr> and constantly under some pressure
<braunr> whereas the kernel space is mostly empty
<braunr> so allocating higher then order-0 pages in linux is more dangerous
than it is in Mach or BSD
<mcsim> ok
<braunr> on the other hand, linux focuses pure performance, and not having
to map memory means less operations, less tlb misses, quicker allocations
<braunr> the Mach VM must map pages "one at a time", which can be expensive
<braunr> it should be adapted to handle multiple page sizes (e.g. 2 MiB) so
that many allocations can be made with few mappings
<braunr> but that's not easy
<braunr> as always: tradeoffs
<mcsim> There are other benefits of physical allocating. In big DMA
transfers can be needed few continuous physical pages. How does mach
handles such cases?
<braunr> gnumach does that awfully
<braunr> it just reserves the whole DMA-able memory and uses special
allocation functions on it, IIRC
<braunr> but kernels which have a MAch VM like memory sytem such as BSDs
have cleaner methods
<braunr> NetBSD provides a function to allocate contiguous physical memory
<braunr> with many constraints
<braunr> FreeBSD uses a binary buddy system like Linux
<braunr> the fact that the kernel allocator uses virtual memory doesn't
mean the kernel has no mean to allocate contiguous physical memory ...
# IRC, freenode, #hurd, 2011-05-02
<braunr> hm nice, my allocator uses less memory than glibc (squeeze
version) on both 32 and 64 bits systems
<braunr> the new per-cpu layer is proving effective
<neal> braunr: Are you reimplementation malloc?
<braunr> no
<braunr> it's still the slab allocator for mach, but tested in userspace
<braunr> so i wrote malloc wrappers
<neal> Oh.
<braunr> i try to heavily test most of my code in userspace now
<neal> it's easier :-)
<neal> I agree
<braunr> even the physical memory allocator has been implemented this way
<neal> is this your mach version?
<braunr> virtual memory allocation will follow
<neal> or are you working on gnu mach?
<braunr> for now it's my version
<braunr> but i intend to spend the summer working on ipc port names
management
[[rework_gnumach_IPC_spaces]].
<braunr> and integrate the result in gnu mach
<neal> are you keeping the same user-space API?
<neal> Or are you experimenting with something new?
<antrik> braunr: to be fair, it's not terribly hard to use less memory than
glibc :-)
<braunr> yes
<braunr> antrik: well ptmalloc3 received some nice improvements
<braunr> neal: the goal is to rework some of the internals only
<braunr> neal: namely, i simply intend to replace the splay tree with a
radix tree
<antrik> braunr: the glibc allocator is emphasising performace, unlike some
other allocators that trade some performance for much better memory
utilisation...
<antrik> ptmalloc3?
<braunr> that's the allocator used in glibc
<braunr> http://www.malloc.de/en/
<antrik> OK. haven't seen any recent numbers... the comparision I have in
mind is many years old...
<braunr> i also made some additions to my avl and red-black trees this week
end, which finally make them suitable for almost all generic uses
<braunr> the red-black tree could be used in e.g. gnu mach to augment the
linked list used in vm maps
<braunr> which is what's done in most modern systems
<braunr> it could also be used to drop the overloaded (and probably over
imbalanced) page cache hash table
# IRC, freenode, #hurd, 2011-05-03
<mcsim> antrik: How should I start porting? Have I just include rbraun's
allocator to gnumach and make it compile?
<antrik> mcsim: well, basically yes I guess... but you will have to look at
the code in question first before we know anything more specific :-)
<antrik> I guess braunr might know better how to start, but he doesn't
appear to be here :-(
<braunr> mcsim: you can't juste put my code into gnu mach and make it run,
it really requires a few careful changes
<braunr> mcsim: you will have to analyse how the current zone allocator
interacts with regard to locking
<braunr> if it is used in interrupt handlers
<braunr> what kind of locks it should use instead of the pthread stuff
available in userspace
<braunr> you will have to change the reclamiing policy, so that caches are
reaped on demand
<braunr> (this basically boils down to calling the new reclaiming function
instead of zone_gc())
<braunr> you must be careful about types too
<braunr> there is work to be done ;)
<braunr> (not to mention the obvious about replacing all the calls to the
zone allocator, and testing/debugging afterwards)
# IRC, freenode, #hurd, 2011-07-14
<braunr> can you make your patch available ?
<mcsim> it is available in gnumach repository at savannah
<mcsim> tree mplaneta/libbraunr/master
<braunr> mcsim: i'll test your branch
<mcsim> ok. I'll give you a link in a minute
<braunr> hm why balloc ?
<mcsim> Braun's allocator
<braunr> err
<braunr>
http://git.sceen.net/rbraun/x15mach.git/?a=blob;f=kern/kmem.c;h=37173fa0b48fc9d7e177bf93de531819210159ab;hb=HEAD
<braunr> mcsim: this is the interface i had in mind for a kernel version :)
<braunr> very similar to the original slab allocator interface actually
<braunr> well, you've been working
<mcsim> But I have a problem with this patch. When I apply it to gnumach
code from debian repository. I have to make a change in file ramdisk.c
with sed -i 's/kernel_map/\&kernel_map/' device/ramdisk.c
<mcsim> because in git repository there is no such file
<braunr> mcsim: how do you configure the kernel before building ?
<braunr> mcsim: you should keep in touch more often i think, so that you
get feedback from us and don't spend too much time "off course"
<mcsim> I didn't configure it. I just run dpkg-buildsource -b.
<braunr> oh you build the debian package
<braunr> well my version was by configure --enable-kdb --enable-rtl8139
<braunr> and it seems stuck in an infinite loop during bootstrap
<mcsim> and printf doesn't work. The first function called by c_boot_entry
is printf(version).
<braunr> mcsim: also, you're invited to get the x15mach version of my
files, which are gplv2+ licensed
<braunr> be careful of my macros.h file, it can conflict with the
macros_help.h file from gnumach iirc
<mcsim> There were conflicts with MACRO_BEGIN and MACRO_END. But I solved
it
<braunr> ok
<braunr> it's tricky
<braunr> mcsim: try to find where the first use of the allocator is made
# IRC, freenode, #hurd, 2011-07-22
<mcsim> braunr, hello. Kernel with your allocator already compiles and
runs. There still some problems, but, certainly, I'm on the final stage
already. I hope I'll finish in a few days.
<tschwinge> mcsim: Oh, cool! Have you done some measurements already?
<mcsim> Not yet
<tschwinge> OK.
<tschwinge> But if it able to run a GNU/Hurd system, then that already is
something, a big milestone!
<braunr> nice
<braunr> although you'll probably need to tweak the garbage collecting
process
<mcsim> tschwinge: thanks
<mcsim> braunr: As back-end for allocating memory I use
kmem_alloc_wired. But in zalloc was an opportunity to use as back-end
kmem_alloc_pageable. Although there was no any zone that used
kmem_alloc_pageable. Do I need to implement this functionality?
<braunr> mcsim: do *not* use kmem_alloc_pageable()
<mcsim> braunr: Ok. This is even better)
<braunr> mcsim: in x15, i've taken this even further: there is *no* kernel
vm object, which means all kernel memory is wired and unmanaged
<braunr> making it fast and safe
<braunr> pageable kernel memory was useful back when RAM was really scarce
<braunr> 20 years ago
<braunr> but it's a source of deadlock
<mcsim> Indeed. I'll won't use kmem_alloc_pageable.
# IRC, freenode, #hurd, 2011-08-09
< braunr> mcsim: what's the "bug related to MEM_CF_VERIFY" you refer to in
one of your commits ?
< braunr> mcsim: don't use spin_lock_t as a member of another structure
< mcsim> braunr: I confused with types in *_verify functions, so they
didn't work. Than I fixed it in the commit you mentioned.
< braunr> in gnumach, most types are actually structure pointers
< braunr> use simple_lock_data_t
< braunr> mcsim: ok
< mcsim> > use simple_lock_data_t
< mcsim> braunr: ok
< braunr> mcsim: don't make too many changes to the code base, and if
you're unsure, don't hesitate to ask
< braunr> also, i really insist you rename the allocator, as done in x15
for example
(http://git.sceen.net/rbraun/x15mach.git/?a=blob;f=vm/kmem.c), instead of
a name based on mine :/
< mcsim> braunr: Ok. It was just work name. When I finish I'll rename the
allocator.
< braunr> other than that, it's nice to see progress
< braunr> although again, it would be better with some reports along
< braunr> i won't be present at the meeting tomorrow unfortunately, but you
should use those to report the status of your work
< mcsim> braunr: You've said that I have to tweak gc process. Did you mean
to call mem_gc() when physical memory ends instead of calling it every x
seconds? Or something else?
< braunr> there are multiple topics, alhtough only one that really matters
< braunr> study how zone_gc was called
< braunr> reclaiming memory should happen when there is pressure on the VM
subsystem
< braunr> but it shouldn't happen too ofte, otherwise there is trashing
< braunr> and your caches become mostly useless
< braunr> the original slab allocator uses a 15-second period after a
reclaim during which reclaiming has no effect
< braunr> this allows having a somehow stable working set for this duration
< braunr> the linux slab allocator uses 5 seconds, but has a more
complicated reclaiming mechanism
< braunr> it releases memory gradually, and from reclaimable caches only
(dentry for example)
< braunr> for x15 i intend to implement the original 15 second interval and
then perform full reclaims
< mcsim> In zalloc mem_gc is called by vm_pageout_scan, but not often than
once a second.
< mcsim> In balloc I've changed interval to once in 15 seconds.
< braunr> don't use the code as it is
< braunr> the version you've based your work on was meant for userspace
< braunr> where there isn't memory pressure
< braunr> so a timer is used to trigger reclaims at regular intervals
< braunr> it's different in a kernel
< braunr> mcsim: where did you see vm_pageout_scan call the zone gc once a
second ?
< mcsim> vm_pageout_scan calls consider_zone_gc and consider_zone_gc checks
if second is passed.
< braunr> where ?
< mcsim> Than zone_gc can be called.
< braunr> ah ok, it's in zaclloc.c then
< braunr> zalloc.c
< braunr> yes this function is fine
< mcsim> so old gc didn't consider vm pressure. Or I missed something.
< braunr> it did
< mcsim> how?
< braunr> well, it's called by the pageout daemon
< braunr> under memory pressure
< braunr> so it's fine
< mcsim> so if mem_gc is called by pageout daemon is it fine?
< braunr> it must be changed to do something similar to what
consider_zone_gc does
< mcsim> It does. mem_gc does the same work as consider_zone_gc and
zone_gc.
< braunr> good
< mcsim> so gc process is fine?
< braunr> should be
< braunr> i see mem.c only includes mem.h, which then includes other
headers
< braunr> don't do that
< braunr> always include all the headers you need where you need them
< braunr> if you need avltree.h in both mem.c and mem.h, include it in both
files
< braunr> and by the way, i recommend you use the red black tree instead of
the avl type
< braunr> (it's the same interface so it shouldn't take long)
< mcsim> As to report. If you won't be present at the meeting, I can tell
you what I have to do now.
< braunr> sure
< braunr> in addition, use GPLv2 as the license, teh BSD one is meant for
the userspace version only
< braunr> GPLv2+ actually
< braunr> hm you don't need list.c
< braunr> it would only add dead code
< braunr> "Zone for dynamical allocator", don't mix terms
< braunr> this comment refers to a vm_map, so call it a map
< mcsim> 1. Change constructor for kentry_alloc_cache.
< mcsim> 2. Make measurements.
< mcsim> +
< mcsim> 3. Use simple_lock_data_t
< mcsim> 4. Replace license
< braunr> kentry_alloc_cache <= what is that ?
< braunr> cache for kernel map entries in vm_map ?
< braunr> the comment for mem_cpu_pool_get doesn't apply in gnumach, as
there is no kernel preemption
< braunr> "Don't attempt mem GC more frequently than hz/MEM_GC_INTERVAL
times a second.
< braunr> "
< mcsim> sorry. I meant vm_map_kentry_cache
< braunr> hm nothing actually about this comment
< braunr> mcsim: ok
< braunr> yes kernel map entries need special handling
< braunr> i don't know how it's done in gnumach though
< braunr> static preallocation ?
< mcsim> yes
< braunr> that's ugly :p
< mcsim> but it uses dynamic allocation further even for vm_map kernel
entries
< braunr> although such bootstrapping issues are generally difficult to
solve elegantly
< braunr> ah
< mcsim> now I use only static allocation, but I'll add dynamic allocation
too
< braunr> when you have time, mind the coding style (convert everything to
gnumach style, which mostly implies using tabs instead of 4-spaces
indentation)
< braunr> when you'll work on dynamic allocation for the kernel map
entries, you may want to review how it's done in x15
< braunr> the mem_source type was originally intended for that purpose, but
has slightly changed once the allocator was adapted to work in my kernel
< mcsim> ok
< braunr> vm_map_kentry_zone is the only zone created with ZONE_FIXED
< braunr> and it is zcram()'ed immediately after
< braunr> so you can consider it a statically allocated zone
< braunr> in x15 i use another strategy: there is a special kernel submap
named kentry_map which contains only one map entry (statically allocated)
< braunr> this map is the backend (mem_source) for the kentry_cache
< braunr> the kentry_cache is created with a special flag that tells it
memory can't be reclaimed
< braunr> when the cache needs to grow, the single map entry is extended to
cover the allocated memory
< braunr> it's similar to the way pmap_growkernel() works for kernel page
table pages
< braunr> (and is actually based on that idea)
< braunr> it's a compromise between full static and dynamic allocation
types
< braunr> the advantage is that the allocator code can be used (so there is
no need for a special allocator like in netbsd)
< braunr> the drawback is that some resources can never be returned to
their source (and under peaks, the amount of unfreeable resources could
become large, but this is unexpected)
< braunr> mcsim: for now you shouldn't waste your time with this
< braunr> i see the number of kernel map entries is fixed at 256
< braunr> and i've never seen the kernel use more than around 30 entries
< mcsim> Do you think that I have to left this problem to the end?
< braunr> yes
# IRC, freenode, #hurd, 2011-08-11
< mcsim> braunr: Hello. Can you give me an advice how can I make
measurements better?
< braunr> mcsim: what kind of measurements
< mcsim> braunr: How much is your allocator better than zalloc.
< braunr> slightly :p
< braunr> that's why i never took the time to put it in gnumach
< mcsim> braunr: Just I thought that there are some rules or
recommendations of such measurements. Or I can do them any way I want?
< braunr> mcsim: i don't know
< braunr> mcsim: benchmarking is an art of its own, and i don't even know
how to use the bits of profiling code available in gnumach (if it still
works)
< antrik> mcsim: hm... are you saying you already have a running system
with slab allocator?... :-)
< braunr> mcsim: the main advantage i can see is the removal of many
arbitrary hard limits
< mcsim> antrik: yes
< antrik> \o/
< antrik> nice work!
< braunr> :)
< braunr> the cpu layer should also help a bit, but it's hard to measure
< braunr> i guess it could be seen on the ipc path for very small buffers
< mcsim> antrik: Thanks. But I still have to 1. Change constructor for
kentry_alloc_cache. and 2. Make measurements.
< braunr> and polish the whole thing :p
< antrik> mcsim: I'm not sure this can be measured... the performance
differente in any real live usage is probably just a few percent at most
-- it's hard to construct a benchmark giving enough precision so it's not
drowned in noise...
< antrik> perhaps it conserves some memory -- but that too would be hard to
measure I fear
< braunr> yes
< braunr> there *should* be better allocation times, less fragmentation,
better accounting ... :)
< braunr> and no arbitrary limits !
< antrik> :-)
< braunr> oh, and the self debugging features can be nice too
< mcsim> But I need to prove that my work wasn't useless
< braunr> well it wasn't, but that's hard to measure
< braunr> it's easy to prove though, since there are additional features
that weren't present in the zone allocator
< mcsim> Ok. If there are some profiling features in gnumach can you give
me a link with their description?
< braunr> mcsim: sorry, no
< braunr> mcsim: you could still write the basic loop test, which counts
the number of allocations performed in a fixed time interval
< braunr> but as it doesn't match many real life patterns, it won't be very
useful
< braunr> and i'm afraid that if you consider real life patterns, you'll
see how negligeable the improvement can be compared to other operations
such as memory copies or I/O (ouch)
< mcsim> Do network drivers use this allocator?
< mcsim> ok. I'll scrape up some test and than I'll report results.
# IRC, freenode, #hurd, 2011-08-26
< mcsim> hello. Are there any analogs of copy_to_user and copy_from_user in
linux for gnumach?
< mcsim> Or how can I determine memory map if I know address? I need this
for vm_map_copyin
< guillem> mcsim: vm_map_lookup_entry?
< mcsim> guillem: but I need to transmit map to this function and it will
return an entry which contains specified address.
< mcsim> And I don't know what map have I transmit.
< mcsim> I need to transfer static array from kernel to user. What map
contains static data?
< antrik> mcsim: Mach doesn't have copy_{from,to}_user -- instead, large
chunks of data are transferred as out-of-line data in IPC messages
(i.e. using VM magic)
< mcsim> antrik: can you give me an example? I just found using
vm_map_copyin in host_zone_info.
< antrik> no idea what vm_map_copyin is to be honest...
# IRC, freenode, #hurd, 2011-08-27
< braunr> mcsim: the primitives are named copyin/copyout, and they are used
for messages with inline data
< braunr> or copyinmsg/copyoutmsg
< braunr> vm_map_copyin/out should be used for chunks larger than a page
(or roughly a page)
< braunr> also, when writing to a task space, see which is better suited:
vm_map_copyout or vm_map_copy_overwrite
< mcsim> braunr: and what will be src_map for vm_map_copyin/out?
< braunr> the caller map
< braunr> which you can get with current_map() iirc
< mcsim> braunr: thank you
< braunr> be careful not to leak anything in the transferred buffers
< braunr> memset() to 0 if in doubt
< mcsim> braunr:ok
< braunr> antrik: vm_map_copyin() is roughly vm_read()
< antrik> braunr: what is it used for?
< braunr> antrik: 01:11 < antrik> mcsim: Mach doesn't have
copy_{from,to}_user -- instead, large chunks of data are transferred as
out-of-line data in IPC messages (i.e. using VM magic)
< braunr> antrik: that "VM magic" is partly implemented using vm_map_copy*
functions
< antrik> braunr: oh, you mean it doesn't actually copy data, but only page
table entries? if so, that's *not* really comparable to
copy_{from,to}_user()...
# IRC, freenode, #hurd, 2011-08-28
< braunr> antrik: the equivalent of copy_{from,to}_user are
copy{in,out}{,msg}
< braunr> antrik: but when the data size is about a page or more, it's
better not to copy, of course
< antrik> braunr: it's actually not clear at all that it's really better to
do VM magic than to copy...
# IRC, freenode, #hurd, 2011-08-29
< braunr> antrik: at least, that used to be the general idea, and with a
simpler VM i suspect it's still true
< braunr> mcsim: did you progress on your host_zone_info replacement ?
< braunr> mcsim: i think you should stick to what the original
implementation did
< braunr> which is making an inline copy if caller provided enough space,
using kmem_alloc_pageable otherwise
< braunr> specify ipc_kernel_map if using kmem_alloc_pageable
< mcsim> braunr: yes. And it works. But I use kmem_alloc, not pageable. Is
it worse?
< mcsim> braunr: host_zone_info replacement is pushed to savannah
repository.
< braunr> mcsim: i'll have a look
< mcsim> braunr: I've pushed one more commit just now, which has attitude
to host_zone_info.
< braunr> mem_alloc_early_init should be renamed mem_bootstrap
< mcsim> ok
< braunr> mcsim: i don't understand your call to kmem_free
< mcsim> braunr: It shouldn't be there?
< braunr> why should it be there ?
< braunr> you're freeing what the copy object references
< braunr> it's strange that it even works
< braunr> also, you shouldn't pass infop directly as the copy object
< braunr> i guess you get a warning for that
< braunr> do what the original code does: use an intermediate copy object
and a cast
< mcsim> ok
< braunr> another error (without consequence but still, you should mind it)
< braunr> simple_lock(&mem_cache_list_lock);
< braunr> [...]
< braunr> kr = kmem_alloc(ipc_kernel_map, &info, info_size);
< braunr> you can't hold simple locks while allocating memory
< braunr> read how the original implementation works around this
< mcsim> ok
< braunr> i guess host_zone_info assumes the zone list doesn't change much
while unlocked
< braunr> or that's it's rather unimportant since it's for debugging
< braunr> a strict snapshot isn't required
< braunr> list_for_each_entry(&mem_cache_list, cache, node) max_caches++;
< braunr> you should really use two separate lines for readability
< braunr> also, instead of counting each time, you could just maintain a
global counter
< braunr> mcsim: use strncpy instead of strcpy for the cache names
< braunr> not to avoid overflow but rather to clear the unused bytes at the
end of the buffer
< braunr> mcsim: about kmem_alloc vs kmem_alloc_pageable, it's a minor
issue
< braunr> you're handing off debugging data to a userspace application
< braunr> a rather dull reporting tool in most cases, which doesn't require
wired down memory
< braunr> so in order to better use available memory, pageable memory
should be used
< braunr> in the future i guess it could become a not-so-minor issue though
< mcsim> ok. I'll fix it
< braunr> mcsim: have you tried to run the kernel with MC_VERIFY always on
?
< braunr> MEM_CF_VERIFY actually
< mcsim1> yes.
< braunr> oh
< braunr> nothing wrong
< braunr> ?
< mcsim1> it is always set
< braunr> ok
< braunr> ah, you set it in macros.h ..
< braunr> don't
< braunr> put it in mem.c if you want, or better, make it a compile-time
option
< braunr> macros.h is a tiny macro library, it shouldn't define such
unrelated options
< mcsim1> ok.
< braunr> mcsim1: did you try fault injection to make sure the checking
code actually works and how it behaves when an error occurs ?
< mcsim1> I think that when I finish I'll merge files cpu.h and macros.h
with mem.c
< braunr> yes that would simplify things
< mcsim1> Yes. When I confused with types mem_buf_fill worked wrong and
panic occurred.
< braunr> very good
< braunr> have you progressed concerning the measurements you wanted to do
?
< mcsim1> not much.
< braunr> ok
< mcsim1> I think they will be ready in a few days.
< antrik> what measurements are these?
< mcsim1> braunr: What maximal size for static data and stack in kernel?
< braunr> what do you mean ?
< braunr> kernel stacks are one page if i'm right
< braunr> static data (rodata+data+bss) are limited by grub bugs only :)
< mcsim1> braunr: probably they are present, because when I created too big
array I couldn't boot kernel
< braunr> local variable or static ?
< mcsim1> static
< braunr> how large ?
< mcsim1> 4Mb
< braunr> hm
< braunr> it's not a grub bug then
< braunr> i was able to embed as much as 32 MiB in x15 while doing this
kind of tests
< braunr> I guess it's the gnu mach boot code which only preallocates one
page for the initial kernel mapping
< braunr> one PTP (page table page) maps 4 MiB
< braunr> (x15 does this completely dynamically, unlike mach or even
current BSDs)
< mcsim1> antrik: First I want to measure time of each cache
creation/allocation/deallocation and then compile kernel.
< braunr> cache creation is irrelevant
< braunr> because of the cpu pools in the new allocator, you should test at
least two different allocation patterns
< braunr> one with quick allocs/frees
< braunr> the other with large numbers of allocs then their matching frees
< braunr> (larger being at least 100)
< braunr> i'd say the cpu pool layer is the real advantage over the
previous zone allocator
< braunr> (from a performance perspective)
< mcsim1> But there is only one cpu
< braunr> it doesn't matter
< braunr> it's stil a very effective cache
< braunr> in addition to reducing contention
< braunr> compare mem_cpu_pool_pop() against mem_cache_alloc_from_slab()
< braunr> mcsim1: work is needed to polish the whole thing, but getting it
actually working is a nice achievement for someone new on the project
< braunr> i hope it helped you learn about memory allocation, virtual
memory, gnu mach and the hurd in general :)
< antrik> indeed :-)
# IRC, freenode, #hurd, 2011-09-06
[some performance testing]
<braunr> i'm not sure such long tests are relevant but let's assume balloc
is slower
<braunr> some tuning is needed here
<braunr> first, we can see that slab allocation occurs more often in balloc
than page allocation does in zalloc
<braunr> so yes, as slab allocation is slower (have you measured which part
actually is slow ? i guess it's the kmem_alloc call)
<braunr> the whole process gets a bit slower too
<mcsim> I used alloc_size = 4096 for zalloc
<braunr> i don't know what that is exactly
<braunr> but you can't hold 500 16 bytes buffers in a page so zalloc must
have had free pages around for that
<mcsim> I use kmem_alloc_wired
<braunr> if you have time, measure it, so that we know how much it accounts
for
<braunr> where are the results for dealloc ?
<mcsim> I can't give you result right now because internet works very
bad. But for first DEALLOC result are the same, exept some cases when it
takes balloc for more than 1000 ticks
<braunr> must be the transfer from the cpu layer to the slab layer
<mcsim> as to kmem_alloc_wired. I think zalloc uses this function too for
allocating objects in zone I test.
<braunr> mcsim: yes, but less frequently, which is why it's faster
<braunr> mcsim: another very important aspect that should be measured is
memory consumption, have you looked into that ?
<mcsim> I think that I made too little iterations in test SMALL
<mcsim> If I increase constant SMALL_TESTS will it be good enough?
<braunr> mcsim: i don't know, try both :)
<braunr> if you increase the number of iterations, balloc average time will
be lower than zalloc, but this doesn't remove the first long
initialization step on the allocated slab
<mcsim> SMALL_TESTS to 500, I mean
<braunr> i wonder if maintaining the slabs sorted through insertion sort is
what makes it slow
<mcsim> braunr: where do you sort slabs? I don't see this.
<braunr> mcsim: mem_cache_alloc_from_slab and its free counterpart
<braunr> mcsim: the mem_source stuff is useless in gnumach, you can remove
it and directly call the kmem_alloc/free functions
<mcsim> But I have to make special allocator for kernel map entries.
<braunr> ah right
<mcsim> btw. It turned out that 256 entries are not enough.
<braunr> that's weird
<braunr> i'll make a patch so that the mem_source code looks more like what
i have in x15 then
<braunr> about the results, i don't think the slab layer is that slow
<braunr> it's the cpu_pool_fill/drain functions that take time
<braunr> they preallocate many objects (64 for your objects size if i'm
right) at once
<braunr> mcsim: look at the first result page: some times, a number around
8000 is printed
<braunr> the common time (ticks, whatever) for a single object is 120
<braunr> 8132/120 is 67, close enough to the 64 value
<mcsim> I forgot about SMALL tests here are they:
http://paste.debian.net/128533/ (balloc) http://paste.debian.net/128534/
(zalloc)
<mcsim> braunr: why do you divide 8132 by 120?
<braunr> mcsim: to see if it matches my assumption that the ~8000 number
matches the cpu_pool_fill call
<mcsim> braunr: I've got it
<braunr> mcsim: i'd be much interested in the dealloc results if you can
paste them too
<mcsim> dealloc: http://paste.debian.net/128589/
http://paste.debian.net/128590/
<braunr> mcsim: thanks
<mcsim> second dealloc: http://paste.debian.net/128591/
http://paste.debian.net/128592/
<braunr> mcsim: so the main conclusion i retain from your tests is that the
transfers from the cpu and the slab layers are what makes the new
allocator a bit slower
<mcsim> OPERATION_SMALL dealloc: http://paste.debian.net/128593/
http://paste.debian.net/128594/
<braunr> mcsim: what needs to be measured now is global memory usage
<mcsim> braunr: data from /proc/vmstat after kernel compilation will be
enough?
<braunr> mcsim: let me check
<braunr> mcsim: no it won't do, you need to measure kernel memory usage
<braunr> the best moment to measure it is right after zone_gc is called
<mcsim> Are there any facilities in gnumach for memory measurement?
<braunr> it's specific to the allocators
<braunr> just count the number of used pages
<braunr> after garbage collection, there should be no free page, so this
should be rather simple
<mcsim> ok
<mcsim> braunr: When I measure memory usage in balloc, what formula is
better cache->nr_slabs * cache->bufs_per_slab * cache->buf_size or
cache->nr_slabs * cache->slab_size?
<braunr> the latter
# IRC, freenode, #hurd, 2011-09-07
<mcsim> braunr: I've disabled calling of mem_cpu_pool_fill and allocator
became faster
<braunr> mcsim: sounds nice
<braunr> mcsim: i suspect the free path might not be as fast though
<mcsim> results for first calling: http://paste.debian.net/128639/ second:
http://paste.debian.net/128640/ and with many alloc/free:
http://paste.debian.net/128641/
<braunr> mcsim: thanks
<mcsim> best result are for second call: average time decreased from 159.56
to 118.756
<mcsim> First call slightly worse, but this is because I've added some
profiling code
<braunr> i still see some ~8k lines in 128639
<braunr> even some around ~12k
<mcsim> I think this is because of mem_cache_grow I'm investigating it now
<braunr> i guess so too
<mcsim> I've measured time for first call in cache and from about 22000
mem_cache_grow takes 20000
<braunr> how did you change the code so that it doesn't call
mem_cpu_pool_fill ?
<braunr> is the cpu layer still used ?
<mcsim> http://paste.debian.net/128644/
<braunr> don't forget the free path
<braunr> mcsim: anyway, even with the previous slightly slower behaviour we
could observe, the performance hit is negligible
<mcsim> Is free path a compilation? (I'm sorry for my english)
<braunr> mcsim: mem_cache_free
<braunr> mcsim: the last two measurements i'd advise are with big (>4k)
object sizes and, really, kernel allocator consumption
<mcsim> http://paste.debian.net/128648/ http://paste.debian.net/128646/
http://paste.debian.net/128649/ (first, second, small)
<braunr> mcsim: these numbers are closer to the zalloc ones, aren't they ?
<mcsim> deallocating slighty faster too
<braunr> it may not be the case with larger objects, because of the use of
a tree
<mcsim> yes, they are closer
<braunr> but then, i expect some space gains
<braunr> the whole thing is about compromise
<mcsim> ok. I'll try to measure them today. Anyway I'll post result and you
could read them in the morning
<braunr> at least, it shows that the zone allocator was actually quite good
<braunr> i don't like how the code looks, there are various hacks here and
there, it lacks self inspection features, but it's quite good
<braunr> and there was little room for true improvement in this area, like
i told you :)
<braunr> (my allocator, like the current x15 dev branch, focuses on mp
machines)
<braunr> mcsim: thanks again for these numbers
<braunr> i wouldn't have had the courage to make the tests myself before
some time eh
<mcsim> braunr: hello. Look at the small_4096 results
http://paste.debian.net/128692/ (balloc) http://paste.debian.net/128693/
(zalloc)
<braunr> mcsim: wow, what's that ? :)
<braunr> mcsim: you should really really include your test parameters in
the report
<braunr> like object size, purpose, and other similar details
<mcsim> for balloc I specified only object_size = 4096
<mcsim> for zalloc object_size = 4096, alloc_size = 4096, memtype = 0;
<braunr> the results are weird
<braunr> apart from the very strange numbers (e.g. 0 or 4429543648), none
is around 3k, which is the value matching a kmem_alloc call
<braunr> happy to see balloc behaves quite good for this size too
<braunr> s/good/well/
<mcsim> Oh
<mcsim> here is significant only first 101 lines
<mcsim> I'm sorry
<braunr> ok
<braunr> what does the test do again ? 10 loops of 10 allocs/frees ?
<mcsim> yes
<braunr> ok, so the only slowdown is at the beginning, when the slabs are
created
<braunr> the two big numbers (31844 and 19548) are strange
<mcsim> on the other hand time of compilation is
<mcsim> balloc zalloc
<mcsim> 38m28.290s 38m58.400s
<mcsim> 38m38.240s 38m42.140s
<mcsim> 38m30.410s 38m52.920s
<braunr> what are you compiling ?
<mcsim> gnumach kernel
<braunr> in 40 mins ?
<mcsim> yes
<braunr> you lack hvm i guess
<mcsim> is it long?
<mcsim> I use real PC
<braunr> very
<braunr> ok
<braunr> so it's normal
<mcsim> in vm it was about 2 hours)
<braunr> the difference really is negligible
<braunr> ok i can explain the big numbers
<braunr> the slab size depends on the object size, and for 4k, it is 32k
<braunr> you can store 8 4k buffers in a slab (lines 2 to 9)
<mcsim> so we need use kmem_alloc_* 8 times?
<braunr> on line 10, the ninth object is allocated, which adds another slab
to the cache, hence the big number
<braunr> no, once for a size of 32k
<braunr> and then the free list is initialized, which means accessing those
pages, which means tlb misses
<braunr> i guess the zone allocator already has free pages available
<mcsim> I see
<braunr> i think you can stop performance measurements, they show the
allocator is slightly slower, but so slightly we don't care about that
<braunr> we need numbers on memory usage now (at the page level)
<braunr> and this isn't easy
<mcsim> For balloc I can get numbers if I summarize nr_slabs*slab_size for
each cache, isn't it?
<braunr> yes
<braunr> you can have a look at the original implementation, function
mem_info
<mcsim> And for zalloc I have to summarize of cur_size and then add
zalloc_wasted_space?
<braunr> i don't know :/
<braunr> i think the best moment to obtain accurate values is after zone_gc
removes the collected pages
<braunr> for both allocators, you could fill a stats structure at that
moment, and have an rpc copy that structure when a client tool requests
it
<braunr> concerning your tests, there is another point to have in mind
<braunr> the very first loop in your code shows a result of 31844
<braunr> although you disabled the call to cpu_pool_fill
<braunr> but the reason why it's so long is that the cpu layer still exists
<braunr> and if you look carefully, the cpu pools are created as needed on
the free path
<mcsim> I removed cpu_pool_drain
<braunr> but not cpu_pool_push/pop i guess
<mcsim> http://paste.debian.net/128698/
<braunr> see, you still allocate the cpu pool array on the free path
<mcsim> but I don't fill it
<braunr> that's not the point
<braunr> it uses mem_cache_alloc
<braunr> so in a call to free, you can also have an allocation, that can
potentially create a new slab
<mcsim> I see, so I have to create cpu_pool at the initialization stage?
<braunr> no, you can't
<braunr> there is a reason why they're allocated on the free path
<braunr> but since you don't have the fill/drain functions, i wonder if you
should just comment out the whole cpu layer code
<braunr> but hmm
<braunr> no really, it's not worth the effort
<braunr> even with drains/fills, the results are really good enough
<braunr> it makes the allocator smp ready
<braunr> we should just keep it that way
<braunr> mcsim: fyi, the reason why cpu pool arrays are allocated on the
free path is to avoid recursion
<braunr> because cpu pool arrays are allocated from caches just as almost
everything else
<mcsim> ok
<mcsim> summ of cur_size and then adding zalloc_wasted_space gives 0x4e1954
<mcsim> but this value isn't even page aligned
<mcsim> For balloc I've got 0x4c6000 0x4aa000 0x48d000
<braunr> hm can you report them in decimal, >> 10 so that values are in KiB
?
<mcsim> 4888 4776 4660 for balloc
<mcsim> 4998 for zalloc
<braunr> when ?
<braunr> after boot ?
<mcsim> boot, compile, zone_gc
<mcsim> and then measure
<braunr> ?
<mcsim> I call garbage collector before measuring
<mcsim> and I measure after kernel compilation
<braunr> i thought it took you 40 minutes
<mcsim> for balloc I got results at night
<braunr> oh so you already got them
<braunr> i can't beleive the kernel only consumes 5 MiB
<mcsim> before gc it takes about 9052 Kib
<braunr> can i see the measurement code ?
<braunr> oh, and how much ram does your machine have ?
<mcsim> 758 mb
<mcsim> 768
<braunr> that's really weird
<braunr> i'd expect the kernel to consume much more space
<mcsim> http://paste.debian.net/128703/
<mcsim> it's only dynamically allocated data
<braunr> yes
<braunr> ipc ports, rights, vm map entries, vm objects, and lots of other
hanging buffers
<braunr> about how much is zalloc_wasted_space ?
<braunr> if it's small or constant, i guess you could ignore it
<mcsim> about 492
<mcsim> KiB
<braunr> well it's another good point, mach internal structures don't imply
much overhead
<braunr> or, the zone allocator is underused
<tschwinge> mcsim, braunr: The memory allocator project is coming along
good, as I get from your IRC messages?
<braunr> tschwinge: yes, but as expected, improvements are minor
<tschwinge> But at the very least it's now well-known, maintainable code.
<braunr> yes, it's readable, easier to understand, provides self inspection
and is smp ready
<braunr> there also are less hacks, but a few less features (there are no
way to avoid sleeping so it's unusable - and unused - in interrupt
handlers)
<braunr> is* no way
<braunr> tschwinge: mcsim did a good job porting and measuring it
# IRC, freenode, #hurd, 2011-09-08
<antrik> braunr: note that the zalloc map used to be limited to 8 MiB or
something like that a couple of years ago... so it doesn't seems
surprising that the kernel uses "only" 5 MiB :-)
<antrik> (yes, we had a *lot* of zalloc panics back then...)
# IRC, freenode, #hurd, 2011-09-14
<mcsim> braunr: hello. I've written a constructor for kernel map entries
and it can return resources to their source. Can you have a look at it?
http://paste.debian.net/130037/ If all be OK I'll push it tomorrow.
<braunr> mcsim: send the patch through mail please, i'll apply it on my
copy
<braunr> are you sure the cache is reapable ?
<mcsim> All slabs, except first I allocate with kmem_alloc_wired.
<braunr> how can you be sure ?
<mcsim> First slab I allocate during bootstrap and use pmap_steal_memory
and further I use only kmem_alloc_wired
<braunr> no, you use kmem_free
<braunr> in kentry_dealloc_cache()
<braunr> which probably creates a recursion
<braunr> using the constructor this way isn't a good idea
<braunr> constructors are good for preconstructed state (set counters to 0,
init lists and locks, that kind of things, not allocating memory)
<braunr> i don't think you should try to make this special cache reapable
<braunr> mcsim: keep in mind constructors are applied on buffers at *slab*
creation, not at object allocation
<braunr> so if you allocate a single slab with, say, 50 or 100 objects per
slab, kmem_alloc_wired would be called that number of times
<mcsim> why kentry_dealloc_cache can create recursion? kentry_dealloc_cache
is called only by mem_cache_reap.
<braunr> right
<braunr> but are you totally sure mem_cache_reap() can't be called by
kmem_free() ?
<braunr> i think you're right, it probably can't
# IRC, freenode, #hurd, 2011-09-25
<mcsim> braunr: hello. I rewrote constructor for kernel entries and seems
that it works fine. I think that this was last milestone. Only moving of
memory allocator sources to more appropriate place and merge with main
branch left.
<braunr> mcsim: it needs renaming and reindenting too
<mcsim> for reindenting C-x h Tab in emacs will be enough?
<braunr> mcsim: make sure which style must be used first
<mcsim> and what should I rename and where better to place allocator? For
example, there is no lib directory, like in x15. Should I create it and
move list.* and rbtree.* to lib/ or move these files to util/ or
something else?
<braunr> mcsim: i told you balloc isn't a good name before, use something
more meaningful (kmem is already used in gnumach unfortunately if i'm
right)
<braunr> you can put the support files in kern/
<mcsim> what about vm_alloc?
<braunr> you should prefix it with vm_
<braunr> shouldn't
<braunr> it's a top level allocator
<braunr> on top of the vm system
<braunr> maybe mcache
<braunr> hm no
<braunr> maybe just km_
<mcsim> kern/km_alloc.*?
<braunr> no
<braunr> just km
<mcsim> ok.
# IRC, freenode, #hurd, 2011-09-27
<mcsim> braunr: hello. When I've tried to speed of new allocator and bad
I've removed function mem_cpu_pool_fill. But you've said to undo this. I
don't understand why this function is necessary. Can you explain it,
please?
<mcsim> When I've tried to compare speed of new allocator and old*
<braunr> i'm not sure i said that
<braunr> i said the performance overhead is negligible
<braunr> so it's better to leave the cpu pool layer in place, as it almost
doesn't hurt
<braunr> you can implement the KMEM_CF_NO_CPU_POOL I added in the x15 mach
version
<braunr> so that cpu pools aren't used by default, but the code is present
in case smp is implemented
<mcsim> I didn't remove cpu pool layer. I've just removed filling of cpu
pool during creation of slab.
<braunr> how do you fill the cpu pools then ?
<mcsim> If object is freed than it is added to cpu poll
<braunr> so you don't fill/drain the pools ?
<braunr> you try to get/put an object and if it fails you directly fall
back to the slab layer ?
<mcsim> I drain them during garbage collection
<braunr> oh
<mcsim> yes
<braunr> you shouldn't touch the cpu layer during gc
<braunr> the number of objects should be small enough so that we don't care
much
<mcsim> ok. I can drain cpu pool at any other time if it is prohibited to
in mem_gc.
<mcsim> But why do we need to fill cpu poll during slab creation?
<mcsim> In this case allocation consist of: get object from slab -> put it
to cpu pool -> get it from cpu pool
<mcsim> I've just remove last to stages
<braunr> hm cpu pools aren't filled at slab creation
<braunr> they're filled when they're empty, and drained when they're full
<braunr> so that the number of objects they contain is increased/reduced to
a value suitable for the next allocations/frees
<braunr> the idea is to fall back as little as possible to the slab layer
because it requires the acquisition of the cache lock
<mcsim> oh. You're right. I'm really sorry. The point is that if cpu pool
is empty we don't need to fill it first
<braunr> uh, yes we do :)
<mcsim> Why cache locking is so undesirable? If we have free objects in
slabs locking will not take a lot if time.
<braunr> mcsim: it's undesirable on a smp system
<mcsim> ok.
<braunr> mcsim: and spin locks are normally noops on a up system
<braunr> which is the case in gnumach, hence the slightly better
performances without the cpu layer
<braunr> but i designed this allocator for x15, which only supports mp
systems :)
<braunr> mcsim: sorry i couldn't look at your code, sick first, busy with
server migration now (new server almost ready for xen hurds :))
<mcsim> ok.
<mcsim> I ended with allocator if didn't miss anything important:)
<braunr> i'll have a look soon i hope :)
# IRC, freenode, #hurd, 2011-09-27
<antrik> braunr: would it be realistic/useful to check during GC whether
all "used" objects are actually in a CPU pool, and if so, destroy them so
the slab can be freed?...
<antrik> mcsim: BTW, did you ever do any measurements of memory
use/fragmentation?
<mcsim> antrik: I couldn't do this for zalloc
<antrik> oh... why not?
<antrik> (BTW, I would be interested in a comparision between using the CPU
layer, and bare slab allocation without CPU layer)
<mcsim> Result I've got were strange. It wasn't even aligned to page size.
<mcsim> Probably is it better to look into /proc/vmstat?
<mcsim> Because I put hooks in the code and probably I missed something
<antrik> mcsim: I doubt vmstat would give enough information to make any
useful comparision...
<braunr> antrik: isn't this draining cpu pools at gc time ?
<braunr> antrik: the cpu layer was found to add a slight overhead compared
to always falling back to the slab layer
<antrik> braunr: my idea is only to drop entries from the CPU cache if they
actually prevent slabs from being freed... if other objects in the slab
are really in use, there is no point in flushing them from the CPU cache
<antrik> braunr: I meant comparing the fragmentation with/without CPU
layer. the difference in CPU usage is probably negligable anyways...
<antrik> you might remember that I was (and still am) sceptical about CPU
layer, as I suspect it worsens the good fragmentation properties of the
pure slab allocator -- but it would be nice to actually check this :-)
<braunr> antrik: right
<braunr> antrik: the more i think about it, the more i consider slqb to be
a better solution ...... :>
<braunr> an idea for when there's time
<braunr> eh
<antrik> hehe :-)
# IRC, freenode, #hurd, 2011-10-13
<braunr> mcsim: what's the current state of your gnumach branch ?
<mcsim> I've merged it with master in September
<braunr> yes i've seen that, but does it build and run fine ?
<mcsim> I've tested it on gnumach from debian repository, but for building
I had to make additional change in device/ramdisk.c, as I mentioned.
<braunr> mcsim: why ?
<mcsim> And it runs fine for me.
<braunr> mcsim: why did you need to make other changes ?
<mcsim> because there is a patch which comes with from-debian-repository
kernel and it addes some code, where I have to make changes. Earlier
kernel_map was a pointer to structure, but I change that and now
kernel_map is structure. So handling to it should be by taking the
address (&kernel_map)
<braunr> why did you do that ?
<braunr> or put it another way: what made you do that type change on
kernel_map ?
<mcsim> Earlier memory for kernel_map was allocating with zalloc. But now
salloc can't allocate memory before it's initialisation
<braunr> that's not a good reason
<braunr> a simple workaround for your problem is this :
<braunr> static struct vm_map kernel_map_store;
<braunr> vm_map_t kernel_map = &kernel_map_store;
<mcsim> braunr: Ok. I'll correct this.
# IRC, freenode, #hurd, 2011-11-01
<braunr> etenil: but mcsim's work is, for one, useful because the allocator
code is much clearer, adds some debugging support, and is smp-ready
# IRC, freenode, #hurd, 2011-11-14
<braunr> i've just realized that replacing the zone allocator removes most
(if not all) static limit on allocated objects
<braunr> as we have nothing similar to rlimits, this means kernel resources
are actually exhaustible
<braunr> and i'm not sure every allocation is cleanly handled in case of
memory shortage
<braunr> youpi: antrik: tschwinge: is this acceptable anyway ?
<braunr> (although IMO, it's also a good thing to get rid of those limits
that made the kernel panic for no valid reason)
<youpi> there are actually not many static limits on allocated objects
<youpi> only a few have one
<braunr> those defined in kern/mach_param.h
<youpi> most of them are not actually enforced
<braunr> ah ?
<braunr> they are used at zinit() time
<braunr> i thought they were
<youpi> yes, but most zones are actually fine with overcoming the max
<braunr> ok
<youpi> see zone->max_size += (zone->max_size >> 1);
<youpi> you need both !EXHAUSTIBLE and FIXED
<braunr> ok
<pinotree> making having rlimits enforced would be nice...
<pinotree> s/making//
<braunr> pinotree: the kernel wouldn't handle many standard rlimits anyway
<braunr> i've just committed my final patch on mcsim's branch, which will
serve as the starting point for integration
<braunr> which means code in this branch won't change (or only last minute
changes)
<braunr> you're invited to test it
<braunr> there shouldn't be any noticeable difference with the master
branch
<braunr> a bit less fragmentation
<braunr> more memory can be reclaimed by the VM system
<braunr> there are debugging features
<braunr> it's SMP ready
<braunr> and overall cleaner than the zone allocator
<braunr> although a bit slower on the free path (because of what's
performed to reduce fragmentation)
<braunr> but even "slower" here is completely negligible
# IRC, freenode, #hurd, 2011-11-15
<mcsim> I enabled cpu_pool layer and kentry cache exhausted at "apt-get
source gnumach && (cd gnumach-* && dpkg-buildpackage)"
<mcsim> I mean kernel with your last commit
<mcsim> braunr: I'll make patch how I've done it in a few minutes, ok? It
will be more specific.
<braunr> mcsim: did you just remove the #if NCPUS > 1 directives ?
<mcsim> no. I replaced macro NCPUS > 1 with SLAB_LAYER, which equals NCPUS
> 1, than I redefined macro SLAB_LAYER
<braunr> ah, you want to make the layer optional, even on UP machines
<braunr> mcsim: can you give me the commands you used to trigger the
problem ?
<mcsim> apt-get source gnumach && (cd gnumach-* && dpkg-buildpackage)
<braunr> mcsim: how much ram & swap ?
<braunr> let's see if it can handle a quite large aptitude upgrade
<mcsim> how can I check swap size?
<braunr> free
<braunr> cat /proc/meminfo
<braunr> top
<braunr> whatever
<mcsim> total used free shared buffers
cached
<mcsim> Mem: 786368 332296 454072 0 0
0
<mcsim> -/+ buffers/cache: 332296 454072
<mcsim> Swap: 1533948 0 1533948
<braunr> ok, i got the problem too
<mcsim> braunr: do you run hurd in qemu?
<braunr> yes
<braunr> i guess the cpu layer increases fragmentation a bit
<braunr> which means more map entries are needed
<braunr> hm, something's not right
<braunr> there are only 26 kernel map entries when i get the panic
<braunr> i wonder why the cache gets that stressed
<braunr> hm, reproducing the kentry exhaustion problem takes quite some
time
<mcsim> braunr: what do you mean?
<braunr> sometimes, dpkg-buildpackage finishes without triggering the
problem
<mcsim> the problem is in apt-get source gnumach
<braunr> i guess the problem happens because of drains/fills, which
allocate/free much more object than actually preallocated at boot time
<braunr> ah ?
<braunr> ok
<braunr> i've never had it at that point, only later
<braunr> i'm unable to trigger it currently, eh
<mcsim> do you use *-dbg kernel?
<braunr> yes
<braunr> well, i use the compiled kernel, with the slab allocator, built
with the in kernel debugger
<mcsim> when you run apt-get source gnumach, you run it in clean directory?
Or there are already present downloaded archives?
<braunr> completely empty
<braunr> ah just got it
<braunr> ok the limit is reached, as expected
<braunr> i'll just bump it
<braunr> the cpu layer drains/fills allocate several objects at once (64 if
the size is small enough)
<braunr> the limit of 256 (actually 252 since the slab descriptor is
embedded in its slab) is then easily reached
<antrik> mcsim: most direct way to check swap usage is vmstat
<braunr> damn, i can't live without slabtop and the amount of
active/inactive cache memory any more
<braunr> hm, weird, we have active/inactive memory in procfs, but not
buffers/cached memory
<braunr> we could set buffers to 0 and everything as cached memory, since
we're currently unable to communicate the purpose of cached memory
(whether it's used by disk servers or file system servers)
<braunr> mcsim: looks like there are about 240 kernel map entries (i forgot
about the ones used in kernel submaps)
<braunr> so yes, addin the cpu layer is what makes the kernel reach the
limit more easily
<mcsim> braunr: so just increasing limit will solve the problem?
<braunr> mcsim: yes
<braunr> slab reclaiming looks very stable
<braunr> and unfrequent
<braunr> (which is surprising)
<pinotree> braunr: "unfrequent"?
<braunr> pinotree: there isn't much memory pressure
<braunr> slab_collect() gets called once a minute on my hurd
<braunr> or is it infrequent ?
<braunr> :)
<pinotree> i have no idea :)
<braunr> infrequent, yes
# IRC, freenode, #hurd, 2011-11-16
<braunr> for those who want to play with the slab branch of gnumach, the
slabinfo tool is available at http://git.sceen.net/rbraun/slabinfo.git/
<braunr> for those merely interested in numbers, here is the output of
slabinfo, for a hurd running in kvm with 512 MiB of RAM, an unused swap,
and a short usage history (gnumach debian packages built, aptitude
upgrade for a dozen of packages, a few git commands)
<braunr> http://www.sceen.net/~rbraun/slabinfo.out
<antrik> braunr: numbers for a long usage history would be much more
interesting :-)
## IRC, freenode, #hurd, 2011-11-17
<braunr> antrik: they'll come :)
<etenil> is something going on on darnassus? it's mighty slow
<braunr> yes
<braunr> i've rebooted it to run a modified kernel (with the slab
allocator) and i'm building stuff on it to stress it
<braunr> (i don't have any other available machine with that amount of
available physical memory)
<etenil> ok
<antrik> braunr: probably would be actually more interesting to test under
memory pressure...
<antrik> guess that doesn't make much of a difference for the kernel object
allocator though
<braunr> antrik: if ram is larger, there can be more objects stored in
kernel space, then, by building something large such as eglibc, memory
pressure is created, causing caches to be reaped
<braunr> our page cache is useless because of vm_object_cached_max
<braunr> it's a stupid arbitrary limit masking the inability of the vm to
handle pressure correctly
<braunr> if removing it, the kernel freezes soon after ram is filled
<braunr> antrik: it may help trigger the "double swap" issue you mentioned
<antrik> what may help trigger it?
<braunr> not checking this limit
<antrik> hm... indeed I wonder whether the freezes I see might have the
same cause
## IRC, freenode, #hurd, 2011-11-19
<braunr> http://www.sceen.net/~rbraun/slabinfo.out <= state of the slab
allocator after building the debian libc packages and removing all files
once done
<braunr> it's mostly the same as on any other machine, because of the
various arbitrary limits in mach (most importantly, the max number of
objects in the page cache)
<braunr> fragmentation is still quite low
<antrik> braunr: actually fragmentation seems to be lower than on the other
run...
<braunr> antrik: what makes you think that ?
<antrik> the numbers of currently unused objects seem to be in a similar
range IIRC, but more of them are reclaimable I think
<antrik> maybe I'm misremembering the other numbers
<braunr> there had been more reclaims on the other run
# IRC, freenode, #hurd, 2011-11-25
<braunr> mcsim: i've just updated the slab branch, please review my last
commit when you have time
<mcsim> braunr: Do you mean compilation/tests?
<braunr> no, just a quick glance at the code, see if it matches what you
intended with your original patch
<mcsim> braunr: everything is ok
<braunr> good
<braunr> i think the branch is ready for integration
|