[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]] [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable id="license" text="Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled [[GNU Free Documentation License|/fdl]]."]]"""]] [[!tag open_issue_gnumach]] There is a [[!FF_project 266]][[!tag bounty]] on this task. IRC, freenode, #hurd, 2011-04-12: braunr: do you think the allocator you wrote for x15 could be used for gnumach? and would you be willing to mentor this? :-) antrik: to be willing to isn't my current problem antrik: and yes, I think my allocator can be used it's a slab allocator after all, it only requires reap() and grow() or mmap()/munmap() whatever you want to call it a backend antrik: although i've been having other ideas recently that would have more impact on our usage patterns I think mcsim: have you investigated how the zone allocator works and how it's hooked into the system yet? mcsim: now let me give you a link mcsim: http://git.sceen.net/rbraun/libbraunr.git/?a=blob;f=mem.c;h=330436e799f322949bfd9e2fedf0475660309946;hb=HEAD mcsim: this is an implementation of the slab allocator i've been working on recently mcsim: i haven't made it public because i reworked the per processor layer, and this part isn't complete yet mcsim: you could use it as a reference for your project braunr: ok it used to be close to the 2001 vmem paper but after many tests, fragmentation and accounting issues have been found so i rewrote it to be closer to the linux implementation (cache filling/draining in bukl transfers) bulk* they actually use the word draining in linux too :) antrik: not complete yet. braunr: oh, it's unfinished? that's unfortunate... antrik: only the per processor part antrik: so it doesn't matter much for gnumach and it's not difficult to set up mcsim: hm, OK... but do you think you will have a fairly good understanding in the next couple of days?... I'm asking because I'd really like to see a proposal a bit more specific than "I'll look into things..." i.e. you should have an idea which things you will actually have to change to hook up a new allocator etc. braunr: OK. will the interface remain unchanged, so it could be easily replaced with an improved implementation later? the zone allocator in gnumach is a badly written bare object allocator actually, there aren't many things to understand about it antrik: yes great :-) and the per processor part should be very close to the phys allocator sitting next to it (with the slight difference that, as per cpu caches have variable sizes, they are allocated on the free path rather than on the allocation path) this is a nice trick in the vmem paper i've kept in mind and the interface also allows to set a "source" for caches ah, good point... do you think we should replace the physmem allocator too? and if so, do it in one step, or one piece at a time?... no too many drivers currently depend on the physical allocator and the pmap module as they are remember linux 2.0 drivers need a direct virtual to physical mapping (especially true for dma mappings) OK the nice thing about having a configurable memory source is that whot do you mean by "allocated on the free path"? even if most caches will use the standard vm_kmem module as their backend there is one exception in the vm_map module, allowing us to get rid of either a static limit, or specific allocation code antrik: well, when you allocate a page, the allocator will lookup one in a per cpu cache if it's empty, it fills the cache (called pools in my implementations) it then retries the problem in the slab allocator is that per cpu caches have variable sizes so per cpu pools are allocated from their own pools (remember the magazine_xx caches in the output i showed you, this is the same thing) but if you allocate them at allocation time, you could end up in an infinite loop so, in the slab allocator, when a per cpu cache is empty, you just fall back to the slab layer on the free path, when a per cpu cache doesn't exist, you allocate it from its own cache this way you can't have an infinite loop antrik: I'll try, but I have exams now. As I understand amount of elements which could be allocated we determine by zone initialization. And at this time memory for zone is reserved. I'm going to change this. And make something similar to kmalloc and vmalloc (support for pages consecutive physically and virtually). And pages in zones consecutive always physically. Am I right? mcsim: don't try to do that why? mcsim: we just need a slab allocator with an interface close to the zone allocator mcsim: IIRC the size of the complete zalloc map is fixed; but not the number of elements per zone we don't need two allocators like kmalloc and vmalloc actually we just need vmalloc IIRC the limits are only present because the original developers wanted to track leaks they assumed zones would be large enough, which isn't true any more today but i didn't see any true reservation antrik: i'm not sure i was clear enough about the "allocation of cpu caches on the free path" antrik: for a better explanation, read the vmem paper ;) braunr: you mean there is no fundamental reason why the zone map has a limited maximal size; and it was only put in to catch cases where something eats up all memory with kernel object creation?... braunr: I think I got it now :-) antrik: i'm pretty certin of it yes I don't see though how it is related to what we were talking about... 10:55 < braunr> and the per processor part should be very close to the phys allocator sitting next to it the phys allocator doesn't have to use this trick because pages have a fixed size, so per cpu caches all have the same size too and the number of "caches", that is, physical segments, is limited and known at compile time so having them statically allocated is possible I see it would actually be very difficult to have a phys allocator requiring dynamic allocation when the dynamic allocator isn't yet ready hehe :-) total size of all zone allocations is limited to 12 MB. And is "was only put in to catch cases where something eats up all memory with kernel object creation?" mcsim: ah right, there could be a kernel submap backing all the zones but this can be increased too submaps are kind of evil :/ mcsim: I think it's actually 32 MiB or something like that in the Debian version... braunr: I'm not sure I ever fully understood what the zalloc map is... I looked through the code once, and I think I got a rough understading, but I was still pretty uncertain about some bits. and I don't remember the details anyways :-) antrik: IIRC, it's a kernel submap it's named kmem_map in x15 don't know what a submap is submaps are vm_map objects in a top vm_map, there are vm_map_entries these entries usually point to vm_objects (for the page cache) but they can point to other maps too the goal is to reduce fragmentation by isolating allocations this also helps reducing contention for exemple, on BSD, there is a submap for mbufs, so that the network code doesn't interfere too much with other kernel allocations antrik: they are similar to spans in vmem, but vmem has an elegant importing mechanism which eliminates the static limit problem so memory is not directly allocated from the physical allocator, but instead from another map which in turn contains physical memory, or something like that?... no, this is entirely virtual submaps are almost exclusively used for the kernel_map you are using a lot of identifies here, but I don't remember (or never knew) what most of them mean :-( sorry :) the kernel map is the vm_map used to represent the ~1 GiB of virtual memory the kernel has (on i386) vm_map objects are simple virtual space maps they contain what you see in linux when doing /proc/self/maps cat /proc/self/maps (linux uses entirely different names but it's roughly the same structure) each line is a vm_map_entry (well, there aren't submaps in linux though) the pmap tool on netbsd is able to show the kernel map with its submaps, but i don't have any image around braunr: is limit for zones is feature and shouldn't be changed? mcsim: i think we shouldn't have fixed limits for zones mcsim: this should be part of the debugging facilities in the slab allocator is this fixed limit really a major problem ? i mean, don't focus on that too much, there are other issues requiring more attention braunr: at 12 MiB, it used to be, causing a lot of zalloc panics. after increasing, I don't think it's much of a problem anymore... but as memory sizes grow, it might become one again that's the problem with a fixed size... yes, that's the issue with submaps but gnumach is full of those, so let's fix them by order of priority well, I'm still trying to digest what you wrote about submaps :-) i'm downloading netbsd, so you can have a good view of all this so, when the kernel allocates virtual address space regions (mostly for itself), instead of grabbing chunks of the address space directly, it takes parts out of a pre-reserved region? not exactly both statements are true antrik: only virtual addresses are reserved it grabs chunks of the address space directly, but does so in a reserved region of the address space a submap is like a normal map, it has a start address, a size, and is empty, then it's populated with vm_map_entries so instead of allocating from 3-4 GiB, you allocate from, say, 3.1-3.2 GiB yeah, that's more or less what I meant... braunr: I see two problems: limited zones and absence of caching. with caching absence of readahead paging will be not so significant please avoid readahead ok and it's not about paging, it's about kernel memory, which is wired (well most of it) what about limited zones ? the whole kernel space is limited, there has to be limits the problem is how to handle them braunr: almost all. I looked through all zones once, and IIRC I found exactly one that actually allows paging... currently, when you reach the limit, you have an OOM error antrik: yes, there are i don't remember which implementation does that but, when processes haven't been active for a minute or so, they are "swapedout" completely even the kernel stack and the page tables (most of the pmap structures are destroyed, some are retained) that might very well be true... at least inactive processes often show up with 0 memory use in top on Hurd this is done by having a pageable kernel map, with wired entries when the swapper thread swaps tasks out, it unwires them but i think modern implementations don't do that any more well, I was talking about zalloc only :-) oh so the zalloc_map must be pageable or there are two submaps ? not sure whether "morden implementations" includes Linux ;-) no, i'm talking about the bsd family only but it's certainly true that on Linux even inactive processes retain some memory linux doesn't make any difference between processor-bound and I/O-bound processes braunr: I have no idea how it works. I just remember that when creating zones, one of the optional flags decides whether the zone is pagable. but as I said, IIRC there is exactly one that actually is... zone_map = kmem_suballoc(kernel_map, &zone_min, &zone_max, zone_map_size, FALSE); kmem_suballoc(parent, min, max, size, pageable) so the zone_map isn't IIRC my conclusion was that pagable zones do not count in the fixed zone map limit... but I'm not sure anymore zinit() has a memtype parameter with ZONE_PAGEABLE as a possible flag this is wierd :) There is no any zones which use ZONE_PAGEABLE flag mcsim: are you sure? I think I found one... if (zone->type & ZONE_PAGEABLE) { admittedly, it is several years ago that I looked into this, so my memory is rather dim... if (kmem_alloc_pageable(zone_map, &addr, ... calling kmem_alloc_pageable() on an unpageable submap seems wrong I've greped gnumach code and there is no any zinit procedure call with ZONE_PAGEABLE flag good hm... perhaps it was in some code that has been removed alltogether since ;-) actually I think it would be pretty neat to have pageable kernel objects... but I guess it would require considerable effort to implement this right mcsim: you also mentioned absence of caching mcsim: the zone allocator actually is a bare caching object allocator antrik: no, it's easy antrik: i already had that in x15 0.1 antrik: the problem is being sure the objects you allocate from a pageable backing store are never used when resolving a page fault that's all I wouldn't expect that to be easy... but surely you know better :-) braunr: indeed. I was wrong. braunr: what is a caching object allocator?... antrik: ok, it's not easy antrik: but once you have vm_objects implemented, having pageable kernel object is just a matter of using the right options, really antrik: an allocator that caches its buffers some years ago, the term "object" would also apply to preconstructed buffers I have no idea what you mean by "caches its buffers" here :-) well, a memory allocator which doesn't immediately free its buffers caches them braunr: but can it return objects to system? mcsim: which one ? yeah, obviously the *implementation* of pageable kernel objects is not hard. the tricky part is deciding which objects can be pageable, and which need to be wired... Can zone allocator return cached objects to system as in slab? I mean reap() well yes, it does so, and it does that too often the caching in the zone allocator is actually limited to the pagesize once page is completely free, it is returned to the vm this is bad caching yes if object takes all page than there is now caching at all caching by side effect true but the linux slab allocator does the same thing :p hm no, the solaris slab allocator does so linux's slab returns objects only when system ask without preconstructed objects, is there actually any point in caching empty slabs?... Once I've changed my allocator to slab and it cached more than 1GB of my memory) ok wait, need to fix a few mistakes first s/ask/asks the zone allocator (in gnumach) actually has a garbage collector braunr: well, the Solaris allocator follows the slab/magazine paper, right? so there is caching at the magazine layer... in that case caching empty slabs too would be rather redundant I'd say... which is called when running low on memory, similar to the slab allocaotr antrik: yes (or rather the paper follows the Solaris allocator ;-) ) mcsim: the zone allocator reap() is zone_gc() braunr: hm, right, there is a "collectable" flag for zones... but I never understood what it means braunr: BTW, I heard Linux has yet another allocator now called "slob"... do you happen to know what that is? slob is a very simple allocator for embedded devices AFAIR this is just heap allocator useful when you have a very low amount of memory like 1 MiB yes just googled it :-) zone and slab are very similar sounds like a simple heap allocator there is another allocator that calls slub, and it better than slab in many cases the main difference is the data structures used to store slabs mcsim: i disagree mcsim: ah, you already said that :-) mcsim: slub is better for systems with very large amounts of memory and processors otherwise, slab is better in addition, there are accounting issues with slub because of cache merging ok. This strange that slub is default allocator well both are very good iirc, linus stated that he really doesn't care as long as its works fine he refused slqb because of that slub is nice because it requires less memory than slab, while still being as fast for most cases it gets slower on the free path, when the cpu performing the free is different from the one which allocated the object that's a reasonable cost slub uses heap for large object. Are there any tests that compare what is better for large objects? well, if slub requires less memory, why do you think slab is better for smaller systems? :-) antrik: smaller is relative mcsim: for large objects slab allocation is rather pointless, as you don't have multiple objects in a page anyways... antrik: when lameter wrote slub, it was intended for systems with several hundreds processors BTW, was slqb really refused only because the other ones are "good enough"?... yes wow, that's a strange argument... linus is already unhappy of having "so many" allocators well, if the new one is better, it could replace one of the others :-) or is it useful only in certain cases? that's the problem nobody really knows hm, OK... I guess that should be tested *before* merging ;-) is anyone still working on it, or was it abandonned? mcsim: back to caching... what does caching in the kernel object allocator got to do with readahead (i.e. clustered paging)?... if we cached some physical pages we don't need to find new ones for allocating new object. And that's why there will not be a page fault. antrik: Regarding kam. Hasn't he finished his project? err... what? one of us must be seriously confused I totally fail to see what caching of physical pages (which isn't even really a correct description of what slab does) has to do with page faults right, KAM didn't finish his project If we free the physical page and return it to system we need another one for next allocation. But if we keep it, we don't need to find new physical page. And physical page is allocated only then when page fault occurs. Probably, I'm wrong what does "return to system" mean? we are talking about the kernel... zalloc/slab are about allocating kernel objects. this doesn't have *anything* to do with paging of userspace processes only thing the have in common is that they need to get pages from the physical page allocator. but that's yet another topic Under "return to system" I mean ability to use this page for other needs. mcsim: consider kernel memory to be wired here, return to system means releasing a page back to the vm system the vm_kmem module then unmaps the physical page and free its virtual address in the kernel map ok antrik: the problem with new allocators like slqb is that it's very difficult to really know if they're better, even with extensive testing antrik: there are papers (like wilson95) about the difficulties in making valuable results in this field see http://www.sceen.net/~rbraun/dynamic_storage_allocation_a_survey_and_critical_review.pdf how can be allocated physically continuous object now? mcsim: rephrase please what is similar to kmalloc in Linux to gnumach? i know memory is reserved for dma in a direct virtual to physical mapping so even if the allocation is done similarly to vmalloc() the selected region of virtual space maps physical memory, so memory is physically contiguous too for other allocation types, a block large enough is allocated, so it's contiguous too I don't clearly understand. If we have fragmentation in physical ram, so there aren't 2 free pages in a row, but there are able apart, we can't to allocate these 2 pages along? no but every system has this problem But since we have only 12 or 32 MB of memory the problem becomes more significant you're confusing virtual and physical memory those 32 MiB are virtual the physical pages backing them don't have to be contiguous Oh, indeed So the only problem are limits? and performance and correctness i find the zone allocator badly written antrik: mcsim: here is the content of the kernel pmap on NetBSD (which uses a virtual memory system close to the Mach VM) antrik: mcsim: http://www.sceen.net/~rbraun/pmap.out [[pmap.out]] you can see the kmem_map (which is used for most general kernel allocations) is 128 MiB large actually it's not the kernel pmap, it's the kernel_map braunr: why is it called pmap.out then? ;-) antrik: because the tool is named pmap for process map it also exists under Linux, although direct access to /proc/xx/maps gives more info braunr: I've said that this is kernel_map. Can I see kernel_map for Linux? mcsim: I don't know how to do that s/I've/You've but Linux doesn't have submaps, and uses a direct virtual to physical mapping, so it's used differently how are things (such as zalloc zones) entered into kernel_map? in zone_init() you have zone_map = kmem_suballoc(kernel_map, &zone_min, &zone_max, zone_map_size, FALSE); so here, kmem_map is named zone_map then, in zalloc() kmem_alloc_wired(zone_map, &addr, zone->alloc_size) so, kmem_alloc just deals out chunks of memory referenced directly by the address, and without knowing anything about the use? kmem_alloc() gives virtual pages zalloc() carves them into buffers, as in the slab allocator the difference is essentially the lack of formal "slab" object which makes the zone code look like a mess so kmem_suballoc() essentially just takes a bunch of pages from the main kernel_map, and uses these to back another map which then in turn deals out pages just like the main kernel_map? no kmem_suballoc creates a vm_map_entry object, and sets its start and end address and creates a vm_map object, which is then inserted in the new entry maybe that's what you meant with "essentially just takes a bunch of pages from the main kernel_map" but there really is no allocation at this point except the map entry and the new map objects well, I'm trying to understand how kmem_alloc() manages things. so it has map_entry structures like the maps of userspace processes? do these also reference actual memory objects? kmem_alloc just allocates virtual pages from a vm_map, and backs those with physical pages (unless the user requested pageable memory) it's not "like the maps of userspace processes" these are actually the same structures a vm_map_entry can reference a memory object or a kernel submap in netbsd, it can also referernce nothing (for pure wired kernel memory like the vm_page array) maybe it's the same in mach, i don't remember exactly antrik: this is actually very clear in vm/vm_kern.c kmem_alloc() creates a new kernel object for the allocation allocates a new entry (or uses a previous existing one if it can be extended) through vm_map_find_entry() then calls kmem_alloc_pages() to back it with wired memory "creates a new kernel object" -- what kind of kernel object? kmem_alloc_wired() does roughly the same thing, except it doesn't need a new kernel object because it knows the new area won't be pageable a simple vm_object used as a container for anonymous memory in case the pages are swapped out vm_object is the same as memory object/pager? or yet something different? antrik: almost antrik: a memory_object is the user view of a vm_object as in the kernel/user interfaces used by external pagers vm_object is a more internal name Is fragmentation a big problem in slab allocator? I've tested it on my computer in Linux and for some caches it reached 30-40% well, fragmentation is a major problem for any allocator... the original slab allocator was design specifically with the goal of reducing fragmentation the revised version with the addition of magazines takes a step back on this though have you compared it to slub? would be pretty interesting... I have an idea how can it be decreased, but it will hurt by performance... antrik: no I haven't, but there will be might the same, I think if each cache will handle two types of object: with sizes that will fit cache sizes (or I bit smaller) and with sizes which are much smaller than maximal cache size. For first type of object will be used standard slab allocator and for latter type will be used (within page) heap allocator. I think that than fragmentation will be decreased not at all. heap allocator has much worse fragmentation. that's why slab allocator was invented the problem is that in a long-running program (such an the kernel), objects tend to have vastly varying lifespans but we use heap only for objects of specified sizes so often a few old objects will keep a whole page hostage for example for 32 byte cache it could be 20-28 byte objects that's particularily visible in programs such as firefox, which will grow the heap during use even though actual needs don't change the slab allocator groups objects in a fashion that makes it more likely adjacent objects will be freed at similar times well, that's pretty oversimplyfied, but I hope you get the idea... it's about locality I agree, but I speak not about general heap allocation. We have many heaps for objects with different sizes. Could it be better? note that this has been a topic of considerable research. you shouldn't seek to improve the actual algorithms -- you would have to read up on the existing research at least before you can contribute anything to the field :-) how would that be different from the slab allocator? slab will allocate 32 byte for both 20 and 32 byte requests And if there was request for 20 bytes we get 12 unused oh, you mean the implementation of the generic allocator on top of slabs? well, that might not be optimal... but it's not an often used case anyways. mostly the kernel uses constant-sized objects, which get their own caches with custom tailored size I don't think the waste here matters at all affirmative. So my idea is useless. does the statistic you refer to show the fragmentation in absolute sizes too? Can you explain what is absolute size? I've counted what were requested (as parameter of kmalloc) and what was really allocated (according to best fit cache size). how did you get that information? I simply wrote a hook I mean total. i.e. how many KiB or MiB are wasted due to fragmentation alltogether ah, interesting. how does it work? BTW, did you read the slab papers? Do you mean articles from lwn.net? no I mean the papers from the Sun hackers who invented the slab allocator(s) Bonwick mostly IIRC Yes hm... then you really should know the rationale behind it... There he says about 11% percent of memory waste you didn't answer my other questions BTW :-) I've corrupted kernel tree with patch, and tomorrow I'm going to read myself up for exam (I have it on Thursday). But than I'll send you a module which I've used for testing. OK I can send you module now, but it will not work without patch. It would be better to rewrite it using debugfs, but when I was writing this test I didn't know about trace_* macros 2011-04-15 There is a hack in zone_gc when it allocates and frees two vm_map_kentry_zone elements to make sure the gc will be able to allocate two in vm_map_delete. Isn't it better to allocate memory for these entries statically? mcsim: that's not the point of the hack mcsim: the point of the hack is to make sure vm_map_delete will be able to allocate stuff allocating them statically will just work once it may happen several times that vm_map_delete needs to allocate it while it's empty (and thus zget_space has to get called, leading to a hang) funnily enough, the bug is also in macos X it's still in my TODO list to manage to find how to submit the issue to them really ? eh is that because of map entry splitting ? it's git commit efc3d9c47cd744c316a8521c9a29fa274b507d26 braunr: iirc something like this, yes netbsd has this issue too possibly i think it's a fundamental problem with the design people think of munmap() as something similar to free() whereas it's really unmap with a BSD-like VM, unmap can easily end up splitting one entry in two but your issue is more about harmful recursion right ? I don't remember actually it's quite some time ago :) ok i think that's why i have "sources" in my slab allocator, the default source (vm_kern) and a custom one for kernel map entries 2011-04-18 braunr: you've said that once page is completely free, it is returned to the vm. who else, besides zone_gc, can return free pages to the vm? mcsim: i also said i was wrong about that zone_gc is the only one 2011-04-19 antrik: mcsim: i added back a new per-cpu layer as planned http://git.sceen.net/rbraun/libbraunr.git/?a=blob;f=mem.c;h=c629b2b9b149f118a30f0129bd8b7526b0302c22;hb=HEAD mcsim: btw, in mem_cache_reap(), you can clearly see there are two loops, just as in zone_gc, to reduce contention and avoid deadlocks this is really common in memory allocators 2011-04-23 I've looked through some allocators and all of them use different per cpu cache policy. AFAIK gnuhurd doesn't support multiprocessing, but still multiprocessing must be kept in mind. So, what do you think what kind of cpu caches is better? As for me I like variant with only per-cpu caches (like in slqb). mcsim: well, have you looked at the allocator braunr wrote himself? :-) I'm not sure I suggested that explicitly to you; but probably it makes most sense to use that in gnumach 2011-04-24 antrik: Yes, I have. He uses both global and per cpu caches. But he also suggested to look through slqb, where there are only per cpu caches.\ i don't remember slqb in detail what do you mean by "only per-cpu caches" ? a whole slab sytem for each cpu ? I mean that there are no global queues in caches, but there are special queues for each cpu. I've just started investigating slqb's code, but I've read an article on lwn about it. And I've read that it is used for zen kernel. zen ? Here is this article http://lwn.net/Articles/311502/ Yes, this is linux kernel with some patches which haven't been approved to torvald's tree http://zen-kernel.org/ i see well it looks nice but as for slub, the problem i can see is cross-CPU freeing and I think nick piggins mentions it piggin* this means that sometimes, objects are "burst-free" from one cpu cache to another which has the same bad effects as in most other allocators, mainly fragmentation There is a special list for freeing object allocated for another CPU And garbage collector frees such object on his own so what's your question ? It is described in the end of article. What cpu-cache policy do you think is better to implement? at this point, any and even if we had a kernel that perfectly supports multiprocessor, I wouldn't care much now it's very hard to evaluate such allocators slqb looks nice, but if you have the same amount of fragmentation per slab as other allocators do (which is likely), you have tat amount of fragmentation multiplied by the number of processors whereas having shared queues limit the problem somehow having shared queues mean you have a bit more contention so, as is the case most of the time, it's a tradeoff by the way, does pigging say why he "doesn't like" slub ? :) piggin* http://lwn.net/Articles/311093/ here he describes what slqb is better. well it doesn't describe why slub is worse but not very particularly except for order-0 allocations and that's a form of fragmentation like i mentioned above in mach those problems have very different impacts the backend memory isn't physical, it's the kernel virtual space so the kernel allocator can request chunks of higher than order-0 pages physical pages are allocated one at a time, then mapped in the kernel space Doesn't order of page depend on buffer size? it does And why does gnumach allocates higher than order-0 pages more? why more ? i didn't say more And why in mach those problems have very different impact? ? i've just explained why :) 09:37 < braunr> physical pages are allocated one at a time, then mapped in the kernel space "one at a time" means order-0 pages, even if you allocate higher than order-0 chunks And in Linux they allocated more than one at time because of prefetching page reading? do you understand what virtual memory is ? linux allocators allocate "physical memory" mach kernel allocator allocates "virtual memory" so even if you allocate a big chunk of virtual memory, it's backed by order-0 physical pages yes, I understand this you don't seem to :/ the problem of higher than order-0 page allocations is fragmentation do you see why ? yes so fragmentation in the kernel space is less likely to create issues than it does in physical memory keep in mind physical memory is almost always full because of the page cache and constantly under some pressure whereas the kernel space is mostly empty so allocating higher then order-0 pages in linux is more dangerous than it is in Mach or BSD ok on the other hand, linux focuses pure performance, and not having to map memory means less operations, less tlb misses, quicker allocations the Mach VM must map pages "one at a time", which can be expensive it should be adapted to handle multiple page sizes (e.g. 2 MiB) so that many allocations can be made with few mappings but that's not easy as always: tradeoffs There are other benefits of physical allocating. In big DMA transfers can be needed few continuous physical pages. How does mach handles such cases? gnumach does that awfully it just reserves the whole DMA-able memory and uses special allocation functions on it, IIRC but kernels which have a MAch VM like memory sytem such as BSDs have cleaner methods NetBSD provides a function to allocate contiguous physical memory with many constraints FreeBSD uses a binary buddy system like Linux the fact that the kernel allocator uses virtual memory doesn't mean the kernel has no mean to allocate contiguous physical memory ...