[[!meta copyright="Copyright © 2011, 2012, 2013, 2014, 2016 Free Software Foundation, Inc."]] [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable id="license" text="Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled [[GNU Free Documentation License|/fdl]]."]]"""]] [[!tag open_issue_gnumach]] [[!toc]] # IRC, freenode, #hurd, 2011-04-12 <antrik> braunr: do you think the allocator you wrote for x15 could be used for gnumach? and would you be willing to mentor this? :-) <braunr> antrik: to be willing to isn't my current problem <braunr> antrik: and yes, I think my allocator can be used <braunr> it's a slab allocator after all, it only requires reap() and grow() <braunr> or mmap()/munmap() whatever you want to call it <braunr> a backend <braunr> antrik: although i've been having other ideas recently <braunr> that would have more impact on our usage patterns I think <antrik> mcsim: have you investigated how the zone allocator works and how it's hooked into the system yet? <braunr> mcsim: now let me give you a link <braunr> mcsim: http://git.sceen.net/rbraun/libbraunr.git/?a=blob;f=mem.c;h=330436e799f322949bfd9e2fedf0475660309946;hb=HEAD <braunr> mcsim: this is an implementation of the slab allocator i've been working on recently <braunr> mcsim: i haven't made it public because i reworked the per processor layer, and this part isn't complete yet <braunr> mcsim: you could use it as a reference for your project <mcsim> braunr: ok <braunr> it used to be close to the 2001 vmem paper <braunr> but after many tests, fragmentation and accounting issues have been found <braunr> so i rewrote it to be closer to the linux implementation (cache filling/draining in bukl transfers) <braunr> bulk* <braunr> they actually use the word draining in linux too :) <mcsim> antrik: not complete yet. <antrik> braunr: oh, it's unfinished? that's unfortunate... <braunr> antrik: only the per processor part <braunr> antrik: so it doesn't matter much for gnumach <braunr> and it's not difficult to set up <antrik> mcsim: hm, OK... but do you think you will have a fairly good understanding in the next couple of days?... <antrik> I'm asking because I'd really like to see a proposal a bit more specific than "I'll look into things..." <antrik> i.e. you should have an idea which things you will actually have to change to hook up a new allocator etc. <antrik> braunr: OK. will the interface remain unchanged, so it could be easily replaced with an improved implementation later? <braunr> the zone allocator in gnumach is a badly written bare object allocator actually, there aren't many things to understand about it <braunr> antrik: yes <antrik> great :-) <braunr> and the per processor part should be very close to the phys allocator sitting next to it <braunr> (with the slight difference that, as per cpu caches have variable sizes, they are allocated on the free path rather than on the allocation path) <braunr> this is a nice trick in the vmem paper i've kept in mind <braunr> and the interface also allows to set a "source" for caches <antrik> ah, good point... do you think we should replace the physmem allocator too? and if so, do it in one step, or one piece at a time?... <braunr> no <braunr> too many drivers currently depend on the physical allocator and the pmap module as they are <braunr> remember linux 2.0 drivers need a direct virtual to physical mapping <braunr> (especially true for dma mappings) <antrik> OK <braunr> the nice thing about having a configurable memory source is that <antrik> whot do you mean by "allocated on the free path"? <braunr> even if most caches will use the standard vm_kmem module as their backend <braunr> there is one exception in the vm_map module, allowing us to get rid of either a static limit, or specific allocation code <braunr> antrik: well, when you allocate a page, the allocator will lookup one in a per cpu cache <braunr> if it's empty, it fills the cache <braunr> (called pools in my implementations) <braunr> it then retries <braunr> the problem in the slab allocator is that per cpu caches have variable sizes <braunr> so per cpu pools are allocated from their own pools <braunr> (remember the magazine_xx caches in the output i showed you, this is the same thing) <braunr> but if you allocate them at allocation time, you could end up in an infinite loop <braunr> so, in the slab allocator, when a per cpu cache is empty, you just fall back to the slab layer <braunr> on the free path, when a per cpu cache doesn't exist, you allocate it from its own cache <braunr> this way you can't have an infinite loop <mcsim> antrik: I'll try, but I have exams now. <mcsim> As I understand amount of elements which could be allocated we determine by zone initialization. And at this time memory for zone is reserved. I'm going to change this. And make something similar to kmalloc and vmalloc (support for pages consecutive physically and virtually). And pages in zones consecutive always physically. <mcsim> Am I right? <braunr> mcsim: don't try to do that <mcsim> why? <braunr> mcsim: we just need a slab allocator with an interface close to the zone allocator <antrik> mcsim: IIRC the size of the complete zalloc map is fixed; but not the number of elements per zone <braunr> we don't need two allocators like kmalloc and vmalloc <braunr> actually we just need vmalloc <braunr> IIRC the limits are only present because the original developers wanted to track leaks <braunr> they assumed zones would be large enough, which isn't true any more today <braunr> but i didn't see any true reservation <braunr> antrik: i'm not sure i was clear enough about the "allocation of cpu caches on the free path" <braunr> antrik: for a better explanation, read the vmem paper ;) <antrik> braunr: you mean there is no fundamental reason why the zone map has a limited maximal size; and it was only put in to catch cases where something eats up all memory with kernel object creation?... <antrik> braunr: I think I got it now :-) <braunr> antrik: i'm pretty certin of it yes <antrik> I don't see though how it is related to what we were talking about... <braunr> 10:55 < braunr> and the per processor part should be very close to the phys allocator sitting next to it <braunr> the phys allocator doesn't have to use this trick <braunr> because pages have a fixed size, so per cpu caches all have the same size too <braunr> and the number of "caches", that is, physical segments, is limited and known at compile time <braunr> so having them statically allocated is possible <antrik> I see <braunr> it would actually be very difficult to have a phys allocator requiring dynamic allocation when the dynamic allocator isn't yet ready <antrik> hehe :-) <mcsim> total size of all zone allocations is limited to 12 MB. And is "was only put in to catch cases where something eats up all memory with kernel object creation?" <braunr> mcsim: ah right, there could be a kernel submap backing all the zones <braunr> but this can be increased too <braunr> submaps are kind of evil :/ <antrik> mcsim: I think it's actually 32 MiB or something like that in the Debian version... <antrik> braunr: I'm not sure I ever fully understood what the zalloc map is... I looked through the code once, and I think I got a rough understading, but I was still pretty uncertain about some bits. and I don't remember the details anyways :-) <braunr> antrik: IIRC, it's a kernel submap <braunr> it's named kmem_map in x15 <antrik> don't know what a submap is <braunr> submaps are vm_map objects <braunr> in a top vm_map, there are vm_map_entries <braunr> these entries usually point to vm_objects <braunr> (for the page cache) <braunr> but they can point to other maps too <braunr> the goal is to reduce fragmentation by isolating allocations <braunr> this also helps reducing contention <braunr> for exemple, on BSD, there is a submap for mbufs, so that the network code doesn't interfere too much with other kernel allocations <braunr> antrik: they are similar to spans in vmem, but vmem has an elegant importing mechanism which eliminates the static limit problem <antrik> so memory is not directly allocated from the physical allocator, but instead from another map which in turn contains physical memory, or something like that?... <braunr> no, this is entirely virtual <braunr> submaps are almost exclusively used for the kernel_map <antrik> you are using a lot of identifies here, but I don't remember (or never knew) what most of them mean :-( <braunr> sorry :) <braunr> the kernel map is the vm_map used to represent the ~1 GiB of virtual memory the kernel has (on i386) <braunr> vm_map objects are simple virtual space maps <braunr> they contain what you see in linux when doing /proc/self/maps <braunr> cat /proc/self/maps <braunr> (linux uses entirely different names but it's roughly the same structure) <braunr> each line is a vm_map_entry <braunr> (well, there aren't submaps in linux though) <braunr> the pmap tool on netbsd is able to show the kernel map with its submaps, but i don't have any image around <mcsim> braunr: is limit for zones is feature and shouldn't be changed? <braunr> mcsim: i think we shouldn't have fixed limits for zones <braunr> mcsim: this should be part of the debugging facilities in the slab allocator <braunr> is this fixed limit really a major problem ? <braunr> i mean, don't focus on that too much, there are other issues requiring more attention <antrik> braunr: at 12 MiB, it used to be, causing a lot of zalloc panics. after increasing, I don't think it's much of a problem anymore... <antrik> but as memory sizes grow, it might become one again <antrik> that's the problem with a fixed size... <braunr> yes, that's the issue with submaps <braunr> but gnumach is full of those, so let's fix them by order of priority <antrik> well, I'm still trying to digest what you wrote about submaps :-) <braunr> i'm downloading netbsd, so you can have a good view of all this <antrik> so, when the kernel allocates virtual address space regions (mostly for itself), instead of grabbing chunks of the address space directly, it takes parts out of a pre-reserved region? <braunr> not exactly <braunr> both statements are true <mcsim> antrik: only virtual addresses are reserved <braunr> it grabs chunks of the address space directly, but does so in a reserved region of the address space <braunr> a submap is like a normal map, it has a start address, a size, and is empty, then it's populated with vm_map_entries <braunr> so instead of allocating from 3-4 GiB, you allocate from, say, 3.1-3.2 GiB <antrik> yeah, that's more or less what I meant... <mcsim> braunr: I see two problems: limited zones and absence of caching. <mcsim> with caching absence of readahead paging will be not so significant <braunr> please avoid readahead <mcsim> ok <braunr> and it's not about paging, it's about kernel memory, which is wired <braunr> (well most of it) <braunr> what about limited zones ? <braunr> the whole kernel space is limited, there has to be limits <braunr> the problem is how to handle them <antrik> braunr: almost all. I looked through all zones once, and IIRC I found exactly one that actually allows paging... <braunr> currently, when you reach the limit, you have an OOM error <braunr> antrik: yes, there are <braunr> i don't remember which implementation does that but, when processes haven't been active for a minute or so, they are "swapedout" <braunr> completely <braunr> even the kernel stack <braunr> and the page tables <braunr> (most of the pmap structures are destroyed, some are retained) <antrik> that might very well be true... at least inactive processes often show up with 0 memory use in top on Hurd <braunr> this is done by having a pageable kernel map, with wired entries <braunr> when the swapper thread swaps tasks out, it unwires them <braunr> but i think modern implementations don't do that any more <antrik> well, I was talking about zalloc only :-) <braunr> oh <braunr> so the zalloc_map must be pageable <braunr> or there are two submaps ? <antrik> not sure whether "morden implementations" includes Linux ;-) <braunr> no, i'm talking about the bsd family only <antrik> but it's certainly true that on Linux even inactive processes retain some memory <braunr> linux doesn't make any difference between processor-bound and I/O-bound processes <antrik> braunr: I have no idea how it works. I just remember that when creating zones, one of the optional flags decides whether the zone is pagable. but as I said, IIRC there is exactly one that actually is... <braunr> zone_map = kmem_suballoc(kernel_map, &zone_min, &zone_max, zone_map_size, FALSE); <braunr> kmem_suballoc(parent, min, max, size, pageable) <braunr> so the zone_map isn't <antrik> IIRC my conclusion was that pagable zones do not count in the fixed zone map limit... but I'm not sure anymore <braunr> zinit() has a memtype parameter <braunr> with ZONE_PAGEABLE as a possible flag <braunr> this is wierd :) <mcsim> There is no any zones which use ZONE_PAGEABLE flag <antrik> mcsim: are you sure? I think I found one... <braunr> if (zone->type & ZONE_PAGEABLE) { <antrik> admittedly, it is several years ago that I looked into this, so my memory is rather dim... <braunr> if (kmem_alloc_pageable(zone_map, &addr, ... <braunr> calling kmem_alloc_pageable() on an unpageable submap seems wrong <mcsim> I've greped gnumach code and there is no any zinit procedure call with ZONE_PAGEABLE flag <braunr> good <antrik> hm... perhaps it was in some code that has been removed alltogether since ;-) <antrik> actually I think it would be pretty neat to have pageable kernel objects... but I guess it would require considerable effort to implement this right <braunr> mcsim: you also mentioned absence of caching <braunr> mcsim: the zone allocator actually is a bare caching object allocator <braunr> antrik: no, it's easy <braunr> antrik: i already had that in x15 0.1 <braunr> antrik: the problem is being sure the objects you allocate from a pageable backing store are never used when resolving a page fault <braunr> that's all <antrik> I wouldn't expect that to be easy... but surely you know better :-) <mcsim> braunr: indeed. I was wrong. <antrik> braunr: what is a caching object allocator?... <braunr> antrik: ok, it's not easy <braunr> antrik: but once you have vm_objects implemented, having pageable kernel object is just a matter of using the right options, really <braunr> antrik: an allocator that caches its buffers <braunr> some years ago, the term "object" would also apply to preconstructed buffers <antrik> I have no idea what you mean by "caches its buffers" here :-) <braunr> well, a memory allocator which doesn't immediately free its buffers caches them <mcsim> braunr: but can it return objects to system? <braunr> mcsim: which one ? <antrik> yeah, obviously the *implementation* of pageable kernel objects is not hard. the tricky part is deciding which objects can be pageable, and which need to be wired... <mcsim> Can zone allocator return cached objects to system as in slab? <mcsim> I mean reap() <braunr> well yes, it does so, and it does that too often <braunr> the caching in the zone allocator is actually limited to the pagesize <braunr> once page is completely free, it is returned to the vm <mcsim> this is bad caching <braunr> yes <mcsim> if object takes all page than there is now caching at all <braunr> caching by side effect <braunr> true <braunr> but the linux slab allocator does the same thing :p <braunr> hm <braunr> no, the solaris slab allocator does so <mcsim> linux's slab returns objects only when system ask <antrik> without preconstructed objects, is there actually any point in caching empty slabs?... <mcsim> Once I've changed my allocator to slab and it cached more than 1GB of my memory) <braunr> ok wait, need to fix a few mistakes first <mcsim> s/ask/asks <braunr> the zone allocator (in gnumach) actually has a garbage collector <antrik> braunr: well, the Solaris allocator follows the slab/magazine paper, right? so there is caching at the magazine layer... in that case caching empty slabs too would be rather redundant I'd say... <braunr> which is called when running low on memory, similar to the slab allocaotr <braunr> antrik: yes <antrik> (or rather the paper follows the Solaris allocator ;-) ) <braunr> mcsim: the zone allocator reap() is zone_gc() <antrik> braunr: hm, right, there is a "collectable" flag for zones... but I never understood what it means <antrik> braunr: BTW, I heard Linux has yet another allocator now called "slob"... do you happen to know what that is? <braunr> slob is a very simple allocator for embedded devices <mcsim> AFAIR this is just heap allocator <braunr> useful when you have a very low amount of memory <braunr> like 1 MiB <braunr> yes <antrik> just googled it :-) <braunr> zone and slab are very similar <antrik> sounds like a simple heap allocator <mcsim> there is another allocator that calls slub, and it better than slab in many cases <braunr> the main difference is the data structures used to store slabs <braunr> mcsim: i disagree <antrik> mcsim: ah, you already said that :-) <braunr> mcsim: slub is better for systems with very large amounts of memory and processors <braunr> otherwise, slab is better <braunr> in addition, there are accounting issues with slub <braunr> because of cache merging <mcsim> ok. This strange that slub is default allocator <braunr> well both are very good <braunr> iirc, linus stated that he really doesn't care as long as its works fine <braunr> he refused slqb because of that <braunr> slub is nice because it requires less memory than slab, while still being as fast for most cases <braunr> it gets slower on the free path, when the cpu performing the free is different from the one which allocated the object <braunr> that's a reasonable cost <mcsim> slub uses heap for large object. Are there any tests that compare what is better for large objects? <antrik> well, if slub requires less memory, why do you think slab is better for smaller systems? :-) <braunr> antrik: smaller is relative <antrik> mcsim: for large objects slab allocation is rather pointless, as you don't have multiple objects in a page anyways... <braunr> antrik: when lameter wrote slub, it was intended for systems with several hundreds processors <antrik> BTW, was slqb really refused only because the other ones are "good enough"?... <braunr> yes <antrik> wow, that's a strange argument... <braunr> linus is already unhappy of having "so many" allocators <antrik> well, if the new one is better, it could replace one of the others :-) <antrik> or is it useful only in certain cases? <braunr> that's the problem <braunr> nobody really knows <antrik> hm, OK... I guess that should be tested *before* merging ;-) <antrik> is anyone still working on it, or was it abandonned? <antrik> mcsim: back to caching... <antrik> what does caching in the kernel object allocator got to do with readahead (i.e. clustered paging)?... <mcsim> if we cached some physical pages we don't need to find new ones for allocating new object. And that's why there will not be a page fault. <mcsim> antrik: Regarding kam. Hasn't he finished his project? <antrik> err... what? <antrik> one of us must be seriously confused <antrik> I totally fail to see what caching of physical pages (which isn't even really a correct description of what slab does) has to do with page faults <antrik> right, KAM didn't finish his project <mcsim> If we free the physical page and return it to system we need another one for next allocation. But if we keep it, we don't need to find new physical page. <mcsim> And physical page is allocated only then when page fault occurs. Probably, I'm wrong <antrik> what does "return to system" mean? we are talking about the kernel... <antrik> zalloc/slab are about allocating kernel objects. this doesn't have *anything* to do with paging of userspace processes <antrik> only thing the have in common is that they need to get pages from the physical page allocator. but that's yet another topic <mcsim> Under "return to system" I mean ability to use this page for other needs. <braunr> mcsim: consider kernel memory to be wired <braunr> here, return to system means releasing a page back to the vm system <braunr> the vm_kmem module then unmaps the physical page and free its virtual address in the kernel map <mcsim> ok <braunr> antrik: the problem with new allocators like slqb is that it's very difficult to really know if they're better, even with extensive testing <braunr> antrik: there are papers (like wilson95) about the difficulties in making valuable results in this field <braunr> see http://www.sceen.net/~rbraun/dynamic_storage_allocation_a_survey_and_critical_review.pdf <mcsim> how can be allocated physically continuous object now? <braunr> mcsim: rephrase please <mcsim> what is similar to kmalloc in Linux to gnumach? <braunr> i know memory is reserved for dma in a direct virtual to physical mapping <braunr> so even if the allocation is done similarly to vmalloc() <braunr> the selected region of virtual space maps physical memory, so memory is physically contiguous too <braunr> for other allocation types, a block large enough is allocated, so it's contiguous too <mcsim> I don't clearly understand. If we have fragmentation in physical ram, so there aren't 2 free pages in a row, but there are able apart, we can't to allocate these 2 pages along? <braunr> no <braunr> but every system has this problem <mcsim> But since we have only 12 or 32 MB of memory the problem becomes more significant <braunr> you're confusing virtual and physical memory <braunr> those 32 MiB are virtual <braunr> the physical pages backing them don't have to be contiguous <mcsim> Oh, indeed <mcsim> So the only problem are limits? <braunr> and performance <braunr> and correctness <braunr> i find the zone allocator badly written <braunr> antrik: mcsim: here is the content of the kernel pmap on NetBSD (which uses a virtual memory system close to the Mach VM) <braunr> antrik: mcsim: http://www.sceen.net/~rbraun/pmap.out [[pmap.out]] <braunr> you can see the kmem_map (which is used for most general kernel allocations) is 128 MiB large <braunr> actually it's not the kernel pmap, it's the kernel_map <antrik> braunr: why is it called pmap.out then? ;-) <braunr> antrik: because the tool is named pmap <braunr> for process map <braunr> it also exists under Linux, although direct access to /proc/xx/maps gives more info <mcsim> braunr: I've said that this is kernel_map. Can I see kernel_map for Linux? <braunr> mcsim: I don't know how to do that <mcsim> s/I've/You've <braunr> but Linux doesn't have submaps, and uses a direct virtual to physical mapping, so it's used differently <antrik> how are things (such as zalloc zones) entered into kernel_map? <braunr> in zone_init() you have <braunr> zone_map = kmem_suballoc(kernel_map, &zone_min, &zone_max, zone_map_size, FALSE); <braunr> so here, kmem_map is named zone_map <braunr> then, in zalloc() <braunr> kmem_alloc_wired(zone_map, &addr, zone->alloc_size) <antrik> so, kmem_alloc just deals out chunks of memory referenced directly by the address, and without knowing anything about the use? <braunr> kmem_alloc() gives virtual pages <braunr> zalloc() carves them into buffers, as in the slab allocator <braunr> the difference is essentially the lack of formal "slab" object <braunr> which makes the zone code look like a mess <antrik> so kmem_suballoc() essentially just takes a bunch of pages from the main kernel_map, and uses these to back another map which then in turn deals out pages just like the main kernel_map? <braunr> no <braunr> kmem_suballoc creates a vm_map_entry object, and sets its start and end address <braunr> and creates a vm_map object, which is then inserted in the new entry <braunr> maybe that's what you meant with "essentially just takes a bunch of pages from the main kernel_map" <braunr> but there really is no allocation at this point <braunr> except the map entry and the new map objects <antrik> well, I'm trying to understand how kmem_alloc() manages things. so it has map_entry structures like the maps of userspace processes? do these also reference actual memory objects? <braunr> kmem_alloc just allocates virtual pages from a vm_map, and backs those with physical pages (unless the user requested pageable memory) <braunr> it's not "like the maps of userspace processes" <braunr> these are actually the same structures <braunr> a vm_map_entry can reference a memory object or a kernel submap <braunr> in netbsd, it can also referernce nothing (for pure wired kernel memory like the vm_page array) <braunr> maybe it's the same in mach, i don't remember exactly <braunr> antrik: this is actually very clear in vm/vm_kern.c <braunr> kmem_alloc() creates a new kernel object for the allocation <braunr> allocates a new entry (or uses a previous existing one if it can be extended) through vm_map_find_entry() <braunr> then calls kmem_alloc_pages() to back it with wired memory <antrik> "creates a new kernel object" -- what kind of kernel object? <braunr> kmem_alloc_wired() does roughly the same thing, except it doesn't need a new kernel object because it knows the new area won't be pageable <braunr> a simple vm_object <braunr> used as a container for anonymous memory in case the pages are swapped out <antrik> vm_object is the same as memory object/pager? or yet something different? <braunr> antrik: almost <braunr> antrik: a memory_object is the user view of a vm_object <braunr> as in the kernel/user interfaces used by external pagers <braunr> vm_object is a more internal name <mcsim> Is fragmentation a big problem in slab allocator? <mcsim> I've tested it on my computer in Linux and for some caches it reached 30-40% <antrik> well, fragmentation is a major problem for any allocator... <antrik> the original slab allocator was design specifically with the goal of reducing fragmentation <antrik> the revised version with the addition of magazines takes a step back on this though <antrik> have you compared it to slub? would be pretty interesting... <mcsim> I have an idea how can it be decreased, but it will hurt by performance... <mcsim> antrik: no I haven't, but there will be might the same, I think <mcsim> if each cache will handle two types of object: with sizes that will fit cache sizes (or I bit smaller) and with sizes which are much smaller than maximal cache size. For first type of object will be used standard slab allocator and for latter type will be used (within page) heap allocator. <mcsim> I think that than fragmentation will be decreased <antrik> not at all. heap allocator has much worse fragmentation. that's why slab allocator was invented <antrik> the problem is that in a long-running program (such an the kernel), objects tend to have vastly varying lifespans <mcsim> but we use heap only for objects of specified sizes <antrik> so often a few old objects will keep a whole page hostage <mcsim> for example for 32 byte cache it could be 20-28 byte objects <antrik> that's particularily visible in programs such as firefox, which will grow the heap during use even though actual needs don't change <antrik> the slab allocator groups objects in a fashion that makes it more likely adjacent objects will be freed at similar times <antrik> well, that's pretty oversimplyfied, but I hope you get the idea... it's about locality <mcsim> I agree, but I speak not about general heap allocation. We have many heaps for objects with different sizes. <mcsim> Could it be better? <antrik> note that this has been a topic of considerable research. you shouldn't seek to improve the actual algorithms -- you would have to read up on the existing research at least before you can contribute anything to the field :-) <antrik> how would that be different from the slab allocator? <mcsim> slab will allocate 32 byte for both 20 and 32 byte requests <mcsim> And if there was request for 20 bytes we get 12 unused <antrik> oh, you mean the implementation of the generic allocator on top of slabs? well, that might not be optimal... but it's not an often used case anyways. mostly the kernel uses constant-sized objects, which get their own caches with custom tailored size <antrik> I don't think the waste here matters at all <mcsim> affirmative. So my idea is useless. <antrik> does the statistic you refer to show the fragmentation in absolute sizes too? <mcsim> Can you explain what is absolute size? <mcsim> I've counted what were requested (as parameter of kmalloc) and what was really allocated (according to best fit cache size). <antrik> how did you get that information? <mcsim> I simply wrote a hook <antrik> I mean total. i.e. how many KiB or MiB are wasted due to fragmentation alltogether <antrik> ah, interesting. how does it work? <antrik> BTW, did you read the slab papers? <mcsim> Do you mean articles from lwn.net? <antrik> no <antrik> I mean the papers from the Sun hackers who invented the slab allocator(s) <antrik> Bonwick mostly IIRC <mcsim> Yes <antrik> hm... then you really should know the rationale behind it... <mcsim> There he says about 11% percent of memory waste <antrik> you didn't answer my other questions BTW :-) <mcsim> I've corrupted kernel tree with patch, and tomorrow I'm going to read myself up for exam (I have it on Thursday). But than I'll send you a module which I've used for testing. <antrik> OK <mcsim> I can send you module now, but it will not work without patch. <mcsim> It would be better to rewrite it using debugfs, but when I was writing this test I didn't know about trace_* macros # IRC, freenode, #hurd, 2011-04-15 <mcsim> There is a hack in zone_gc when it allocates and frees two vm_map_kentry_zone elements to make sure the gc will be able to allocate two in vm_map_delete. Isn't it better to allocate memory for these entries statically? <youpi> mcsim: that's not the point of the hack <youpi> mcsim: the point of the hack is to make sure vm_map_delete will be able to allocate stuff <youpi> allocating them statically will just work once <youpi> it may happen several times that vm_map_delete needs to allocate it while it's empty (and thus zget_space has to get called, leading to a hang) <youpi> funnily enough, the bug is also in macos X <youpi> it's still in my TODO list to manage to find how to submit the issue to them <braunr> really ? <braunr> eh <braunr> is that because of map entry splitting ? <youpi> it's git commit efc3d9c47cd744c316a8521c9a29fa274b507d26 <youpi> braunr: iirc something like this, yes <braunr> netbsd has this issue too <youpi> possibly <braunr> i think it's a fundamental problem with the design <braunr> people think of munmap() as something similar to free() <braunr> whereas it's really unmap <braunr> with a BSD-like VM, unmap can easily end up splitting one entry in two <braunr> but your issue is more about harmful recursion right ? <youpi> I don't remember actually <youpi> it's quite some time ago :) <braunr> ok <braunr> i think that's why i have "sources" in my slab allocator, the default source (vm_kern) and a custom one for kernel map entries # IRC, freenode, #hurd, 2011-04-18 <mcsim> braunr: you've said that once page is completely free, it is returned to the vm. <mcsim> who else, besides zone_gc, can return free pages to the vm? <braunr> mcsim: i also said i was wrong about that <braunr> zone_gc is the only one # IRC, freenode, #hurd, 2011-04-19 <braunr> antrik: mcsim: i added back a new per-cpu layer as planned <braunr> http://git.sceen.net/rbraun/libbraunr.git/?a=blob;f=mem.c;h=c629b2b9b149f118a30f0129bd8b7526b0302c22;hb=HEAD <braunr> mcsim: btw, in mem_cache_reap(), you can clearly see there are two loops, just as in zone_gc, to reduce contention and avoid deadlocks <braunr> this is really common in memory allocators # IRC, freenode, #hurd, 2011-04-23 <mcsim> I've looked through some allocators and all of them use different per cpu cache policy. AFAIK gnuhurd doesn't support multiprocessing, but still multiprocessing must be kept in mind. So, what do you think what kind of cpu caches is better? As for me I like variant with only per-cpu caches (like in slqb). <antrik> mcsim: well, have you looked at the allocator braunr wrote himself? :-) <antrik> I'm not sure I suggested that explicitly to you; but probably it makes most sense to use that in gnumach # IRC, freenode, #hurd, 2011-04-24 <mcsim> antrik: Yes, I have. He uses both global and per cpu caches. But he also suggested to look through slqb, where there are only per cpu caches.\ <braunr> i don't remember slqb in detail <braunr> what do you mean by "only per-cpu caches" ? <braunr> a whole slab sytem for each cpu ? <mcsim> I mean that there are no global queues in caches, but there are special queues for each cpu. <mcsim> I've just started investigating slqb's code, but I've read an article on lwn about it. And I've read that it is used for zen kernel. <braunr> zen ? <mcsim> Here is this article http://lwn.net/Articles/311502/ <mcsim> Yes, this is linux kernel with some patches which haven't been approved to torvald's tree <mcsim> http://zen-kernel.org/ <braunr> i see <braunr> well it looks nice <braunr> but as for slub, the problem i can see is cross-CPU freeing <braunr> and I think nick piggins mentions it <braunr> piggin* <braunr> this means that sometimes, objects are "burst-free" from one cpu cache to another <braunr> which has the same bad effects as in most other allocators, mainly fragmentation <mcsim> There is a special list for freeing object allocated for another CPU <mcsim> And garbage collector frees such object on his own <braunr> so what's your question ? <mcsim> It is described in the end of article. <mcsim> What cpu-cache policy do you think is better to implement? <braunr> at this point, any <braunr> and even if we had a kernel that perfectly supports multiprocessor, I wouldn't care much now <braunr> it's very hard to evaluate such allocators <braunr> slqb looks nice, but if you have the same amount of fragmentation per slab as other allocators do (which is likely), you have tat amount of fragmentation multiplied by the number of processors <braunr> whereas having shared queues limit the problem somehow <braunr> having shared queues mean you have a bit more contention <braunr> so, as is the case most of the time, it's a tradeoff <braunr> by the way, does pigging say why he "doesn't like" slub ? :) <braunr> piggin* <mcsim> http://lwn.net/Articles/311093/ <mcsim> here he describes what slqb is better. <braunr> well it doesn't describe why slub is worse <mcsim> but not very particularly <braunr> except for order-0 allocations <braunr> and that's a form of fragmentation like i mentioned above <braunr> in mach those problems have very different impacts <braunr> the backend memory isn't physical, it's the kernel virtual space <braunr> so the kernel allocator can request chunks of higher than order-0 pages <braunr> physical pages are allocated one at a time, then mapped in the kernel space <mcsim> Doesn't order of page depend on buffer size? <braunr> it does <mcsim> And why does gnumach allocates higher than order-0 pages more? <braunr> why more ? <braunr> i didn't say more <mcsim> And why in mach those problems have very different impact? <braunr> ? <braunr> i've just explained why :) <braunr> 09:37 < braunr> physical pages are allocated one at a time, then mapped in the kernel space <braunr> "one at a time" means order-0 pages, even if you allocate higher than order-0 chunks <mcsim> And in Linux they allocated more than one at time because of prefetching page reading? <braunr> do you understand what virtual memory is ? <braunr> linux allocators allocate "physical memory" <braunr> mach kernel allocator allocates "virtual memory" <braunr> so even if you allocate a big chunk of virtual memory, it's backed by order-0 physical pages <mcsim> yes, I understand this <braunr> you don't seem to :/ <braunr> the problem of higher than order-0 page allocations is fragmentation <braunr> do you see why ? <mcsim> yes <braunr> so <braunr> fragmentation in the kernel space is less likely to create issues than it does in physical memory <braunr> keep in mind physical memory is almost always full because of the page cache <braunr> and constantly under some pressure <braunr> whereas the kernel space is mostly empty <braunr> so allocating higher then order-0 pages in linux is more dangerous than it is in Mach or BSD <mcsim> ok <braunr> on the other hand, linux focuses pure performance, and not having to map memory means less operations, less tlb misses, quicker allocations <braunr> the Mach VM must map pages "one at a time", which can be expensive <braunr> it should be adapted to handle multiple page sizes (e.g. 2 MiB) so that many allocations can be made with few mappings <braunr> but that's not easy <braunr> as always: tradeoffs <mcsim> There are other benefits of physical allocating. In big DMA transfers can be needed few continuous physical pages. How does mach handles such cases? <braunr> gnumach does that awfully <braunr> it just reserves the whole DMA-able memory and uses special allocation functions on it, IIRC <braunr> but kernels which have a MAch VM like memory sytem such as BSDs have cleaner methods <braunr> NetBSD provides a function to allocate contiguous physical memory <braunr> with many constraints <braunr> FreeBSD uses a binary buddy system like Linux <braunr> the fact that the kernel allocator uses virtual memory doesn't mean the kernel has no mean to allocate contiguous physical memory ... # IRC, freenode, #hurd, 2011-05-02 <braunr> hm nice, my allocator uses less memory than glibc (squeeze version) on both 32 and 64 bits systems <braunr> the new per-cpu layer is proving effective <neal> braunr: Are you reimplementation malloc? <braunr> no <braunr> it's still the slab allocator for mach, but tested in userspace <braunr> so i wrote malloc wrappers <neal> Oh. <braunr> i try to heavily test most of my code in userspace now <neal> it's easier :-) <neal> I agree <braunr> even the physical memory allocator has been implemented this way <neal> is this your mach version? <braunr> virtual memory allocation will follow <neal> or are you working on gnu mach? <braunr> for now it's my version <braunr> but i intend to spend the summer working on ipc port names management [[rework_gnumach_IPC_spaces]]. <braunr> and integrate the result in gnu mach <neal> are you keeping the same user-space API? <neal> Or are you experimenting with something new? <antrik> braunr: to be fair, it's not terribly hard to use less memory than glibc :-) <braunr> yes <braunr> antrik: well ptmalloc3 received some nice improvements <braunr> neal: the goal is to rework some of the internals only <braunr> neal: namely, i simply intend to replace the splay tree with a radix tree <antrik> braunr: the glibc allocator is emphasising performace, unlike some other allocators that trade some performance for much better memory utilisation... <antrik> ptmalloc3? <braunr> that's the allocator used in glibc <braunr> http://www.malloc.de/en/ <antrik> OK. haven't seen any recent numbers... the comparision I have in mind is many years old... <braunr> i also made some additions to my avl and red-black trees this week end, which finally make them suitable for almost all generic uses <braunr> the red-black tree could be used in e.g. gnu mach to augment the linked list used in vm maps <braunr> which is what's done in most modern systems <braunr> it could also be used to drop the overloaded (and probably over imbalanced) page cache hash table [[gnumach_vm_map_red-black_trees]]. # IRC, freenode, #hurd, 2011-05-03 <mcsim> antrik: How should I start porting? Have I just include rbraun's allocator to gnumach and make it compile? <antrik> mcsim: well, basically yes I guess... but you will have to look at the code in question first before we know anything more specific :-) <antrik> I guess braunr might know better how to start, but he doesn't appear to be here :-( <braunr> mcsim: you can't juste put my code into gnu mach and make it run, it really requires a few careful changes <braunr> mcsim: you will have to analyse how the current zone allocator interacts with regard to locking <braunr> if it is used in interrupt handlers <braunr> what kind of locks it should use instead of the pthread stuff available in userspace <braunr> you will have to change the reclamiing policy, so that caches are reaped on demand <braunr> (this basically boils down to calling the new reclaiming function instead of zone_gc()) <braunr> you must be careful about types too <braunr> there is work to be done ;) <braunr> (not to mention the obvious about replacing all the calls to the zone allocator, and testing/debugging afterwards) # IRC, freenode, #hurd, 2011-07-14 <braunr> can you make your patch available ? <mcsim> it is available in gnumach repository at savannah <mcsim> tree mplaneta/libbraunr/master <braunr> mcsim: i'll test your branch <mcsim> ok. I'll give you a link in a minute <braunr> hm why balloc ? <mcsim> Braun's allocator <braunr> err <braunr> http://git.sceen.net/rbraun/x15mach.git/?a=blob;f=kern/kmem.c;h=37173fa0b48fc9d7e177bf93de531819210159ab;hb=HEAD <braunr> mcsim: this is the interface i had in mind for a kernel version :) <braunr> very similar to the original slab allocator interface actually <braunr> well, you've been working <mcsim> But I have a problem with this patch. When I apply it to gnumach code from debian repository. I have to make a change in file ramdisk.c with sed -i 's/kernel_map/\&kernel_map/' device/ramdisk.c <mcsim> because in git repository there is no such file <braunr> mcsim: how do you configure the kernel before building ? <braunr> mcsim: you should keep in touch more often i think, so that you get feedback from us and don't spend too much time "off course" <mcsim> I didn't configure it. I just run dpkg-buildsource -b. <braunr> oh you build the debian package <braunr> well my version was by configure --enable-kdb --enable-rtl8139 <braunr> and it seems stuck in an infinite loop during bootstrap <mcsim> and printf doesn't work. The first function called by c_boot_entry is printf(version). <braunr> mcsim: also, you're invited to get the x15mach version of my files, which are gplv2+ licensed <braunr> be careful of my macros.h file, it can conflict with the macros_help.h file from gnumach iirc <mcsim> There were conflicts with MACRO_BEGIN and MACRO_END. But I solved it <braunr> ok <braunr> it's tricky <braunr> mcsim: try to find where the first use of the allocator is made # IRC, freenode, #hurd, 2011-07-22 <mcsim> braunr, hello. Kernel with your allocator already compiles and runs. There still some problems, but, certainly, I'm on the final stage already. I hope I'll finish in a few days. <tschwinge> mcsim: Oh, cool! Have you done some measurements already? <mcsim> Not yet <tschwinge> OK. <tschwinge> But if it able to run a GNU/Hurd system, then that already is something, a big milestone! <braunr> nice <braunr> although you'll probably need to tweak the garbage collecting process <mcsim> tschwinge: thanks <mcsim> braunr: As back-end for allocating memory I use kmem_alloc_wired. But in zalloc was an opportunity to use as back-end kmem_alloc_pageable. Although there was no any zone that used kmem_alloc_pageable. Do I need to implement this functionality? <braunr> mcsim: do *not* use kmem_alloc_pageable() <mcsim> braunr: Ok. This is even better) <braunr> mcsim: in x15, i've taken this even further: there is *no* kernel vm object, which means all kernel memory is wired and unmanaged <braunr> making it fast and safe <braunr> pageable kernel memory was useful back when RAM was really scarce <braunr> 20 years ago <braunr> but it's a source of deadlock <mcsim> Indeed. I'll won't use kmem_alloc_pageable. # IRC, freenode, #hurd, 2011-08-09 < braunr> mcsim: what's the "bug related to MEM_CF_VERIFY" you refer to in one of your commits ? < braunr> mcsim: don't use spin_lock_t as a member of another structure < mcsim> braunr: I confused with types in *_verify functions, so they didn't work. Than I fixed it in the commit you mentioned. < braunr> in gnumach, most types are actually structure pointers < braunr> use simple_lock_data_t < braunr> mcsim: ok < mcsim> > use simple_lock_data_t < mcsim> braunr: ok < braunr> mcsim: don't make too many changes to the code base, and if you're unsure, don't hesitate to ask < braunr> also, i really insist you rename the allocator, as done in x15 for example (http://git.sceen.net/rbraun/x15mach.git/?a=blob;f=vm/kmem.c), instead of a name based on mine :/ < mcsim> braunr: Ok. It was just work name. When I finish I'll rename the allocator. < braunr> other than that, it's nice to see progress < braunr> although again, it would be better with some reports along < braunr> i won't be present at the meeting tomorrow unfortunately, but you should use those to report the status of your work < mcsim> braunr: You've said that I have to tweak gc process. Did you mean to call mem_gc() when physical memory ends instead of calling it every x seconds? Or something else? < braunr> there are multiple topics, alhtough only one that really matters < braunr> study how zone_gc was called < braunr> reclaiming memory should happen when there is pressure on the VM subsystem < braunr> but it shouldn't happen too ofte, otherwise there is trashing < braunr> and your caches become mostly useless < braunr> the original slab allocator uses a 15-second period after a reclaim during which reclaiming has no effect < braunr> this allows having a somehow stable working set for this duration < braunr> the linux slab allocator uses 5 seconds, but has a more complicated reclaiming mechanism < braunr> it releases memory gradually, and from reclaimable caches only (dentry for example) < braunr> for x15 i intend to implement the original 15 second interval and then perform full reclaims < mcsim> In zalloc mem_gc is called by vm_pageout_scan, but not often than once a second. < mcsim> In balloc I've changed interval to once in 15 seconds. < braunr> don't use the code as it is < braunr> the version you've based your work on was meant for userspace < braunr> where there isn't memory pressure < braunr> so a timer is used to trigger reclaims at regular intervals < braunr> it's different in a kernel < braunr> mcsim: where did you see vm_pageout_scan call the zone gc once a second ? < mcsim> vm_pageout_scan calls consider_zone_gc and consider_zone_gc checks if second is passed. < braunr> where ? < mcsim> Than zone_gc can be called. < braunr> ah ok, it's in zaclloc.c then < braunr> zalloc.c < braunr> yes this function is fine < mcsim> so old gc didn't consider vm pressure. Or I missed something. < braunr> it did < mcsim> how? < braunr> well, it's called by the pageout daemon < braunr> under memory pressure < braunr> so it's fine < mcsim> so if mem_gc is called by pageout daemon is it fine? < braunr> it must be changed to do something similar to what consider_zone_gc does < mcsim> It does. mem_gc does the same work as consider_zone_gc and zone_gc. < braunr> good < mcsim> so gc process is fine? < braunr> should be < braunr> i see mem.c only includes mem.h, which then includes other headers < braunr> don't do that < braunr> always include all the headers you need where you need them < braunr> if you need avltree.h in both mem.c and mem.h, include it in both files < braunr> and by the way, i recommend you use the red black tree instead of the avl type < braunr> (it's the same interface so it shouldn't take long) < mcsim> As to report. If you won't be present at the meeting, I can tell you what I have to do now. < braunr> sure < braunr> in addition, use GPLv2 as the license, teh BSD one is meant for the userspace version only < braunr> GPLv2+ actually < braunr> hm you don't need list.c < braunr> it would only add dead code < braunr> "Zone for dynamical allocator", don't mix terms < braunr> this comment refers to a vm_map, so call it a map < mcsim> 1. Change constructor for kentry_alloc_cache. < mcsim> 2. Make measurements. < mcsim> + < mcsim> 3. Use simple_lock_data_t < mcsim> 4. Replace license < braunr> kentry_alloc_cache <= what is that ? < braunr> cache for kernel map entries in vm_map ? < braunr> the comment for mem_cpu_pool_get doesn't apply in gnumach, as there is no kernel preemption [[microkernel/mach/gnumach/preemption]]. < braunr> "Don't attempt mem GC more frequently than hz/MEM_GC_INTERVAL times a second. < braunr> " < mcsim> sorry. I meant vm_map_kentry_cache < braunr> hm nothing actually about this comment < braunr> mcsim: ok < braunr> yes kernel map entries need special handling < braunr> i don't know how it's done in gnumach though < braunr> static preallocation ? < mcsim> yes < braunr> that's ugly :p < mcsim> but it uses dynamic allocation further even for vm_map kernel entries < braunr> although such bootstrapping issues are generally difficult to solve elegantly < braunr> ah < mcsim> now I use only static allocation, but I'll add dynamic allocation too < braunr> when you have time, mind the coding style (convert everything to gnumach style, which mostly implies using tabs instead of 4-spaces indentation) < braunr> when you'll work on dynamic allocation for the kernel map entries, you may want to review how it's done in x15 < braunr> the mem_source type was originally intended for that purpose, but has slightly changed once the allocator was adapted to work in my kernel < mcsim> ok < braunr> vm_map_kentry_zone is the only zone created with ZONE_FIXED < braunr> and it is zcram()'ed immediately after < braunr> so you can consider it a statically allocated zone < braunr> in x15 i use another strategy: there is a special kernel submap named kentry_map which contains only one map entry (statically allocated) < braunr> this map is the backend (mem_source) for the kentry_cache < braunr> the kentry_cache is created with a special flag that tells it memory can't be reclaimed < braunr> when the cache needs to grow, the single map entry is extended to cover the allocated memory < braunr> it's similar to the way pmap_growkernel() works for kernel page table pages < braunr> (and is actually based on that idea) < braunr> it's a compromise between full static and dynamic allocation types < braunr> the advantage is that the allocator code can be used (so there is no need for a special allocator like in netbsd) < braunr> the drawback is that some resources can never be returned to their source (and under peaks, the amount of unfreeable resources could become large, but this is unexpected) < braunr> mcsim: for now you shouldn't waste your time with this < braunr> i see the number of kernel map entries is fixed at 256 < braunr> and i've never seen the kernel use more than around 30 entries < mcsim> Do you think that I have to left this problem to the end? < braunr> yes # IRC, freenode, #hurd, 2011-08-11 < mcsim> braunr: Hello. Can you give me an advice how can I make measurements better? < braunr> mcsim: what kind of measurements < mcsim> braunr: How much is your allocator better than zalloc. < braunr> slightly :p < braunr> that's why i never took the time to put it in gnumach < mcsim> braunr: Just I thought that there are some rules or recommendations of such measurements. Or I can do them any way I want? < braunr> mcsim: i don't know < braunr> mcsim: benchmarking is an art of its own, and i don't even know how to use the bits of profiling code available in gnumach (if it still works) < antrik> mcsim: hm... are you saying you already have a running system with slab allocator?... :-) < braunr> mcsim: the main advantage i can see is the removal of many arbitrary hard limits < mcsim> antrik: yes < antrik> \o/ < antrik> nice work! < braunr> :) < braunr> the cpu layer should also help a bit, but it's hard to measure < braunr> i guess it could be seen on the ipc path for very small buffers < mcsim> antrik: Thanks. But I still have to 1. Change constructor for kentry_alloc_cache. and 2. Make measurements. < braunr> and polish the whole thing :p < antrik> mcsim: I'm not sure this can be measured... the performance differente in any real live usage is probably just a few percent at most -- it's hard to construct a benchmark giving enough precision so it's not drowned in noise... < antrik> perhaps it conserves some memory -- but that too would be hard to measure I fear < braunr> yes < braunr> there *should* be better allocation times, less fragmentation, better accounting ... :) < braunr> and no arbitrary limits ! < antrik> :-) < braunr> oh, and the self debugging features can be nice too < mcsim> But I need to prove that my work wasn't useless < braunr> well it wasn't, but that's hard to measure < braunr> it's easy to prove though, since there are additional features that weren't present in the zone allocator < mcsim> Ok. If there are some profiling features in gnumach can you give me a link with their description? < braunr> mcsim: sorry, no < braunr> mcsim: you could still write the basic loop test, which counts the number of allocations performed in a fixed time interval < braunr> but as it doesn't match many real life patterns, it won't be very useful < braunr> and i'm afraid that if you consider real life patterns, you'll see how negligeable the improvement can be compared to other operations such as memory copies or I/O (ouch) < mcsim> Do network drivers use this allocator? < mcsim> ok. I'll scrape up some test and than I'll report results. # IRC, freenode, #hurd, 2011-08-26 < mcsim> hello. Are there any analogs of copy_to_user and copy_from_user in linux for gnumach? < mcsim> Or how can I determine memory map if I know address? I need this for vm_map_copyin < guillem> mcsim: vm_map_lookup_entry? < mcsim> guillem: but I need to transmit map to this function and it will return an entry which contains specified address. < mcsim> And I don't know what map have I transmit. < mcsim> I need to transfer static array from kernel to user. What map contains static data? < antrik> mcsim: Mach doesn't have copy_{from,to}_user -- instead, large chunks of data are transferred as out-of-line data in IPC messages (i.e. using VM magic) < mcsim> antrik: can you give me an example? I just found using vm_map_copyin in host_zone_info. < antrik> no idea what vm_map_copyin is to be honest... # IRC, freenode, #hurd, 2011-08-27 < braunr> mcsim: the primitives are named copyin/copyout, and they are used for messages with inline data < braunr> or copyinmsg/copyoutmsg < braunr> vm_map_copyin/out should be used for chunks larger than a page (or roughly a page) < braunr> also, when writing to a task space, see which is better suited: vm_map_copyout or vm_map_copy_overwrite < mcsim> braunr: and what will be src_map for vm_map_copyin/out? < braunr> the caller map < braunr> which you can get with current_map() iirc < mcsim> braunr: thank you < braunr> be careful not to leak anything in the transferred buffers < braunr> memset() to 0 if in doubt < mcsim> braunr:ok < braunr> antrik: vm_map_copyin() is roughly vm_read() < antrik> braunr: what is it used for? < braunr> antrik: 01:11 < antrik> mcsim: Mach doesn't have copy_{from,to}_user -- instead, large chunks of data are transferred as out-of-line data in IPC messages (i.e. using VM magic) < braunr> antrik: that "VM magic" is partly implemented using vm_map_copy* functions < antrik> braunr: oh, you mean it doesn't actually copy data, but only page table entries? if so, that's *not* really comparable to copy_{from,to}_user()... # IRC, freenode, #hurd, 2011-08-28 < braunr> antrik: the equivalent of copy_{from,to}_user are copy{in,out}{,msg} < braunr> antrik: but when the data size is about a page or more, it's better not to copy, of course < antrik> braunr: it's actually not clear at all that it's really better to do VM magic than to copy... # IRC, freenode, #hurd, 2011-08-29 < braunr> antrik: at least, that used to be the general idea, and with a simpler VM i suspect it's still true < braunr> mcsim: did you progress on your host_zone_info replacement ? < braunr> mcsim: i think you should stick to what the original implementation did < braunr> which is making an inline copy if caller provided enough space, using kmem_alloc_pageable otherwise < braunr> specify ipc_kernel_map if using kmem_alloc_pageable < mcsim> braunr: yes. And it works. But I use kmem_alloc, not pageable. Is it worse? < mcsim> braunr: host_zone_info replacement is pushed to savannah repository. < braunr> mcsim: i'll have a look < mcsim> braunr: I've pushed one more commit just now, which has attitude to host_zone_info. < braunr> mem_alloc_early_init should be renamed mem_bootstrap < mcsim> ok < braunr> mcsim: i don't understand your call to kmem_free < mcsim> braunr: It shouldn't be there? < braunr> why should it be there ? < braunr> you're freeing what the copy object references < braunr> it's strange that it even works < braunr> also, you shouldn't pass infop directly as the copy object < braunr> i guess you get a warning for that < braunr> do what the original code does: use an intermediate copy object and a cast < mcsim> ok < braunr> another error (without consequence but still, you should mind it) < braunr> simple_lock(&mem_cache_list_lock); < braunr> [...] < braunr> kr = kmem_alloc(ipc_kernel_map, &info, info_size); < braunr> you can't hold simple locks while allocating memory < braunr> read how the original implementation works around this < mcsim> ok < braunr> i guess host_zone_info assumes the zone list doesn't change much while unlocked < braunr> or that's it's rather unimportant since it's for debugging < braunr> a strict snapshot isn't required < braunr> list_for_each_entry(&mem_cache_list, cache, node) max_caches++; < braunr> you should really use two separate lines for readability < braunr> also, instead of counting each time, you could just maintain a global counter < braunr> mcsim: use strncpy instead of strcpy for the cache names < braunr> not to avoid overflow but rather to clear the unused bytes at the end of the buffer < braunr> mcsim: about kmem_alloc vs kmem_alloc_pageable, it's a minor issue < braunr> you're handing off debugging data to a userspace application < braunr> a rather dull reporting tool in most cases, which doesn't require wired down memory < braunr> so in order to better use available memory, pageable memory should be used < braunr> in the future i guess it could become a not-so-minor issue though < mcsim> ok. I'll fix it < braunr> mcsim: have you tried to run the kernel with MC_VERIFY always on ? < braunr> MEM_CF_VERIFY actually < mcsim1> yes. < braunr> oh < braunr> nothing wrong < braunr> ? < mcsim1> it is always set < braunr> ok < braunr> ah, you set it in macros.h .. < braunr> don't < braunr> put it in mem.c if you want, or better, make it a compile-time option < braunr> macros.h is a tiny macro library, it shouldn't define such unrelated options < mcsim1> ok. < braunr> mcsim1: did you try fault injection to make sure the checking code actually works and how it behaves when an error occurs ? < mcsim1> I think that when I finish I'll merge files cpu.h and macros.h with mem.c < braunr> yes that would simplify things < mcsim1> Yes. When I confused with types mem_buf_fill worked wrong and panic occurred. < braunr> very good < braunr> have you progressed concerning the measurements you wanted to do ? < mcsim1> not much. < braunr> ok < mcsim1> I think they will be ready in a few days. < antrik> what measurements are these? < mcsim1> braunr: What maximal size for static data and stack in kernel? < braunr> what do you mean ? < braunr> kernel stacks are one page if i'm right < braunr> static data (rodata+data+bss) are limited by grub bugs only :) < mcsim1> braunr: probably they are present, because when I created too big array I couldn't boot kernel < braunr> local variable or static ? < mcsim1> static < braunr> how large ? < mcsim1> 4Mb < braunr> hm < braunr> it's not a grub bug then < braunr> i was able to embed as much as 32 MiB in x15 while doing this kind of tests < braunr> I guess it's the gnu mach boot code which only preallocates one page for the initial kernel mapping < braunr> one PTP (page table page) maps 4 MiB < braunr> (x15 does this completely dynamically, unlike mach or even current BSDs) < mcsim1> antrik: First I want to measure time of each cache creation/allocation/deallocation and then compile kernel. < braunr> cache creation is irrelevant < braunr> because of the cpu pools in the new allocator, you should test at least two different allocation patterns < braunr> one with quick allocs/frees < braunr> the other with large numbers of allocs then their matching frees < braunr> (larger being at least 100) < braunr> i'd say the cpu pool layer is the real advantage over the previous zone allocator < braunr> (from a performance perspective) < mcsim1> But there is only one cpu < braunr> it doesn't matter < braunr> it's stil a very effective cache < braunr> in addition to reducing contention < braunr> compare mem_cpu_pool_pop() against mem_cache_alloc_from_slab() < braunr> mcsim1: work is needed to polish the whole thing, but getting it actually working is a nice achievement for someone new on the project < braunr> i hope it helped you learn about memory allocation, virtual memory, gnu mach and the hurd in general :) < antrik> indeed :-) # IRC, freenode, #hurd, 2011-09-06 [some performance testing] <braunr> i'm not sure such long tests are relevant but let's assume balloc is slower <braunr> some tuning is needed here <braunr> first, we can see that slab allocation occurs more often in balloc than page allocation does in zalloc <braunr> so yes, as slab allocation is slower (have you measured which part actually is slow ? i guess it's the kmem_alloc call) <braunr> the whole process gets a bit slower too <mcsim> I used alloc_size = 4096 for zalloc <braunr> i don't know what that is exactly <braunr> but you can't hold 500 16 bytes buffers in a page so zalloc must have had free pages around for that <mcsim> I use kmem_alloc_wired <braunr> if you have time, measure it, so that we know how much it accounts for <braunr> where are the results for dealloc ? <mcsim> I can't give you result right now because internet works very bad. But for first DEALLOC result are the same, exept some cases when it takes balloc for more than 1000 ticks <braunr> must be the transfer from the cpu layer to the slab layer <mcsim> as to kmem_alloc_wired. I think zalloc uses this function too for allocating objects in zone I test. <braunr> mcsim: yes, but less frequently, which is why it's faster <braunr> mcsim: another very important aspect that should be measured is memory consumption, have you looked into that ? <mcsim> I think that I made too little iterations in test SMALL <mcsim> If I increase constant SMALL_TESTS will it be good enough? <braunr> mcsim: i don't know, try both :) <braunr> if you increase the number of iterations, balloc average time will be lower than zalloc, but this doesn't remove the first long initialization step on the allocated slab <mcsim> SMALL_TESTS to 500, I mean <braunr> i wonder if maintaining the slabs sorted through insertion sort is what makes it slow <mcsim> braunr: where do you sort slabs? I don't see this. <braunr> mcsim: mem_cache_alloc_from_slab and its free counterpart <braunr> mcsim: the mem_source stuff is useless in gnumach, you can remove it and directly call the kmem_alloc/free functions <mcsim> But I have to make special allocator for kernel map entries. <braunr> ah right <mcsim> btw. It turned out that 256 entries are not enough. <braunr> that's weird <braunr> i'll make a patch so that the mem_source code looks more like what i have in x15 then <braunr> about the results, i don't think the slab layer is that slow <braunr> it's the cpu_pool_fill/drain functions that take time <braunr> they preallocate many objects (64 for your objects size if i'm right) at once <braunr> mcsim: look at the first result page: some times, a number around 8000 is printed <braunr> the common time (ticks, whatever) for a single object is 120 <braunr> 8132/120 is 67, close enough to the 64 value <mcsim> I forgot about SMALL tests here are they: http://paste.debian.net/128533/ (balloc) http://paste.debian.net/128534/ (zalloc) <mcsim> braunr: why do you divide 8132 by 120? <braunr> mcsim: to see if it matches my assumption that the ~8000 number matches the cpu_pool_fill call <mcsim> braunr: I've got it <braunr> mcsim: i'd be much interested in the dealloc results if you can paste them too <mcsim> dealloc: http://paste.debian.net/128589/ http://paste.debian.net/128590/ <braunr> mcsim: thanks <mcsim> second dealloc: http://paste.debian.net/128591/ http://paste.debian.net/128592/ <braunr> mcsim: so the main conclusion i retain from your tests is that the transfers from the cpu and the slab layers are what makes the new allocator a bit slower <mcsim> OPERATION_SMALL dealloc: http://paste.debian.net/128593/ http://paste.debian.net/128594/ <braunr> mcsim: what needs to be measured now is global memory usage <mcsim> braunr: data from /proc/vmstat after kernel compilation will be enough? <braunr> mcsim: let me check <braunr> mcsim: no it won't do, you need to measure kernel memory usage <braunr> the best moment to measure it is right after zone_gc is called <mcsim> Are there any facilities in gnumach for memory measurement? <braunr> it's specific to the allocators <braunr> just count the number of used pages <braunr> after garbage collection, there should be no free page, so this should be rather simple <mcsim> ok <mcsim> braunr: When I measure memory usage in balloc, what formula is better cache->nr_slabs * cache->bufs_per_slab * cache->buf_size or cache->nr_slabs * cache->slab_size? <braunr> the latter # IRC, freenode, #hurd, 2011-09-07 <mcsim> braunr: I've disabled calling of mem_cpu_pool_fill and allocator became faster <braunr> mcsim: sounds nice <braunr> mcsim: i suspect the free path might not be as fast though <mcsim> results for first calling: http://paste.debian.net/128639/ second: http://paste.debian.net/128640/ and with many alloc/free: http://paste.debian.net/128641/ <braunr> mcsim: thanks <mcsim> best result are for second call: average time decreased from 159.56 to 118.756 <mcsim> First call slightly worse, but this is because I've added some profiling code <braunr> i still see some ~8k lines in 128639 <braunr> even some around ~12k <mcsim> I think this is because of mem_cache_grow I'm investigating it now <braunr> i guess so too <mcsim> I've measured time for first call in cache and from about 22000 mem_cache_grow takes 20000 <braunr> how did you change the code so that it doesn't call mem_cpu_pool_fill ? <braunr> is the cpu layer still used ? <mcsim> http://paste.debian.net/128644/ <braunr> don't forget the free path <braunr> mcsim: anyway, even with the previous slightly slower behaviour we could observe, the performance hit is negligible <mcsim> Is free path a compilation? (I'm sorry for my english) <braunr> mcsim: mem_cache_free <braunr> mcsim: the last two measurements i'd advise are with big (>4k) object sizes and, really, kernel allocator consumption <mcsim> http://paste.debian.net/128648/ http://paste.debian.net/128646/ http://paste.debian.net/128649/ (first, second, small) <braunr> mcsim: these numbers are closer to the zalloc ones, aren't they ? <mcsim> deallocating slighty faster too <braunr> it may not be the case with larger objects, because of the use of a tree <mcsim> yes, they are closer <braunr> but then, i expect some space gains <braunr> the whole thing is about compromise <mcsim> ok. I'll try to measure them today. Anyway I'll post result and you could read them in the morning <braunr> at least, it shows that the zone allocator was actually quite good <braunr> i don't like how the code looks, there are various hacks here and there, it lacks self inspection features, but it's quite good <braunr> and there was little room for true improvement in this area, like i told you :) <braunr> (my allocator, like the current x15 dev branch, focuses on mp machines) <braunr> mcsim: thanks again for these numbers <braunr> i wouldn't have had the courage to make the tests myself before some time eh <mcsim> braunr: hello. Look at the small_4096 results http://paste.debian.net/128692/ (balloc) http://paste.debian.net/128693/ (zalloc) <braunr> mcsim: wow, what's that ? :) <braunr> mcsim: you should really really include your test parameters in the report <braunr> like object size, purpose, and other similar details <mcsim> for balloc I specified only object_size = 4096 <mcsim> for zalloc object_size = 4096, alloc_size = 4096, memtype = 0; <braunr> the results are weird <braunr> apart from the very strange numbers (e.g. 0 or 4429543648), none is around 3k, which is the value matching a kmem_alloc call <braunr> happy to see balloc behaves quite good for this size too <braunr> s/good/well/ <mcsim> Oh <mcsim> here is significant only first 101 lines <mcsim> I'm sorry <braunr> ok <braunr> what does the test do again ? 10 loops of 10 allocs/frees ? <mcsim> yes <braunr> ok, so the only slowdown is at the beginning, when the slabs are created <braunr> the two big numbers (31844 and 19548) are strange <mcsim> on the other hand time of compilation is <mcsim> balloc zalloc <mcsim> 38m28.290s 38m58.400s <mcsim> 38m38.240s 38m42.140s <mcsim> 38m30.410s 38m52.920s <braunr> what are you compiling ? <mcsim> gnumach kernel <braunr> in 40 mins ? <mcsim> yes <braunr> you lack hvm i guess <mcsim> is it long? <mcsim> I use real PC <braunr> very <braunr> ok <braunr> so it's normal <mcsim> in vm it was about 2 hours) <braunr> the difference really is negligible <braunr> ok i can explain the big numbers <braunr> the slab size depends on the object size, and for 4k, it is 32k <braunr> you can store 8 4k buffers in a slab (lines 2 to 9) <mcsim> so we need use kmem_alloc_* 8 times? <braunr> on line 10, the ninth object is allocated, which adds another slab to the cache, hence the big number <braunr> no, once for a size of 32k <braunr> and then the free list is initialized, which means accessing those pages, which means tlb misses <braunr> i guess the zone allocator already has free pages available <mcsim> I see <braunr> i think you can stop performance measurements, they show the allocator is slightly slower, but so slightly we don't care about that <braunr> we need numbers on memory usage now (at the page level) <braunr> and this isn't easy <mcsim> For balloc I can get numbers if I summarize nr_slabs*slab_size for each cache, isn't it? <braunr> yes <braunr> you can have a look at the original implementation, function mem_info <mcsim> And for zalloc I have to summarize of cur_size and then add zalloc_wasted_space? <braunr> i don't know :/ <braunr> i think the best moment to obtain accurate values is after zone_gc removes the collected pages <braunr> for both allocators, you could fill a stats structure at that moment, and have an rpc copy that structure when a client tool requests it <braunr> concerning your tests, there is another point to have in mind <braunr> the very first loop in your code shows a result of 31844 <braunr> although you disabled the call to cpu_pool_fill <braunr> but the reason why it's so long is that the cpu layer still exists <braunr> and if you look carefully, the cpu pools are created as needed on the free path <mcsim> I removed cpu_pool_drain <braunr> but not cpu_pool_push/pop i guess <mcsim> http://paste.debian.net/128698/ <braunr> see, you still allocate the cpu pool array on the free path <mcsim> but I don't fill it <braunr> that's not the point <braunr> it uses mem_cache_alloc <braunr> so in a call to free, you can also have an allocation, that can potentially create a new slab <mcsim> I see, so I have to create cpu_pool at the initialization stage? <braunr> no, you can't <braunr> there is a reason why they're allocated on the free path <braunr> but since you don't have the fill/drain functions, i wonder if you should just comment out the whole cpu layer code <braunr> but hmm <braunr> no really, it's not worth the effort <braunr> even with drains/fills, the results are really good enough <braunr> it makes the allocator smp ready <braunr> we should just keep it that way <braunr> mcsim: fyi, the reason why cpu pool arrays are allocated on the free path is to avoid recursion <braunr> because cpu pool arrays are allocated from caches just as almost everything else <mcsim> ok <mcsim> summ of cur_size and then adding zalloc_wasted_space gives 0x4e1954 <mcsim> but this value isn't even page aligned <mcsim> For balloc I've got 0x4c6000 0x4aa000 0x48d000 <braunr> hm can you report them in decimal, >> 10 so that values are in KiB ? <mcsim> 4888 4776 4660 for balloc <mcsim> 4998 for zalloc <braunr> when ? <braunr> after boot ? <mcsim> boot, compile, zone_gc <mcsim> and then measure <braunr> ? <mcsim> I call garbage collector before measuring <mcsim> and I measure after kernel compilation <braunr> i thought it took you 40 minutes <mcsim> for balloc I got results at night <braunr> oh so you already got them <braunr> i can't beleive the kernel only consumes 5 MiB <mcsim> before gc it takes about 9052 Kib <braunr> can i see the measurement code ? <braunr> oh, and how much ram does your machine have ? <mcsim> 758 mb <mcsim> 768 <braunr> that's really weird <braunr> i'd expect the kernel to consume much more space <mcsim> http://paste.debian.net/128703/ <mcsim> it's only dynamically allocated data <braunr> yes <braunr> ipc ports, rights, vm map entries, vm objects, and lots of other hanging buffers <braunr> about how much is zalloc_wasted_space ? <braunr> if it's small or constant, i guess you could ignore it <mcsim> about 492 <mcsim> KiB <braunr> well it's another good point, mach internal structures don't imply much overhead <braunr> or, the zone allocator is underused <tschwinge> mcsim, braunr: The memory allocator project is coming along good, as I get from your IRC messages? <braunr> tschwinge: yes, but as expected, improvements are minor <tschwinge> But at the very least it's now well-known, maintainable code. <braunr> yes, it's readable, easier to understand, provides self inspection and is smp ready <braunr> there also are less hacks, but a few less features (there are no way to avoid sleeping so it's unusable - and unused - in interrupt handlers) <braunr> is* no way <braunr> tschwinge: mcsim did a good job porting and measuring it # IRC, freenode, #hurd, 2011-09-08 <antrik> braunr: note that the zalloc map used to be limited to 8 MiB or something like that a couple of years ago... so it doesn't seems surprising that the kernel uses "only" 5 MiB :-) <antrik> (yes, we had a *lot* of zalloc panics back then...) # IRC, freenode, #hurd, 2011-09-14 <mcsim> braunr: hello. I've written a constructor for kernel map entries and it can return resources to their source. Can you have a look at it? http://paste.debian.net/130037/ If all be OK I'll push it tomorrow. <braunr> mcsim: send the patch through mail please, i'll apply it on my copy <braunr> are you sure the cache is reapable ? <mcsim> All slabs, except first I allocate with kmem_alloc_wired. <braunr> how can you be sure ? <mcsim> First slab I allocate during bootstrap and use pmap_steal_memory and further I use only kmem_alloc_wired <braunr> no, you use kmem_free <braunr> in kentry_dealloc_cache() <braunr> which probably creates a recursion <braunr> using the constructor this way isn't a good idea <braunr> constructors are good for preconstructed state (set counters to 0, init lists and locks, that kind of things, not allocating memory) <braunr> i don't think you should try to make this special cache reapable <braunr> mcsim: keep in mind constructors are applied on buffers at *slab* creation, not at object allocation <braunr> so if you allocate a single slab with, say, 50 or 100 objects per slab, kmem_alloc_wired would be called that number of times <mcsim> why kentry_dealloc_cache can create recursion? kentry_dealloc_cache is called only by mem_cache_reap. <braunr> right <braunr> but are you totally sure mem_cache_reap() can't be called by kmem_free() ? <braunr> i think you're right, it probably can't # IRC, freenode, #hurd, 2011-09-25 <mcsim> braunr: hello. I rewrote constructor for kernel entries and seems that it works fine. I think that this was last milestone. Only moving of memory allocator sources to more appropriate place and merge with main branch left. <braunr> mcsim: it needs renaming and reindenting too <mcsim> for reindenting C-x h Tab in emacs will be enough? <braunr> mcsim: make sure which style must be used first <mcsim> and what should I rename and where better to place allocator? For example, there is no lib directory, like in x15. Should I create it and move list.* and rbtree.* to lib/ or move these files to util/ or something else? <braunr> mcsim: i told you balloc isn't a good name before, use something more meaningful (kmem is already used in gnumach unfortunately if i'm right) <braunr> you can put the support files in kern/ <mcsim> what about vm_alloc? <braunr> you should prefix it with vm_ <braunr> shouldn't <braunr> it's a top level allocator <braunr> on top of the vm system <braunr> maybe mcache <braunr> hm no <braunr> maybe just km_ <mcsim> kern/km_alloc.*? <braunr> no <braunr> just km <mcsim> ok. # IRC, freenode, #hurd, 2011-09-27 <mcsim> braunr: hello. When I've tried to speed of new allocator and bad I've removed function mem_cpu_pool_fill. But you've said to undo this. I don't understand why this function is necessary. Can you explain it, please? <mcsim> When I've tried to compare speed of new allocator and old* <braunr> i'm not sure i said that <braunr> i said the performance overhead is negligible <braunr> so it's better to leave the cpu pool layer in place, as it almost doesn't hurt <braunr> you can implement the KMEM_CF_NO_CPU_POOL I added in the x15 mach version <braunr> so that cpu pools aren't used by default, but the code is present in case smp is implemented <mcsim> I didn't remove cpu pool layer. I've just removed filling of cpu pool during creation of slab. <braunr> how do you fill the cpu pools then ? <mcsim> If object is freed than it is added to cpu poll <braunr> so you don't fill/drain the pools ? <braunr> you try to get/put an object and if it fails you directly fall back to the slab layer ? <mcsim> I drain them during garbage collection <braunr> oh <mcsim> yes <braunr> you shouldn't touch the cpu layer during gc <braunr> the number of objects should be small enough so that we don't care much <mcsim> ok. I can drain cpu pool at any other time if it is prohibited to in mem_gc. <mcsim> But why do we need to fill cpu poll during slab creation? <mcsim> In this case allocation consist of: get object from slab -> put it to cpu pool -> get it from cpu pool <mcsim> I've just remove last to stages <braunr> hm cpu pools aren't filled at slab creation <braunr> they're filled when they're empty, and drained when they're full <braunr> so that the number of objects they contain is increased/reduced to a value suitable for the next allocations/frees <braunr> the idea is to fall back as little as possible to the slab layer because it requires the acquisition of the cache lock <mcsim> oh. You're right. I'm really sorry. The point is that if cpu pool is empty we don't need to fill it first <braunr> uh, yes we do :) <mcsim> Why cache locking is so undesirable? If we have free objects in slabs locking will not take a lot if time. <braunr> mcsim: it's undesirable on a smp system <mcsim> ok. <braunr> mcsim: and spin locks are normally noops on a up system <braunr> which is the case in gnumach, hence the slightly better performances without the cpu layer <braunr> but i designed this allocator for x15, which only supports mp systems :) <braunr> mcsim: sorry i couldn't look at your code, sick first, busy with server migration now (new server almost ready for xen hurds :)) <mcsim> ok. <mcsim> I ended with allocator if didn't miss anything important:) <braunr> i'll have a look soon i hope :) # IRC, freenode, #hurd, 2011-09-27 <antrik> braunr: would it be realistic/useful to check during GC whether all "used" objects are actually in a CPU pool, and if so, destroy them so the slab can be freed?... <antrik> mcsim: BTW, did you ever do any measurements of memory use/fragmentation? <mcsim> antrik: I couldn't do this for zalloc <antrik> oh... why not? <antrik> (BTW, I would be interested in a comparision between using the CPU layer, and bare slab allocation without CPU layer) <mcsim> Result I've got were strange. It wasn't even aligned to page size. <mcsim> Probably is it better to look into /proc/vmstat? <mcsim> Because I put hooks in the code and probably I missed something <antrik> mcsim: I doubt vmstat would give enough information to make any useful comparision... <braunr> antrik: isn't this draining cpu pools at gc time ? <braunr> antrik: the cpu layer was found to add a slight overhead compared to always falling back to the slab layer <antrik> braunr: my idea is only to drop entries from the CPU cache if they actually prevent slabs from being freed... if other objects in the slab are really in use, there is no point in flushing them from the CPU cache <antrik> braunr: I meant comparing the fragmentation with/without CPU layer. the difference in CPU usage is probably negligable anyways... <antrik> you might remember that I was (and still am) sceptical about CPU layer, as I suspect it worsens the good fragmentation properties of the pure slab allocator -- but it would be nice to actually check this :-) <braunr> antrik: right <braunr> antrik: the more i think about it, the more i consider slqb to be a better solution ...... :> <braunr> an idea for when there's time <braunr> eh <antrik> hehe :-) # IRC, freenode, #hurd, 2011-10-13 <braunr> mcsim: what's the current state of your gnumach branch ? <mcsim> I've merged it with master in September <braunr> yes i've seen that, but does it build and run fine ? <mcsim> I've tested it on gnumach from debian repository, but for building I had to make additional change in device/ramdisk.c, as I mentioned. <braunr> mcsim: why ? <mcsim> And it runs fine for me. <braunr> mcsim: why did you need to make other changes ? <mcsim> because there is a patch which comes with from-debian-repository kernel and it addes some code, where I have to make changes. Earlier kernel_map was a pointer to structure, but I change that and now kernel_map is structure. So handling to it should be by taking the address (&kernel_map) <braunr> why did you do that ? <braunr> or put it another way: what made you do that type change on kernel_map ? <mcsim> Earlier memory for kernel_map was allocating with zalloc. But now salloc can't allocate memory before it's initialisation <braunr> that's not a good reason <braunr> a simple workaround for your problem is this : <braunr> static struct vm_map kernel_map_store; <braunr> vm_map_t kernel_map = &kernel_map_store; <mcsim> braunr: Ok. I'll correct this. # IRC, freenode, #hurd, 2011-11-01 <braunr> etenil: but mcsim's work is, for one, useful because the allocator code is much clearer, adds some debugging support, and is smp-ready # IRC, freenode, #hurd, 2011-11-14 <braunr> i've just realized that replacing the zone allocator removes most (if not all) static limit on allocated objects <braunr> as we have nothing similar to rlimits, this means kernel resources are actually exhaustible <braunr> and i'm not sure every allocation is cleanly handled in case of memory shortage <braunr> youpi: antrik: tschwinge: is this acceptable anyway ? <braunr> (although IMO, it's also a good thing to get rid of those limits that made the kernel panic for no valid reason) <youpi> there are actually not many static limits on allocated objects <youpi> only a few have one <braunr> those defined in kern/mach_param.h <youpi> most of them are not actually enforced <braunr> ah ? <braunr> they are used at zinit() time <braunr> i thought they were <youpi> yes, but most zones are actually fine with overcoming the max <braunr> ok <youpi> see zone->max_size += (zone->max_size >> 1); <youpi> you need both !EXHAUSTIBLE and FIXED <braunr> ok <pinotree> making having rlimits enforced would be nice... <pinotree> s/making// <braunr> pinotree: the kernel wouldn't handle many standard rlimits anyway <braunr> i've just committed my final patch on mcsim's branch, which will serve as the starting point for integration <braunr> which means code in this branch won't change (or only last minute changes) <braunr> you're invited to test it <braunr> there shouldn't be any noticeable difference with the master branch <braunr> a bit less fragmentation <braunr> more memory can be reclaimed by the VM system <braunr> there are debugging features <braunr> it's SMP ready <braunr> and overall cleaner than the zone allocator <braunr> although a bit slower on the free path (because of what's performed to reduce fragmentation) <braunr> but even "slower" here is completely negligible # IRC, freenode, #hurd, 2011-11-15 <mcsim> I enabled cpu_pool layer and kentry cache exhausted at "apt-get source gnumach && (cd gnumach-* && dpkg-buildpackage)" <mcsim> I mean kernel with your last commit <mcsim> braunr: I'll make patch how I've done it in a few minutes, ok? It will be more specific. <braunr> mcsim: did you just remove the #if NCPUS > 1 directives ? <mcsim> no. I replaced macro NCPUS > 1 with SLAB_LAYER, which equals NCPUS > 1, than I redefined macro SLAB_LAYER <braunr> ah, you want to make the layer optional, even on UP machines <braunr> mcsim: can you give me the commands you used to trigger the problem ? <mcsim> apt-get source gnumach && (cd gnumach-* && dpkg-buildpackage) <braunr> mcsim: how much ram & swap ? <braunr> let's see if it can handle a quite large aptitude upgrade <mcsim> how can I check swap size? <braunr> free <braunr> cat /proc/meminfo <braunr> top <braunr> whatever <mcsim> total used free shared buffers cached <mcsim> Mem: 786368 332296 454072 0 0 0 <mcsim> -/+ buffers/cache: 332296 454072 <mcsim> Swap: 1533948 0 1533948 <braunr> ok, i got the problem too <mcsim> braunr: do you run hurd in qemu? <braunr> yes <braunr> i guess the cpu layer increases fragmentation a bit <braunr> which means more map entries are needed <braunr> hm, something's not right <braunr> there are only 26 kernel map entries when i get the panic <braunr> i wonder why the cache gets that stressed <braunr> hm, reproducing the kentry exhaustion problem takes quite some time <mcsim> braunr: what do you mean? <braunr> sometimes, dpkg-buildpackage finishes without triggering the problem <mcsim> the problem is in apt-get source gnumach <braunr> i guess the problem happens because of drains/fills, which allocate/free much more object than actually preallocated at boot time <braunr> ah ? <braunr> ok <braunr> i've never had it at that point, only later <braunr> i'm unable to trigger it currently, eh <mcsim> do you use *-dbg kernel? <braunr> yes <braunr> well, i use the compiled kernel, with the slab allocator, built with the in kernel debugger <mcsim> when you run apt-get source gnumach, you run it in clean directory? Or there are already present downloaded archives? <braunr> completely empty <braunr> ah just got it <braunr> ok the limit is reached, as expected <braunr> i'll just bump it <braunr> the cpu layer drains/fills allocate several objects at once (64 if the size is small enough) <braunr> the limit of 256 (actually 252 since the slab descriptor is embedded in its slab) is then easily reached <antrik> mcsim: most direct way to check swap usage is vmstat <braunr> damn, i can't live without slabtop and the amount of active/inactive cache memory any more <braunr> hm, weird, we have active/inactive memory in procfs, but not buffers/cached memory <braunr> we could set buffers to 0 and everything as cached memory, since we're currently unable to communicate the purpose of cached memory (whether it's used by disk servers or file system servers) <braunr> mcsim: looks like there are about 240 kernel map entries (i forgot about the ones used in kernel submaps) <braunr> so yes, addin the cpu layer is what makes the kernel reach the limit more easily <mcsim> braunr: so just increasing limit will solve the problem? <braunr> mcsim: yes <braunr> slab reclaiming looks very stable <braunr> and unfrequent <braunr> (which is surprising) <pinotree> braunr: "unfrequent"? <braunr> pinotree: there isn't much memory pressure <braunr> slab_collect() gets called once a minute on my hurd <braunr> or is it infrequent ? <braunr> :) <pinotree> i have no idea :) <braunr> infrequent, yes # IRC, freenode, #hurd, 2011-11-16 <braunr> for those who want to play with the slab branch of gnumach, the slabinfo tool is available at http://darnassus.sceen.net/cgit/rbraun/slabinfo.git/ <braunr> for those merely interested in numbers, here is the output of slabinfo, for a hurd running in kvm with 512 MiB of RAM, an unused swap, and a short usage history (gnumach debian packages built, aptitude upgrade for a dozen of packages, a few git commands) <braunr> http://www.sceen.net/~rbraun/slabinfo.out <antrik> braunr: numbers for a long usage history would be much more interesting :-) ## IRC, freenode, #hurd, 2011-11-17 <braunr> antrik: they'll come :) <etenil> is something going on on darnassus? it's mighty slow <braunr> yes <braunr> i've rebooted it to run a modified kernel (with the slab allocator) and i'm building stuff on it to stress it <braunr> (i don't have any other available machine with that amount of available physical memory) <etenil> ok <antrik> braunr: probably would be actually more interesting to test under memory pressure... <antrik> guess that doesn't make much of a difference for the kernel object allocator though <braunr> antrik: if ram is larger, there can be more objects stored in kernel space, then, by building something large such as eglibc, memory pressure is created, causing caches to be reaped <braunr> our page cache is useless because of vm_object_cached_max <braunr> it's a stupid arbitrary limit masking the inability of the vm to handle pressure correctly <braunr> if removing it, the kernel freezes soon after ram is filled <braunr> antrik: it may help trigger the "double swap" issue you mentioned <antrik> what may help trigger it? <braunr> not checking this limit <antrik> hm... indeed I wonder whether the freezes I see might have the same cause ## IRC, freenode, #hurd, 2011-11-19 <braunr> http://www.sceen.net/~rbraun/slabinfo.out <= state of the slab allocator after building the debian libc packages and removing all files once done <braunr> it's mostly the same as on any other machine, because of the various arbitrary limits in mach (most importantly, the max number of objects in the page cache) <braunr> fragmentation is still quite low <antrik> braunr: actually fragmentation seems to be lower than on the other run... <braunr> antrik: what makes you think that ? <antrik> the numbers of currently unused objects seem to be in a similar range IIRC, but more of them are reclaimable I think <antrik> maybe I'm misremembering the other numbers <braunr> there had been more reclaims on the other run # IRC, freenode, #hurd, 2011-11-25 <braunr> mcsim: i've just updated the slab branch, please review my last commit when you have time <mcsim> braunr: Do you mean compilation/tests? <braunr> no, just a quick glance at the code, see if it matches what you intended with your original patch <mcsim> braunr: everything is ok <braunr> good <braunr> i think the branch is ready for integration # IRC, freenode, #hurd, 2011-12-17 <braunr> in the slab branch, there now is no use for the defines in kern/mach_param.h <braunr> should the file be removed or left empty as a placeholder for future arbitrary limits ? <braunr> (i'd tend ro remove it as a way of indicating we don't want arbitrary limits but there may be a good reason to keep it around .. :)) <youpi> I'd just drop it <braunr> ok <braunr> hmm maybe we do want to keep that one : <braunr> #define IMAR_MAX (1 << 10) /* Max number of msg-accepted reqs */ <antrik> whatever that is... <braunr> it gets returned in ipc_marequest_info <braunr> but the mach_debug interface has never been used on the hurd <braunr> there now is a master-slab branch in the gnumach repo, feel free to test it # IRC, freenode, #hurd, 2011-12-22 <youpi> braunr: does the new gnumach allocator has profiling features? <youpi> e.g. to easily know where memory leaks reside <braunr> youpi: you mean tracking call traces to allocated blocks ? <youpi> not necessarily traces <youpi> but at least means to know what kind of objects is filling memory <braunr> it's very close to the zone allocator <braunr> but instead of zones, there are caches <braunr> each named after the type they store <braunr> see http://www.sceen.net/~rbraun/slabinfo.out <youpi> ok, so we can know, per-type, how much memory is used <braunr> yes <youpi> good <braunr> if backtraces can easily be forged, it wouldn't be hard to add that feature too <youpi> does it dump such info when memory goes short? <braunr> no but it can <braunr> i've done this during tests <youpi> it'd be good <youpi> because I don't know in advance when a buildd will crash due to that :) <braunr> each time slab_collect() is called for example <youpi> I mean not on collect, but when it's too late <youpi> and thus always enabled <braunr> ok <youpi> (because there's nothing better to do than at least give infos) <braunr> you just have to define "when it's too late", and i can add that <youpi> when there is no memory left <braunr> you mean when the number of free pages strictly reaches 0 ? <youpi> yes <braunr> ok <youpi> i.e. just before crashing the kernel <braunr> i see # IRC, freenode, #hurdfr, 2012-01-02 <youpi> braunr: le code du slab allocator, il est écrit from scratch ? <youpi> il y a encore du copyright carnegie mellon <youpi> (dans slab_info.h du moins) <youpi> ipc_hash_global_size = 256; <youpi> il faudrait mettre 256 comme constante dans un header <youpi> sinon c'est encore une valeur arbitraire cachée dans du code <youpi> de même pour ipc_marequest_size etc. <braunr> youpi: oui, from scratch <braunr> slab_info.h est à l'origine zone_info.h <braunr> pour les valeurs fixes, elles étaient déjà présentes de cette façon, j'ai pensé qu'il valait mieux laisser comme ça pour faciliter la lecture des diffs <braunr> je ferai des macros à la place <braunr> du coup il faudra peut-être remettre mach_param.h <braunr> ou alors dans les .h ipc # IRC, freenode, #hurd, 2012-01-18 <braunr> does the slab branch need other reviews/reports before being integrated ? # IRC, freenode, #hurd, 2012-01-30 <braunr> youpi: do you have some idea about when you want to get the slab branch in master ? <youpi> I was considering as soon as mcsim gets his paper <braunr> right # IRC, freenode, #hurd, 2012-02-22 <mcsim> Do I understand correct, that real memory page should be necessarily in one of following lists: vm_page_queue_active, vm_page_queue_inactive, vm_page_queue_free? <braunr> cached pages are <braunr> some special pages used only by the kernel aren't <braunr> pages can be both wired and cached (i.e. managed by the page cache), so that they can be passed to external applications and then unwired (as is the case with your host_slab_info() function if you remember) <braunr> use "physical" instead of "real memory" <mcsim> braunr: thank you. # IRC, freenode, #hurd, 2012-04-22 <braunr> youpi: tschwinge: when the slab code was added, a few new files made into gnumach that come from my git repo and are used in other projects as well <braunr> they're licensed under BSD upstream and GPL in gnumach, and though it initially didn't disturb me, now it does <braunr> i think i should fix this by leaving the original copyright and adding the GPL on top <youpi> sure, submit a patch <braunr> hm i have direct commit acces if im right <youpi> then fix it :) <braunr> do you want to review ? <youpi> I don't think there is any need to <braunr> ok # IRC, freenode, #hurd, 2012-12-08 <mcsim> braunr: hi. Do I understand correct that merely the same technique is used in linux to determine the slab where, the object to be freed, resides? <braunr> yes but it's faster on linux since it uses a direct mapping of physical memory <braunr> it just has to shift the virtual address to obtain the physical one, whereas x15 has to walk the pages tables <braunr> of course it only works for kmalloc, vmalloc is entirely different <mcsim> btw, is there sense to use some kind of B-tree instead of AVL to decrease number of cache misses? AFAIK, in modern processors size of L1 cache line is at least 64 bytes, so in one node we can put at least 4 leafs (key + pointer to data) making search faster. <braunr> that would be a b-tree <braunr> and yes, red-black trees were actually developed based on properties observed on b-trees <braunr> but increasing the size of the nodes also increases memory overhead <braunr> and code complexity <braunr> that's why i have a radix trees for cases where there are a large number of entries with keys close to each other :) <braunr> a radix-tree is basically a b-tree using the bits of the key as indexes in the various arrays it walks instead of comparing keys to each other <braunr> the original avl tree used in my slab allocator was intended to reduce the average height of the tree (avl is better for that) <braunr> avl trees are more suited for cases where there are more lookups than inserts/deletions <braunr> they make the tree "flatter" but the maximum complexity of operations that change the tree is 2log2(n), since rebalancing the tree can make the algorithm reach back to the tree root <braunr> red-black trees have slightly bigger heights but insertions are limited to 2 rotations and deletions to 3 <mcsim> there should be not much lookups in slab allocators <braunr> which explains why they're more generally found in generic containers <mcsim> or do I misunderstand something? <braunr> well, there is a lookup for each free() <braunr> whereas there are insertions/deletions when a slab becomes non-empty/empty <mcsim> I see <braunr> so it was very efficient for caches of small objects, where slabs have many of them <braunr> also, i wrote the implementation in userspace, without functionality pmap provides (although i could have emulated it afterwards) # IRC, freenode, #hurd, 2013-01-06 <youpi> braunr: panic: vm_map: kentry memory exhausted <braunr> youpi: ouch <youpi> that's what I usually get <braunr> ok <braunr> the kentry area is a preallocated memory area that is used to back the vm_map_kentry cache <braunr> objects from this cache are used to describe kernel virtual memory <braunr> so in this case, i simply assume the kentry area must be enlarged <braunr> (currently, both virtual and physical memory is preallocated, an improvement could be what is now done in x15, to preallocate virtual memory only <braunr> ) <youpi> Mmm, why do we actually have this limit? <braunr> the kentry area must be described by one entry <youpi> ah, sorry, vm/vm_resident.c: kentry_data = pmap_steal_memory(kentry_data_size); <braunr> a statically allocated one <youpi> I had missed that one <braunr> previously, the zone allocator would do that <braunr> the kentry area is required to avoid recursion when allocating memory <braunr> another solution would be a custom allocator in vm_map, but i wanted to use a common cache for those objects too <braunr> youpi: you could simply try doubling KENTRY_DATA_SIZE <youpi> already doing that <braunr> we might even consider a much larger size until it's reworked <youpi> well, it's rare enough on buildds already <youpi> doubling should be enough <youpi> or else we have leaks <braunr> right <braunr> it may not be leaks though <braunr> it may be poor map entry merging <braunr> i'd expected the kernel map entries to be easier to merge, but it may simply not be the case <braunr> (i mean, when i made my tests, it looked like there were few kernel map entries, but i may have missed corner cases that could cause more of them to be needed) ## IRC, freenode, #hurd, 2014-02-11 <braunr> youpi: what's the issue with kentry_data_size ? <youpi> I don't know <braunr> so back to 64pages from 256 ? <youpi> in debian for now yes <braunr> :/ <braunr> from what i recall with x15, grub is indeed allowed to put modules and command lines around as it likes <braunr> restricted to 4G <braunr> iirc, command lines were in the first 1M while modules could be loaded right after the kernel or at the end of memory, depending on the versions <youpi> braunr: possibly VM_KERNEL_MAP_SIZE is then not big enough <braunr> youpi: what's the size of the ramdisk ? <braunr> youpi: or kmem_map too big <braunr> we discussed this earlier with teythoon [[user-space_device_drivers]], *Open Issues*, *System Boot*, *IRC, freenode, \#hurd, 2011-07-27*, *IRC, freenode, #hurd, 2014-02-10* <braunr> or maybe we want to remove kmem_map altogether and directly use kernel_map <youpi> it's 6.2MiB big <braunr> hm <youpi> err no <braunr> looks small <youpi> 70MiB <braunr> ok yes <youpi> (uncompressed) <braunr> well <braunr> kernel_map is supposed to have 64M on i386 ... <braunr> it's 192M large, with kmem_map taking 128M <braunr> so at most 64M, with possible fragmentation <teythoon> i believe the compressed initrd is stored in the ramdisk <youpi> ah, right it's ext2fs which uncompresses it <braunr> uncompresses it where <braunr> ? <teythoon> libstore does that <youpi> module --nounzip /boot/${gtk}initrd.gz <youpi> braunr: in userland memory <youpi> it's not grub which uncompresses it for sure <teythoon> braunr: so my ramdisk isn't 64 megs either <braunr> which explains why it sometimes works <teythoon> yes <teythoon> mine is like 15 megs <braunr> kentry_data_size calls pmap_steal_memory, an early allocation function which changes virtual_space_start, which is later used to create the first kernel map entry <braunr> err, pmap_steal_memory is called with kentry_data_size as its argument <braunr> this first kernel map entry is installed inside kernel_map and reduces the amount of available virtual memory there <braunr> so yes, it all points to a layout problem <braunr> i suggest reducing kmem_map down to 64M <youpi> that's enough to get d-i back to boot <youpi> what would be the downside? <youpi> (why did you raise it to 128 actually? :) ) <braunr> i merged the map used by generic kalloc allocations into kmem_map <braunr> both were 64M <braunr> i don't see any downside for the moment <braunr> i rarely see more than 50M used by the slab allocator <braunr> and with the recent code i added to collect reclaimable memory on kernel allocation failures, it's unlikely the slab allocator will be starved <youpi> but then we need that patch too <braunr> no <braunr> it would be needed if kmem_map gets filled <braunr> this very rarely happens <youpi> is "very rarely" enough ? :) <braunr> actualy i've never seen it happen <braunr> i added it because i had port leaks with fakeroot <braunr> port rights are a bit special because they're stored in a table in kernel space <braunr> this table is enlarged with kmem_realloc <braunr> when an ipc space gets very large, fragmentation makes it very difficult to successfully resize it <braunr> that should be the only possible issue <braunr> actually, there is another submap that steals memory from kernel_map: device_io_map is 16M large <braunr> so kernel_map gets down to 48M <braunr> if the initial entry (that is, kentry_data_size + the physical page table size) gets a bit large, kernel_map may have very little available room <braunr> the physical page table size obviously varies depending on the amount of physical memory loaded, which may explain why the installer worked on some machines <youpi> well, it works up to 1855M <youpi> at 1856 it doesn't work any more :) <braunr> heh :) <youpi> and that's about the max gnumach can handle anyway <braunr> then reducing kmem_map down to 96M should be enough <youpi> it works indeed <braunr> could you check the amount of available space in kernel_map ? <braunr> the value of kernel_map->size should do <youpi> printing it "multiboot modules" print should be fine I guess? ### IRC, freenode, #hurd, 2014-02-12 <braunr> probably <teythoon> ? <braunr> i expect a bit more than 160M <braunr> (for the value of kernel_map->size) <braunr> teythoon: ? <youpi> well, it's 2110210048 <teythoon> what is multiboot modules printing ? <youpi> almost last in gnumach bootup <braunr> humm <braunr> it must account directly mapped physical pages <braunr> considering the kernel has exactly 2G, this means there is 36M available in kernel_map <braunr> youpi: is the ramdisk loaded at that moment ? <youpi> what do you mean by "loaded" ? :) <braunr> created <youpi> where? <braunr> allocated in kernel memory <youpi> the script hasn't started yet <braunr> ok <braunr> its size was 6M+ right ? <braunr> so it leaves around 30M <youpi> something like this yes <braunr> and changing kmem_map from 128M to 96M gave us 32M <braunr> so that's it # IRC, freenode, #hurd, 2013-04-18 <braunr> oh nice, i've found a big scalability issue with my slab allocator <braunr> it shouldn't affect gnumach much though ## IRC, freenode, #hurd, 2013-04-19 <ArneBab> braunr: is it fixable? <braunr> yes <braunr> well, i'll do it in x15 for a start <braunr> again, i don't think gnumach is much affected <braunr> it's a scalability issue <braunr> when millions of objects are in use <braunr> gnumach rarely has more than a few hundred thousands <braunr> it's also related to heavy multithreading/smp <braunr> and by multithreading, i also mean preemption <braunr> gnumach isn't preemptible and uniprocessor <braunr> if the resulting diff is clean enough, i'll push it to gnumach though :) ### IRC, freenode, #hurd, 2013-04-21 <braunr> ArneBab_: i fixed the scalability problems btw ## IRC, freenode, #hurd, 2013-04-20 <braunr> well, there is also a locking error in the slab allocator, although not a problem for a non preemptible kernel like gnumach <braunr> non preemptible / uniprocessor ## IRC freenode, #hurd, 2016-12-29 <braunr> i've identified a fundamental flaw with the default pager <braunr> and actually, with mach in general i suppose <braunr> i assumed that it was necessary to trust the server only <braunr> that a server didn't need to trust its client <braunr> but mach messages carry memory that is potentially mapped from unprivileged pagers <braunr> which means faulting on that memory effectively makes the faulting process a client to the unprivileged pager <braunr> and that's something that can happen to the default pager during heavy memory pressure <braunr> in which case it deadlocks on itself because the copyout hangs on a fault, waiting for the unprivileged pager to provide the data <braunr> (which it can't because of heavy memory pressure and because it's unprivileged, it's blocked, waiting until allocations resume) <braunr> the pageout daemon will keep paging out to the default pager in the hope those pages get freed <braunr> but sending to the default pager is now impossible because its map is locked on the never-ending fault