[[!meta copyright="Copyright © 2011, 2012, 2013, 2014, 2016 Free Software
Foundation, Inc."]]

[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
id="license" text="Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License, Version 1.2 or
any later version published by the Free Software Foundation; with no Invariant
Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
is included in the section entitled [[GNU Free Documentation
License|/fdl]]."]]"""]]

[[!tag open_issue_gnumach]]

[[!toc]]


# IRC, freenode, #hurd, 2011-04-12

    <antrik> braunr: do you think the allocator you wrote for x15 could be used
      for gnumach? and would you be willing to mentor this? :-)
    <braunr> antrik: to be willing to isn't my current problem
    <braunr> antrik: and yes, I think my allocator can be used
    <braunr> it's a slab allocator after all, it only requires reap() and
      grow()
    <braunr> or mmap()/munmap() whatever you want to call it
    <braunr> a backend
    <braunr> antrik: although i've been having other ideas recently
    <braunr> that would have more impact on our usage patterns I think
    <antrik> mcsim: have you investigated how the zone allocator works and how
      it's hooked into the system yet?
    <braunr> mcsim: now let me give you a link
    <braunr> mcsim:
      http://git.sceen.net/rbraun/libbraunr.git/?a=blob;f=mem.c;h=330436e799f322949bfd9e2fedf0475660309946;hb=HEAD
    <braunr> mcsim: this is an implementation of the slab allocator i've been
      working on recently
    <braunr> mcsim: i haven't made it public because i reworked the per
      processor layer, and this part isn't complete yet
    <braunr> mcsim: you could use it as a reference for your project
    <mcsim> braunr: ok
    <braunr> it used to be close to the 2001 vmem paper
    <braunr> but after many tests, fragmentation and accounting issues have
      been found
    <braunr> so i rewrote it to be closer to the linux implementation (cache
      filling/draining in bukl transfers)
    <braunr> bulk*
    <braunr> they actually use the word draining in linux too :)
    <mcsim> antrik: not complete yet.
    <antrik> braunr: oh, it's unfinished? that's unfortunate...
    <braunr> antrik: only the per processor part
    <braunr> antrik: so it doesn't matter much for gnumach
    <braunr> and it's not difficult to set up
    <antrik> mcsim: hm, OK... but do you think you will have a fairly good
      understanding in the next couple of days?...
    <antrik> I'm asking because I'd really like to see a proposal a bit more
      specific than "I'll look into things..."
    <antrik> i.e. you should have an idea which things you will actually have
      to change to hook up a new allocator etc.
    <antrik> braunr: OK. will the interface remain unchanged, so it could be
      easily replaced with an improved implementation later?
    <braunr> the zone allocator in gnumach is a badly written bare object
      allocator actually, there aren't many things to understand about it
    <braunr> antrik: yes
    <antrik> great :-)
    <braunr> and the per processor part should be very close to the phys
      allocator sitting next to it
    <braunr> (with the slight difference that, as per cpu caches have variable
      sizes, they are allocated on the free path rather than on the allocation
      path)
    <braunr> this is a nice trick in the vmem paper i've kept in mind
    <braunr> and the interface also allows to set a "source" for caches
    <antrik> ah, good point... do you think we should replace the physmem
      allocator too? and if so, do it in one step, or one piece at a time?...
    <braunr> no
    <braunr> too many drivers currently depend on the physical allocator and
      the pmap module as they are
    <braunr> remember linux 2.0 drivers need a direct virtual to physical
      mapping
    <braunr> (especially true for dma mappings)
    <antrik> OK
    <braunr> the nice thing about having a configurable memory source is that
    <antrik> whot do you mean by "allocated on the free path"?
    <braunr> even if most caches will use the standard vm_kmem module as their
      backend
    <braunr> there is one exception in the vm_map module, allowing us to get
      rid of either a static limit, or specific allocation code
    <braunr> antrik: well, when you allocate a page, the allocator will lookup
      one in a per cpu cache
    <braunr> if it's empty, it fills the cache
    <braunr> (called pools in my implementations)
    <braunr> it then retries
    <braunr> the problem in the slab allocator is that per cpu caches have
      variable sizes
    <braunr> so per cpu pools are allocated from their own pools
    <braunr> (remember the magazine_xx caches in the output i showed you, this
      is the same thing)
    <braunr> but if you allocate them at allocation time, you could end up in
      an infinite loop
    <braunr> so, in the slab allocator, when a per cpu cache is empty, you just
      fall back to the slab layer
    <braunr> on the free path, when a per cpu cache doesn't exist, you allocate
      it from its own cache
    <braunr> this way you can't have an infinite loop
    <mcsim> antrik: I'll try, but I have exams now.
    <mcsim> As I understand amount of elements which could be allocated we
      determine by zone initialization. And at this time memory for zone is
      reserved. I'm going to change this. And make something similar to kmalloc
      and vmalloc (support for pages consecutive physically and virtually). And
      pages in zones consecutive always physically.
    <mcsim> Am I right?
    <braunr> mcsim: don't try to do that
    <mcsim> why?
    <braunr> mcsim: we just need a slab allocator with an interface close to
      the zone allocator
    <antrik> mcsim: IIRC the size of the complete zalloc map is fixed; but not
      the number of elements per zone
    <braunr> we don't need two allocators like kmalloc and vmalloc
    <braunr> actually we just need vmalloc
    <braunr> IIRC the limits are only present because the original developers
      wanted to track leaks
    <braunr> they assumed zones would be large enough, which isn't true any
      more today
    <braunr> but i didn't see any true reservation
    <braunr> antrik: i'm not sure i was clear enough about the "allocation of
      cpu caches on the free path"
    <braunr> antrik: for a better explanation, read the vmem paper ;)
    <antrik> braunr: you mean there is no fundamental reason why the zone map
      has a limited maximal size; and it was only put in to catch cases where
      something eats up all memory with kernel object creation?...
    <antrik> braunr: I think I got it now :-)
    <braunr> antrik: i'm pretty certin of it yes
    <antrik> I don't see though how it is related to what we were talking
      about...
    <braunr> 10:55 < braunr> and the per processor part should be very close to
      the phys allocator sitting next to it
    <braunr> the phys allocator doesn't have to use this trick
    <braunr> because pages have a fixed size, so per cpu caches all have the
      same size too
    <braunr> and the number of "caches", that is, physical segments, is limited
      and known at compile time
    <braunr> so having them statically allocated is possible
    <antrik> I see
    <braunr> it would actually be very difficult to have a phys allocator
      requiring dynamic allocation when the dynamic allocator isn't yet ready
    <antrik> hehe :-)
    <mcsim> total size of all zone allocations is limited to 12 MB. And is "was
      only put in to catch cases where something eats up all memory with kernel
      object creation?"
    <braunr> mcsim: ah right, there could be a kernel submap backing all the
      zones
    <braunr> but this can be increased too
    <braunr> submaps are kind of evil :/
    <antrik> mcsim: I think it's actually 32 MiB or something like that in the
      Debian version...
    <antrik> braunr: I'm not sure I ever fully understood what the zalloc map
      is... I looked through the code once, and I think I got a rough
      understading, but I was still pretty uncertain about some bits. and I
      don't remember the details anyways :-)
    <braunr> antrik: IIRC, it's a kernel submap
    <braunr> it's named kmem_map in x15
    <antrik> don't know what a submap is
    <braunr> submaps are vm_map objects
    <braunr> in a top vm_map, there are vm_map_entries
    <braunr> these entries usually point to vm_objects
    <braunr> (for the page cache)
    <braunr> but they can point to other maps too
    <braunr> the goal is to reduce fragmentation by isolating allocations
    <braunr> this also helps reducing contention
    <braunr> for exemple, on BSD, there is a submap for mbufs, so that the
      network code doesn't interfere too much with other kernel allocations
    <braunr> antrik: they are similar to spans in vmem, but vmem has an elegant
      importing mechanism which eliminates the static limit problem
    <antrik> so memory is not directly allocated from the physical allocator,
      but instead from another map which in turn contains physical memory, or
      something like that?...
    <braunr> no, this is entirely virtual
    <braunr> submaps are almost exclusively used for the kernel_map
    <antrik> you are using a lot of identifies here, but I don't remember (or
      never knew) what most of them mean :-(
    <braunr> sorry :)
    <braunr> the kernel map is the vm_map used to represent the ~1 GiB of
      virtual memory the kernel has (on i386)
    <braunr> vm_map objects are simple virtual space maps
    <braunr> they contain what you see in linux when doing /proc/self/maps
    <braunr> cat /proc/self/maps
    <braunr> (linux uses entirely different names but it's roughly the same
      structure)
    <braunr> each line is a vm_map_entry
    <braunr> (well, there aren't submaps in linux though)
    <braunr> the pmap tool on netbsd is able to show the kernel map with its
      submaps, but i don't have any image around
    <mcsim> braunr: is limit for zones is feature and shouldn't be changed?
    <braunr> mcsim: i think we shouldn't have fixed limits for zones
    <braunr> mcsim: this should be part of the debugging facilities in the slab
      allocator
    <braunr> is this fixed limit really a major problem ?
    <braunr> i mean, don't focus on that too much, there are other issues
      requiring more attention
    <antrik> braunr: at 12 MiB, it used to be, causing a lot of zalloc
      panics. after increasing, I don't think it's much of a problem anymore...
    <antrik> but as memory sizes grow, it might become one again
    <antrik> that's the problem with a fixed size...
    <braunr> yes, that's the issue with submaps
    <braunr> but gnumach is full of those, so let's fix them by order of
      priority
    <antrik> well, I'm still trying to digest what you wrote about submaps :-)
    <braunr> i'm downloading netbsd, so you can have a good view of all this
    <antrik> so, when the kernel allocates virtual address space regions
      (mostly for itself), instead of grabbing chunks of the address space
      directly, it takes parts out of a pre-reserved region?
    <braunr> not exactly
    <braunr> both statements are true
    <mcsim> antrik: only virtual addresses are reserved
    <braunr> it grabs chunks of the address space directly, but does so in a
      reserved region of the address space
    <braunr> a submap is like a normal map, it has a start address, a size, and
      is empty, then it's populated with vm_map_entries
    <braunr> so instead of allocating from 3-4 GiB, you allocate from, say,
      3.1-3.2 GiB
    <antrik> yeah, that's more or less what I meant...
    <mcsim> braunr: I see two problems: limited zones and absence of caching. 
    <mcsim> with caching absence of readahead paging will be not so significant
    <braunr> please avoid readahead
    <mcsim> ok
    <braunr> and it's not about paging, it's about kernel memory, which is
      wired
    <braunr> (well most of it)
    <braunr> what about limited zones ?
    <braunr> the whole kernel space is limited, there has to be limits
    <braunr> the problem is how to handle them
    <antrik> braunr: almost all. I looked through all zones once, and IIRC I
      found exactly one that actually allows paging...
    <braunr> currently, when you reach the limit, you have an OOM error
    <braunr> antrik: yes, there are
    <braunr> i don't remember which implementation does that but, when
      processes haven't been active for a minute or so, they are "swapedout"
    <braunr> completely
    <braunr> even the kernel stack
    <braunr> and the page tables
    <braunr> (most of the pmap structures are destroyed, some are retained)
    <antrik> that might very well be true... at least inactive processes often
      show up with 0 memory use in top on Hurd
    <braunr> this is done by having a pageable kernel map, with wired entries
    <braunr> when the swapper thread swaps tasks out, it unwires them
    <braunr> but i think modern implementations don't do that any more
    <antrik> well, I was talking about zalloc only :-)
    <braunr> oh
    <braunr> so the zalloc_map must be pageable
    <braunr> or there are two submaps ?
    <antrik> not sure whether "morden implementations" includes Linux ;-)
    <braunr> no, i'm talking about the bsd family only
    <antrik> but it's certainly true that on Linux even inactive processes
      retain some memory
    <braunr> linux doesn't make any difference between processor-bound and
      I/O-bound processes
    <antrik> braunr: I have no idea how it works. I just remember that when
      creating zones, one of the optional flags decides whether the zone is
      pagable. but as I said, IIRC there is exactly one that actually is...
    <braunr> zone_map = kmem_suballoc(kernel_map, &zone_min, &zone_max,
      zone_map_size, FALSE);
    <braunr> kmem_suballoc(parent, min, max, size, pageable)
    <braunr> so the zone_map isn't
    <antrik> IIRC my conclusion was that pagable zones do not count in the
      fixed zone map limit... but I'm not sure anymore
    <braunr> zinit() has a memtype parameter
    <braunr> with ZONE_PAGEABLE as a possible flag
    <braunr> this is wierd :)
    <mcsim> There is no any zones which use ZONE_PAGEABLE flag
    <antrik> mcsim: are you sure? I think I found one...
    <braunr> if (zone->type & ZONE_PAGEABLE) {
    <antrik> admittedly, it is several years ago that I looked into this, so my
      memory is rather dim...
    <braunr> if (kmem_alloc_pageable(zone_map, &addr, ...
    <braunr> calling kmem_alloc_pageable() on an unpageable submap seems wrong
    <mcsim> I've greped gnumach code and there is no any zinit procedure call
      with ZONE_PAGEABLE flag
    <braunr> good
    <antrik> hm... perhaps it was in some code that has been removed
      alltogether since ;-)
    <antrik> actually I think it would be pretty neat to have pageable kernel
      objects... but I guess it would require considerable effort to implement
      this right
    <braunr> mcsim: you also mentioned absence of caching
    <braunr> mcsim: the zone allocator actually is a bare caching object
      allocator
    <braunr> antrik: no, it's easy
    <braunr> antrik: i already had that in x15 0.1
    <braunr> antrik: the problem is being sure the objects you allocate from a
      pageable backing store are never used when resolving a page fault
    <braunr> that's all
    <antrik> I wouldn't expect that to be easy... but surely you know better
      :-)
    <mcsim> braunr: indeed. I was wrong.
    <antrik> braunr: what is a caching object allocator?...
    <braunr> antrik: ok, it's not easy
    <braunr> antrik: but once you have vm_objects implemented, having pageable
      kernel object is just a matter of using the right options, really
    <braunr> antrik: an allocator that caches its buffers
    <braunr> some years ago, the term "object" would also apply to
      preconstructed buffers
    <antrik> I have no idea what you mean by "caches its buffers" here :-)
    <braunr> well, a memory allocator which doesn't immediately free its
      buffers caches them
    <mcsim> braunr: but can it return objects to system?
    <braunr> mcsim: which one ?
    <antrik> yeah, obviously the *implementation* of pageable kernel objects is
      not hard. the tricky part is deciding which objects can be pageable, and
      which need to be wired...
    <mcsim> Can zone allocator return cached objects to system as in slab?
    <mcsim> I mean reap()
    <braunr> well yes, it does so, and it does that too often
    <braunr> the caching in the zone allocator is actually limited to the
      pagesize
    <braunr> once page is completely free, it is returned to the vm
    <mcsim> this is bad caching
    <braunr> yes
    <mcsim> if object takes all page than there is now caching at all
    <braunr> caching by side effect
    <braunr> true
    <braunr> but the linux slab allocator does the same thing :p
    <braunr> hm
    <braunr> no, the solaris slab allocator does so
    <mcsim> linux's slab returns objects only when system ask
    <antrik> without preconstructed objects, is there actually any point in
      caching empty slabs?...
    <mcsim> Once I've changed my allocator to slab and it cached more than 1GB
      of my memory)
    <braunr> ok wait, need to fix a few mistakes first
    <mcsim> s/ask/asks
    <braunr> the zone allocator (in gnumach) actually has a garbage collector
    <antrik> braunr: well, the Solaris allocator follows the slab/magazine
      paper, right? so there is caching at the magazine layer... in that case
      caching empty slabs too would be rather redundant I'd say...
    <braunr> which is called when running low on memory, similar to the slab
      allocaotr
    <braunr> antrik: yes
    <antrik> (or rather the paper follows the Solaris allocator ;-) )
    <braunr> mcsim: the zone allocator reap() is zone_gc()
    <antrik> braunr: hm, right, there is a "collectable" flag for zones... but
      I never understood what it means
    <antrik> braunr: BTW, I heard Linux has yet another allocator now called
      "slob"... do you happen to know what that is?
    <braunr> slob is a very simple allocator for embedded devices
    <mcsim> AFAIR this is just heap allocator
    <braunr> useful when you have a very low amount of memory
    <braunr> like 1 MiB
    <braunr> yes
    <antrik> just googled it :-)
    <braunr> zone and slab are very similar
    <antrik> sounds like a simple heap allocator
    <mcsim> there is another allocator that calls slub, and it better than slab
      in many cases
    <braunr> the main difference is the data structures used to store slabs
    <braunr> mcsim: i disagree
    <antrik> mcsim: ah, you already said that :-)
    <braunr> mcsim: slub is better for systems with very large amounts of
      memory and processors
    <braunr> otherwise, slab is better
    <braunr> in addition, there are accounting issues with slub
    <braunr> because of cache merging
    <mcsim> ok. This strange that slub is default allocator
    <braunr> well both are very good
    <braunr> iirc, linus stated that he really doesn't care as long as its
      works fine
    <braunr> he refused slqb because of that
    <braunr> slub is nice because it requires less memory than slab, while
      still being as fast for most cases
    <braunr> it gets slower on the free path, when the cpu performing the free
      is different from the one which allocated the object
    <braunr> that's a reasonable cost
    <mcsim> slub uses heap for large object. Are there any tests that compare
      what is better for large objects?
    <antrik> well, if slub requires less memory, why do you think slab is
      better for smaller systems? :-)
    <braunr> antrik: smaller is relative
    <antrik> mcsim: for large objects slab allocation is rather pointless, as
      you don't have multiple objects in a page anyways...
    <braunr> antrik: when lameter wrote slub, it was intended for systems with
      several hundreds processors
    <antrik> BTW, was slqb really refused only because the other ones are "good
      enough"?...
    <braunr> yes
    <antrik> wow, that's a strange argument...
    <braunr> linus is already unhappy of having "so many" allocators
    <antrik> well, if the new one is better, it could replace one of the others
      :-)
    <antrik> or is it useful only in certain cases?
    <braunr> that's the problem
    <braunr> nobody really knows
    <antrik> hm, OK... I guess that should be tested *before* merging ;-)
    <antrik> is anyone still working on it, or was it abandonned?
    <antrik> mcsim: back to caching...
    <antrik> what does caching in the kernel object allocator got to do with
      readahead (i.e. clustered paging)?...
    <mcsim> if we cached some physical pages we don't need to find new ones for
      allocating new object. And that's why there will not be a page fault.
    <mcsim> antrik: Regarding kam. Hasn't he finished his project?
    <antrik> err... what?
    <antrik> one of us must be seriously confused
    <antrik> I totally fail to see what caching of physical pages (which isn't
      even really a correct description of what slab does) has to do with page
      faults
    <antrik> right, KAM didn't finish his project
    <mcsim> If we free the physical page and return it to system we need
      another one for next allocation. But if we keep it, we don't need to find
      new physical page. 
    <mcsim> And physical page is allocated only then when page fault
      occurs. Probably, I'm wrong
    <antrik> what does "return to system" mean? we are talking about the
      kernel...
    <antrik> zalloc/slab are about allocating kernel objects. this doesn't have
      *anything* to do with paging of userspace processes
    <antrik> only thing the have in common is that they need to get pages from
      the physical page allocator. but that's yet another topic
    <mcsim> Under "return to system" I mean ability to use this page for other
      needs.
    <braunr> mcsim: consider kernel memory to be wired
    <braunr> here, return to system means releasing a page back to the vm
      system
    <braunr> the vm_kmem module then unmaps the physical page and free its
      virtual address in the kernel map
    <mcsim> ok
    <braunr> antrik: the problem with new allocators like slqb is that it's
      very difficult to really know if they're better, even with extensive
      testing
    <braunr> antrik: there are papers (like wilson95) about the difficulties in
      making valuable results in this field
    <braunr> see
      http://www.sceen.net/~rbraun/dynamic_storage_allocation_a_survey_and_critical_review.pdf
    <mcsim> how can be allocated physically continuous object now?
    <braunr> mcsim: rephrase please
    <mcsim> what is similar to kmalloc in Linux to gnumach?
    <braunr> i know memory is reserved for dma in a direct virtual to physical
      mapping
    <braunr> so even if the allocation is done similarly to vmalloc()
    <braunr> the selected region of virtual space maps physical memory, so
      memory is physically contiguous too
    <braunr> for other allocation types, a block large enough is allocated, so
      it's contiguous too
    <mcsim> I don't clearly understand. If we have fragmentation in physical
      ram, so there aren't 2 free pages in a row, but there are able apart, we
      can't to allocate these 2 pages along?
    <braunr> no
    <braunr> but every system has this problem
    <mcsim> But since we have only 12 or 32 MB of memory the problem becomes
      more significant
    <braunr> you're confusing virtual and physical memory
    <braunr> those 32 MiB are virtual
    <braunr> the physical pages backing them don't have to be contiguous
    <mcsim> Oh, indeed 
    <mcsim> So the only problem are limits?
    <braunr> and performance
    <braunr> and correctness
    <braunr> i find the zone allocator badly written
    <braunr> antrik: mcsim: here is the content of the kernel pmap on NetBSD
      (which uses a virtual memory system close to the Mach VM)
    <braunr> antrik: mcsim: http://www.sceen.net/~rbraun/pmap.out

[[pmap.out]]

    <braunr> you can see the kmem_map (which is used for most general kernel
      allocations) is 128 MiB large
    <braunr> actually it's not the kernel pmap, it's the kernel_map
    <antrik> braunr: why is it called pmap.out then? ;-)
    <braunr> antrik: because the tool is named pmap
    <braunr> for process map
    <braunr> it also exists under Linux, although direct access to
      /proc/xx/maps gives more info
    <mcsim> braunr: I've said that this is kernel_map. Can I see kernel_map for
      Linux?
    <braunr> mcsim: I don't know how to do that
    <mcsim> s/I've/You've
    <braunr> but Linux doesn't have submaps, and uses a direct virtual to
      physical mapping, so it's used differently
    <antrik> how are things (such as zalloc zones) entered into kernel_map?
    <braunr> in zone_init() you have
    <braunr> zone_map = kmem_suballoc(kernel_map, &zone_min, &zone_max,
      zone_map_size, FALSE);
    <braunr> so here, kmem_map is named zone_map
    <braunr> then, in zalloc()
    <braunr> kmem_alloc_wired(zone_map, &addr, zone->alloc_size)
    <antrik> so, kmem_alloc just deals out chunks of memory referenced directly
      by the address, and without knowing anything about the use?
    <braunr> kmem_alloc() gives virtual pages
    <braunr> zalloc() carves them into buffers, as in the slab allocator
    <braunr> the difference is essentially the lack of formal "slab" object
    <braunr> which makes the zone code look like a mess
    <antrik> so kmem_suballoc() essentially just takes a bunch of pages from
      the main kernel_map, and uses these to back another map which then in
      turn deals out pages just like the main kernel_map?
    <braunr> no
    <braunr> kmem_suballoc creates a vm_map_entry object, and sets its start
      and end address
    <braunr> and creates a vm_map object, which is then inserted in the new
      entry
    <braunr> maybe that's what you meant with "essentially just takes a bunch
      of pages from the main kernel_map"
    <braunr> but there really is no allocation at this point
    <braunr> except the map entry and the new map objects
    <antrik> well, I'm trying to understand how kmem_alloc() manages things. so
      it has map_entry structures like the maps of userspace processes? do
      these also reference actual memory objects?
    <braunr> kmem_alloc just allocates virtual pages from a vm_map, and backs
      those with physical pages (unless the user requested pageable memory)
    <braunr> it's not "like the maps of userspace processes"
    <braunr> these are actually the same structures
    <braunr> a vm_map_entry can reference a memory object or a kernel submap
    <braunr> in netbsd, it can also referernce nothing (for pure wired kernel
      memory like the vm_page array)
    <braunr> maybe it's the same in mach, i don't remember exactly
    <braunr> antrik: this is actually very clear in vm/vm_kern.c
    <braunr> kmem_alloc() creates a new kernel object for the allocation
    <braunr> allocates a new entry (or uses a previous existing one if it can
      be extended) through vm_map_find_entry()
    <braunr> then calls kmem_alloc_pages() to back it with wired memory
    <antrik> "creates a new kernel object" -- what kind of kernel object?
    <braunr> kmem_alloc_wired() does roughly the same thing, except it doesn't
      need a new kernel object because it knows the new area won't be pageable
    <braunr> a simple vm_object
    <braunr> used as a container for anonymous memory in case the pages are
      swapped out
    <antrik> vm_object is the same as memory object/pager? or yet something
      different?
    <braunr> antrik: almost
    <braunr> antrik: a memory_object is the user view of a vm_object
    <braunr> as in the kernel/user interfaces used by external pagers
    <braunr> vm_object is a more internal name
    <mcsim> Is fragmentation a big problem in slab allocator?
    <mcsim> I've tested it on my computer in Linux and for some caches it
      reached 30-40%
    <antrik> well, fragmentation is a major problem for any allocator...
    <antrik> the original slab allocator was design specifically with the goal
      of reducing fragmentation
    <antrik> the revised version with the addition of magazines takes a step
      back on this though
    <antrik> have you compared it to slub? would be pretty interesting...
    <mcsim> I have an idea how can it be decreased, but it will hurt by
      performance...
    <mcsim> antrik: no I haven't, but there will be might the same, I think
    <mcsim> if each cache will handle two types of object: with sizes that will
      fit cache sizes (or I bit smaller) and with sizes which are much smaller
      than maximal cache size. For first type of object will be used standard
      slab allocator and for latter type will be used (within page) heap
      allocator.
    <mcsim> I think that than fragmentation will be decreased
    <antrik> not at all. heap allocator has much worse fragmentation. that's
      why slab allocator was invented
    <antrik> the problem is that in a long-running program (such an the
      kernel), objects tend to have vastly varying lifespans
    <mcsim> but we use heap only for objects of specified sizes
    <antrik> so often a few old objects will keep a whole page hostage
    <mcsim> for example for 32 byte cache it could be 20-28 byte objects
    <antrik> that's particularily visible in programs such as firefox, which
      will grow the heap during use even though actual needs don't change
    <antrik> the slab allocator groups objects in a fashion that makes it more
      likely adjacent objects will be freed at similar times
    <antrik> well, that's pretty oversimplyfied, but I hope you get the
      idea... it's about locality
    <mcsim> I agree, but I speak not about general heap allocation. We have
      many heaps for objects with different sizes.
    <mcsim> Could it be better?
    <antrik> note that this has been a topic of considerable research. you
      shouldn't seek to improve the actual algorithms -- you would have to read
      up on the existing research at least before you can contribute anything
      to the field :-)
    <antrik> how would that be different from the slab allocator?
    <mcsim> slab will allocate 32 byte for both 20 and 32 byte requests
    <mcsim> And if there was request for 20 bytes we get 12 unused
    <antrik> oh, you mean the implementation of the generic allocator on top of
      slabs? well, that might not be optimal... but it's not an often used case
      anyways. mostly the kernel uses constant-sized objects, which get their
      own caches with custom tailored size
    <antrik> I don't think the waste here matters at all
    <mcsim> affirmative. So my idea is useless. 
    <antrik> does the statistic you refer to show the fragmentation in absolute
      sizes too?
    <mcsim> Can you explain what is absolute size?
    <mcsim> I've counted what were requested (as parameter of kmalloc) and what
      was really allocated (according to best fit cache size).
    <antrik> how did you get that information?
    <mcsim> I simply wrote a hook
    <antrik> I mean total. i.e. how many KiB or MiB are wasted due to
      fragmentation alltogether
    <antrik> ah, interesting. how does it work?
    <antrik> BTW, did you read the slab papers?
    <mcsim> Do you mean articles from lwn.net?
    <antrik> no 
    <antrik> I mean the papers from the Sun hackers who invented the slab
      allocator(s)
    <antrik> Bonwick mostly IIRC
    <mcsim> Yes
    <antrik> hm... then you really should know the rationale behind it...
    <mcsim> There he says about 11% percent of memory waste
    <antrik> you didn't answer my other questions BTW :-)
    <mcsim> I've corrupted kernel tree with patch, and tomorrow I'm going to
      read myself up for exam (I have it on Thursday). But than I'll send you a
      module which I've used for testing.
    <antrik> OK
    <mcsim> I can send you module now, but it will not work without patch.
    <mcsim> It would be better to rewrite it using debugfs, but when I was
      writing this test I didn't know about trace_* macros


# IRC, freenode, #hurd, 2011-04-15

    <mcsim> There is a hack in zone_gc when it allocates and frees two
      vm_map_kentry_zone elements to make sure the gc will be able to allocate
      two in vm_map_delete. Isn't it better to allocate memory for these
      entries statically?
    <youpi> mcsim: that's not the point of the hack
    <youpi> mcsim: the point of the hack is to make sure vm_map_delete will be
      able to allocate stuff
    <youpi> allocating them statically will just work once
    <youpi> it may happen several times that vm_map_delete needs to allocate it
      while it's empty (and thus zget_space has to get called, leading to a
      hang)
    <youpi> funnily enough, the bug is also in macos X
    <youpi> it's still in my TODO list to manage to find how to submit the
      issue to them
    <braunr> really ?
    <braunr> eh
    <braunr> is that because of map entry splitting ?
    <youpi> it's git commit efc3d9c47cd744c316a8521c9a29fa274b507d26
    <youpi> braunr: iirc something like this, yes
    <braunr> netbsd has this issue too
    <youpi> possibly
    <braunr> i think it's a fundamental problem with the design
    <braunr> people think of munmap() as something similar to free()
    <braunr> whereas it's really unmap
    <braunr> with a BSD-like VM, unmap can easily end up splitting one entry in
      two
    <braunr> but your issue is more about harmful recursion right ?
    <youpi> I don't remember actually
    <youpi> it's quite some time ago :)
    <braunr> ok
    <braunr> i think that's why i have "sources" in my slab allocator, the
      default source (vm_kern) and a custom one for kernel map entries


# IRC, freenode, #hurd, 2011-04-18

    <mcsim> braunr: you've said that once page is completely free, it is
      returned to the vm.
    <mcsim> who else, besides zone_gc, can return free pages to the vm?
    <braunr> mcsim: i also said i was wrong about that
    <braunr> zone_gc is the only one


# IRC, freenode, #hurd, 2011-04-19

    <braunr> antrik: mcsim: i added back a new per-cpu layer as planned
    <braunr>
      http://git.sceen.net/rbraun/libbraunr.git/?a=blob;f=mem.c;h=c629b2b9b149f118a30f0129bd8b7526b0302c22;hb=HEAD
    <braunr> mcsim: btw, in mem_cache_reap(), you can clearly see there are two
      loops, just as in zone_gc, to reduce contention and avoid deadlocks
    <braunr> this is really common in memory allocators


# IRC, freenode, #hurd, 2011-04-23

    <mcsim> I've looked through some allocators and all of them use different
      per cpu cache policy. AFAIK gnuhurd doesn't support multiprocessing, but
      still multiprocessing must be kept in mind. So, what do you think what
      kind of cpu caches is better? As for me I like variant with only per-cpu
      caches (like in slqb).
    <antrik> mcsim: well, have you looked at the allocator braunr wrote
      himself? :-)
    <antrik> I'm not sure I suggested that explicitly to you; but probably it
      makes most sense to use that in gnumach


# IRC, freenode, #hurd, 2011-04-24

    <mcsim> antrik: Yes, I have. He uses both global and per cpu caches. But he
      also suggested to look through slqb, where there are only per cpu
      caches.\
    <braunr> i don't remember slqb in detail
    <braunr> what do you mean by "only per-cpu caches" ?
    <braunr> a whole slab sytem for each cpu ?
    <mcsim> I mean that there are no global queues in caches, but there are
      special queues for each cpu.
    <mcsim> I've just started investigating slqb's code, but I've read an
      article on lwn about it. And I've read that it is used for zen kernel.
    <braunr> zen ?
    <mcsim> Here is this article http://lwn.net/Articles/311502/
    <mcsim> Yes, this is linux kernel with some patches which haven't been
      approved to torvald's tree
    <mcsim> http://zen-kernel.org/
    <braunr> i see
    <braunr> well it looks nice
    <braunr> but as for slub, the problem i can see is cross-CPU freeing
    <braunr> and I think nick piggins mentions it
    <braunr> piggin*
    <braunr> this means that sometimes, objects are "burst-free" from one cpu
      cache to another
    <braunr> which has the same bad effects as in most other allocators, mainly
      fragmentation
    <mcsim> There is a special list for freeing object allocated for another
      CPU
    <mcsim> And garbage collector frees such object on his own
    <braunr> so what's your question ?
    <mcsim> It is described in the end of article.
    <mcsim> What cpu-cache policy do you think is better to implement?
    <braunr> at this point, any
    <braunr> and even if we had a kernel that perfectly supports
      multiprocessor, I wouldn't care much now
    <braunr> it's very hard to evaluate such allocators
    <braunr> slqb looks nice, but if you have the same amount of fragmentation
      per slab as other allocators do (which is likely), you have tat amount of
      fragmentation multiplied by the number of processors
    <braunr> whereas having shared queues limit the problem somehow
    <braunr> having shared queues mean you have a bit more contention
    <braunr> so, as is the case most of the time, it's a tradeoff
    <braunr> by the way, does pigging say why he "doesn't like" slub ? :)
    <braunr> piggin*
    <mcsim> http://lwn.net/Articles/311093/
    <mcsim> here he describes what slqb is better.
    <braunr> well it doesn't describe why slub is worse
    <mcsim> but not very particularly 
    <braunr> except for order-0 allocations
    <braunr> and that's a form of fragmentation like i mentioned above
    <braunr> in mach those problems have very different impacts
    <braunr> the backend memory isn't physical, it's the kernel virtual space
    <braunr> so the kernel allocator can request chunks of higher than order-0
      pages
    <braunr> physical pages are allocated one at a time, then mapped in the
      kernel space
    <mcsim> Doesn't order of page depend on buffer size?
    <braunr> it does
    <mcsim> And why does gnumach allocates higher than order-0 pages more?
    <braunr> why more ?
    <braunr> i didn't say more
    <mcsim> And why in mach those problems have very different impact?
    <braunr> ?
    <braunr> i've just explained why :)
    <braunr> 09:37 < braunr> physical pages are allocated one at a time, then
      mapped in the kernel space
    <braunr> "one at a time" means order-0 pages, even if you allocate higher
      than order-0 chunks
    <mcsim> And in Linux they allocated more than one at time because of
      prefetching page reading?
    <braunr> do you understand what virtual memory is ?
    <braunr> linux allocators allocate "physical memory"
    <braunr> mach kernel allocator allocates "virtual memory"
    <braunr> so even if you allocate a big chunk of virtual memory, it's backed
      by order-0 physical pages
    <mcsim> yes, I understand this
    <braunr> you don't seem to :/
    <braunr> the problem of higher than order-0 page allocations is
      fragmentation
    <braunr> do you see why ?
    <mcsim> yes
    <braunr> so
    <braunr> fragmentation in the kernel space is less likely to create issues
      than it does in physical memory
    <braunr> keep in mind physical memory is almost always full because of the
      page cache
    <braunr> and constantly under some pressure
    <braunr> whereas the kernel space is mostly empty
    <braunr> so allocating higher then order-0 pages in linux is more dangerous
      than it is in Mach or BSD
    <mcsim> ok
    <braunr> on the other hand, linux focuses pure performance, and not having
      to map memory means less operations, less tlb misses, quicker allocations
    <braunr> the Mach VM must map pages "one at a time", which can be expensive
    <braunr> it should be adapted to handle multiple page sizes (e.g. 2 MiB) so
      that many allocations can be made with few mappings
    <braunr> but that's not easy
    <braunr> as always: tradeoffs
    <mcsim> There are other benefits of physical allocating. In big DMA
      transfers can be needed few continuous physical pages. How does mach
      handles such cases?
    <braunr> gnumach does that awfully
    <braunr> it just reserves the whole DMA-able memory and uses special
      allocation functions on it, IIRC
    <braunr> but kernels which have a MAch VM like memory sytem such as BSDs
      have cleaner methods
    <braunr> NetBSD provides a function to allocate contiguous physical memory
    <braunr> with many constraints
    <braunr> FreeBSD uses a binary buddy system like Linux
    <braunr> the fact that the kernel allocator uses virtual memory doesn't
      mean the kernel has no mean to allocate contiguous physical memory ...


# IRC, freenode, #hurd, 2011-05-02

    <braunr> hm nice, my allocator uses less memory than glibc (squeeze
      version) on both 32 and 64 bits systems
    <braunr> the new per-cpu layer is proving effective
    <neal> braunr: Are you reimplementation malloc?
    <braunr> no
    <braunr> it's still the slab allocator for mach, but tested in userspace
    <braunr> so i wrote malloc wrappers
    <neal> Oh.
    <braunr> i try to heavily test most of my code in userspace now
    <neal> it's easier :-)
    <neal> I agree
    <braunr> even the physical memory allocator has been implemented this way
    <neal> is this your mach version?
    <braunr> virtual memory allocation will follow
    <neal> or are you working on gnu mach?
    <braunr> for now it's my version
    <braunr> but i intend to spend the summer working on ipc port names
      management

[[rework_gnumach_IPC_spaces]].

    <braunr> and integrate the result in gnu mach
    <neal> are you keeping the same user-space API?
    <neal> Or are you experimenting with something new?
    <antrik> braunr: to be fair, it's not terribly hard to use less memory than
      glibc :-)
    <braunr> yes
    <braunr> antrik: well ptmalloc3 received some nice improvements
    <braunr> neal: the goal is to rework some of the internals only
    <braunr> neal: namely, i simply intend to replace the splay tree with a
      radix tree
    <antrik> braunr: the glibc allocator is emphasising performace, unlike some
      other allocators that trade some performance for much better memory
      utilisation...
    <antrik> ptmalloc3?
    <braunr> that's the allocator used in glibc
    <braunr> http://www.malloc.de/en/
    <antrik> OK. haven't seen any recent numbers... the comparision I have in
      mind is many years old...
    <braunr> i also made some additions to my avl and red-black trees this week
      end, which finally make them suitable for almost all generic uses
    <braunr> the red-black tree could be used in e.g. gnu mach to augment the
      linked list used in vm maps
    <braunr> which is what's done in most modern systems
    <braunr> it could also be used to drop the overloaded (and probably over
      imbalanced) page cache hash table

[[gnumach_vm_map_red-black_trees]].


# IRC, freenode, #hurd, 2011-05-03

    <mcsim> antrik: How should I start porting? Have I just include rbraun's
      allocator to gnumach and make it compile?
    <antrik> mcsim: well, basically yes I guess... but you will have to look at
      the code in question first before we know anything more specific :-)
    <antrik> I guess braunr might know better how to start, but he doesn't
      appear to be here :-(
    <braunr> mcsim: you can't juste put my code into gnu mach and make it run,
      it really requires a few careful changes
    <braunr> mcsim: you will have to analyse how the current zone allocator
      interacts with regard to locking
    <braunr> if it is used in interrupt handlers
    <braunr> what kind of locks it should use instead of the pthread stuff
      available in userspace
    <braunr> you will have to change the reclamiing policy, so that caches are
      reaped on demand
    <braunr> (this basically boils down to calling the new reclaiming function
      instead of zone_gc())
    <braunr> you must be careful about types too
    <braunr> there is work to be done ;)
    <braunr> (not to mention the obvious about replacing all the calls to the
      zone allocator, and testing/debugging afterwards)


# IRC, freenode, #hurd, 2011-07-14

    <braunr> can you make your patch available ?
    <mcsim> it is available in gnumach repository at savannah 
    <mcsim> tree mplaneta/libbraunr/master
    <braunr> mcsim: i'll test your branch
    <mcsim> ok. I'll give you a link in a minute
    <braunr> hm why balloc ?
    <mcsim> Braun's allocator
    <braunr> err
    <braunr>
      http://git.sceen.net/rbraun/x15mach.git/?a=blob;f=kern/kmem.c;h=37173fa0b48fc9d7e177bf93de531819210159ab;hb=HEAD
    <braunr> mcsim: this is the interface i had in mind for a kernel version :)
    <braunr> very similar to the original slab allocator interface actually
    <braunr> well, you've been working
    <mcsim> But I have a problem with this patch. When I apply it to gnumach
      code from debian repository. I have to make a change in file ramdisk.c
      with sed -i 's/kernel_map/\&kernel_map/' device/ramdisk.c
    <mcsim> because in git repository there is no such file
    <braunr> mcsim: how do you configure the kernel before building ?
    <braunr> mcsim: you should keep in touch more often i think, so that you
      get feedback from us and don't spend too much time "off course"
    <mcsim> I didn't configure it. I just run dpkg-buildsource -b.
    <braunr> oh you build the debian package
    <braunr> well my version was by configure --enable-kdb --enable-rtl8139
    <braunr> and it seems stuck in an infinite loop during bootstrap
    <mcsim> and printf doesn't work. The first function called by c_boot_entry
      is printf(version).
    <braunr> mcsim: also, you're invited to get the x15mach version of my
      files, which are gplv2+ licensed
    <braunr> be careful of my macros.h file, it can conflict with the
      macros_help.h file from gnumach iirc
    <mcsim> There were conflicts with MACRO_BEGIN and MACRO_END. But I solved
      it
    <braunr> ok
    <braunr> it's tricky
    <braunr> mcsim: try to find where the first use of the allocator is made


# IRC, freenode, #hurd, 2011-07-22

    <mcsim> braunr, hello. Kernel with your allocator already compiles and
      runs. There still some problems, but, certainly, I'm on the final stage
      already. I hope I'll finish in a few days.
    <tschwinge> mcsim: Oh, cool!  Have you done some measurements already?
    <mcsim> Not yet
    <tschwinge> OK.
    <tschwinge> But if it able to run a GNU/Hurd system, then that already is
      something, a big milestone!
    <braunr> nice
    <braunr> although you'll probably need to tweak the garbage collecting
      process
    <mcsim> tschwinge: thanks
    <mcsim> braunr: As back-end for allocating memory I use
      kmem_alloc_wired. But in zalloc was an opportunity to use as back-end
      kmem_alloc_pageable. Although there was no any zone that used
      kmem_alloc_pageable. Do I need to implement this functionality?
    <braunr> mcsim: do *not* use kmem_alloc_pageable()
    <mcsim> braunr: Ok. This is even better)
    <braunr> mcsim: in x15, i've taken this even further: there is *no* kernel
      vm object, which means all kernel memory is wired and unmanaged
    <braunr> making it fast and safe
    <braunr> pageable kernel memory was useful back when RAM was really scarce
    <braunr> 20 years ago
    <braunr> but it's a source of deadlock
    <mcsim> Indeed. I'll won't use kmem_alloc_pageable.


# IRC, freenode, #hurd, 2011-08-09

    < braunr> mcsim: what's the "bug related to MEM_CF_VERIFY" you refer to in
      one of your commits ?
    < braunr> mcsim: don't use spin_lock_t as a member of another structure
    < mcsim> braunr: I confused with types in *_verify functions, so they
      didn't work. Than I fixed it in the commit you mentioned.
    < braunr> in gnumach, most types are actually structure pointers
    < braunr> use simple_lock_data_t
    < braunr> mcsim: ok
    < mcsim> > use simple_lock_data_t
    < mcsim> braunr: ok
    < braunr> mcsim: don't make too many changes to the code base, and if
      you're unsure, don't hesitate to ask
    < braunr> also, i really insist you rename the allocator, as done in x15
      for example
      (http://git.sceen.net/rbraun/x15mach.git/?a=blob;f=vm/kmem.c), instead of
      a name based on mine :/
    < mcsim> braunr: Ok. It was just work name. When I finish I'll rename the
      allocator.
    < braunr> other than that, it's nice to see progress
    < braunr> although again, it would be better with some reports along
    < braunr> i won't be present at the meeting tomorrow unfortunately, but you
      should use those to report the status of your work
    < mcsim> braunr: You've said that I have to tweak gc process. Did you mean
      to call mem_gc() when physical memory ends instead of calling it every x
      seconds? Or something else?
    < braunr> there are multiple topics, alhtough only one that really matters
    < braunr> study how zone_gc was called
    < braunr> reclaiming memory should happen when there is pressure on the VM
      subsystem
    < braunr> but it shouldn't happen too ofte, otherwise there is trashing
    < braunr> and your caches become mostly useless
    < braunr> the original slab allocator uses a 15-second period after a
      reclaim during which reclaiming has no effect
    < braunr> this allows having a somehow stable working set for this duration
    < braunr> the linux slab allocator uses 5 seconds, but has a more
      complicated reclaiming mechanism
    < braunr> it releases memory gradually, and from reclaimable caches only
      (dentry for example)
    < braunr> for x15 i intend to implement the original 15 second interval and
      then perform full reclaims
    < mcsim> In zalloc mem_gc is called by vm_pageout_scan, but not often than
      once a second.
    < mcsim> In balloc I've changed interval to once in 15 seconds.
    < braunr> don't use the code as it is
    < braunr> the version you've based your work on was meant for userspace
    < braunr> where there isn't memory pressure
    < braunr> so a timer is used to trigger reclaims at regular intervals
    < braunr> it's different in a kernel
    < braunr> mcsim: where did you see vm_pageout_scan call the zone gc once a
      second ?
    < mcsim> vm_pageout_scan calls consider_zone_gc and consider_zone_gc checks
      if second is passed.
    < braunr> where ?
    < mcsim> Than zone_gc can be called.
    < braunr> ah ok, it's in zaclloc.c then
    < braunr> zalloc.c
    < braunr> yes this function is fine
    < mcsim> so old gc didn't consider vm pressure. Or I missed something.
    < braunr> it did
    < mcsim> how?
    < braunr> well, it's called by the pageout daemon
    < braunr> under memory pressure
    < braunr> so it's fine
    < mcsim> so if mem_gc is called by pageout daemon is it fine?
    < braunr> it must be changed to do something similar to what
      consider_zone_gc does
    < mcsim> It does. mem_gc does the same work as consider_zone_gc and
      zone_gc.
    < braunr> good
    < mcsim> so gc process is fine?
    < braunr> should be
    < braunr> i see mem.c only includes mem.h, which then includes other
      headers
    < braunr> don't do that
    < braunr> always include all the headers you need where you need them
    < braunr> if you need avltree.h in both mem.c and mem.h, include it in both
      files
    < braunr> and by the way, i recommend you use the red black tree instead of
      the avl type
    < braunr> (it's the same interface so it shouldn't take long)
    < mcsim> As to report. If you won't be present at the meeting, I can tell
      you what I have to do now.
    < braunr> sure
    < braunr> in addition, use GPLv2 as the license, teh BSD one is meant for
      the userspace version only
    < braunr> GPLv2+ actually
    < braunr> hm you don't need list.c
    < braunr> it would only add dead code
    < braunr> "Zone for dynamical allocator", don't mix terms
    < braunr> this comment refers to a vm_map, so call it a map
    < mcsim> 1. Change constructor for kentry_alloc_cache.
    < mcsim> 2. Make measurements.
    < mcsim> +
    < mcsim> 3. Use simple_lock_data_t
    < mcsim> 4. Replace license
    < braunr> kentry_alloc_cache <= what is that ?
    < braunr> cache for kernel map entries in vm_map ?
    < braunr> the comment for mem_cpu_pool_get doesn't apply in gnumach, as
      there is no kernel preemption

[[microkernel/mach/gnumach/preemption]].

    < braunr> "Don't attempt mem GC more frequently than hz/MEM_GC_INTERVAL
      times a second.
    < braunr> "
    < mcsim> sorry. I meant vm_map_kentry_cache
    < braunr> hm nothing actually about this comment
    < braunr> mcsim: ok
    < braunr> yes kernel map entries need special handling
    < braunr> i don't know how it's done in gnumach though
    < braunr> static preallocation ?
    < mcsim> yes
    < braunr> that's ugly :p
    < mcsim> but it uses dynamic allocation further even for vm_map kernel
      entries
    < braunr> although such bootstrapping issues are generally difficult to
      solve elegantly
    < braunr> ah
    < mcsim> now I use only static allocation, but I'll add dynamic allocation
      too
    < braunr> when you have time, mind the coding style (convert everything to
      gnumach style, which mostly implies using tabs instead of 4-spaces
      indentation)
    < braunr> when you'll work on dynamic allocation for the kernel map
      entries, you may want to review how it's done in x15
    < braunr> the mem_source type was originally intended for that purpose, but
      has slightly changed once the allocator was adapted to work in my kernel
    < mcsim> ok
    < braunr> vm_map_kentry_zone is the only zone created with ZONE_FIXED
    < braunr> and it is zcram()'ed immediately after
    < braunr> so you can consider it a statically allocated zone
    < braunr> in x15 i use another strategy: there is a special kernel submap
      named kentry_map which contains only one map entry (statically allocated)
    < braunr> this map is the backend (mem_source) for the kentry_cache
    < braunr> the kentry_cache is created with a special flag that tells it
      memory can't be reclaimed
    < braunr> when the cache needs to grow, the single map entry is extended to
      cover the allocated memory
    < braunr> it's similar to the way pmap_growkernel() works for kernel page
      table pages
    < braunr> (and is actually based on that idea)
    < braunr> it's a compromise between full static and dynamic allocation
      types
    < braunr> the advantage is that the allocator code can be used (so there is
      no need for a special allocator like in netbsd)
    < braunr> the drawback is that some resources can never be returned to
      their source (and under peaks, the amount of unfreeable resources could
      become large, but this is unexpected)
    < braunr> mcsim: for now you shouldn't waste your time with this
    < braunr> i see the number of kernel map entries is fixed at 256
    < braunr> and i've never seen the kernel use more than around 30 entries
    < mcsim> Do you think that I have to left this problem to the end?
    < braunr> yes


# IRC, freenode, #hurd, 2011-08-11

    < mcsim> braunr: Hello. Can you give me an advice how can I make
      measurements better?
    < braunr> mcsim: what kind of measurements
    < mcsim> braunr: How much is your allocator better than zalloc.
    < braunr> slightly :p
    < braunr> that's why i never took the time to put it in gnumach
    < mcsim> braunr: Just I thought that there are some rules or
      recommendations of such measurements. Or I can do them any way I want?
    < braunr> mcsim: i don't know
    < braunr> mcsim: benchmarking is an art of its own, and i don't even know
      how to use the bits of profiling code available in gnumach (if it still
      works)
    < antrik> mcsim: hm... are you saying you already have a running system
      with slab allocator?... :-)
    < braunr> mcsim: the main advantage i can see is the removal of many
      arbitrary hard limits
    < mcsim> antrik: yes
    < antrik> \o/
    < antrik> nice work!
    < braunr> :)
    < braunr> the cpu layer should also help a bit, but it's hard to measure
    < braunr> i guess it could be seen on the ipc path for very small buffers
    < mcsim> antrik: Thanks. But I still have to 1. Change constructor for
      kentry_alloc_cache. and 2. Make measurements.
    < braunr> and polish the whole thing :p
    < antrik> mcsim: I'm not sure this can be measured... the performance
      differente in any real live usage is probably just a few percent at most
      -- it's hard to construct a benchmark giving enough precision so it's not
      drowned in noise...
    < antrik> perhaps it conserves some memory -- but that too would be hard to
      measure I fear
    < braunr> yes
    < braunr> there *should* be better allocation times, less fragmentation,
      better accounting ... :)
    < braunr> and no arbitrary limits !
    < antrik> :-)
    < braunr> oh, and the self debugging features can be nice too
    < mcsim> But I need to prove that my work wasn't useless
    < braunr> well it wasn't, but that's hard to measure
    < braunr> it's easy to prove though, since there are additional features
      that weren't present in the zone allocator
    < mcsim> Ok. If there are some profiling features in gnumach can you give
      me a link with their description?
    < braunr> mcsim: sorry, no
    < braunr> mcsim: you could still write the basic loop test, which counts
      the number of allocations performed in a fixed time interval
    < braunr> but as it doesn't match many real life patterns, it won't be very
      useful
    < braunr> and i'm afraid that if you consider real life patterns, you'll
      see how negligeable the improvement can be compared to other operations
      such as memory copies or I/O (ouch)
    < mcsim> Do network drivers use this allocator?
    < mcsim> ok. I'll scrape up some test and than I'll report results.


# IRC, freenode, #hurd, 2011-08-26

    < mcsim> hello. Are there any analogs of copy_to_user and copy_from_user in
      linux for gnumach?
    < mcsim> Or how can I determine memory map if I know address? I need this
      for vm_map_copyin
    < guillem> mcsim: vm_map_lookup_entry?
    < mcsim> guillem: but I need to transmit map to this function and it will
      return an entry which contains specified address.
    < mcsim> And I don't know what map have I transmit.
    < mcsim> I need to transfer static array from kernel to user. What map
      contains static data?
    < antrik> mcsim: Mach doesn't have copy_{from,to}_user -- instead, large
      chunks of data are transferred as out-of-line data in IPC messages
      (i.e. using VM magic)
    < mcsim> antrik: can you give me an example? I just found using
      vm_map_copyin in host_zone_info.
    < antrik> no idea what vm_map_copyin is to be honest...


# IRC, freenode, #hurd, 2011-08-27

    < braunr> mcsim: the primitives are named copyin/copyout, and they are used
      for messages with inline data
    < braunr> or copyinmsg/copyoutmsg
    < braunr> vm_map_copyin/out should be used for chunks larger than a page
      (or roughly a page)
    < braunr> also, when writing to a task space, see which is better suited:
      vm_map_copyout or vm_map_copy_overwrite
    < mcsim> braunr: and what will be src_map for vm_map_copyin/out?
    < braunr> the caller map
    < braunr> which you can get with current_map() iirc
    < mcsim> braunr: thank you
    < braunr> be careful not to leak anything in the transferred buffers
    < braunr> memset() to 0 if in doubt
    < mcsim> braunr:ok
    < braunr> antrik: vm_map_copyin() is roughly vm_read()
    < antrik> braunr: what is it used for?
    < braunr> antrik: 01:11 < antrik> mcsim: Mach doesn't have
      copy_{from,to}_user -- instead, large chunks of data are transferred as
      out-of-line data in IPC messages (i.e. using VM magic)
    < braunr> antrik: that "VM magic" is partly implemented using vm_map_copy*
      functions
    < antrik> braunr: oh, you mean it doesn't actually copy data, but only page
      table entries? if so, that's *not* really comparable to
      copy_{from,to}_user()...


# IRC, freenode, #hurd, 2011-08-28

    < braunr> antrik: the equivalent of copy_{from,to}_user are
      copy{in,out}{,msg}
    < braunr> antrik: but when the data size is about a page or more, it's
      better not to copy, of course
    < antrik> braunr: it's actually not clear at all that it's really better to
      do VM magic than to copy...


# IRC, freenode, #hurd, 2011-08-29

    < braunr> antrik: at least, that used to be the general idea, and with a
      simpler VM i suspect it's still true
    < braunr> mcsim: did you progress on your host_zone_info replacement ?
    < braunr> mcsim: i think you should stick to what the original
      implementation did
    < braunr> which is making an inline copy if caller provided enough space,
      using kmem_alloc_pageable otherwise
    < braunr> specify ipc_kernel_map if using kmem_alloc_pageable
    < mcsim> braunr: yes. And it works. But I use kmem_alloc, not pageable. Is
      it worse?
    < mcsim> braunr: host_zone_info replacement is pushed to savannah
      repository. 
    < braunr> mcsim: i'll have a look
    < mcsim> braunr: I've pushed one more commit just now, which has attitude
      to host_zone_info.
    < braunr> mem_alloc_early_init should be renamed mem_bootstrap
    < mcsim> ok
    < braunr> mcsim: i don't understand your call to kmem_free
    < mcsim> braunr: It shouldn't be there?
    < braunr> why should it be there ?
    < braunr> you're freeing what the copy object references
    < braunr> it's strange that it even works
    < braunr> also, you shouldn't pass infop directly as the copy object
    < braunr> i guess you get a warning for that
    < braunr> do what the original code does: use an intermediate copy object
      and a cast
    < mcsim> ok
    < braunr> another error (without consequence but still, you should mind it)
    < braunr> simple_lock(&mem_cache_list_lock);
    < braunr> [...]
    < braunr> kr = kmem_alloc(ipc_kernel_map, &info, info_size);
    < braunr> you can't hold simple locks while allocating memory
    < braunr> read how the original implementation works around this
    < mcsim> ok
    < braunr> i guess host_zone_info assumes the zone list doesn't change much
      while unlocked
    < braunr> or that's it's rather unimportant since it's for debugging
    < braunr> a strict snapshot isn't required
    < braunr> list_for_each_entry(&mem_cache_list, cache, node) max_caches++;
    < braunr> you should really use two separate lines for readability
    < braunr> also, instead of counting each time, you could just maintain a
      global counter
    < braunr> mcsim: use strncpy instead of strcpy for the cache names
    < braunr> not to avoid overflow but rather to clear the unused bytes at the
      end of the buffer
    < braunr> mcsim: about kmem_alloc vs kmem_alloc_pageable, it's a minor
      issue
    < braunr> you're handing off debugging data to a userspace application
    < braunr> a rather dull reporting tool in most cases, which doesn't require
      wired down memory
    < braunr> so in order to better use available memory, pageable memory
      should be used
    < braunr> in the future i guess it could become a not-so-minor issue though
    < mcsim> ok. I'll fix it
    < braunr> mcsim: have you tried to run the kernel with MC_VERIFY always on
      ?
    < braunr> MEM_CF_VERIFY actually
    < mcsim1> yes.
    < braunr> oh
    < braunr> nothing wrong 
    < braunr> ?
    < mcsim1> it is always set
    < braunr> ok
    < braunr> ah, you set it in macros.h ..
    < braunr> don't
    < braunr> put it in mem.c if you want, or better, make it a compile-time
      option
    < braunr> macros.h is a tiny macro library, it shouldn't define such
      unrelated options
    < mcsim1> ok.
    < braunr> mcsim1: did you try fault injection to make sure the checking
      code actually works and how it behaves when an error occurs ?
    < mcsim1> I think that when I finish I'll merge files cpu.h and macros.h
      with mem.c
    < braunr> yes that would simplify things
    < mcsim1> Yes. When I confused with types mem_buf_fill worked wrong and
      panic occurred.
    < braunr> very good
    < braunr> have you progressed concerning the measurements you wanted to do
      ?
    < mcsim1> not much.
    < braunr> ok
    < mcsim1> I think they will be ready in a few days.
    < antrik> what measurements are these?
    < mcsim1> braunr: What maximal size for static data and stack in kernel?
    < braunr> what do you mean ?
    < braunr> kernel stacks are one page if i'm right
    < braunr> static data (rodata+data+bss) are limited by grub bugs only :)
    < mcsim1> braunr: probably they are present, because when I created too big
      array I couldn't boot kernel
    < braunr> local variable or static ?
    < mcsim1> static
    < braunr> how large ?
    < mcsim1> 4Mb
    < braunr> hm
    < braunr> it's not a grub bug then
    < braunr> i was able to embed as much as 32 MiB in x15 while doing this
      kind of tests
    < braunr> I guess it's the gnu mach boot code which only preallocates one
      page for the initial kernel mapping
    < braunr> one PTP (page table page) maps 4 MiB
    < braunr> (x15 does this completely dynamically, unlike mach or even
      current BSDs)
    < mcsim1> antrik: First I want to measure time of each cache
      creation/allocation/deallocation and then compile kernel.
    < braunr> cache creation is irrelevant
    < braunr> because of the cpu pools in the new allocator, you should test at
      least two different allocation patterns
    < braunr> one with quick allocs/frees
    < braunr> the other with large numbers of allocs then their matching frees
    < braunr> (larger being at least 100)
    < braunr> i'd say the cpu pool layer is the real advantage over the
      previous zone allocator
    < braunr> (from a performance perspective)
    < mcsim1> But there is only one cpu
    < braunr> it doesn't matter
    < braunr> it's stil a very effective cache
    < braunr> in addition to reducing contention
    < braunr> compare mem_cpu_pool_pop() against mem_cache_alloc_from_slab()
    < braunr> mcsim1: work is needed to polish the whole thing, but getting it
      actually working is a nice achievement for someone new on the project
    < braunr> i hope it helped you learn about memory allocation, virtual
      memory, gnu mach and the hurd in general :)
    < antrik> indeed :-)


# IRC, freenode, #hurd, 2011-09-06

    [some performance testing]
    <braunr> i'm not sure such long tests are relevant but let's assume balloc
      is slower
    <braunr> some tuning is needed here
    <braunr> first, we can see that slab allocation occurs more often in balloc
      than page allocation does in zalloc
    <braunr> so yes, as slab allocation is slower (have you measured which part
      actually is slow ? i guess it's the kmem_alloc call)
    <braunr> the whole process gets a bit slower too
    <mcsim> I used alloc_size = 4096 for zalloc
    <braunr> i don't know what that is exactly
    <braunr> but you can't hold 500 16 bytes buffers in a page so zalloc must
      have had free pages around for that
    <mcsim> I use kmem_alloc_wired
    <braunr> if you have time, measure it, so that we know how much it accounts
      for
    <braunr> where are the results for dealloc ?
    <mcsim> I can't give you result right now because internet works very
      bad. But for first DEALLOC result are the same, exept some cases when it
      takes balloc for more than 1000 ticks
    <braunr> must be the transfer from the cpu layer to the slab layer
    <mcsim> as to kmem_alloc_wired. I think zalloc uses this function too for
      allocating objects in zone I test.
    <braunr> mcsim: yes, but less frequently, which is why it's faster
    <braunr> mcsim: another very important aspect that should be measured is
      memory consumption, have you looked into that ?
    <mcsim> I think that I made too little iterations in test SMALL
    <mcsim> If I increase constant SMALL_TESTS will it be good enough?
    <braunr> mcsim: i don't know, try both :)
    <braunr> if you increase the number of iterations, balloc average time will
      be lower than zalloc, but this doesn't remove the first long
      initialization step on the allocated slab
    <mcsim> SMALL_TESTS to 500, I mean
    <braunr> i wonder if maintaining the slabs sorted through insertion sort is
      what makes it slow
    <mcsim> braunr: where do you sort slabs? I don't see this.
    <braunr> mcsim: mem_cache_alloc_from_slab and its free counterpart
    <braunr> mcsim: the mem_source stuff is useless in gnumach, you can remove
      it and directly call the kmem_alloc/free functions
    <mcsim> But I have to make special allocator for kernel map entries.
    <braunr> ah right
    <mcsim> btw. It turned out that 256 entries are not enough.
    <braunr> that's weird
    <braunr> i'll make a patch so that the mem_source code looks more like what
      i have in x15 then
    <braunr> about the results, i don't think the slab layer is that slow
    <braunr> it's the cpu_pool_fill/drain functions that take time
    <braunr> they preallocate many objects (64 for your objects size if i'm
      right) at once
    <braunr> mcsim: look at the first result page: some times, a number around
      8000 is printed
    <braunr> the common time (ticks, whatever) for a single object is 120
    <braunr> 8132/120 is 67, close enough to the 64 value
    <mcsim> I forgot about SMALL tests here are they:
      http://paste.debian.net/128533/ (balloc) http://paste.debian.net/128534/
      (zalloc)
    <mcsim> braunr: why do you divide 8132 by 120?
    <braunr> mcsim: to see if it matches my assumption that the ~8000 number
      matches the cpu_pool_fill call
    <mcsim> braunr: I've got it
    <braunr> mcsim: i'd be much interested in the dealloc results if you can
      paste them too
    <mcsim> dealloc: http://paste.debian.net/128589/
      http://paste.debian.net/128590/
    <braunr> mcsim: thanks
    <mcsim> second dealloc: http://paste.debian.net/128591/
      http://paste.debian.net/128592/
    <braunr> mcsim: so the main conclusion i retain from your tests is that the
      transfers from the cpu and the slab layers are what makes the new
      allocator a bit slower
    <mcsim> OPERATION_SMALL dealloc: http://paste.debian.net/128593/
      http://paste.debian.net/128594/
    <braunr> mcsim: what needs to be measured now is global memory usage
    <mcsim> braunr: data from /proc/vmstat after kernel compilation will be
      enough?
    <braunr> mcsim: let me check
    <braunr> mcsim: no it won't do, you need to measure kernel memory usage
    <braunr> the best moment to measure it is right after zone_gc is called
    <mcsim> Are there any facilities in gnumach for memory measurement?
    <braunr> it's specific to the allocators
    <braunr> just count the number of used pages
    <braunr> after garbage collection, there should be no free page, so this
      should be rather simple
    <mcsim> ok
    <mcsim> braunr: When I measure memory usage in balloc, what formula is
      better cache->nr_slabs * cache->bufs_per_slab * cache->buf_size or
      cache->nr_slabs * cache->slab_size?
    <braunr> the latter


# IRC, freenode, #hurd, 2011-09-07

    <mcsim> braunr: I've disabled calling of mem_cpu_pool_fill and allocator
      became faster
    <braunr> mcsim: sounds nice
    <braunr> mcsim: i suspect the free path might not be as fast though
    <mcsim> results for first calling: http://paste.debian.net/128639/ second:
      http://paste.debian.net/128640/ and with many alloc/free:
      http://paste.debian.net/128641/
    <braunr> mcsim: thanks
    <mcsim> best result are for second call: average time decreased from 159.56
      to 118.756
    <mcsim> First call slightly worse, but this is because I've added some
      profiling code
    <braunr> i still see some ~8k lines in 128639
    <braunr> even some around ~12k
    <mcsim> I think this is because of mem_cache_grow I'm investigating it now
    <braunr> i guess so too
    <mcsim> I've measured time for first call in cache and from about 22000
      mem_cache_grow takes 20000
    <braunr> how did you change the code so that it doesn't call
      mem_cpu_pool_fill ?
    <braunr> is the cpu layer still used ?
    <mcsim> http://paste.debian.net/128644/
    <braunr> don't forget the free path
    <braunr> mcsim: anyway, even with the previous slightly slower behaviour we
      could observe, the performance hit is negligible
    <mcsim> Is free path a compilation? (I'm sorry for my english)
    <braunr> mcsim: mem_cache_free
    <braunr> mcsim: the last two measurements i'd advise are with big (>4k)
      object sizes and, really, kernel allocator consumption
    <mcsim> http://paste.debian.net/128648/ http://paste.debian.net/128646/
      http://paste.debian.net/128649/ (first, second, small)
    <braunr> mcsim: these numbers are closer to the zalloc ones, aren't they ?
    <mcsim> deallocating slighty faster too
    <braunr> it may not be the case with larger objects, because of the use of
      a tree
    <mcsim> yes, they are closer
    <braunr> but then, i expect some space gains
    <braunr> the whole thing is about compromise
    <mcsim> ok. I'll try to measure them today. Anyway I'll post result and you
      could read them in the morning
    <braunr> at least, it shows that the zone allocator was actually quite good
    <braunr> i don't like how the code looks, there are various hacks here and
      there, it lacks self inspection features, but it's quite good
    <braunr> and there was little room for true improvement in this area, like
      i told you :)
    <braunr> (my allocator, like the current x15 dev branch, focuses on mp
      machines)
    <braunr> mcsim: thanks again for these numbers
    <braunr> i wouldn't have had the courage to make the tests myself before
      some time eh
    <mcsim> braunr: hello. Look at the small_4096 results
      http://paste.debian.net/128692/ (balloc) http://paste.debian.net/128693/
      (zalloc)
    <braunr> mcsim: wow, what's that ? :)
    <braunr> mcsim: you should really really include your test parameters in
      the report
    <braunr> like object size, purpose, and other similar details
    <mcsim> for balloc I specified only object_size = 4096
    <mcsim> for zalloc object_size = 4096, alloc_size = 4096, memtype = 0;
    <braunr> the results are weird
    <braunr> apart from the very strange numbers (e.g. 0 or 4429543648), none
      is around 3k, which is the value matching a kmem_alloc call
    <braunr> happy to see balloc behaves quite good for this size too
    <braunr> s/good/well/
    <mcsim> Oh
    <mcsim> here is significant only first 101 lines
    <mcsim> I'm sorry
    <braunr> ok
    <braunr> what does the test do again ? 10 loops of 10 allocs/frees ?
    <mcsim> yes
    <braunr> ok, so the only slowdown is at the beginning, when the slabs are
      created
    <braunr> the two big numbers (31844 and 19548) are strange
    <mcsim> on the other hand time of compilation is 
    <mcsim> balloc               zalloc
    <mcsim> 38m28.290s  38m58.400s 
    <mcsim> 38m38.240s  38m42.140s 
    <mcsim> 38m30.410s  38m52.920s 
    <braunr> what are you compiling ?
    <mcsim> gnumach kernel
    <braunr> in 40 mins ?
    <mcsim> yes
    <braunr> you lack hvm i guess
    <mcsim> is it long?
    <mcsim> I use real PC
    <braunr> very
    <braunr> ok
    <braunr> so it's normal
    <mcsim> in vm it was about 2 hours)
    <braunr> the difference really is negligible
    <braunr> ok i can explain the big numbers
    <braunr> the slab size depends on the object size, and for 4k, it is 32k
    <braunr> you can store 8 4k buffers in a slab (lines 2 to 9)
    <mcsim> so we need use kmem_alloc_* 8 times?
    <braunr> on line 10, the ninth object is allocated, which adds another slab
      to the cache, hence the big number
    <braunr> no, once for a size of 32k
    <braunr> and then the free list is initialized, which means accessing those
      pages, which means tlb misses
    <braunr> i guess the zone allocator already has free pages available
    <mcsim> I see
    <braunr> i think you can stop performance measurements, they show the
      allocator is slightly slower, but so slightly we don't care about that
    <braunr> we need numbers on memory usage now (at the page level)
    <braunr> and this isn't easy
    <mcsim> For balloc I can get numbers if I summarize nr_slabs*slab_size for
      each cache, isn't it?
    <braunr> yes
    <braunr> you can have a look at the original implementation, function
      mem_info
    <mcsim> And for zalloc I have to summarize of cur_size and then add
      zalloc_wasted_space?
    <braunr> i don't know :/
    <braunr> i think the best moment to obtain accurate values is after zone_gc
      removes the collected pages
    <braunr> for both allocators, you could fill a stats structure at that
      moment, and have an rpc copy that structure when a client tool requests
      it
    <braunr> concerning your tests, there is another point to have in mind
    <braunr> the very first loop in your code shows a result of 31844
    <braunr> although you disabled the call to cpu_pool_fill
    <braunr> but the reason why it's so long is that the cpu layer still exists
    <braunr> and if you look carefully, the cpu pools are created as needed on
      the free path
    <mcsim> I removed cpu_pool_drain
    <braunr> but not cpu_pool_push/pop i guess
    <mcsim> http://paste.debian.net/128698/
    <braunr> see, you still allocate the cpu pool array on the free path
    <mcsim> but I don't fill it
    <braunr> that's not the point
    <braunr> it uses mem_cache_alloc
    <braunr> so in a call to free, you can also have an allocation, that can
      potentially create a new slab
    <mcsim> I see, so I have to create cpu_pool at the initialization stage?
    <braunr> no, you can't
    <braunr> there is a reason why they're allocated on the free path
    <braunr> but since you don't have the fill/drain functions, i wonder if you
      should just comment out the whole cpu layer code
    <braunr> but hmm
    <braunr> no really, it's not worth the effort
    <braunr> even with drains/fills, the results are really good enough
    <braunr> it makes the allocator smp ready
    <braunr> we should just keep it that way
    <braunr> mcsim: fyi, the reason why cpu pool arrays are allocated on the
      free path is to avoid recursion
    <braunr> because cpu pool arrays are allocated from caches just as almost
      everything else
    <mcsim> ok
    <mcsim> summ of cur_size and then adding zalloc_wasted_space gives 0x4e1954
    <mcsim> but this value isn't even page aligned
    <mcsim> For balloc I've got 0x4c6000 0x4aa000 0x48d000
    <braunr> hm can you report them in decimal, >> 10 so that values are in KiB
      ?
    <mcsim> 4888 4776 4660 for balloc
    <mcsim> 4998 for zalloc
    <braunr> when ?
    <braunr> after boot ?
    <mcsim> boot, compile, zone_gc
    <mcsim> and then measure
    <braunr> ?
    <mcsim> I call garbage collector before measuring
    <mcsim> and I measure after kernel compilation
    <braunr> i thought it took you 40 minutes
    <mcsim> for balloc I got results at night
    <braunr> oh so you already got them
    <braunr> i can't beleive the kernel only consumes 5 MiB
    <mcsim> before gc it takes about 9052 Kib
    <braunr> can i see the measurement code ?
    <braunr> oh, and how much ram does your machine have ?
    <mcsim> 758 mb
    <mcsim> 768
    <braunr> that's really weird
    <braunr> i'd expect the kernel to consume much more space
    <mcsim> http://paste.debian.net/128703/
    <mcsim> it's only dynamically allocated data
    <braunr> yes
    <braunr> ipc ports, rights, vm map entries, vm objects, and lots of other
      hanging buffers
    <braunr> about how much is zalloc_wasted_space ?
    <braunr> if it's small or constant, i guess you could ignore it
    <mcsim> about 492
    <mcsim> KiB
    <braunr> well it's another good point, mach internal structures don't imply
      much overhead
    <braunr> or, the zone allocator is underused

    <tschwinge> mcsim, braunr: The memory allocator project is coming along
      good, as I get from your IRC messages?
    <braunr> tschwinge: yes, but as expected, improvements are minor
    <tschwinge> But at the very least it's now well-known, maintainable code.
    <braunr> yes, it's readable, easier to understand, provides self inspection
      and is smp ready
    <braunr> there also are less hacks, but a few less features (there are no
      way to avoid sleeping so it's unusable - and unused - in interrupt
      handlers)
    <braunr> is* no way
    <braunr> tschwinge: mcsim did a good job porting and measuring it


# IRC, freenode, #hurd, 2011-09-08

    <antrik> braunr: note that the zalloc map used to be limited to 8 MiB or
      something like that a couple of years ago... so it doesn't seems
      surprising that the kernel uses "only" 5 MiB :-)
    <antrik> (yes, we had a *lot* of zalloc panics back then...)


# IRC, freenode, #hurd, 2011-09-14

    <mcsim> braunr: hello. I've written a constructor for kernel map entries
      and it can return resources to their source. Can you have a look at it?
      http://paste.debian.net/130037/ If all be OK I'll push it tomorrow.
    <braunr> mcsim: send the patch through mail please, i'll apply it on my
      copy
    <braunr> are you sure the cache is reapable ?
    <mcsim> All slabs, except first I allocate with kmem_alloc_wired.
    <braunr> how can you be sure ?
    <mcsim> First slab I allocate during bootstrap and use pmap_steal_memory
      and further I use only kmem_alloc_wired
    <braunr> no, you use kmem_free
    <braunr> in kentry_dealloc_cache()
    <braunr> which probably creates a recursion
    <braunr> using the constructor this way isn't a good idea
    <braunr> constructors are good for preconstructed state (set counters to 0,
      init lists and locks, that kind of things, not allocating memory)
    <braunr> i don't think you should try to make this special cache reapable
    <braunr> mcsim: keep in mind constructors are applied on buffers at *slab*
      creation, not at object allocation
    <braunr> so if you allocate a single slab with, say, 50 or 100 objects per
      slab, kmem_alloc_wired would be called that number of times
    <mcsim> why kentry_dealloc_cache can create recursion? kentry_dealloc_cache
      is called only by mem_cache_reap.
    <braunr> right
    <braunr> but are you totally sure mem_cache_reap() can't be called by
      kmem_free() ?
    <braunr> i think you're right, it probably can't


# IRC, freenode, #hurd, 2011-09-25

    <mcsim> braunr: hello. I rewrote constructor for kernel entries and seems
      that it works fine. I think that this was last milestone. Only moving of
      memory allocator sources to more appropriate place and merge with main
      branch left.
    <braunr> mcsim: it needs renaming and reindenting too
    <mcsim> for reindenting C-x h Tab in emacs will be enough?
    <braunr> mcsim: make sure which style must be used first
    <mcsim> and what should I rename and where better to place allocator? For
      example, there is no lib directory, like in x15. Should I create it and
      move list.* and rbtree.* to lib/ or move these files to util/ or
      something else?
    <braunr> mcsim: i told you balloc isn't a good name before, use something
      more meaningful (kmem is already used in gnumach unfortunately if i'm
      right)
    <braunr> you can put the support files in kern/
    <mcsim> what about vm_alloc?
    <braunr> you should prefix it with vm_
    <braunr> shouldn't
    <braunr> it's a top level allocator
    <braunr> on top of the vm system
    <braunr> maybe mcache
    <braunr> hm no
    <braunr> maybe just km_
    <mcsim> kern/km_alloc.*?
    <braunr> no
    <braunr> just km
    <mcsim> ok.


# IRC, freenode, #hurd, 2011-09-27

    <mcsim> braunr: hello. When I've tried to speed of new allocator and bad
      I've removed function mem_cpu_pool_fill. But you've said to undo this. I
      don't understand why this function is necessary. Can you explain it,
      please?
    <mcsim> When I've tried to compare speed of new allocator and old*
    <braunr> i'm not sure i said that
    <braunr> i said the performance overhead is negligible
    <braunr> so it's better to leave the cpu pool layer in place, as it almost
      doesn't hurt
    <braunr> you can implement the KMEM_CF_NO_CPU_POOL I added in the x15 mach
      version
    <braunr> so that cpu pools aren't used by default, but the code is present
      in case smp is implemented
    <mcsim> I didn't remove cpu pool layer. I've just removed filling of cpu
      pool during creation of slab.
    <braunr> how do you fill the cpu pools then ?
    <mcsim> If object is freed than it is added to cpu poll
    <braunr> so you don't fill/drain the pools ?
    <braunr> you try to get/put an object and if it fails you directly fall
      back to the slab layer ?
    <mcsim> I drain them during garbage collection
    <braunr> oh
    <mcsim> yes
    <braunr> you shouldn't touch the cpu layer during gc
    <braunr> the number of objects should be small enough so that we don't care
      much
    <mcsim> ok. I can drain cpu pool at any other time if it is prohibited to
      in mem_gc.
    <mcsim> But why do we need to fill cpu poll during slab creation?
    <mcsim> In this case allocation consist of: get object from slab -> put it
      to cpu pool -> get it from cpu pool
    <mcsim> I've just remove last to stages
    <braunr> hm cpu pools aren't filled at slab creation
    <braunr> they're filled when they're empty, and drained when they're full
    <braunr> so that the number of objects they contain is increased/reduced to
      a value suitable for the next allocations/frees
    <braunr> the idea is to fall back as little as possible to the slab layer
      because it requires the acquisition of the cache lock
    <mcsim> oh. You're right. I'm really sorry. The point is that if cpu pool
      is empty we don't need to fill it first
    <braunr> uh, yes we do :)
    <mcsim> Why cache locking is so undesirable? If we have free objects in
      slabs locking will not take a lot if time.
    <braunr> mcsim: it's undesirable on a smp system
    <mcsim> ok.
    <braunr> mcsim: and spin locks are normally noops on a up system
    <braunr> which is the case in gnumach, hence the slightly better
      performances without the cpu layer
    <braunr> but i designed this allocator for x15, which only supports mp
      systems :)
    <braunr> mcsim: sorry i couldn't look at your code, sick first, busy with
      server migration now (new server almost ready for xen hurds :))
    <mcsim> ok.
    <mcsim> I ended with allocator if didn't miss anything important:)
    <braunr> i'll have a look soon i hope :)


# IRC, freenode, #hurd, 2011-09-27

    <antrik> braunr: would it be realistic/useful to check during GC whether
      all "used" objects are actually in a CPU pool, and if so, destroy them so
      the slab can be freed?...
    <antrik> mcsim: BTW, did you ever do any measurements of memory
      use/fragmentation?
    <mcsim> antrik: I couldn't do this for zalloc
    <antrik> oh... why not?
    <antrik> (BTW, I would be interested in a comparision between using the CPU
      layer, and bare slab allocation without CPU layer)
    <mcsim> Result I've got were strange. It wasn't even aligned to page size.
    <mcsim> Probably is it better to look into /proc/vmstat?
    <mcsim> Because I put hooks in the code and probably I missed something
    <antrik> mcsim: I doubt vmstat would give enough information to make any
      useful comparision...
    <braunr> antrik: isn't this draining cpu pools at gc time ?
    <braunr> antrik: the cpu layer was found to add a slight overhead compared
      to always falling back to the slab layer
    <antrik> braunr: my idea is only to drop entries from the CPU cache if they
      actually prevent slabs from being freed... if other objects in the slab
      are really in use, there is no point in flushing them from the CPU cache
    <antrik> braunr: I meant comparing the fragmentation with/without CPU
      layer. the difference in CPU usage is probably negligable anyways...
    <antrik> you might remember that I was (and still am) sceptical about CPU
      layer, as I suspect it worsens the good fragmentation properties of the
      pure slab allocator -- but it would be nice to actually check this :-)
    <braunr> antrik: right
    <braunr> antrik: the more i think about it, the more i consider slqb to be
      a better solution ...... :>
    <braunr> an idea for when there's time
    <braunr> eh
    <antrik> hehe :-)


# IRC, freenode, #hurd, 2011-10-13

    <braunr> mcsim: what's the current state of your gnumach branch ?
    <mcsim> I've merged it with master in September
    <braunr> yes i've seen that, but does it build and run fine ?
    <mcsim> I've tested it on gnumach from debian repository, but for building
      I had to make additional change in device/ramdisk.c, as I mentioned.
    <braunr> mcsim: why ?
    <mcsim> And it runs fine for me.
    <braunr> mcsim: why did you need to make other changes ?
    <mcsim> because there is a patch which comes with from-debian-repository
      kernel and it addes some code, where I have to make changes. Earlier
      kernel_map was a pointer to structure, but I change that and now
      kernel_map is structure. So handling to it should be by taking the
      address (&kernel_map)
    <braunr> why did you do that ?
    <braunr> or put it another way: what made you do that type change on
      kernel_map ?
    <mcsim> Earlier memory for kernel_map was allocating with zalloc. But now
      salloc can't allocate memory before it's initialisation
    <braunr> that's not a good reason
    <braunr> a simple workaround for your problem is this :
    <braunr> static struct vm_map kernel_map_store;
    <braunr> vm_map_t kernel_map = &kernel_map_store;
    <mcsim> braunr: Ok. I'll correct this.


# IRC, freenode, #hurd, 2011-11-01

    <braunr> etenil: but mcsim's work is, for one, useful because the allocator
      code is much clearer, adds some debugging support, and is smp-ready


# IRC, freenode, #hurd, 2011-11-14

    <braunr> i've just realized that replacing the zone allocator removes most
      (if not all) static limit on allocated objects
    <braunr> as we have nothing similar to rlimits, this means kernel resources
      are actually exhaustible
    <braunr> and i'm not sure every allocation is cleanly handled in case of
      memory shortage
    <braunr> youpi: antrik: tschwinge: is this acceptable anyway ?
    <braunr> (although IMO, it's also a good thing to get rid of those limits
      that made the kernel panic for no valid reason)
    <youpi> there are actually not many static limits on allocated objects
    <youpi> only a few have one
    <braunr> those defined in kern/mach_param.h
    <youpi> most of them are not actually enforced
    <braunr> ah ?
    <braunr> they are used at zinit() time
    <braunr> i thought they were
    <youpi> yes,  but most zones are actually fine with overcoming the max
    <braunr> ok
    <youpi> see zone->max_size += (zone->max_size >> 1);
    <youpi> you need both !EXHAUSTIBLE and FIXED
    <braunr> ok
    <pinotree> making having rlimits enforced would be nice...
    <pinotree> s/making//
    <braunr> pinotree: the kernel wouldn't handle many standard rlimits anyway

    <braunr> i've just committed my final patch on mcsim's branch, which will
      serve as the starting point for integration
    <braunr> which means code in this branch won't change (or only last minute
      changes)
    <braunr> you're invited to test it
    <braunr> there shouldn't be any noticeable difference with the master
      branch
    <braunr> a bit less fragmentation
    <braunr> more memory can be reclaimed by the VM system
    <braunr> there are debugging features
    <braunr> it's SMP ready
    <braunr> and overall cleaner than the zone allocator
    <braunr> although a bit slower on the free path (because of what's
      performed to reduce fragmentation)
    <braunr> but even "slower" here is completely negligible


# IRC, freenode, #hurd, 2011-11-15

    <mcsim> I enabled cpu_pool layer and kentry cache exhausted at "apt-get
      source gnumach && (cd gnumach-* && dpkg-buildpackage)"
    <mcsim> I mean kernel with your last commit
    <mcsim> braunr: I'll make patch how I've done it in a few minutes, ok? It
      will be more specific.
    <braunr> mcsim: did you just remove the #if NCPUS > 1 directives ?
    <mcsim> no. I replaced macro NCPUS > 1 with SLAB_LAYER, which equals NCPUS
      > 1, than I redefined macro SLAB_LAYER
    <braunr> ah, you want to make the layer optional, even on UP machines
    <braunr> mcsim: can you give me the commands you used to trigger the
      problem ?
    <mcsim> apt-get source gnumach && (cd gnumach-* && dpkg-buildpackage)
    <braunr> mcsim: how much ram & swap ?
    <braunr> let's see if it can handle a quite large aptitude upgrade
    <mcsim> how can I check swap size?
    <braunr> free
    <braunr> cat /proc/meminfo
    <braunr> top
    <braunr> whatever
    <mcsim>              total       used       free     shared    buffers
      cached
    <mcsim> Mem:        786368     332296     454072          0          0
      0
    <mcsim> -/+ buffers/cache:     332296     454072
    <mcsim> Swap:      1533948          0    1533948
    <braunr> ok, i got the problem too
    <mcsim> braunr: do you run hurd in qemu?
    <braunr> yes
    <braunr> i guess the cpu layer increases fragmentation a bit
    <braunr> which means more map entries are needed
    <braunr> hm, something's not right
    <braunr> there are only 26 kernel map entries when i get the panic
    <braunr> i wonder why the cache gets that stressed
    <braunr> hm, reproducing the kentry exhaustion problem takes quite some
      time
    <mcsim> braunr: what do you mean?
    <braunr> sometimes, dpkg-buildpackage finishes without triggering the
      problem
    <mcsim> the problem is in apt-get source gnumach
    <braunr> i guess the problem happens because of drains/fills, which
      allocate/free much more object than actually preallocated at boot time
    <braunr> ah ?
    <braunr> ok
    <braunr> i've never had it at that point, only later
    <braunr> i'm unable to trigger it currently, eh
    <mcsim> do you use *-dbg kernel?
    <braunr> yes
    <braunr> well, i use the compiled kernel, with the slab allocator, built
      with the in kernel debugger
    <mcsim> when you run apt-get source gnumach, you run it in clean directory?
      Or there are already present downloaded archives?
    <braunr> completely empty
    <braunr> ah just got it
    <braunr> ok the limit is reached, as expected
    <braunr> i'll just bump it
    <braunr> the cpu layer drains/fills allocate several objects at once (64 if
      the size is small enough)
    <braunr> the limit of 256 (actually 252 since the slab descriptor is
      embedded in its slab) is then easily reached
    <antrik> mcsim: most direct way to check swap usage is vmstat
    <braunr> damn, i can't live without slabtop and the amount of
      active/inactive cache memory any more
    <braunr> hm, weird, we have active/inactive memory in procfs, but not
      buffers/cached memory
    <braunr> we could set buffers to 0 and everything as cached memory, since
      we're currently unable to communicate the purpose of cached memory
      (whether it's used by disk servers or file system servers)
    <braunr> mcsim: looks like there are about 240 kernel map entries (i forgot
      about the ones used in kernel submaps)
    <braunr> so yes, addin the cpu layer is what makes the kernel reach the
      limit more easily
    <mcsim> braunr: so just increasing limit will solve the problem?
    <braunr> mcsim: yes
    <braunr> slab reclaiming looks very stable
    <braunr> and unfrequent
    <braunr> (which is surprising)
    <pinotree> braunr: "unfrequent"?
    <braunr> pinotree: there isn't much memory pressure
    <braunr> slab_collect() gets called once a minute on my hurd
    <braunr> or is it infrequent ?
    <braunr> :)
    <pinotree> i have no idea :)
    <braunr> infrequent, yes


# IRC, freenode, #hurd, 2011-11-16

    <braunr> for those who want to play with the slab branch of gnumach, the
      slabinfo tool is available at http://darnassus.sceen.net/cgit/rbraun/slabinfo.git/
    <braunr> for those merely interested in numbers, here is the output of
      slabinfo, for a hurd running in kvm with 512 MiB of RAM, an unused swap,
      and a short usage history (gnumach debian packages built, aptitude
      upgrade for a dozen of packages, a few git commands)
    <braunr> http://www.sceen.net/~rbraun/slabinfo.out
    <antrik> braunr: numbers for a long usage history would be much more
      interesting :-)


## IRC, freenode, #hurd, 2011-11-17

    <braunr> antrik: they'll come :)
    <etenil> is something going on on darnassus? it's mighty slow
    <braunr> yes
    <braunr> i've rebooted it to run a modified kernel (with the slab
      allocator) and i'm building stuff on it to stress it
    <braunr> (i don't have any other available machine with that amount of
      available physical memory)
    <etenil> ok
    <antrik> braunr: probably would be actually more interesting to test under
      memory pressure...
    <antrik> guess that doesn't make much of a difference for the kernel object
      allocator though
    <braunr> antrik: if ram is larger, there can be more objects stored in
      kernel space, then, by building something large such as eglibc, memory
      pressure is created, causing caches to be reaped
    <braunr> our page cache is useless because of vm_object_cached_max
    <braunr> it's a stupid arbitrary limit masking the inability of the vm to
      handle pressure correctly 
    <braunr> if removing it, the kernel freezes soon after ram is filled
    <braunr> antrik: it may help trigger the "double swap" issue you mentioned
    <antrik> what may help trigger it?
    <braunr> not checking this limit
    <antrik> hm... indeed I wonder whether the freezes I see might have the
      same cause


## IRC, freenode, #hurd, 2011-11-19

    <braunr> http://www.sceen.net/~rbraun/slabinfo.out <= state of the slab
      allocator after building the debian libc packages and removing all files
      once done
    <braunr> it's mostly the same as on any other machine, because of the
      various arbitrary limits in mach (most importantly, the max number of
      objects in the page cache)
    <braunr> fragmentation is still quite low
    <antrik> braunr: actually fragmentation seems to be lower than on the other
      run...
    <braunr> antrik: what makes you think that ?
    <antrik> the numbers of currently unused objects seem to be in a similar
      range IIRC, but more of them are reclaimable I think
    <antrik> maybe I'm misremembering the other numbers
    <braunr> there had been more reclaims on the other run


# IRC, freenode, #hurd, 2011-11-25

    <braunr> mcsim: i've just updated the slab branch, please review my last
      commit when you have time
    <mcsim> braunr: Do you mean compilation/tests?
    <braunr> no, just a quick glance at the code, see if it matches what you
      intended with your original patch
    <mcsim> braunr: everything is ok
    <braunr> good
    <braunr> i think the branch is ready for integration


# IRC, freenode, #hurd, 2011-12-17

    <braunr> in the slab branch, there now is no use for the defines in
      kern/mach_param.h
    <braunr> should the file be removed or left empty as a placeholder for
      future arbitrary limits ?
    <braunr> (i'd tend ro remove it as a way of indicating we don't want
      arbitrary limits but there may be a good reason to keep it around .. :))
    <youpi> I'd just drop it
    <braunr> ok
    <braunr> hmm maybe we do want to keep that one :
    <braunr> #define IMAR_MAX        (1 << 10)       /* Max number of
      msg-accepted reqs */
    <antrik> whatever that is...
    <braunr> it gets returned in ipc_marequest_info
    <braunr> but the mach_debug interface has never been used on the hurd
    <braunr> there now is a master-slab branch in the gnumach repo, feel free
      to test it


# IRC, freenode, #hurd, 2011-12-22

    <youpi> braunr: does the new gnumach allocator has profiling features?
    <youpi> e.g. to easily know where memory leaks reside
    <braunr> youpi: you mean tracking call traces to allocated blocks ?
    <youpi> not necessarily traces
    <youpi> but at least means to know what kind of objects is filling memory
    <braunr> it's very close to the zone allocator
    <braunr> but instead of zones, there are caches
    <braunr> each named after the type they store
    <braunr> see http://www.sceen.net/~rbraun/slabinfo.out
    <youpi> ok, so we can know, per-type, how much memory is used
    <braunr> yes
    <youpi> good
    <braunr> if backtraces can easily be forged, it wouldn't be hard to add
      that feature too
    <youpi> does it dump such info when memory goes short?
    <braunr> no but it can
    <braunr> i've done this during tests
    <youpi> it'd be good
    <youpi> because I don't know in advance when a buildd will crash due to
      that :)
    <braunr> each time slab_collect() is called for example
    <youpi> I mean not on collect, but when it's too late
    <youpi> and thus always enabled
    <braunr> ok
    <youpi> (because there's nothing better to do than at least give infos)
    <braunr> you just have to define "when it's too late", and i can add that
    <youpi> when there is no memory left
    <braunr> you mean when the number of free pages strictly reaches 0 ?
    <youpi> yes
    <braunr> ok
    <youpi> i.e. just before crashing the kernel
    <braunr> i see


# IRC, freenode, #hurdfr, 2012-01-02

    <youpi> braunr: le code du slab allocator, il est écrit from scratch ?
    <youpi> il y a encore du copyright carnegie mellon
    <youpi> (dans slab_info.h du moins)
    <youpi> ipc_hash_global_size = 256;
    <youpi> il faudrait mettre 256 comme constante dans un header
    <youpi> sinon c'est encore une valeur arbitraire cachée dans du code
    <youpi> de même pour ipc_marequest_size etc.
    <braunr> youpi: oui, from scratch
    <braunr> slab_info.h est à l'origine zone_info.h
    <braunr> pour les valeurs fixes, elles étaient déjà présentes de cette
      façon, j'ai pensé qu'il valait mieux laisser comme ça pour faciliter la
      lecture des diffs
    <braunr> je ferai des macros à la place
    <braunr> du coup il faudra peut-être remettre mach_param.h
    <braunr> ou alors dans les .h ipc


# IRC, freenode, #hurd, 2012-01-18

    <braunr> does the slab branch need other reviews/reports before being
      integrated ?


# IRC, freenode, #hurd, 2012-01-30

    <braunr> youpi: do you have some idea about when you want to get the slab
      branch in master ?
    <youpi> I was considering as soon as mcsim gets his paper
    <braunr> right


# IRC, freenode, #hurd, 2012-02-22

    <mcsim> Do I understand correct, that real memory page should be
      necessarily in one of following lists: vm_page_queue_active,
      vm_page_queue_inactive, vm_page_queue_free?
    <braunr> cached pages are
    <braunr> some special pages used only by the kernel aren't
    <braunr> pages can be both wired and cached (i.e. managed by the page
      cache), so that they can be passed to external applications and then
      unwired (as is the case with your host_slab_info() function if you
      remember)
    <braunr> use "physical" instead of "real memory"
    <mcsim> braunr: thank you.


# IRC, freenode, #hurd, 2012-04-22

    <braunr> youpi: tschwinge: when the slab code was added, a few new files
      made into gnumach that come from my git repo and are used in other
      projects as well
    <braunr> they're licensed under BSD upstream and GPL in gnumach, and though
      it initially didn't disturb me, now it does
    <braunr> i think i should fix this by leaving the original copyright and
      adding the GPL on top
    <youpi> sure, submit a patch
    <braunr> hm i have direct commit acces if im right
    <youpi> then fix it :)
    <braunr> do you want to review ?
    <youpi> I don't think there is any need to
    <braunr> ok


# IRC, freenode, #hurd, 2012-12-08

    <mcsim> braunr: hi. Do I understand correct that merely the same technique
      is used in linux to determine the slab where, the object to be freed,
      resides?
    <braunr> yes but it's faster on linux since it uses a direct mapping of
      physical memory
    <braunr> it just has to shift the virtual address to obtain the physical
      one, whereas x15 has to walk the pages tables
    <braunr> of course it only works for kmalloc, vmalloc is entirely different
    <mcsim> btw, is there sense to use some kind of B-tree instead of AVL to
      decrease number of cache misses? AFAIK, in modern processors size of L1
      cache line is at least 64 bytes, so in one node we can put at least 4
      leafs (key + pointer to data) making search faster.
    <braunr> that would be a b-tree
    <braunr> and yes, red-black trees were actually developed based on
      properties observed on b-trees
    <braunr> but increasing the size of the nodes also increases memory
      overhead
    <braunr> and code complexity
    <braunr> that's why i have a radix trees for cases where there are a large
      number of entries with keys close to each other :)
    <braunr> a radix-tree is basically a b-tree using the bits of the key as
      indexes in the various arrays it walks instead of comparing keys to each
      other
    <braunr> the original avl tree used in my slab allocator was intended to
      reduce the average height of the tree (avl is better for that)
    <braunr> avl trees are more suited for cases where there are more lookups
      than inserts/deletions
    <braunr> they make the tree "flatter" but the maximum complexity of
      operations that change the tree is 2log2(n), since rebalancing the tree
      can make the algorithm reach back to the tree root
    <braunr> red-black trees have slightly bigger heights but insertions are
      limited to 2 rotations and deletions to 3
    <mcsim> there should be not much lookups in slab allocators
    <braunr> which explains why they're more generally found in generic
      containers
    <mcsim> or do I misunderstand something?
    <braunr> well, there is a lookup for each free()
    <braunr> whereas there are insertions/deletions when a slab becomes
      non-empty/empty
    <mcsim> I see
    <braunr> so it was very efficient for caches of small objects, where slabs
      have many of them
    <braunr> also, i wrote the implementation in userspace, without
      functionality pmap provides (although i could have emulated it
      afterwards)


# IRC, freenode, #hurd, 2013-01-06

    <youpi> braunr: panic: vm_map: kentry memory exhausted
    <braunr> youpi: ouch
    <youpi> that's what I usually get
    <braunr> ok
    <braunr> the kentry area is a preallocated memory area that is used to back
      the vm_map_kentry cache
    <braunr> objects from this cache are used to describe kernel virtual memory
    <braunr> so in this case, i simply assume the kentry area must be enlarged
    <braunr> (currently, both virtual and physical memory is preallocated, an
      improvement could be what is now done in x15, to preallocate virtual
      memory only
    <braunr> )
    <youpi> Mmm, why do we actually have this limit?
    <braunr> the kentry area must be described by one entry
    <youpi> ah, sorry, vm/vm_resident.c:       kentry_data =
      pmap_steal_memory(kentry_data_size);
    <braunr> a statically allocated one
    <youpi> I had missed that one
    <braunr> previously, the zone allocator would do that
    <braunr> the kentry area is required to avoid recursion when allocating
      memory
    <braunr> another solution would be a custom allocator in vm_map, but i
      wanted to use a common cache for those objects too
    <braunr> youpi: you could simply try doubling KENTRY_DATA_SIZE
    <youpi> already doing that
    <braunr> we might even consider a much larger size until it's reworked
    <youpi> well, it's rare enough on buildds already
    <youpi> doubling should be enough
    <youpi> or else we have leaks
    <braunr> right
    <braunr> it may not be leaks though
    <braunr> it may be poor map entry merging
    <braunr> i'd expected the kernel map entries to be easier to merge, but it
      may simply not be the case
    <braunr> (i mean, when i made my tests, it looked like there were few
      kernel map entries, but i may have missed corner cases that could cause
      more of them to be needed)


## IRC, freenode, #hurd, 2014-02-11

    <braunr> youpi: what's the issue with kentry_data_size ?
    <youpi> I don't know
    <braunr> so back to 64pages from 256 ?
    <youpi> in debian for now yes
    <braunr> :/
    <braunr> from what i recall with x15, grub is indeed allowed to put modules
      and command lines around as it likes
    <braunr> restricted to 4G
    <braunr> iirc, command lines were in the first 1M while modules could be
      loaded right after the kernel or at the end of memory, depending on the
      versions
    <youpi> braunr: possibly VM_KERNEL_MAP_SIZE is then not big enough
    <braunr> youpi: what's the size of the ramdisk ?
    <braunr> youpi: or kmem_map too big
    <braunr> we discussed this earlier with teythoon 

[[user-space_device_drivers]], *Open Issues*, *System Boot*, *IRC, freenode,
\#hurd, 2011-07-27*, *IRC, freenode, #hurd, 2014-02-10*

    <braunr> or maybe we want to remove kmem_map altogether and directly use
      kernel_map
    <youpi> it's 6.2MiB big
    <braunr> hm
    <youpi> err no
    <braunr> looks small
    <youpi> 70MiB
    <braunr> ok yes
    <youpi> (uncompressed)
    <braunr> well
    <braunr> kernel_map is supposed to have 64M on i386 ...
    <braunr> it's 192M large, with kmem_map taking 128M
    <braunr> so at most 64M, with possible fragmentation
    <teythoon> i believe the compressed initrd is stored in the ramdisk
    <youpi> ah, right it's ext2fs which uncompresses it
    <braunr> uncompresses it where 
    <braunr> ?
    <teythoon> libstore does that
    <youpi> module --nounzip /boot/${gtk}initrd.gz 
    <youpi> braunr: in userland memory
    <youpi> it's not grub which uncompresses it for sure
    <teythoon> braunr: so my ramdisk isn't 64 megs either
    <braunr> which explains why it sometimes works
    <teythoon> yes
    <teythoon> mine is like 15 megs
    <braunr> kentry_data_size calls pmap_steal_memory, an early allocation
      function which changes virtual_space_start, which is later used to create
      the first kernel map entry
    <braunr> err, pmap_steal_memory is called with kentry_data_size as its
      argument
    <braunr> this first kernel map entry is installed inside kernel_map and
      reduces the amount of available virtual memory there
    <braunr> so yes, it all points to a layout problem
    <braunr> i suggest reducing kmem_map down to 64M
    <youpi> that's enough to get d-i back to boot
    <youpi> what would be the downside?
    <youpi> (why did you raise it to 128 actually? :) )
    <braunr> i merged the map used by generic kalloc allocations into kmem_map
    <braunr> both were 64M
    <braunr> i don't see any downside for the moment
    <braunr> i rarely see more than 50M used by the slab allocator
    <braunr> and with the recent code i added to collect reclaimable memory on
      kernel allocation failures, it's unlikely the slab allocator will be
      starved
    <youpi> but then we need that patch too
    <braunr> no
    <braunr> it would be needed if kmem_map gets filled
    <braunr> this very rarely happens
    <youpi> is "very rarely" enough ? :)
    <braunr> actualy i've never seen it happen
    <braunr> i added it because i had port leaks with fakeroot
    <braunr> port rights are a bit special because they're stored in a table in
      kernel space
    <braunr> this table is enlarged with kmem_realloc
    <braunr> when an ipc space gets very large, fragmentation makes it very
      difficult to successfully resize it
    <braunr> that should be the only possible issue
    <braunr> actually, there is another submap that steals memory from
      kernel_map: device_io_map is 16M large
    <braunr> so kernel_map gets down to 48M
    <braunr> if the initial entry (that is, kentry_data_size + the physical
      page table size) gets a bit large, kernel_map may have very little
      available room
    <braunr> the physical page table size obviously varies depending on the
      amount of physical memory loaded, which may explain why the installer
      worked on some machines
    <youpi> well, it works up to 1855M
    <youpi> at 1856 it doesn't work any more :)
    <braunr> heh :)
    <youpi> and that's about the max gnumach can handle anyway
    <braunr> then reducing kmem_map down to 96M should be enough
    <youpi> it works indeed
    <braunr> could you check the amount of available space in kernel_map ?
    <braunr> the value of kernel_map->size should do
    <youpi> printing it "multiboot modules" print should be fine I guess?


### IRC, freenode, #hurd, 2014-02-12

    <braunr> probably
    <teythoon> ?
    <braunr> i expect a bit more than 160M
    <braunr> (for the value of kernel_map->size)
    <braunr> teythoon: ?
    <youpi> well, it's 2110210048
    <teythoon> what is multiboot modules printing ?
    <youpi> almost last in gnumach bootup
    <braunr> humm
    <braunr> it must account directly mapped physical pages
    <braunr> considering the kernel has exactly 2G, this means there is 36M
      available in kernel_map
    <braunr> youpi: is the ramdisk loaded at that moment ?
    <youpi> what do you mean by "loaded" ? :)
    <braunr> created
    <youpi> where?
    <braunr> allocated in kernel memory
    <youpi> the script hasn't started yet
    <braunr> ok
    <braunr> its size was 6M+ right ?
    <braunr> so it leaves around 30M
    <youpi> something like this yes
    <braunr> and changing kmem_map from 128M to 96M gave us 32M
    <braunr> so that's it


# IRC, freenode, #hurd, 2013-04-18

    <braunr> oh nice, i've found a big scalability issue with my slab allocator
    <braunr> it shouldn't affect gnumach much though


## IRC, freenode, #hurd, 2013-04-19

    <ArneBab> braunr: is it fixable?
    <braunr> yes
    <braunr> well, i'll do it in x15 for a start
    <braunr> again, i don't think gnumach is much affected
    <braunr> it's a scalability issue
    <braunr> when millions of objects are in use
    <braunr> gnumach rarely has more than a few hundred thousands
    <braunr> it's also related to heavy multithreading/smp
    <braunr> and by multithreading, i also mean preemption
    <braunr> gnumach isn't preemptible and uniprocessor
    <braunr> if the resulting diff is clean enough, i'll push it to gnumach
      though :)


### IRC, freenode, #hurd, 2013-04-21

    <braunr> ArneBab_: i fixed the scalability problems btw


## IRC, freenode, #hurd, 2013-04-20

    <braunr> well, there is also a locking error in the slab allocator,
      although not a problem for a non preemptible kernel like gnumach
    <braunr> non preemptible / uniprocessor

## IRC freenode, #hurd, 2016-12-29

    <braunr> i've identified a fundamental flaw with the default pager
    <braunr> and actually, with mach in general i suppose
    <braunr> i assumed that it was necessary to trust the server only
    <braunr> that a server didn't need to trust its client
    <braunr> but mach messages carry memory that is potentially mapped from unprivileged pagers
    <braunr> which means faulting on that memory effectively makes the faulting process a client to the unprivileged pager
    <braunr> and that's something that can happen to the default pager during heavy memory pressure
    <braunr> in which case it deadlocks on itself because the copyout hangs on a fault, waiting for the unprivileged pager to provide the data
    <braunr> (which it can't because of heavy memory pressure and because it's unprivileged, it's blocked, waiting until allocations resume)
    <braunr> the pageout daemon will keep paging out to the default pager in the hope those pages get freed
    <braunr> but sending to the default pager is now impossible because its map is locked on the never-ending fault