Revert "rename open_issues.mdwn to service_solahart_jakarta_selatan__082122541663.mdwn"

This reverts commit 95878586ec7611791f4001a4ee17abf943fae3c1.
author: Samuel Thibault <samuel.thibault@ens-lyon.org> 2015-02-18 00:58:35 +0100
committer: Samuel Thibault <samuel.thibault@ens-lyon.org> 2015-02-18 00:58:35 +0100
commit: 49a086299e047b18280457b654790ef4a2e5abfa (patch)
tree: c2b29e0734d560ce4f58c6945390650b5cac8a1b /open_issues/resource_management_problems
parent: e2b3602ea241cd0f6bc3db88bf055bee459028b6 (diff)
4 files changed, 487 insertions, 0 deletions
diff --git a/open_issues/resource_management_problems/configure_max_command_line_length.mdwn b/open_issues/resource_management_problems/configure_max_command_line_length.mdwn
new file mode 100644
index 00000000..6c0a0d99
--- /dev/null
+++ b/open_issues/resource_management_problems/configure_max_command_line_length.mdwn
@@ -0,0 +1,17 @@
+[[!meta copyright="Copyright © 2009 Free Software Foundation, Inc."]]
+
+[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
+id="license" text="Permission is granted to copy, distribute and/or modify this
+document under the terms of the GNU Free Documentation License, Version 1.2 or
+any later version published by the Free Software Foundation; with no Invariant
+Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
+is included in the section entitled [[GNU Free Documentation
+License|/fdl]]."]]"""]]
+
+[[!tag open_issue_porting]]
+
+    <terpstra> do the buildds also crash?
+    <youpi> sometimes
+    <youpi> usually when a configure scripts tries to find out how large a
+      command line can be
+    <youpi> (thus eating all memory)
diff --git a/open_issues/resource_management_problems/io_accounting.mdwn b/open_issues/resource_management_problems/io_accounting.mdwn
new file mode 100644
index 00000000..113b965a
--- /dev/null
+++ b/open_issues/resource_management_problems/io_accounting.mdwn
@@ -0,0 +1,49 @@
+[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]]
+
+[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
+id="license" text="Permission is granted to copy, distribute and/or modify this
+document under the terms of the GNU Free Documentation License, Version 1.2 or
+any later version published by the Free Software Foundation; with no Invariant
+Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
+is included in the section entitled [[GNU Free Documentation
+License|/fdl]]."]]"""]]
+
+IRC, freenode, #hurd, 2011-07-22
+
+    <braunr> an interesting question i've had in mind for a few weeks now is
+      I/O accounting
+    <braunr> what *is* I/O on a microkernel based system ?
+    <braunr> can any cross address space transfer be classified as I/O ?
+
+IRC, freenode, #hurd, 2011-07-29
+
+    < braunr> how does the hurd account I/O ?
+    < youpi> I don't think it does
+    < youpi> not an easy task, actually
+    < youpi> since gnumach has no idea about it
+    < braunr> yes
+    < braunr> another centralization issue
+    < braunr> does network access count as I/O on linux ?
+    < youpi> no
+    < braunr> not even nfs ?
+    < youpi> else you'd get 100% for servers :)
+    < braunr> right
+    < youpi> nfs goes through vfs first
+    < braunr> i'll rephrase my question
+    < youpi> I'd need to check but I believe it can check nfs
+    < braunr> does I/O accounting occur at the vfs level or block layer ?
+    < youpi> I don't know, but I beleive vfs
+    < youpi> (at least that's how I'd do it)
+    < braunr> i don't have any more nfs box to test that :/
+    < braunr> personally i'd do it at the block layer :)
+    < youpi> well, both
+    < youpi> so e2fsck can show up too
+    < braunr> yes
+    < youpi> it's just a matter of ref counting
+    < youpi> apparently nfs doesn't account
+    < youpi> find . -printf "" doesn't show up in waitio
+    < braunr> good
+    < youpi> well, depends on the point of view
+    < youpi> as a user, you'd like to know whether your processes are stuck on
+      i/o (be it disk or net)
+    < braunr> this implies clearly defining what io is
diff --git a/open_issues/resource_management_problems/pagers.mdwn b/open_issues/resource_management_problems/pagers.mdwn
new file mode 100644
index 00000000..4c36703c
--- /dev/null
+++ b/open_issues/resource_management_problems/pagers.mdwn
@@ -0,0 +1,322 @@
+[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]]
+
+[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
+id="license" text="Permission is granted to copy, distribute and/or modify this
+document under the terms of the GNU Free Documentation License, Version 1.2 or
+any later version published by the Free Software Foundation; with no Invariant
+Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
+is included in the section entitled [[GNU Free Documentation
+License|/fdl]]."]]"""]]
+
+[[!tag open_issue_gnumach]]
+
+[[!toc]]
+
+
+# IRC, freenode, #hurd, 2011-09-14
+
+Coming from [[translators_set_up_by_untrusted_users]], 2011-09-14 discussion:
+
+    <slpz> antrik: I think a tunable option for preventing non-root users from
+      creating pagers and attaching translators could also be desirable
+    <antrik> slpz: why would you want to prevent creating pagers and attaching
+      translators?
+    <tschwinge> Preventing resource exhaustion, I guess.
+    <slpz> antrik: security and (as tschwinge says) for prevent a rouge pager
+      from exhausting the system.
+    <slpz> antrik: without the ability to use translators for non-root users,
+      Hurd can provide (almost) the same level of resource protection than
+      other *nixes
+
+See also: [[translators_set_up_by_untrusted_users]],
+[[hurd/translator/tmpfs/tmpfs_vs_defpager]].
+
+    <braunr> the hurd is about that though
+    <slpz> there should be also a limit on the number of outstanding requests
+      that a task can have, and some other easily traceable values
+    <braunr> port messages queues have limits
+    <antrik> slpz: anything can exhaust the system. there are much more basic
+      limits that are missing... and I don't see how translators or pagers are
+      special in that regard
+    <slpz> braunr: that's what I said tunable. If I don't share my computer
+      with untrusted users, I want full functionality. Otherwise, I can enable
+      that limitation
+    <slpz> braunr: but I think those limits are on reception
+    <braunr> that's a wrong solution
+    <slpz> antrik: because pagers are external memory objects, and those are
+      treated differently
+    <braunr> compared to what ?
+    <braunr> and yes, the limit is on the message queue, on reception
+    <braunr> why is that a problem ?
+    <slpz> antrik: forbidding the use of translator was for security, to avoid
+      the problem of traversing an untrusted FS
+    <slpz> braunr: compared to anonymous memory
+    <slpz> braunr: because if the limit is on reception, a task can easily do a
+      DoS against a server
+    <braunr> hm actually, the problems we have with swap handling is that
+      anonymous memory is handled in a very similar way as other objects
+    <slpz> braunr: I want to limit the number of outstanding (unprocessed
+      messages in queues) requests
+    <braunr> slpz: the solution isn't about forbidding the use of translators,
+      but changing common code (libc i guess) not to use them, they can still
+      run beside
+    <slpz> braunr: that's because, currently, the external page limit is not
+      enforced
+    <braunr> i'm also not sure about DoS attacks
+    <braunr> if i'm right, there is often one port for each managed object,
+      which usually exist per client
+    <slpz> braunr: yes, that could an option too (for translators, not for
+      pagers)
+    <braunr> i don't see how pagers wouldn't be translators on the hurd
+    <slpz> braunr: all pagers are translators, but not all translators are
+      pagers ;-)
+    <braunr> so if it works for translators, it also works for pagers
+    <slpz> braunr: it would fix the security issue, but not the resource
+      exhaustion problem, with only affects to pagers
+    <braunr> i just don't see a point in implementing resource limits before
+      even fixing other fundamental issues
+    <braunr> the only way to avoid resource exhaustion is resource limits
+    <antrik> slpz: just not following untrusted translators is much more useful
+      than forbidding them alltogether
+    <braunr> and the main problem of mach is resource accounting
+    <braunr> so first, fix that, using the critique as a starting point
+
+[[hurd/critique]].
+
+    <slpz> braunr: i'm not saying that this should be implemented right now,
+      i'm just pointing out this possibility
+    <braunr> i think we're all mostly aware of it
+    <slpz> braunr: resource accounting, as it's expressed in the critique,
+      would be wonderful, but it's just too complex IMHO
+    <braunr> it requires carefully designed changes to the interface yes
+    <slpz> to the interface, to the internals, to user space tasks...
+    <braunr> the internals wouldn't be impacted that much
+    <braunr> user space tasks would mostly include hurd servers
+    <braunr> if the changes are centralized in libraries, it should be easy to
+      provide to the servers
+
+
+# IRC, freenode, #hurd, 2011-09-22
+
+    <slpz> antrik: I've also implemented a simple resource control on dirty
+      pages and changed pageout_scan to free external pages, and only touch
+      anonymous memory if it's really needed
+    <slpz> antrik: those combined make the system work better under heavy load
+    <slpz> antrik: 1.5 GB of RAM and another 1.5 GB of swap helps a lot, too
+      :-)
+    <antrik> hm... I'm not sure what these things mean exactly TBH... but I
+      wonder whether some of these could fix the performance degradation (and
+      ultimate crash) I described recently...
+
+[[/open_issues/default_pager]], [[system performance degradation
+(?)|performance/degradation]].
+
+    <antrik> care to explain them to a noob like me?
+    <slpz> probably not. During my tests, I've noticed that, at some points,
+      the system performance starts to degrade, and this doesn't change until
+      it's restarted
+    <slpz> but I wasn't able to create a test case to reproduce the bug...
+    <slpz> antrik: Sure. First, I've changed GNU Mach to:
+    <slpz>  - Classify all pages from data_supply as external, and count them
+      in vm_page_external_count (previously, this variable was always zero)
+
+[[/open_issues/mach_vm_pageout]]
+
+    <slpz>  - Count all pages for which a  data_unlock has been requested as
+      potentially dirty pages
+    <antrik> there is one important bit I forgot to mention in my recent
+      report: one "reliable" way to cause growing swap usage is simply
+      installing a lot of debian packages (e.g. running an apt-get upgrade)
+    <antrik> some other kinds of I/O also seem to have such an effect, but I
+      wasn't able to pinpoint specific situations
+    <slpz>  - Establish a limit on how many potentially dirty pages are
+      allowed. If it's reached, a notification (right now it's just a bogus
+      m_o_data_unlock, to avoid implementing a new RPC) it's sent to the pager
+      which has generated the page fault
+    <slpz>  - Establish a hard limit on those dirt pages. If it's reached,
+      threads asking for a data_unlock are blocked until someone cleans some
+      pages. This should be improved with a forced pageout, if needed.
+    <slpz>  - And finally, in vm_pageout_scan, run over the inactive queue
+      searching for clean, external pages, freeing them. If it's not possible
+      to free enough pages, or if vm_page_external_count is less than 10% of
+      system's memory, the "normal" pageout is used.
+    <slpz> I need to clean up things a little, but I want to send a preliminary
+      patch to bug-hurd ASAP, to have more people testing it.
+    <slpz> antrik: Do you thing that performance degradation can be related
+      with the number of threads of your ext2fs translators?
+    <antrik> slpz: hm... I didn't watch that recently; but in the past, I
+      observe that the thread count is pretty constant after it reaches
+      something like 14000 on heavy load...
+    <antrik> err... wait, 14000 was ports :-)
+    <antrik> I doubt my system would survive 14000 threads ;-)
+    <antrik> don't remember thread count... I guess I should start watching
+      this again
+    <slpz> antrik: I was thinking that 14000 threads sound like a lot :-)
+    <slpz> what I know for sure, is that when operating with large files, the
+      deactivation of all pages of the memory object which is done after every
+      operation really hurts to performance
+    <antrik> right now my root FS has 5100 ports and a mere 71 thread... but
+      then, it's almost freshly booted :-)
+    <slpz> that's why I've just commented that operation in my code, since it's
+      not really needed anymore :-)
+    <slpz> anyway, after submitting all my pending mails to bug-hurd, I'll try
+      to hunt that bug. Sounds funny.
+    <antrik> regarding your explanation, I'm still trying to wrap my head
+      around some of the details. I must admit that I don't remember what
+      data_unlock does... or maybe I never fully understood it
+    <antrik> the limit on dirty pages is global?
+    <slpz> yes, right now it's global
+    <marcusb> I try to find the old discussion of the thread storm stuff
+    <marcusb> there was some concern about deadlocks
+    <slpz> marcusb: yes, because we were talking about putting an static limit
+      for the server threads of a translators
+    <slpz> marcusb: and that was wrong (my fault, I was even dumber back then
+      :-P)
+    <marcusb> oh boy digging in old mail is no fun.  first I see mistakes in my
+      english.  then I see quite complicated pager stuff I don't ever remember
+      touching.  but there is a patch, and it has my name on it
+    <marcusb> I think I lost a couple of the early years of my hurd hacking :)
+    <antrik> hm... I reread the chapter on locking, and it's still above me :-(
+    <marcusb> not sure what you are talking about, but if there are any
+      specific questions...
+    <antrik> marcusb: external pager interface
+
+[[microkernel/mach/external_pager_mechanism]].
+
+    <marcusb> uuuuh ;)
+    <antrik> memory_object_lock_request(), memory_object_lock_completed(),
+      memory_object_data_unlock()
+    <marcusb> is that from the mach manual?
+    <antrik> yes
+    <antrik> I didn't really understand that part when I first read it a couple
+      of years ago, and I still don't understand it now :-(
+    <marcusb> I am sure I didn't understand it either
+    <marcusb> and maybe I missed my window :)
+    <marcusb> let's see
+    <antrik> hehe
+    <antrik> slpz: what exactly do you mean by "the pager which has generated
+      the page fault"?
+    <antrik> marcusb: essentially I'm trying to understand the explanation of
+      the changes slpz did, but there are several bits totally obscure to me
+      :-(
+    <slpz> antrik: when a I/O operation is requested to ext2fs, it maps the
+      object in question to it's own space, and then memcpy's from/to there
+    <slpz> antrik: so the translator (which is also a pager) is the one who
+      generates the page fault
+    <marcusb> yeah
+    <marcusb> antrik: it's important to understand which messages are sent by
+      the kernel to the manager and which are sent the other way
+    <marcusb> if the dest port is memory_object_t, that indicates a msg from
+      kernel to manager.  if it is memory_object_control_t, it's a msg from
+      manager to kernel
+    <slpz> antrik: m_o_lock_request it's used by the pager to "settle" the
+      status of a memory object, m_o_lock_completed is the answer from the
+      kernel when the lock has been completed (only if the client has requested
+      to be notified), and m_o_data_unlock is a request from the kernel to
+      change the level of protection for a page (it's called from vm_fault.c)
+    <marcusb> slpz: but it's not pagers generating page faults, but users of
+      the memory object on the other side
+    <antrik> marcusb: well, I think the direction is clear to me... but the
+      purpose not really :-)
+    <marcusb> ie a client that mapped a file
+    <slpz> antrik: in ext2fs, all pages are initially provided to the kernel
+      (via data_supply) write protected. When a write operation is done over
+      one of those pages, a page fault it's generated, which sends a
+      m_o_data_unlock to the pager, which answers (if convenient) which a
+      page_lock decreasing the protection level
+    <marcusb> antrik: one use of lock_request is when you want to shut down
+      cleanly and want to get the dirty pages written back to you from the
+      kernel.
+    <marcusb> antrik: the other thing may be COW strategies
+    <slpz> marcusb: well, pagers and clients are in the same task for most
+      translators, like ext2fs
+    <marcusb> slpz: oh.
+    <slpz> marcusb: but yes, a read operation in a mmap'ed file would trigger
+      the fault in a client user task
+    <marcusb> slpz: I think I forgot everything about pagers :)
+    <slpz> marcusb: pager-memcpy.c is the key :-)
+    <marcusb> slpz: what becomes of the fault then?  the kernel sees it's a
+      mapped memory object.  will it then talk to the manager or to a pager? 
+    <antrik> slpz: the translator causes the faults itself when it handles
+      io_read()/io_write() requests I suppose, as opposed to clients accessing
+      mmap()ed objects which then generate the faults?...
+    <antrik> ah, that's actually what you already said above :-)
+    <slpz> marcusb: I'm not sure what do you mean by "manager"...
+    <marcusb> manager == memory object
+    <marcusb> mh
+    <slpz> marcusb: for all external objects, it will ask to their current
+      pager
+    <marcusb> slpz: I think I am missing a couple of details, so nevermind.
+      It's starting to come back to me, but I am a bit afraid of that ;)
+    <marcusb> what I love about the Hurd is how damn readable the code is
+    <marcusb> considering it's an object system, it's so much nicer to read
+      than gtk stuff
+    <slpz> when you get the big picture, it's actually somewhat fun to see how
+      data moves around just to fulfill a simple read()
+    <marcusb> you should make a diagram!
+    <marcusb> bonus point for animated video ;)
+
+[[hurd/IO_path]].
+
+    <slpz> marcusb: heh, take a look at the hurd specific parts of glibc... I
+      cry in pain every time a do that...
+    <marcusb> slpz: oh yeah, rdwr-internal.
+    <marcusb> oh man
+    <marcusb> slpz: funny thing, I just looked at them the other day because of
+      the security issue
+    <slpz> marcusb: I think there was one, maybe a slice from someone's
+      presentation...
+    <marcusb> I think I was always confused about the pager/memobj/kernel
+      interactions
+    <slpz> marcusb: I'm barely able to read Roland's glibc code. I think it's
+      out of my reach.
+    <antrik> marcusb: I think part of the problem is confusing terminology
+    <marcusb> it's good that you are instrumenting the mach kernel to see
+      what's actually going on in there.  it was a black book for me, but neal
+      too a peek and got a much better understanding of the performance issues
+      than I ever did
+    <antrik> when talking about "pager", we usually mean the process doing the
+      paging; but in mach terminology this actually seems to be the "manager",
+      while a "pager" is an individual object in the manager process... or
+      something like that ;-)
+    <marcusb> antrik: I just never took a look at the big picture.  I look at
+      the parts
+    <marcusb> I knew the tail, ears, and legs of the elephant.
+    <marcusb> it's a lot of code for a beginner
+    <antrik> I never understood the distinction between "pager" and "memory
+      object" though...
+    <antrik> maybe "pager" refers to the object in the external pager, while
+      "memory object" is the part managed in Mach itself?...
+    <marcusb> memory object is a real object, to which you can send messages.
+      it's implemented in the server
+    <antrik> hm... maybe it's the other way around then ;-)
+    <marcusb> there is also the default pager
+    <marcusb> I think the pager is just another name for the process that
+      serves the memory object (default pager == memory object for anonymous
+      memory == swap)
+    <marcusb> but!
+    <marcusb> there is also libpager
+
+[[hurd/libpager]]
+
+    <marcusb> and that's a more complicated beast
+    <antrik> actually, the correct term seems to be "default memory manager"...
+    <marcusb> yeah
+    <marcusb> from mach's pov
+    <marcusb> we always called it default pager in the Hurd
+    <antrik> marcusb: problem is that "pager" is sometimes used in the Mach
+      documentation to refer to memory object ports IIRC
+    <marcusb> isn't it defpager executable?
+    <marcusb> could be
+    <marcusb> it's the same thing, really
+    <antrik> indeed, the program implementing the default memory manager is
+      called "default pager"... so the terminology is really inconsistent
+    <marcusb> the hurd's pager library is a high level abstraction for mach's
+      external memory object interface.
+    <marcusb> i wouldn't worry about it too much
+    <antrik> I never looked at libpager
+    <marcusb> you should!
+    <marcusb> it's an important beast
+    <antrik> never seemed relevant to anything I did so far...
+    <antrik> though maybe it would help understanding
+    <marcusb> it's related to what you are looking now :)
diff --git a/open_issues/resource_management_problems/zalloc_panics.mdwn b/open_issues/resource_management_problems/zalloc_panics.mdwn
new file mode 100644
index 00000000..9c29b07c
--- /dev/null
+++ b/open_issues/resource_management_problems/zalloc_panics.mdwn
@@ -0,0 +1,99 @@
+[[!meta copyright="Copyright © 2005, 2007, 2008, 2010, 2012 Free Software
+Foundation, Inc."]]
+
+[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
+id="license" text="Permission is granted to copy, distribute and/or modify this
+document under the terms of the GNU Free Documentation License, Version 1.2 or
+any later version published by the Free Software Foundation; with no Invariant
+Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
+is included in the section entitled [[GNU Free Documentation
+License|/fdl]]."]]"""]]
+
+[[!tag open_issue_gnumach open_issue_hurd]]
+
+Written by antrik / Olaf Buddenhagen, last updated: 12 Apr 2007.
+
+The Hurd sometimes crashes with a kernel panic saying someting like: "Panic: zalloc failed: zone map exhausted".
+
+These panics are generally caused by some kind of kernel resource exhaustion, but there are several differnt reasons for that.
+
+It used to happen very often under heavy disk load (like large compile jobs), or in a reproducible test case by opening a large number of ports to /dev/null and then closing them all very quickly. The reason for this particular problem has been identified a while back: The multithreaded Hurd servers create a new worker thread whenever a new request (RPC) comes in while all existing threads are busy. When the server is hammered with lots of requests -- which happens both under heavy disk load, and when quickly closing many ports to one server -- it will create an absurd number of threads, causing the resource exhaustion.
+
+The Debian hurd package contains a patch by k0ro (Sergio Lopez), which fixes this by limiting the amount of created threads in a rather simplistic but very effective manner. This patch however hasn't been included in upstream CVS so far. A more elegant solution, suitable for upstream inclusion, would be desirable.
+
+Some panics still seem to happen in very specific situations, like the one described at <https://savannah.gnu.org/bugs/?19426> . These are probably the result of bugs that cause port leaks, accidental fork bombs, or similar problems.
+
+In principle, resource exhaustion can also happen by normal use, though this is rather unlikely in the absence of bugs or malicious programs. Nevertheless, all these problems could be avoided (or limited in effect) by introducing some limits on number of processes per user, number of threads and ports per process/user etc.
+
+Trying to track down causes for the panics, I got some interesting results. (UPDATE: Many of my original observations were clearly related to the server thread explosion problem. To avoid confusion, I now removed these, as this is no longer an open issue.)
+
+* It all started with someone (probably azeem) mentioning that builing some package always crashes Hurd at the same stage of the Debian packaging process (UPDATE: Almost all of these panics when building packages were a result of the thread explosion and don't happen anymore.)
+* Someone (maybe he himself) pointed out that this stage is characterized by many processes being quickly created and destroyed
+* Someone else (probably hde) started some experimenting, to get a reproducible test case
+* He realized that just starting and killing five child processes in quick succession suffices to kill some Hurd systems
+* I tried to confirm this, but it turned out my system is more robust
+
+As I could never reproduce the problem with a small number of quickly killed processes, I can't say whether this problem still exists. While I could reproduce such an effect with first opening and then very quickly closing many ports (which is more or less what happens when quickly killing many processes), I needed really large numbers of processes/ports for that. The thread throtteling patch fixed my test case; but it seems unlikely that killing only five processes could have caused a thread explosion, so maybe hde's observation was a different problem really...
+
+I started various other experiments with creating child processes (fork bombs), resulting in a number of interesting observations:
+
+* Just forking a large number of processes crashes the Hurd reliably (not surprising)
+* The number of processes at which the panic occurs is very constant (typicallly +-2) under stable conditions, as long as forking doesn't happen too fast
+* The exact number depends on various conditions:
+  * Run directly from the Mach console, it's around 1040 on my machine (given enough RAM); however, it drops to 940 when started through a raw ssh session, and to 990 when run under screen through ssh (TODO: check number of ports open per process depending on how it is started) UPDATE: In a later test, I got somewhat larger numbers (don't remember exactly, but well above 1000), but still very constant between successive runs. Not sure what effected this change.
+  * It doesn't depend on whether normal user or root
+  * With only 128 MiB of RAM, the numbers drop slightly (like 100 less or so); no further change between 256 and 384 MiB
+  * Lowering zone\_map\_size in mach/kern/zalloc.c reduces the numbers (quite exactly half from 8 MiB to 4 MiB)
+  * There seems to be some saturation near 16 MiB however: The difference between 8 MiB and 16 MiB is significantly smaller
+  * Also, with 8 MiB or 4 MiB, the difference between console/ssh/screen becomes much more apparent (500 vs. 800, 250 vs. 400)
+  * With more than 16 MiB, Mach doesn't even boot
+* Creating the processes very fast results in a sooner and less predictable crash (TODO: Check whether this is still the case with thread throtteling?)
+* Creating processes recursively (fork only one child which forks the next one etc.) results in faster crash
+* rpcinfo shows that child processes have more ports open by default, which is very likely the reason for the above observation
+* Opening many ports from a few processes doesn't usually cause a system crash; there are only lots of open() failures and translator faults once some limit is reached... Seems the zalloc-full condition is better caught on open() than on fork() (TODO: investigate this further, with different memory sizes, different zone\_map\_size, different kinds of resources using zalloc etc.)
+* After opening/leaking lots of ports to /dev/null (32768 it seems), the NULL translator somehow becomes disfunctional, and a new instance is started
+
+While most of these Observations clearly show an exhaustion of kernel memory which is not surprising, some of the oddities seem to indicate problems that might deserve further investigation.
+
+
+# IRC, freenode, #hurd, 2012-04-01
+
+    <mel__> antrik: i just found
+      http://www.gnu.org/software/hurd/open_issues/resource_management_problems/zalloc_panics.html
+      -- that is from 2007. is this still the current status?
+    <youpi> mel__: probably
+    <mcsim> mel__: gnumach has no more zalloc allocator, so I doubt if it could
+      be a problem.
+
+[[gnumach_memory_management]].
+
+    <youpi> mcsim: but it still has an allocator
+    <youpi> which can run out of resources
+    <mcsim> AFAIR, now there is no such limit.
+    <youpi> err, there is
+    <youpi> the size of your RAM :)
+    <mcsim> In zalloc appearing of this message didn't depend of available size
+      of free ram.
+    <youpi> then update the description, but I'm still getting allocation
+      errors, when userland makes crazy things like creating millions of tasks
+    <mcsim> At least it could appear when there still was free memory
+    <youpi> and that's not surprising
+    <youpi> sure, I know that *some* limits have been removed, but there
+      weren't so many, and I have seen cases where it's simply mach running out
+      of memory
+    <youpi> also, we have a limited amount of virtual addressing space
+    <antrik> mel__: this writeup is outdated in several regards. *some* of the
+      observations might still be relevant, but nothing that seems
+      particularily important
+    <antrik> the zalloc panics have pretty much disappeared after the default
+      zalloc zone size has been considerably extended (which was not possible
+      before because of some bug)
+    <mel__> i see
+    <antrik> but as mcsim pointed out, with the new allocator not relying on a
+      fixed-sized zalloc zone at all, they are even less likely, and should
+      happen only if all memory is exhausted
+    <antrik> I guess this outdated report can just be dropped
+    <mcsim> I think, that now it is problem rather of absence of OOM-killer or
+      resource manager
+    <antrik> mcsim: right :-)
+    <antrik> (and we have separate articles about that)
author	Samuel Thibault <samuel.thibault@ens-lyon.org>	2015-02-18 00:58:35 +0100
committer	Samuel Thibault <samuel.thibault@ens-lyon.org>	2015-02-18 00:58:35 +0100
commit	49a086299e047b18280457b654790ef4a2e5abfa (patch)
tree	c2b29e0734d560ce4f58c6945390650b5cac8a1b /open_issues/resource_management_problems
parent	e2b3602ea241cd0f6bc3db88bf055bee459028b6 (diff)