From 817df620bedae9c1daa0497f64a901d51e5bd2dd Mon Sep 17 00:00:00 2001 From: Thomas Schwinge Date: Sat, 26 Mar 2011 00:52:08 +0100 Subject: Some more IRC discussions. --- open_issues/anatomy_of_a_hurd_system.mdwn | 73 +++++++++ open_issues/ext2fs_page_cache_swapping_leak.mdwn | 23 +++ open_issues/pfinet_vs_system_time_changes.mdwn | 42 ++++++ ...dez-vous_leading_to_duplicate_port_destroy.mdwn | 163 +++++++++++++++++++++ open_issues/sudo_date_crash.mdwn | 16 -- open_issues/unit_testing.mdwn | 20 +++ 6 files changed, 321 insertions(+), 16 deletions(-) create mode 100644 open_issues/anatomy_of_a_hurd_system.mdwn create mode 100644 open_issues/ext2fs_page_cache_swapping_leak.mdwn create mode 100644 open_issues/pfinet_vs_system_time_changes.mdwn create mode 100644 open_issues/rpc_to_self_with_rendez-vous_leading_to_duplicate_port_destroy.mdwn delete mode 100644 open_issues/sudo_date_crash.mdwn (limited to 'open_issues') diff --git a/open_issues/anatomy_of_a_hurd_system.mdwn b/open_issues/anatomy_of_a_hurd_system.mdwn new file mode 100644 index 00000000..e1d5c9d8 --- /dev/null +++ b/open_issues/anatomy_of_a_hurd_system.mdwn @@ -0,0 +1,73 @@ +[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]] + +[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable +id="license" text="Permission is granted to copy, distribute and/or modify this +document under the terms of the GNU Free Documentation License, Version 1.2 or +any later version published by the Free Software Foundation; with no Invariant +Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license +is included in the section entitled [[GNU Free Documentation +License|/fdl]]."]]"""]] + +[[!taglink open_issue_documentation]] + +A bunch of this should also be covered in other (introductionary) material, +like Bushnell's Hurd paper. All this should be unfied and streamlined. + +IRC, freenode, #hurd, 2011-03-08 + + I've a question on what are the "units" in the hurd project, if + you were to divide them into units if they aren't, and what are the + dependency relations between those units(roughly, nothing too pedantic + for now) + there is GNU Mach (the microkernel); there are the server + libraries in the Hurd package; there are the actual servers in the same; + and there is the POSIX implementation layer in glibc + relations are a bit tricky + Mach is the base layer which implements IPC and memory management + hmm I'll probably allocate time for dependency graph generation, + in the worst case + on top of this, the Hurd servers, using the server libraries, + implement various aspects of the system functionality + client programs use libc calls to use the servers + (servers also use libc to communicate with other servers and/or + Mach though) + so every server depends solely on mach, and no other server? + s/mach/mach and/or libc/ + I think these things should be pretty clear one you are somewhat + familiar with the Hurd architecture... nothing really tricky there + no + servers often depend on other servers for certain functionality + +--- + +IRC, freenode, #hurd, 2011-03-12 + + when mach first starts up, does it have some basic i/o or fs + functionality built into it to start up the initial hurd translators? + I/O is presently completely in Mach + filesystems are in userspace + the root filesystem and exec server are loaded by grub + o I see + so in order to start hurd, you would have to start mach and + simultaneously start the root filesystem and exec server? + not exactly + GRUB loads all three, and then starts Mach. Mach in turn starts + the servers according to the multiboot information passed from GRUB + ok, so does GRUB load them into ram? + I'm trying to figure out in my mind how hurd is initially started + up from a low-level pov + yes, as I said, GRUB loads them + ok, thanks antrik...I'm new to the idea of microkernels, but a + veteran of monolithic kernels + although I just learned that windows nt is a hybrid kernel which I + never knew! + note there's a /hurd/ext2fs.static + I belive that's what is used initially... right? + yes + loading the shared libraries in addition to the actual server + would be unweildy + so the root FS server is linked statically instead + what does the root FS server do? + well, it serves the root FS ;-) + it also does some bootstrapping work during startup, to bring the + rest of the system up diff --git a/open_issues/ext2fs_page_cache_swapping_leak.mdwn b/open_issues/ext2fs_page_cache_swapping_leak.mdwn new file mode 100644 index 00000000..0ace5cd3 --- /dev/null +++ b/open_issues/ext2fs_page_cache_swapping_leak.mdwn @@ -0,0 +1,23 @@ +[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]] + +[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable +id="license" text="Permission is granted to copy, distribute and/or modify this +document under the terms of the GNU Free Documentation License, Version 1.2 or +any later version published by the Free Software Foundation; with no Invariant +Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license +is included in the section entitled [[GNU Free Documentation +License|/fdl]]."]]"""]] + +[[!tag open_issue_hurd]] + +IRC, OFTC, #debian-hurd, 2011-03-24 + + I still believe we have an ext2fs page cache swapping leak, however + as the 1.8GiB swap was full, yet the ld process was only 1.5GiB big + a leak at swapping time, you mean? + I mean the ext2fs page cache being swapped out instead of simply + dropped + ah + so the swap tends to accumulate unuseful stuff, i see + yes + the disk content, basicallyt :) diff --git a/open_issues/pfinet_vs_system_time_changes.mdwn b/open_issues/pfinet_vs_system_time_changes.mdwn new file mode 100644 index 00000000..a9e1e242 --- /dev/null +++ b/open_issues/pfinet_vs_system_time_changes.mdwn @@ -0,0 +1,42 @@ +[[!meta copyright="Copyright © 2010, 2011 Free Software Foundation, Inc."]] + +[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable +id="license" text="Permission is granted to copy, distribute and/or modify this +document under the terms of the GNU Free Documentation License, Version 1.2 or +any later version published by the Free Software Foundation; with no Invariant +Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license +is included in the section entitled [[GNU Free Documentation +License|/fdl]]."]]"""]] + +[[!tag open_issue_hurd]] + +IRC, unknown channel, unknown date. + + I did a sudo date... + and the machine hangs + +This was very likely as misdiagnosis: + +IRC, freenode, #hurd, 2011-03-25 + + antrik: I suspect it'S some timing stuff in pfinet that perhaps + uses absolute time, and somehow wildely gets confused? + tschwinge: BTW, pfinet doesn't actually die I think -- it just + drops open connections... + perhaps it thinks they timed out + antrik: Isn't the translator restarted instead? + don't think so + when pfinet actually dies, I also loose the NFS mounts, which + doesn't happen in this case + hehe "... and the machine hangs" + he didn't bother to check that the machine is perfectly fine, only + the SSH connection got dropped + Ah, I see. So it'S perhaps indeed simply closes TCP + connections that have been without data for ``too long''? + yeah, that's my guess + my clock is speeding, so ntpdate sets it in the past + perhaps there is some math that concludes the connection have been + inactive for -200 seconds, which (unsigned) is more than any timeout :-) + (The other way round, you might likely get some integer + wrap-around, and thus the same result.) + Yes. diff --git a/open_issues/rpc_to_self_with_rendez-vous_leading_to_duplicate_port_destroy.mdwn b/open_issues/rpc_to_self_with_rendez-vous_leading_to_duplicate_port_destroy.mdwn new file mode 100644 index 00000000..9db92250 --- /dev/null +++ b/open_issues/rpc_to_self_with_rendez-vous_leading_to_duplicate_port_destroy.mdwn @@ -0,0 +1,163 @@ +[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]] + +[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable +id="license" text="Permission is granted to copy, distribute and/or modify this +document under the terms of the GNU Free Documentation License, Version 1.2 or +any later version published by the Free Software Foundation; with no Invariant +Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license +is included in the section entitled [[GNU Free Documentation +License|/fdl]]."]]"""]] + +[[!tag open_issue_hurd]] + +[RPC to self with rendez-vous leading to duplicate port +destroy](http://lists.gnu.org/archive/html/bug-hurd/2011-03/msg00045.html) + +IRC, freenode, #hurd, 2011-03-14 + + youpi: I wonder, why does the root FS call diskfs_S_dir_lookup() + at all?... + errr, because a client asked for it? + (problem with RPCs is you can't easily know where they come from :) + ) + (especially when it's the root fs...) + ah, it's about a client request... didn't see that + well, I just said "is called", yes + I do not really understand though why it tries to reauthenticate + against itself... + I fear my memory of the lookup mechanism grew a bit dim + see the source + it's about a translated entry + (and I never fully understood some aspects anyways...) + it needs to start the translated entry as another user, possibly + yes, but a translated entry normally would be served by *another* + process?... + sure, but ext2fs has to prepare it + thus reauthenticate to prepare the correct set of rights + prepare what? + rights + so the process is not root, doesn't have / opened as root, etc. + rights for what? + err, about everything + IIRC the reauthentication is done by the parent FS on the port to + the *translated* node + and the translated node should be a different process?... + that's not what I read in the source + fshelp_fetch_root + ports[INIT_PORT_CRDIR] = reauth (getcrdir ()); + here, getcrdir() returns ext2fs itself + well, perhaps the issue is that I have no idea what + fshelp_fetch_root() does, nor why it is called here... + it notably starts the translator that dir_lookup is looking at, if + needed + possibly as a different user, thus reauthentication of CRDIR + so this is about a port that is passed to the translator being + started? + no + well, depends on what you mean by "port" + it's about reauthenticating a port to be passed to the translator + being started + and for that a rendez-vous port is needed for the reauthentication + and that's the one at stake + yeah, I meant the port that is reauthenticated + what is CRDIR? + current root dir ... + so the parent translator passes it's own root dir to the child + translator; and the issue is that for the root FS the root dir points to + the root FS itself... + yes + OK, that makes sense + (but that's only one example, rgrep mach_port_destroy hurd/ show + other potential issues) + well, that's actually what I wanted to mention next... why is the + rendez-vous port destroyed, instead of just deallocating the port right + and letting reference counting to it's thing?... + do its thing + "just to make sure" I guess + it's pretty obvious that this will cause trouble for any RPC + referencing itself... + well, follow-up with that on the list + with roland/tb in CC + only they would know any real reason for destroy + btw, if you knew how we could make _hurd_select()'s raw __mach_msg + call be interruptible by signals, that'll permit to fix sudo + (damn, I need sleep, my tenses are all wrong) + BTW, does this cause any actual trouble?... + I don't know much about interruption... cfhammer might have a + better idea, he look into that stuff quite a bit AIUI + looked + (hehe, it's not only your tenses... guess there's something in the + ether ;-) ) + it makes sudo, mailq, etc. fail sometimes + I mean the rendez-vous thing + that's it, yes + sudo etc. fail at least due to this + so these are two different problems that both affect sudo? + (rendez-vous and interruption I mean) + yes + with my patch the buildds have much fewer issues, but still some + (my interrupt-related patch) + I'm installing a s/destroy/deallocate/ version of ext2fs on the + buildds, we'll see how it behaves + (it fixes my testcase at least) + interrupt-related patch? + only thing interrupt-related I remember was the reauthentication + race... + that's what I mean + well, cfhammer investigated this is quite some depth, explaining + quite well why the race is only mitigated but still exists... problem is + that we didn't know how to fix it properly + because nobody seems to understand the cancellation code, except + perhaps for Roland and Thomas + (and I'm not even entirely sure about them :-) ) + I think his findings and our conclusions are documented on the + ML... + by "much fewer issues", I mean that some of the symptoms have + disappeared, others haven't + BTW, couldn't the rendez-vous thing be worked around by simply + ignoring the errors from the failing deallocate?... + no, failing deallocate are actually dangerous + why? + since the name might have been reused for something else in the + meanwhile + that's the whole point of the warning I had added in the kernel + itself + I see + such things really deserve tracking, since they can have any kind + of consequence + does Mach try to reuse names quickly, rather than only after + wrapping around?... + it seems to + OK, then this is a serious problem indeed + (note: I rarely divine issues when there aren't actual frequent + symptoms :) ) + well, the problem with the warning is that it only shows in the + cases that do *not* cause a problem... so it's hard to associate them + with any specific issues + well, most of the time the port is not reused quickly enough + so in most case it shows up more often than causing problem + +IRC, freenode, #hurd, 2011-03-14 + + ok, mach_port_deallocate actually can't be used + since mach_reply_port() returns a receive right, not a send right + * youpi guesses he will really have to manage to understand all that port + stuff completely + oh, right + youpi: hm... now I'm confused though. if one client holds a + receive right, the other client (or in this case the same process) should + have a send or send-once right -- these should *not* share the same name + in my understanding + destroying the receive right should turn the send right into a + dead name + so unless I'm missing something, the destroy shouldn't be a + problem, and there must be something else going wrong + hm... actually I'm probably wrong + yeah, definitely wrong. receive rights and "ordinary" send rights + share the name. only send-once rights are special + I wonder whether the problem could be worked around by using a + send-once right... + mach_port_mod_refs(mach_task_self(), name, + MACH_PORT_RIGHT_RECEIVE, -1) can be used to deallocate only the receive + right + oh, you already figured that out :-) diff --git a/open_issues/sudo_date_crash.mdwn b/open_issues/sudo_date_crash.mdwn deleted file mode 100644 index 53303abc..00000000 --- a/open_issues/sudo_date_crash.mdwn +++ /dev/null @@ -1,16 +0,0 @@ -[[!meta copyright="Copyright © 2010 Free Software Foundation, Inc."]] - -[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable -id="license" text="Permission is granted to copy, distribute and/or modify this -document under the terms of the GNU Free Documentation License, Version 1.2 or -any later version published by the Free Software Foundation; with no Invariant -Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license -is included in the section entitled [[GNU Free Documentation -License|/fdl]]."]]"""]] - -[[!tag open_issue_gnumach]] - -IRC, unknown channel, unknown date. - - I did a sudo date... - and the machine hangs diff --git a/open_issues/unit_testing.mdwn b/open_issues/unit_testing.mdwn index a5ffe19d..feda3be4 100644 --- a/open_issues/unit_testing.mdwn +++ b/open_issues/unit_testing.mdwn @@ -320,3 +320,23 @@ freenode, #hurd channel, 2011-03-07: this, and just generally though that some sort of automated testing is needed, and thus started collecting ideas. antrik: You're of course invited to fix that. + +IRC, freenode, #hurd, 2011-03-08 + +(After discussing the [[anatomy_of_a_hurd_system]].) + + so that's what your question is actually about? + so what I would imagine is a set of only-this-server tests for + each server, and then we can have fun adding composite tests + thus making debugging the composite scenarios a bit less tricky + indeed + and if you were trying to pass a composite test, it would also + help knowing that you still didn't break the server-only test + there are so many different things that can be tested... the + summer will only suffice to dip into this really :-) + yeah, I'm designing my proposal to focus on 1) make/use a + testing framework that fits the Hurd case very well 2) write some tests + and docs on how to write good tests + well, doesn't have to be *one* framework... unit testing and + regression testing are quite different things, which can be covered by + different frameworks -- cgit v1.2.3