IRC.

author: Thomas Schwinge <tschwinge@gnu.org> 2013-01-08 21:31:31 +0100
committer: Thomas Schwinge <tschwinge@gnu.org> 2013-01-08 21:31:31 +0100
commit: 51c95fc11727532e3b0d98c8470a6b60907a0680 (patch)
tree: 6a8dd54654398bb2b05ce3f7cbcdc211d1dfbfb7 /open_issues/select.mdwn
parent: 160d0597cb94d58f8ab273226b1f3830589a500b (diff)
1 files changed, 407 insertions, 1 deletions
diff --git a/open_issues/select.mdwn b/open_issues/select.mdwn
index 778af530..391509a9 100644
--- a/open_issues/select.mdwn
+++ b/open_issues/select.mdwn
@@ -1,4 +1,4 @@
-[[!meta copyright="Copyright © 2010, 2011, 2012 Free Software Foundation,
+[[!meta copyright="Copyright © 2010, 2011, 2012, 2013 Free Software Foundation,
 Inc."]]
 
 [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
@@ -1631,6 +1631,412 @@ IRC, unknown channel, unknown date:
     <braunr> i'll try the non intrusive mode
 
 
+## IRC, freenode, #hurd, 2012-12-11
+
+    <gnu_srs1> braunr: What is the technical difference of having the delay at
+      io_select compared to mach_msg for one FD?
+    <braunr> gnu_srs1: it's a slight optimization
+    <braunr> instead of doing a send and a receive, the same mach_msg call is
+      used for both
+    <braunr> (for L4 guys it wouldn't be considered a slight optimization :))
+
+
+## IRC, freenode, #hurd, 2012-12-17
+
+    <braunr> tschwinge:
+      http://git.savannah.gnu.org/cgit/hurd/glibc.git/log/?h=rbraun/select_timeout_for_one_fd
+    <braunr> gnu_srs: talking about that, can you explain :
+    <braunr> "- The pure delay case is much faster now, making it necessary to                                                                                                                                                    
+    <braunr> introduce a delay of 1 msec when the timeout parameter is set to
+      zero.                                                                                                                                              
+    <braunr> "
+    <gnu_srs> I meant poll with zero delay needs a delay to make sure the file
+      descriptors are ready. Testing it now.
+    <braunr> for me, the "pure delay" case is the case where there is no file
+      descriptor
+    <braunr> when the timeout is 0 is the non-blocking case
+    <braunr> and yes, you need 1ms for the non-blocking case when there are
+      file descriptors
+    <gnu_srs> sorry bad wording (again)
+    <braunr> (note however that this last "requirement" is very hurd specific,
+      and due to a design issue)
+    <braunr> the work i did six months ago fixes it, but depends on pthreads
+      for correct performances (or rather, a thread library change, but
+      changing cthreads was both difficult and pointless)
+    <braunr> also, i intend to work on io_poll as a replacement for io_select,
+      that fixes the "message storm" (i love these names) caused by dead-name
+      notifications resulting from the way io_select works
+
+
+## IRC, freenode, #hurd, 2012-12-19
+
+    <braunr> tschwinge: i've tested the glibc rbraun/select_timeout_for_one_fd
+      branch for a few days on darnassus now, and nothing wrong to report
+
+
+## IRC, freenode, #hurd, 2012-12-20
+
+    <youpi> braunr: so, shall I commit the single hurd select timeout fix to
+      the debian package?
+    <braunr> youpi: i'd say so yes
+
+
+## IRC, freenode, #hurd, 2013-01-03
+
+    <braunr> gnu_srs: sorry, i don't understand your poll_timeout patch
+    <braunr> it basically reverts mine for poll only
+    <braunr> but why ?
+    <gnu_srs> braunr: It does not revert your select patch, if there is one FD
+      the timeout is at io_select, if not one the timeout is at mach_msg
+    <braunr> but why does it only concern poll ?
+    <braunr> (and why didn't i do it this way in the first place ?)
+    <braunr> (or maybe i did ?)
+    <gnu_srs> there are problems with a timeout of zero for poll, depending on
+      the implementation the FDs can result in not being ready.
+    <braunr> but that's also true with select
+    <gnu_srs> the cases I've tested only have problems for poll, not select
+    <braunr> we'll have to create test cases for both
+    <braunr> but your solution doesn't hold anyway
+    <braunr> our current workaround for this class of problems is to set a
+      lower bound on the timeout to 1
+    <braunr> (which comes from a debian specific patch)
+    <gnu_srs> see the test code i sent,
+      http://lists.gnu.org/archive/html/bug-hurd/2012-12/msg00043.html,
+      test_poll+select.c
+    <braunr> the patch might be incomplete though
+    <braunr> i know, but your solution is still wrong
+    <braunr> see debian/patches/hurd-i386/local-select.diff in the debian
+      eglibc package
+    <gnu_srs> and in that message I have introduced a minimum timeout for poll
+      of 1ms.
+    <braunr> yes but you shouldn't
+    <braunr> this is a *known* bug, and for now we have a distribution-specific
+      patch
+    <braunr> in other words, we can't commit that crap upstream
+    <gnu_srs> well, according to youpi there is a need for a communication to
+      flag when the FDs are ready, not yet implemented.
+    <braunr> i'm not sure what you mean by that
+    <youpi> I don't understand what you refer to
+    <braunr> there is a need for a full round-trip even in the non blocking
+      case
+    <braunr> which is implemented in one of my hurd branches, but awaits
+      pthreads integration for decent scalability
+    <youpi> the only difference between poll and select is that select can stop
+      the loop on error, while poll needs to continue
+    <braunr> youpi: don't you think the glibc select patch is incomplete ?
+    <youpi> incomplete in what direction?
+    <youpi> the minimum 1ms delay is a completely bogus workaround
+    <gnu_srs> youpi:
+      http://lists.gnu.org/archive/html/bug-hurd/2012-11/msg00001.html
+    <youpi> so I wouldn't say it's even completing anything :)
+    <braunr> hm no never mind, it's not
+    <braunr> i thought it missed some cases where the delay made sense, but no
+    <braunr> the timeout can only be 0 if the timeout parameter is non NULL
+    <braunr> gnu_srs: during your tests, do you run with the debian eglibc
+      package (including your changes), or from the git glibc ?
+    <gnu_srs> I run with -37, -38, with my minimum poll changes, my 3 cases,
+      and 3 case-poll updates.
+    <braunr> so you do have the debian patches
+    <braunr> so you normally have this 1ms hack
+    <braunr> which means you shouldn't need to make the poll case special
+    <gnu_srs> A admit the 1ms patch is not possible to submit upstream, but it
+      makes things work (and youpi use it for vim)
+    <braunr> i'll try to reproduce your ntpdate problem with -38 when i have
+      some time
+    <braunr> uh, no, vim actually doesn't use the hack :p
+    <youpi> gnu_srs: it's the contrary: we have to avoid it for vim
+    <gnu_srs>  if (strcmp(program_invocation_short_name, "vi") &&
+      strcmp(program_invocation_short_name, "vim") &&
+      strcmp(program_invocation_short_name, "vimdiff") && !to)
+    <gnu_srs>         to = 1;
+    <youpi> that does what we are saying
+    <braunr> strcmp returns 0 on equality
+    <gnu_srs> aha, OK then
+    <gnu_srs> I don't have that hack in my code. I have tested vim a little,
+      but cannot judge, since I'm not a vi user.
+    <braunr> you don't ?
+    <braunr> you should have it if the package patches were correctly applied
+    <gnu_srs> Maybe somebody else could compile a libc with the 3-split code to
+      test it out?
+    <braunr> that's another issue
+    <gnu_srs> I mean the patch I sent to the list, there the vi{m} hack is not
+      present.
+    <braunr> well obviously
+    <braunr> but i'm asking about the poll_timeout one
+    <gnu_srs> A, OK, it's very easy to test that version too but currently -38
+      maybe has a regression due to some other patch.
+    <braunr> that's another thing we're interested in
+    <gnu_srs> Unfortunately it takes a _long_ time to build a new version of
+      libc (several hours...)
+    <braunr> -38 is already built
+    <gnu_srs> yes, but removing patches one by one and rebuilding.
+    <braunr> but then, the "regression" you mention concerns a package that
+      wasn't really working before
+    <braunr> removing ?
+    <braunr> ah, to identify the trouble carrying one
+    <gnu_srs> ntpdate works with -37, no problem...
+    <gnu_srs> but not with -38
+    <braunr> again, trace it with -38
+    <braunr> to see on what it blocks
+    <gnu_srs> as I wrote yesterday gdb hangs the box hard and rpctrace bugs
+      out, any ideas?
+    <youpi> printf
+    <braunr> gdb from a subhurd
+    <gnu_srs> I'm suspecting the setitimer patch: Without it gdb ntpdate does
+      not freeze hard any longer, bt full: http://paste.debian.net/221491/
+    Program received signal SIGINT, Interrupt.
+    0x010477cc in mach_msg_trap ()
+        at /usr/src/kernels/eglibc/eglibc-2.13/build-tree/hurd-i386-libc/mach/mach_msg_trap.S:2
+    2       kernel_trap (__mach_msg_trap,-25,7)
+    (gdb) thread apply all bt full
+    
+    Thread 6 (Thread 3158.6):
+    #0  0x010477cc in mach_msg_trap ()
+        at /usr/src/kernels/eglibc/eglibc-2.13/build-tree/hurd-i386-libc/mach/mach_msg_trap.S:2
+    No locals.
+    #1  0x01047fc9 in __mach_msg (msg=0x52fd4, option=1282, send_size=0, 
+        rcv_size=0, rcv_name=132, timeout=100, notify=0) at msg.c:110
+            ret = <optimized out>
+    #2  0x010ec3a8 in timer_thread () at ../sysdeps/mach/hurd/setitimer.c:90
+            err = <optimized out>
+            msg = {header = {msgh_bits = 4608, msgh_size = 32, 
+                msgh_remote_port = 0, msgh_local_port = 132, msgh_seqno = 78, 
+                msgh_id = 23100}, return_code = 17744699}
+    
+    setitimer.c:90
+          err = __mach_msg (&msg.header,
+                            MACH_RCV_MSG|MACH_RCV_TIMEOUT|MACH_RCV_INTERRUPT,
+                            0, 0, _hurd_itimer_port,
+                            _hurd_itimerval.it_value.tv_sec * 1000 +
+                            _hurd_itimerval.it_value.tv_usec / 1000,
+                            MACH_PORT_NULL);
+
+[[alarm_setitimer]].
+
+    <braunr> gdb ?
+    <braunr> i thought ntpdate was the program freezing
+    <gnu_srs> the freeze is due to -38
+    <braunr> yes we know that
+    <braunr> but why do you say "gdb ntpdate" instead of "ntpdate" ?
+    <gnu_srs> yes, ntpdate freezes: without gdb kill -9 is OK, with gdb it
+      freezes hard (with setitimer pacth).
+    <braunr> we don't care much about the kill behaviour
+    <braunr> ntpdate does indeed makes direct calls to setitimer
+    <gnu_srs> without the setitimer patch: without gdb ntpdate freezes (C-c is
+      OK), with gdb C-c gives the paste above
+    <braunr> sorry i don't understand
+    <braunr> *what* is the problem ?
+    <youpi> there are two of them
+    <youpi> ntpdate freezing
+    <youpi> gdb freezing
+    <braunr> ok
+    <youpi> he's saying gdb freezing is due to the setitimer patch
+    <braunr> yes that's what i understand now
+    <braunr> what he said earlier made me think ntpdate was freezing with -38
+    <gnu_srs> better: ntpdate hangs, gdb ntpdate freezes (with the setitimer
+      patch)
+    <braunr> what's the behaviour in -37, and then what is the behaviour with
+      -38 ?
+    <braunr> (of both actions, so your answer should give us four behaviours)
+    <youpi> gnu_srs: what is the difference between "hangs" and "freezes" ?
+    <gnu_srs> -37 no problem, both with and without gdb.
+    <braunr> you mean ntpdate doesn't freeze with -37, and does with -38 ?
+    <gnu_srs> hangs: kill -9 is sufficient, freezes: reboot, checking file
+      system etc
+    <braunr> and i really mean ntpdate, not gdb whatever
+    <gnu_srs> the ntpdate hang (without the setitimer patch) in -38 can be due
+      to the poll stuff: Have to check further with my poll patch...
+
+
+## IRC, freenode, #hurd, 2013-01-04
+
+    <gnu_srs> Summary of the eglibc-2.13-38 issues: without the
+      unsubmitted-setitimer_fix.diff patch and with
+    <gnu_srs> my poll_timeout.patch fix in
+      http://lists.gnu.org/archive/html/bug-hurd/2012-12/msg00042.html
+    <gnu_srs> ntpdate works again :)
+    <gnu_srs> please consider reworking the setitimer patch and add a poll case
+      in hurdselect.c:-D 
+    <gnu_srs> Additional info: vim prefers to use select before poll,. With the
+      proposed changes (small,3-split),
+    <gnu_srs> only poll is affected by the 1ms default timeout, i.e. the
+      current vi hack is no longer needed.
+    <braunr> gnu_srs: the setitimer patch looks fine, and has real grounds
+    <braunr> gnu_srs: your poll_timeout doesn't
+    <braunr> so unless you can explain where the problem comes from, we
+      shouldn't remove the setitimer patch and add yours
+    <braunr> in addition
+    <braunr> 09:30 < gnu_srs> only poll is affected by the 1ms default timeout,
+      i.e. the current vi hack is no longer needed.
+    <braunr> that sentence is complete nonsense
+    <braunr> poll and select are implemented using the same rpc, which means
+      they're both broken
+    <braunr> if the vi hack isn't needed, it means you broke every poll user
+    <braunr> btw, i think your ntpdate issue is very similar to the gitk one
+    <braunr> gitk currently doesn't work because of select/poll
+    <braunr> it does work fine with my hurd select branch though
+    <braunr> which clearly shows a more thorough change is required, and your
+      hacks won't do any good (you may "fix" ntpdate, and break many other
+      things)
+    <gnu_srs> braunr: Why don't you try ntpdate yourself on -38 (none of my
+      patches applied)
+    <braunr> you're missing the point
+    <braunr> the real problem is the way select/poll is implemented, both at
+      client *and* server sides
+    <gnu_srs> 09:30 etc: The current implementation is slower than the 3-way
+      patch. Therefore it in not needed in the current implementation (I didn't
+      propose that either)
+    <braunr> hacks at the client side only are pointless, whatever you do
+    <braunr> slower ?
+    <braunr> it's not about performance but correctness
+    <braunr> your hack *can't* solve the select/poll correctness issue
+    <gnu_srs> yes, slower on my kvm boxes...
+    <braunr> so it's normal that ntpdate and other applications such as gitk
+      are broken
+    <braunr> if you try to fix it by playing with the timeout, you'll just
+      break the applications that were fixed in the past by playing with the
+      timeout another way
+    <braunr> can you understand that ?
+    <gnu_srs> forget the timeout default, it's a side issue.
+    <braunr> the *real* select/poll issue is that non blocking calls
+      (timeout=0) don't have the time to make a full round trip at the server
+    <braunr> no it's not, it's the whole problem
+    <braunr> some applications work with a higher timeout, some like gitk don't
+    <braunr> ntpdate might act just the same
+    <gnu_srs> yes of course, and I have not addressed this problem either, I'm
+      mostly interested in the 3-way split.
+    <braunr> well, it looks like you're trying to ..
+    <gnu_srs> to be able to submit my poll patches (not yet published)
+    <braunr> i suggest you postpone these changes until the underlying
+      implementation works
+    <braunr> i strongly suspect the split to be useless
+    <braunr> we shouldn't need completely different paths just for this
+      conformance issue
+    <braunr> so wait until select/poll is fixed, then test again
+    <gnu_srs> Read the POSIX stuff: poll and select are different.
+    <braunr> i know
+    <braunr> their expected behaviour is
+    <braunr> that's what needs to be addressed
+    <braunr> but you can't do that now, because there are other bugs in the way
+    <braunr> so you'll have a hard time making sure your changes do fix your
+      issues, because the problems might be cause by the other problems
+    <gnu_srs> since you are the one who knows best, why don't you implement
+      everything yourself.
+    <braunr> well, i did
+    <braunr> and i'm just waiting for the pthreads hurd to spread before
+      adapting my select branch
+
+[[libpthread]].
+
+    <braunr> it won't fix the conformance issue, but it will fix the underlying
+      implementation (the rpc)
+    <braunr> and then you'll have reliable results for the tests you're
+      currently doing
+    <gnu_srs> why not even trying out the cases I found to have problems??
+    <braunr> because i now know why you're observing what you're observing
+    <braunr> i don't need my eyes to see it to actually imagine it clerly
+    <braunr> when i start gitk and it's hanging, i'm not thinking 'oh my, i
+      need to hack glibc select/poll !!!'
+    <braunr> because i know the problem
+    <braunr> i know what needs to be done, i know how to do it, it will be done
+      in time
+    <braunr> please try to think the same way ..
+    <braunr> you're fixing problems by pure guessing, without understanding
+      what's really happenning
+    <gnu_srs> (10:59:17 AM) braunr: your hack *can't* solve the select/poll
+      correctness issue: which hack?
+    <braunr> "please consider removing setitimer because it blocks my ntpdate"
+    <braunr> gnu_srs: all your select related patches
+    <braunr> the poll_timeout, the 3-way split, they just can't
+    <braunr> changes need to be made at the server side too
+    <braunr> you *may* have fixed the conformance issue related to what is
+      returned, but since it get mixed with the underlying implementation
+      problems, your tests aren't reliable
+    <gnu_srs> well some of the test code is from gnulib, their code is not
+      reliable?
+    <braunr> their results aren't
+    <braunr> why is that so hard to understand for you ?
+    <gnu_srs> (11:08:05 AM) braunr: "please consider removing setitimer because
+      it blocks my ntpdate": It's not my ntpdate, it's a program that fails to
+      run on -38, but does on -37!
+    <braunr> it doesn't mean glibc -37 is right
+    <braunr> it just means the ntpdate case seems to be handled correctly
+    <braunr> a correct implementation is *ALWAYS* correct
+    <braunr> if there is one wrong case, it's not, and we know our select/poll
+      implementation is wrong
+    <gnu_srs> no of course not, and the ntpdate implementation is not correct?
+      file a bug upstream then.
+    <braunr> you're starting to say stupid things again
+    <braunr> ntpdate and gnulib tests can't be right if they use code that
+      isn't right
+    <braunr> it doesn't mean they'll always fail either, the programs that fail
+      are those for which select/poll behaves wrong
+    <braunr> same thing for setitimer btw
+    <braunr> we know it was wrong, and i think it was never working actually
+    <gnu_srs> where are the missing test cases then? maybe you should publish
+      correct code so we can try it out?
+    <braunr> i have, but there are dependencies that prevent using it right now
+    <braunr> which is why i'm waiting for pthreads hurd to spread
+    <braunr> pthreads provide the necessary requirements for select to be
+      correctly implemented at server side
+    <gnu_srs> well conformance with your code could be tested on Linux,
+      kFreeBSD, etc
+    <braunr> ?
+    <braunr> i'm not writing test units
+    <gnu_srs> /code/test code/
+    <braunr> the problem is *NOT* the test code
+    <braunr> the problem is some of our system calls
+    <braunr> it's the same for ntpdate and gitk and all other users
+    <gnu_srs> then the gnulib, ntpdate, gitk code is _not_ wrong
+    <braunr> no, but their execution is, and thus their results
+    <braunr> which is ok, they're tests
+    <braunr> they're here precisely to tell us if one case works
+    <braunr> they must all pass to hope their subject is right
+    <braunr> so, again, since there are several problems with our low level
+      calls, you may have fixed one, but still suffer from others
+    <braunr> so even if you did fix something, you may not consider the test
+      failure as an indication that your fix is wrong
+    <braunr> but if you try to make your changes fix everything just to have
+      results that look valid, it's certain to fail, since you only fix the
+      client side, and it's *known* the server side must be changed too
+    <gnu_srs> do you consider unsubmitted-single-hurdselect-timeout.diff and
+      local-select.diff a hack or not?
+    <braunr> the first isn't, since it fixes the correctness of the call for
+      one case, at the cost of some performance
+    <braunr> the second clearly is
+    <braunr> which is the difference between unsubmitted-* and local-*
+    <gnu_srs> and my proposal to modify the first is a hack?
+    <braunr> yes
+    <braunr> it reverts a valid change to "make things work" whereas we know
+      the previous behaviour was wrong
+    <braunr> that's close to the definition of a hack
+    <gnu_srs> "make things work" meaning breaking some applications?
+    <braunr> yes
+    <braunr> in this case, those using poll with one file descriptor and
+      expecting a timeout, not twice the timeout
+    <braunr> well, your change isn't really a revert
+    <braunr> hum yes actually it is
+    <gnu_srs> the timeout is correct
+    <braunr> no, it looks correct
+    <braunr> how did you test it ?
+    <braunr> and same question as yesterday: why only poll ?
+    <gnu_srs> see the code I mentioned before
+    <braunr> no i won't look
+    <braunr> it doesn't explain anything
+    <gnu_srs> I have not found any problems with select, only poll (yes, this
+      is a user perspective)
+    <braunr> that's what i call "pure guessing"
+    <braunr> you just can't explain why it fixes things
+    <braunr> because you don't know
+    <braunr> you don't understand what's really happening in the system
+    <braunr> there is a good idea in your change
+    <braunr> but merely for performance, not correctness
+    <braunr> (your change makes it possible to save the mach_msg receive if the
+      io_select blocked, which is good, but not required)
+
+See also [[alarm_setitimer]].
+
+
 # See Also
 
 See also [[select_bogus_fd]] and [[select_vs_signals]].
author	Thomas Schwinge <tschwinge@gnu.org>	2013-01-08 21:31:31 +0100
committer	Thomas Schwinge <tschwinge@gnu.org>	2013-01-08 21:31:31 +0100
commit	51c95fc11727532e3b0d98c8470a6b60907a0680 (patch)
tree	6a8dd54654398bb2b05ce3f7cbcdc211d1dfbfb7 /open_issues/select.mdwn
parent	160d0597cb94d58f8ab273226b1f3830589a500b (diff)