From 95878586ec7611791f4001a4ee17abf943fae3c1 Mon Sep 17 00:00:00 2001 From: "https://me.yahoo.com/a/g3Ccalpj0NhN566pHbUl6i9QF0QEkrhlfPM-#b1c14" Date: Mon, 16 Feb 2015 20:08:03 +0100 Subject: rename open_issues.mdwn to service_solahart_jakarta_selatan__082122541663.mdwn --- open_issues/sigpipe.mdwn | 345 ----------------------------------------------- 1 file changed, 345 deletions(-) delete mode 100644 open_issues/sigpipe.mdwn (limited to 'open_issues/sigpipe.mdwn') diff --git a/open_issues/sigpipe.mdwn b/open_issues/sigpipe.mdwn deleted file mode 100644 index 0df3560e..00000000 --- a/open_issues/sigpipe.mdwn +++ /dev/null @@ -1,345 +0,0 @@ -[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]] - -[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable -id="license" text="Permission is granted to copy, distribute and/or modify this -document under the terms of the GNU Free Documentation License, Version 1.2 or -any later version published by the Free Software Foundation; with no Invariant -Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license -is included in the section entitled [[GNU Free Documentation -License|/fdl]]."]]"""]] - -[[!tag open_issue_glibc open_issue_hurd]] - -[[!GNU_Savannah_bug 461]] - -IRC, freenode, #hurd, 2011-04-20 - - I found a problem from 2002 by Marcus Brinkmann that I think is - related to my problems: http://savannah.gnu.org/bugs/?461. He has a test - file called pipetest.c that shows that SIGPIPE is not triggered reliably. - Cited from the bug report: The attached test program does not - trigger SIGPIPE reliably, because closing the read end of the pipe - happens asynchronously. The write can succeed because the read end might - not have been closed yet. - I have debugged this program on both Hurd and Linux, and the - problem in Hurd remains:-( - Anybody looked into the almost ten year old - bug:http://savannah.gnu.org/bugs/?461 this one is definitely related to - the build problems of e.g. ghc6 and ruby1.9.1. Should I mention this on - the ML? - that could be it indeed - does th bug still happen? - depends on: new interface io_close - which depends on: POSIX record locking - youpi: Yes it does, I've tested the pipetest.c file submitted by - Marcus B on both Linux and Hurd - that would've maybe been a nice GSOC task - azeem: err, the contrary for posix record locking, non ? - argh - why would POSIX record locking depend on this? - well anyway, then have POSIX record locking be a GSOC task :) - I wasn't aware that would also fix ruby and ghc building :) - http://permalink.gmane.org/gmane.os.hurd.devel.readers/265 - (for io_close stuff) - http://comments.gmane.org/gmane.os.hurd.devel.readers/63 actually - I guess if they didn't implement it/agreed on something back then - it'd be quite hard to do it now - azeem: marcus recently showed up here. Maybe he can help out/has - ideas? - well yeah - but marcus was the junior guy back then - but it's a very hurdish solution (ie, complex, buggy, and - not implemented) - maybe we can go for something simpler - azeem: what is this quote about? - don't remember - not io_close I'd say - -2011-04-21 - - svante_: why do you think the problem you see in ruby and ghc is - related to async close() ? - -2011-04-22 - - Well: the test case I'm running on ruby is giving me an EBADF - after 8 successful loops, and tracing within eglibc points towards - __mutex_lock_solid or __spin_lock, __spin_lock_solid from - mach/lock-intern.h from cthreads. - -2011-04-23 - - srs1: yeah, I saw it... but I still wonder what makes you think - this is related to async FD closing? - antrik: Every test case showing the problems are related to fd.h and - the functions there, especially the ones used in the function: - _HURD_FD_H_EXTERN_INLINE struct hurd_fd *_hurd_fd_get (int fd) and so is - the pipetest from Marcus too. - I have not yet been able to trace further with gdb since most - variables are optimized out and adding print statements does not work, at - least not yet. Now I'm trying to build eglibc with -O1 to see if the - optimized out variables are there or not. - srs1: he means the ghc6 issue - (and the ruby issue) - youpi: Yes, the ghc6 and ruby ends at the functions I mentioned in - fd,h - Both ghc6 and ruby programs are writing to a file when the error - happens. If they are using a pipeline or not I don't know yet, I think it - is a regular file write. - I can send your the ruby program if you like: It is a c-file so - debugging is possible. ghc6 is worse, since that program cannot be - debugged directly with gdb. - pipetest also results in the program hanging in locking stuff?... - pipetest does not hang, but gives no output as it should. Running it - in gdb with single stepping shows the correct behavior, but then gdb - hangs if I try to single stepping further, continue at the right place - works! - I haven't looked at the pipetest program. do you have the link - handy? - never mind, got it - srs1: that sounds like a GDB problem... - most probably, yes - (and I've always seen issues like this in gdb on hurd) - actually I think it's expected... the RPC handling code has some - explicit GDB hooks AIUI; trying to single-step into this code is probably - expected to wreck havoc... - well, it should have some sane behavior - even if it's "skip to next point where it's debuggable" - srs1: note that there is no BADF involved in the pipetest AIUI... - -2011-04-28 - - what is the actual problem you are seeing BTW? - antrik: in ruby the problem is: Exception `IOError' at - tool/mkconfig.rb:12 - closed stream - Triggered by ruby:io.c:internal_read_func() calling - sysdeps/mach/hurd/read.c returning a negative number of bytes read. - gnu_srs1: why do you think that error is locking related? - This happens after 8 iterations of the read loop with 8192 bytes - read each time. - but that doesn't involve locking at all, does it? - I think it is, if there is a pipepline set up?? - Also the ghc6 hang ends up in hangs in sysdeps/mach/hurd/read.c - traced into fd.h where all things happen (including setting locks and - mutexes) - what locking ? - stdio locking is different from file locking - and a pipe doesn't imply file locking at all - read may block on pipes, but it's unrelated to flock - Look into the file fd.h, maybe you can describe things - better. I'm not fluent in this stuff. - Has a pipe has a file descriptor associated to it? What about a - file read/write? - a pipe provides 2 file descriptors, one for reading and another - one for writting - i may give a look at that if i manage to build glibc - succesfully... - Take a look at the realevant code from fd.h: - http://pastebin.com/kesBpjy4 - the ruby error happens just trying to build ruby1.9? - gnu_srs1: from what you said, the error occurs while reading, - so i don't see how it can be related to that code - you already got a descriptor if you're reading from it - I have not tried anything else than ruby1.9.1. I can send you - the ruby debug setup and files if you are interested. - gnu_srs1: ok, i'll try to build ruby1.9.1 later... let's see if - i can build glibc first - abeaumont: well, the read suddenly returns -1 bytes read, - resulting in a file descriptor of -1 (instead of +3). - gnu_srs1: i see - gnu_srs1: are you sure the hang really happens in _hurd_fd_get()? - could you give us a backtrace? - gnu_srs1: there are many reasons why read() can return -1; errno - should indicate the reason. unfortunately, I can't make much out of - ruby's "translation" of the error :-) - antrik: In the ruby case there is no hang: The steam is closed - by read() giving an error code !=0. This triggers things in the ruby - code: A negative number of bytes read and a negative fd results, and an - error error is triggered in the ruby code. - antrik: See http://pastebin.com/eZmaZQJr - gnu_srs1: yes, this all sounds perfectly right. the question is - *why* read() returns an error code. we'd need to know what error it is - exactly, and in what situation it occurs. tracing the libc code is not at - all useful here - uhm... 1073741833 is errno?... - BTW: I think the error code is EBADF (badfile descriptor?). The - integer version of it is 1073741833, see the pastebin i linked to. - you could use perror() to get something more readable :-) - or error() with the right arguments - I used integer when printing, but looking into fd.h I think it - is EBADF (I did get this result once in gdb) - fd.h won't tell you anything. most error codes are generated by - the server, not by libc - BADF might be generated in libc when ruby tries to read on FD -1 - (no idea why it tries to do that... perhaps there is actually - something wrong/stupid in ruby's error handling) - Well I single-stepped in fd.h using gdb and printing err gave - EBADf. err is declared as: error_t err in read.c - at which point did you single-step? while fd was still 3? - I don't think the problem is in ruby, it is in mach/hurd! - Similar problems with ghc, python-apt, etc - Yes, fd=3 was not changed. I cannot trace into fd.h from - read.c. That is the problem with all cases! Need to leave for a while - now. - sorry, I don't see *anything* similar in the ghc failure. - I don't know about python-apt - for the ghc case, I'd like to see a GDB backtrace from the point - where it is hanging - just to be clear: anything I/O-related will involve fd.h - somewhere. that doesn't in any way indicate the problems are related. in - fact the symptoms you described are very different, and I'm pretty - certain these are completely different issues - antrik: Here is a backtrace, - http://pastebin.com/wvCiXFjB. Numbers 6,7,8 are from the calling Haskell - functions. They cannot be debugged by gdb. Nice to see that somebody is - showing interest at last:-/ - hm... I wonder whether the _hurd_intr_rpc_msg_in_trap is a result - of the ^C? - if so, it seems to be a "normal" bloking read() operation. so - again probably not related to libc code at all - Where is this blocking read() code located mach/hurd? - io_read() is implemented by whatever server handles the FD in - question - I guess rpctrace will be more helpful here than GDB... to see what - the program is trying to do here - Why don't I get there with gdb? - err... the server is a different process - you are only tracing the client code - OK, here is a rpctrace for ruby: - http://pastebin.com/sdPiKGBW.Nice programs you have, no manual pages, and - the program hang - s/http://pastebin.com/sdPiKGBW.Nice - /http://pastebin.com/sdPiKGBW. BTW: Nice/ - antrik: Do you want the rpctrace of the ghc hang too? If that is - the case, do you need the whole file. From the ruby case the last part - looked most interesting: - libpthread/sysdeps/generic/pt-mutex-timedlock.c: assert (mutex->owner != - self); - gnu_srs1: hm... you get that assertion only with rpctrace? guess - it doesn't work properly then :-( - Is it visible on the client side? - gnu_srs1: that assertion *is* from the client side. I'm just - surprised that apparently it's only triggered when you run it in rpctrace - how did you invoke rpctrace? - rpctrace "command with options" > rpctrace.out 2>&1 - well, I'd like to know the "command with options" part :-) - OK: for ruby: ./miniruby ./ tool/mkconfig.rb as before. - OK, so it just runs the ruby interpreter and no other processes - No other processes involved! - gnu_srs1: i can reproduce the ruby error, no let's dig in it :D - gnu_srs1: rpctrace for ghc could be useful too... but if it's too - long, pasting only the last bit might suffice - antrik: OK, will do that. Do you find anything interesting? - abeaumont: Using gdb: gdb ./miniruby; (gdb) break io.c:569; c8; - break fd.h:72 or break read.c:27 and you are there. Beware of gdb - hanging, so you need another terminal to kill -9 gdb (sometimes a reboot - is needed :-( - gnu_srs1: no, the ruby rpctrace is useless; apparently rpctrace - makes it break before reaching the relevant part :-( - thanks gnu_srs1 - antrik: Hope for better luck with ghc: - http://pastebin.com/dgcrA05t - hm... it hangs at proc_dostop() again... whatever that means - -2011-05-07 - - One question about ruby: I know where the problems occur in ruby - code. Can I switch to the kernel thread just before in gdb to single step - from there? - you can put a breakpoint, can't you? - gnu_srs: kernel thread? - Yes, but will single stepping from there lead me to the Hurd - code. I have not succeeded to do that yet! - you mean the translator code? - Well, Roland did call it the signal thread, there are at least - two threads per process, a signal thread and a main (user) thread. - then it's a thread in gdb - just use the thread gdb commands to access it - I do find two threads in gdb, yes. But following only the user - thread does not lead me to the cause of the problems. - And following the other (signal thread) has not been successful - so far. - multithreading debugging in gdb is painful yes - single-step isn't really an option in it - gnu_srs: well, as I said before, the cause is probably not in the - libc code anyways. it would be much more relevant to find out what the FD - in question is, and what "special" thing Ruby does to it to trigger the - problematic behaviour... - it's simpler to put printfs etc. - youpi: well, printf doesn't work in the FD code :-) - you can make it work - open /dev/mem, write to 0xb8000 - I'm not even joking - I have printfs in the ruby code. And at some parts in eglibc (but - it is not possible to put them at all places I want, as mentioned before) - sure, there are ways to debug this code too... but I don't think - it's useful. so far there is no indication that this will help finding - the actual issue - The problem is not file descriptors. It is that an ongoing read - suddenly returns -1 bytes read. And then the ruby code assigns a negative - file descriptor in the exception handling. - a *read* ? - with errno == 0 ? - Yes, a read! - how ruby comes to assigning a negative fd from that? - does it somehow close the fd? - The errno reported from the read is EBADF! - did you try to rpctrace it? - I don't bother too much about ruby exception handling. The error - has already happened in the read operation. And that lead me to eglibc - code.... and so on... - do you know what kind of file this fd was supposed to be on? - sure, that's debugging - Yes I did rpctrace, but that was not successful. rpctrace just - hang! Buggy code? - youpi: I assume that's Ruby's way to indicate that the FD is not - valid anymore, after the previous error - does the program fork? - antrik: possibly - rpctrace has known issues, yes - gnu_srs: did you trace close()s by hand with printfs? - Ho w to find out if it forks? - what does rpctrace stop on ? - Well, I don't remember. Antrik? - proc_dostop() IIRC - or something like that - I did not find any close() statements in the code I debugged. - ok, proc_dostop() is typically a sign of fork() - gnu_srs: that doesn't necessarily mean it's not called - gnu_srs: I think his point is that something else might close the - FD, causing the error you see - anything can happen in the wild :) - gnu_srs: as I said before, the next step is to find out what this - FD is, and what happens to it... - antrik: Any ideas how to find out? - what is the backtrace? - Well I know the fd number, it is either 3 or 5 in my tests. Does - the number matter? - yes, it's not std{in,out,err} - How to get a backtrace of a program that does not hang? - make it hang at the point of failure - when read returns -1 - so you know who did the read - I have to run the loop several times before the number of bytes - read is -1. - you mean running the program several times ? - or just let the loop continue for some time? - if it's the latter, you can add breakpoints with conditions - No the read loop runs for 7 iterations, and fails the 8th time! - then make it hang when read() returns -1 - could you paste your code somewhere? - when debugging, you're allowed to do all kinds of ugly things, you - know ;) - OK, I'll try that. - MR_Spock: The easiest way would be to try to build - ruby1.9.1. Then I can help you from where it fails. - pinotree: How to give a breakpoint with a condition? - break where if condition - see help break - oh, there's even a thread condition nowadays, good - Thanks for the discussion. I have to get into the real world for - a while now. To be continued. - gnu_srs: well, if you already know that the loop runs several - times before the error occurs, you apparently already looked at the - higher-level code that is relevant here... - but it may be generic code, and not tell what calls it -- cgit v1.2.3