summaryrefslogtreecommitdiff
path: root/service_solahart_jakarta_selatan__082122541663/term_blocking.mdwn
blob: fd2035d9653b11d785dfbcbb90e8746e8be19016 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
[[!meta copyright="Copyright © 2009, 2011, 2012, 2013 Free Software Foundation,
Inc."]]

[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
id="license" text="Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License, Version 1.2 or
any later version published by the Free Software Foundation; with no Invariant
Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
is included in the section entitled [[GNU Free Documentation
License|/fdl]]."]]"""]]

[[!tag open_issue_hurd]]

There must be some blocking / dead-locking (?) problem in `term`.

[[!toc]]


# Original Findings

    # w | grep [t]sch
    tschwing  p1 192.168.10.60: Tue 8PM  0:03  2172 /bin/bash
    tschwing  p2 192.168.10.60: Tue 4PM 40hrs   689 emacs
    tschwing  p3 192.168.10.60:  8:52PM 11:37 15307 /bin/bash
    tschwing  p0 192.168.10.60:  6:42PM 11:47  8104 /bin/bash
    tschwing  p8 192.168.10.60:  8:27AM  0:02 16510 /bin/bash

Now open a new screen window, or login shell, or...

    # ps -Af | tail
    [...]
    tschwinge 16538  676  p6  0:00.08 /bin/bash
        root 16554   128  co  0:00.09 ps -Af
        root 16555   128  co  0:00.01 tail

`bash` is started (on `p6`), but newer makes it to the shell promt; doesn't
even start to execute `.bash_profile` / `.bashrc`.  The next shell started, on
the next available pseudoterminal, will work without problems.

The `term` on `p6` has already been running before:

    # ps -Af | grep [t]typ6
        root  6871     3   -  5:45.86 /hurd/term /dev/ptyp6 pty-master /dev/ttyp6

In this situation, `w` will sometimes report erroneous values for *IDLE*
for the process using that terminal.

Killed that `term` instance, and things were fine again.


All this reproducible happens while running the [[GDB testsuite|gdb]].

---

Have a freshly started shell blocking on such a `term` instance.

    $ ps -F hurd-long -p 1766 -T -Q
      PID TH#  UID  PPID  PGrp  Sess TH  Vmem   RSS %CPU     User   System Args
     1766        0     3     1     1  6  131M 1.14M  0.0  0:28.85  5:40.91 /hurd/term /dev/ptyp3 pty-master /dev/ttyp3
            0                                        0.0  0:05.76  1:08.48
            1                                        0.0  0:00.00  0:00.01
            2                                        0.0  0:06.40  1:11.52
            3                                        0.0  0:05.76  1:09.89
            4                                        0.0  0:05.42  1:06.74
            5                                        0.0  0:05.50  1:04.25

... and after 5:45 h:

    $ ps -F hurd-long -p 21987 -T -Q
      PID TH#  UID  PPID  PGrp  Sess TH  Vmem   RSS %CPU     User   System Args
    21987     1001   676 21987 21987  2  148M 2.03M  0.0  0:00.02  0:00.07 /bin/bash
            0                                        0.0  0:00.02  0:00.07 
            1                                        0.0  0:00.00  0:00.00 

    $ ps -F hurd-long -p 1766 -T -Q
      PID TH#  UID  PPID  PGrp  Sess TH  Vmem   RSS %CPU     User   System Args
     1766        0     3     1     1  6  131M 1.14M  0.0  0:29.04  5:42.38 /hurd/term /dev/ptyp3 pty-master /dev/ttyp3
            0                                        0.0  0:05.76  1:08.48
            1                                        0.0  0:00.00  0:00.01
            2                                        0.0  0:06.41  1:11.90
            3                                        0.0  0:05.82  1:10.28
            4                                        0.0  0:05.52  1:07.06
            5                                        0.0  0:05.52  1:04.63

    $ sudo gdb /hurd/term 1766
    [sudo] password for tschwinge: 
    GNU gdb (GDB) 7.0-debian
    Copyright (C) 2009 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "i486-gnu".
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>...
    Reading symbols from /hurd/term...Reading symbols from /usr/lib/debug/hurd/term...done.
    (no debugging symbols found)...done.
    Attaching to program `/hurd/term', pid 1766
    [New Thread 1766.1]
    [New Thread 1766.2]
    [New Thread 1766.3]
    [New Thread 1766.4]
    [New Thread 1766.5]
    [New Thread 1766.6]
    Reading symbols from /lib/libhurdbugaddr.so.0.3...Reading symbols from /usr/lib/debug/lib/libhurdbugaddr.so.0.3...
    [System doesn't respond anymore, but no kernel crash.]

---

The very same behavior is still observable as of 2011-03-24.

Next: rebooted; on console started root shell, screen, a few spare windows; as
user started GDB test suite, noticed the PTY it's using; in a root shell
started GDB (the system one, for `.debug` stuff) on `/hurd/term`, `set
noninvasive on`, attach to the *term* that GDB is using.

---

[[service_solahart_jakarta_selatan__082122541663/term_blocking/2011-07-04]].

---

2012-11-05

Log file from a 2011-09-07 run:

    [...]
    Running ../../../master/gdb/testsuite/gdb.base/readline.exp ...
    spawn [...]/gdb/testsuite/../../gdb/gdb -nw -nx -data-directory [...]/gdb/testsuite/../data-directory
    GNU gdb (GDB) 7.3.50.20110906-cvs
    Copyright (C) 2011 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "i686-unknown-gnu0.3".
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>.
    (gdb) set height 0
    (gdb) set width 0
    (gdb) dir
    Reinitialize source path to empty? (y or n) y
    Source directories searched: $cdir:$cwd
    (gdb) dir ../../../master/gdb/testsuite/gdb.base
    Source directories searched: [...]/gdb/testsuite/../../../master/gdb/testsuite/gdb.base:$cdir:$cwd
    (gdb) p 1
    $1 = 1
    PASS: gdb.base/readline.exp: Simple operate-and-get-next - send p 1
    (gdb) p 2
    $2 = 2
    PASS: gdb.base/readline.exp: Simple operate-and-get-next - send p 2
    (gdb) p 3
    $3 = 3
    PASS: gdb.base/readline.exp: Simple operate-and-get-next - send p 3
    (gdb) p 3(gdb) p 3PASS: gdb.base/readline.exp: Simple operate-and-get-next - C-p to p 3
    ^H2(gdb) p 2PASS: gdb.base/readline.exp: Simple operate-and-get-next - C-p to p 2
    ^H1(gdb) p 1PASS: gdb.base/readline.exp: Simple operate-and-get-next - C-p to p 1
    ^OFAIL: gdb.base/readline.exp: Simple operate-and-get-next - C-o for p 1
    FAIL: gdb.base/readline.exp: operate-and-get-next with secondary prompt - send if 1 > 0
    FAIL: gdb.base/readline.exp: print 42 (timeout)
    FAIL: gdb.base/readline.exp: arrow keys with secondary prompt (timeout)
    spawn [...]/gdb/testsuite/../../gdb/gdb -nw -nx -data-directory [...]/gdb/testsuite/../data-directory
    ERROR: (timeout) GDB never initialized after 10 seconds.
    ERROR: no fileid for coulomb
    ERROR: no fileid for coulomb
    UNRESOLVED: gdb.base/readline.exp: Simple operate-and-get-next - send p 7
    testcase ../../../master/gdb/testsuite/gdb.base/readline.exp completed in 646 seconds
    Running ../../../master/gdb/testsuite/gdb.base/wchar.exp ...
    Executing on host: gcc  -c -g  -o [...]/gdb/testsuite/gdb.base/wchar0.o ../../../master/gdb/testsuite/gdb.base/wchar.c    (timeout = 300)
    spawn gcc -c -g -o [...]/gdb/testsuite/gdb.base/wchar0.o ../../../master/gdb/testsuite/gdb.base/wchar.c
    Executing on host: gcc [...]/gdb/testsuite/gdb.base/wchar0.o  -g  -lm   -o [...]/gdb/testsuite/gdb.base/wchar    (timeout = 300)
    spawn gcc [...]/gdb/testsuite/gdb.base/wchar0.o -g -lm -o [...]/gdb/testsuite/gdb.base/wchar
    get_compiler_info: gcc-4-6-1
    spawn [...]/gdb/testsuite/../../gdb/gdb -nw -nx -data-directory [...]/gdb/testsuite/../data-directory
    ERROR: (timeout) GDB never initialized after 10 seconds.
    ERROR: no fileid for coulomb
    ERROR: no fileid for coulomb
    ERROR: no fileid for coulomb
    ERROR: couldn't load [...]/gdb/testsuite/gdb.base/wchar into [...]/gdb/testsuite/../../gdb/gdb (timed out).
    ERROR: no fileid for coulomb
    ERROR: Delete all breakpoints in delete_breakpoints (timeout)
    ERROR: no fileid for coulomb
    UNRESOLVED: gdb.base/wchar.exp: setting breakpoint at wchar.c:34 (timeout)
    testcase ../../../master/gdb/testsuite/gdb.base/wchar.exp completed in 797 seconds
    [...]


# IRC, freenode, #hurd, 2012-08-09

In context of the [[select]] issue.

    <braunr> i wonder where the tty allocation is made
    <braunr> it could simply be that current applications don't handle old BSD
      ptys correctly
    <braunr> hm no, allocation is fine
    <braunr> does someone know why there is no term instance for /dev/ttypX ?
    <braunr> showtrans says "/hurd/term /dev/ttyp0 pty-slave /dev/ptyp0" though
    <youpi> braunr: /dev/ttypX share the same translator with /dev/ptypX
    <braunr> youpi: but how ?
    <youpi> see the main function of term
    <youpi> it attaches itself to the other node
    <youpi> with file_set_translator
    <youpi> just  like pfinet can attach itself to /servers/socket/26 too
    <braunr> youpi: isn't there a possible race when the same translator tries
      to sets itself on several nodes ?
    <youpi> I don't know
    <tschwinge> There is.
    <braunr> i guess it would just faikl
    <braunr> fail
    <tschwinge> I remember some discussion about this, possibly in context of
      the IPv6 project.
    <braunr> gdb shows weird traces in term
    <braunr> i got this earlier today: http://www.sceen.net/~rbraun/gdb.txt
    <braunr> 0x805e008 is the ptyctl, the trivs control for the pty
    <tschwinge> braunr: How do you mean »weird«?
    <braunr> tschwinge: some peropen (po) are never destroyed
    <tschwinge> Well, can't they possibly still be open?
    <braunr> they shouldn't
    <braunr> that's why term doesn't close cleany, why select still reports
      readiness, and why screen loops on it
    <braunr> (and why each ssh session uses a different pty)
    <tschwinge> ... but only on darnassus, I think?  (I think I haven't seen
      this anywhere else.)
    <braunr> really ?
    <braunr> i had it on my virtual machines too
    <tschwinge> But perhaps I've always been rebooting systems quickly enough
      to not notice.
    <tschwinge> OK, I'll have a look next time I boot mine.
    <braunr> i suppose it's why you can't login anymore quickly when syslog is
      running

[[service_solahart_jakarta_selatan__082122541663/syslog]]?

    <braunr> i've traced the problem to ptyio.c, where pty_open_hook returns
      EBUSY because ptyopen is still true
    <braunr> ptyopen remains true because pty_po_create_hook doesn't get called
    <youpi> tschwinge: I've seen the pty issue on exodar too, and on my qemu
      image too
    <braunr> err, pty_po_destroy_hook
    <tschwinge> OK.
    <braunr> and pty_po_destroy_hook doesn't get called from users.c because
      po->cntl != ptyctl
    <braunr> which means, somehow, the pty never gets closed
    <youpi> oddly enough it seems to happen on all qemu systems I have, and no
      xen system I have
    <braunr> Oo
    <braunr> are they all (xen and qemu) up to date ?
    <braunr> (so we can remove versions as a factor)
    <tschwinge> Aha.  I only hve Xen and real hardware.
    <youpi> braunr: no
    <braunr> youpi: do you know any obscur site about ptys ? :)
    <youpi> no
    <youpi> well, actually yes
    <youpi> http://dept-info.labri.fr/~thibault/a (in french)
    <braunr> :D
    <braunr> http://www.linusakesson.net/programming/tty/index.php looks
      interesting
    <youpi> indeed


## IRC, freenode, #hurdfr, 2012-08-09

    <braunr> youpi: ce que j'ai le plus de mal à comprendre, c'est ce qu'est un
      "controlling tty"
    <youpi> c'est le plus obscur d'obscur :)
    <braunr> s'il est exclusif à une appli, comment ça doit se comporter sur un
      fork, etc..
    <youpi> de manière simple, c'est ce qui permet de faire ^C
    <braunr> eh oui, et c'est sûrement là que ça explose
    <youpi> c'est pas exclusif, c'est hérité
    <braunr>
      http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/bernstein-on-ttys/cttys.html


## IRC, freenode, #hurd, 2012-08-10

    <braunr> youpi: and just to be sure about the test procedure, i log on a
      system, type tty, see e.g. ttyp0, log out, and in again, then tty returns
      ttyp1, etc..
    <youpi> yes
    <braunr> youpi: and an open (e.g. cat) on /dev/ptyp0 returns EBUSY
    <youpi> indeed
    <braunr> so on xen it doesn't
    <braunr> grmbl
    <youpi> I've never seen it, more precisely
    <braunr> i also have the problem with a non-accelerated qemu
    <braunr> antrik: do you have the term problems we've seen on your bare
      hardware ?
    <antrik> I'm not sure what problem you are seeing exactly :-)
    <braunr> antrik: when logging through ssh, tty first returns ttyp0, and the
      second time (after logging out from the first session) ttyp1
    <braunr> antrik: and term servers that have been used are then stuck in a
      busy state
    <antrik> braunr: my ptys seem to be reused just fine
    <braunr> or perhaps they didn't have the bug
    <braunr> antrik: that's so weird
    <antrik> (I do *sometimes* get hanging ptys, but that's a different issue
      -- these are *not* busy; they just hang when reused...)
    <braunr> antrik: yes i saw that too
    <antrik> braunr: note though that my hurd package is many months old...
    <antrik> (in fact everything on this system)
    <braunr> antrik: i didn't see anything relevant about the term server in
      years
    <braunr> antrik: what shell do you use ?
    <antrik> yeah, but such errors could be caused by all kinds of changes in
      other parts of the Hurd, glibc, whatever...
    <antrik> bash


## IRC, freenode, #hurd, 2012-12-27

    <youpi> we however have a similar symptom with screen
    <youpi> shells don't terminate
    <braunr> yes
    <youpi> or at least the window doesn't close
    <braunr> the screen problem is the same as the term servers not being properly closed
    <youpi> k
    <braunr> that one is still on my todo list
    <braunr> and not easy
    <youpi> like so many small items on the TODO lists :)
    <braunr> that one is an important one :)
    <braunr> because we're still using legacy pty, the number of terms is
      limited
    <braunr> which means at some point we can't log in any more using them
    <braunr> (i regularly kill pty terms on darnassus to avoid that)
    <braunr> it prevents screen and rsyslogd iirc from working correctly, which
      is very annoying
    <braunr> there may be other issues


# Formal Verification

This issue may be a simple programming error, or it may be more complicated.

Methods of [[service_solahart_jakarta_selatan__082122541663/formal_verification]] should be applied to confirm that there is
no error in `/hurd/term`'s logic itself.  There are tools for formal
verification/[[code_analysis]] that can likely help here.

There is a [[!FF_project 277]][[!tag bounty]] on this task.