summaryrefslogtreecommitdiff
path: root/open_issues/profiling.mdwn
blob: e7dde9031c31228a570ddcb742a1b2bb4f9ff6b1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
[[!meta copyright="Copyright © 2010, 2011, 2013, 2014 Free Software Foundation,
Inc."]]

[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
id="license" text="Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License, Version 1.2 or
any later version published by the Free Software Foundation; with no Invariant
Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
is included in the section entitled [[GNU Free Documentation
License|/fdl]]."]]"""]]

[[!meta title="Profiling, Tracing"]]

*Profiling* ([[!wikipedia Profiling_(computer_programming) desc="Wikipedia
article"]]) is a tool for tracing where CPU time is spent.  This is usually
done for [[performance analysis|performance]] reasons.

  * [[hurd/debugging/rpctrace]]

  * [[gprof]]

    Should be working, but some issues have been reported, regarding GCC spec
    files.  Should be possible to fix (if not yet done) easily.

  * [[glibc]]'s sotruss

  * [[ltrace]]

  * [[latrace]]

  * [[community/gsoc/project_ideas/dtrace]]

    Have a look at this, integrate it into the main trees.

  * [[LTTng]]

  * [[SystemTap]]

  * ... or some other Linux thing.


# IRC, freenode, #hurd, 2013-06-17

    <congzhang> is that possible we develop rpc msg analyse tool? make it clear
      view system at different level?
    <congzhang> hurd was dynamic system, how can we just read log line by line
    <kilobug> congzhang: well, you can use rpctrace and then analyze the logs,
      but rpctrace is quite intrusive and will slow down things (like strace or
      similar)
    <kilobug> congzhang: I don't know if a low-overhead solution could be made
      or not
    <congzhang> that's the problem
    <congzhang> when real system run, the msg cross different server, and then
      the debug action should not intrusive the process itself
    <congzhang> we observe the system and analyse os
    <congzhang> when rms choose microkernel, it's expect to accelerate the
      progress, but not
    <congzhang> microkernel make debug a litter hard
    <kilobug> well, it's not limited to microkernels, debugging/tracing is
      intrusive and slow things down, it's an universal law of compsci
    <kilobug> no, it makes debugging easier
    <congzhang> I don't think so
    <kilobug> you can gdb the various services (like ext2fs or pfinet) more
      easily
    <kilobug> and rpctrace isn't any worse than strace
    <congzhang> how easy when debug lpc
    <kilobug> lpc ?
    <congzhang> because cross context
    <congzhang> classic function call
    <congzhang> when find the bug source, I don't care performance, I wan't to
      know it's right or wrong by design, If it work as I expect 
    <congzhang> I optimize it latter
    <congzhang> I have an idea, but don't know weather it's usefull or not
    <braunr> rpctrace is a lot less instrusive than ptrace based tools
    <braunr> congzhang: debugging is not made hard by the design choice, but by
      implementation details
    <braunr> as a simple counter example, someone often cited usb development
      on l3 being made a lot easier than on a monolithic kernel
    <congzhang> Collect the trace information first, and then layout the msg by
      graph, when something wrong, I focus the trouble rpc, and found what
      happen around
    <braunr> "by graph" ?
    <congzhang> yes
    <congzhang> braunr: directed graph or something similar
    <braunr> and not caring about performance when debugging is actually stupid
    <braunr> i've seen it on many occasions, people not being able to use
      debugging tools because they were far too inefficient and slow
    <braunr> why a graph ?
    <braunr> what you want is the complete trace, taking into account cross
      address space boundaries
    <congzhang> yes
    <braunr> well it's linear
    <braunr> switching server
    <congzhang> by independent process view it's linear
    <congzhang> it's linear on cpu's view too
    <congzhang> yes, I need complete trace, and dynamic control at microkernel
      level
    <congzhang> os, if server crash, and then I know what's other doing, from
      the graph
    <congzhang> graph needn't to be one, if the are not connect together, time
      sort them
    <congzhang> when hurd was complete ok, some tools may be help too
    <braunr> i don't get what you want on that graph
    <congzhang> sorry, I need a context
    <congzhang> like uml sequence diagram, I need what happen one by one
    <congzhang> from server's view and from the function's view
    <braunr> that's still linear
    <braunr> so please stop using the word graph
    <braunr> you want a trace
    <braunr> a simple call trace
    <congzhang> yes, and a tool
    <braunr> with some work gdb could do it
    <congzhang> you mean under  some microkernel infrastructure help 
    <congzhang> ?
    <braunr> if needed
    <congzhang> braunr: will that be easy?
    <braunr> not too hard
    <braunr> i've had this idea for a long time actually
    <braunr> another reason i insist on migrating threads (or rather, binding
      server and client threads)
    <congzhang> braunr: that's  great
    <braunr> the current problem we have when using gdb is that we don't know
      which server thread is handling the request of which client
    <braunr> we can guess it
    <braunr> but it's not always obvious
    <congzhang> I read the talk, know some of your idea
    <congzhang> make things happen like classic kernel, just from function
      ,sure:)
    <braunr> that's it
    <congzhang> I think you and other do a lot of work to improve the mach and
      hurd, buT we lack the design document and the diagram, one diagram was
      great than one thousand words
    <braunr> diagrams are made after the prototypes that prove they're doable
    <braunr> i'm not a researcher
    <braunr> and we have little time
    <braunr> the prototype is the true spec
    <congzhang> that's why i wan't cllector the trace info and show, you can
      know what happen and how happen, maybe just suitable for newbie, hope
      more young hack like it
    <braunr> once it's done, everything else is just sugar candy around it


# IRC, freenode, #hurd, 2014-01-05

    <teythoon> braunr: do you speak ocaml ?
    <teythoon> i had this awesome idea for a universal profiling framework for
      c
    <teythoon> universal as in not os dependent, so it can be easily used on
      hurd or in gnu mach
    <teythoon> it does a source transformation, instrumenting what you are
      interested in
    <teythoon> for this transformation, coccinelle is used
    <teythoon> i have a prototype to measure how often a field in a struct is
      accessed
    <teythoon> unfortunately, coccinelle hangs while processing kern/slab.c :/
    <youpi> teythoon:  I do speak ocaml
    <teythoon> awesome :)
    <teythoon> unfortunately, i do not :/
    <teythoon> i should probably get in touch with the coccinelle devs, most
      likely the problem is that coccinelle runs in circles somewhere
    <youpi> it's not so complex actually
    <youpi> possibly,  yes
    <teythoon> do you know coccinelle ?
    <youpi> the only really peculiar thing in ocaml is lambda calculus
    <youpi> +c
    <youpi> I know a bit, although I've never really written an semantic patch
      myself
    <teythoon> i'm okay with that
    <youpi> but I can understand them
    <youpi> then ocaml should be fine for you :)
    <youpi> just ask the few bits that you don't understand :)
    <teythoon> yeah, i haven't really made an effort yet
    <youpi> writing ocaml is a bit more difficult because you need to
      understand the syntax, but for putting printfs it should be easy enough
    <youpi> if you get a backtrace with ocamldebug (it basically works like
      gdb), I can probably explain you what might be happening


## IRC, freenode, #hurd, 2014-01-06

    <teythoon> braunr: i'm not doing microoptimizations, i'm developing a
      profiler :p
    <braunr> teythoon: nice :)
    <teythoon> i thought you might like it
    <braunr> teythoon: you may want to look at
      http://pdos.csail.mit.edu/multicore/dprof/
    <braunr> from the same people who brought radixvm
    <teythoon> which data structure should i test it with next ?
    <braunr> uh, no idea :)
    <braunr> the ipc ones i suppose
    <teythoon> yeah, or the task related ones
    <braunr> but be careful, there many "inline" versions of many ipc functions
      in the fast paths
    <braunr> and when they say inline, they really mean they copied it
    <braunr> +are
    <teythoon> but i have a microbenchmark for ipc performance
    <braunr> you sure have been busy ;p
    <braunr> it's funny you're working on a profiler at the same time a
      collegue of mine said he was interested in writing one in x15 :)
    <teythoon> i don't think inlining is a problem for my tool
    <teythoon> well, you can use my tool for x15
    <braunr> i told him he could look at what you did
    <braunr> so i expect he'll ask soon
    <teythoon> cool :)
    <teythoon> my tool uses coccinelle to instrument c code, so this works in
      any environment
    <teythoon> one just needs a little glue and a method to get the data
    <braunr> seems reasonable
    <teythoon> for gnumach, i just stuff a tiny bit of code into the kdb

    <teythoon> hm debians bigmem patch with my code transformation makes
      gnumach hang early on
    <teythoon> i don't even get a single message from gnumach
    <braunr> ouch
    <teythoon> or it is somethign else entirely
    <teythoon> it didn't even work without my patches o_O
    <teythoon> weird
    <teythoon> uh oh, the kmem_cache array is not properly aligned
    <teythoon> braunr: http://paste.debian.net/74588/
    <braunr> teythoon: do you mean, with your patch ?
    <braunr> i'm not sure i understand
    <braunr> are you saying gnumach doesn't start because of an alignment issue
      ?
    <teythoon> no, that's unrelated
    <teythoon> i skipped the bigmem patch, have a running gnumach with
      instrumentation
    <braunr> hum, what is that aliased column ?
    <teythoon> but, despite my efforts with __attribute__((align(64))), i see
      lot's of accesses to kmem_cache objects which are not properly aligned
    <braunr> is that reported by the performance counters ?
    <teythoon> no
    <teythoon> http://paste.debian.net/74593/
    <braunr> aer those the previous lines accessed by other unrelated code ?
    <braunr> previous bytes in the same line*
    <teythoon> this is a patch generated to instrument the code
    <teythoon> so i instrument field access of the form i->a
    <teythoon> but if one does &i->a, my approach will no longer keep track of
      any access through that pointer
    <teythoon> so i do not count that as an access but as creating an alias for
      that field
    <braunr> ok
    <teythoon> so if that aliased count is not zero, the tool might
      underestimate the access count
    <teythoon> hm
    <teythoon> static struct kmem_cache kalloc_caches[KALLOC_NR_CACHES]
      __attribute__((align(64)));
    <teythoon> but
    <teythoon> nm gnumach|grep kalloc_caches
    <teythoon> c0226e20 b kalloc_caches
    <teythoon> ah, that's fine
    <braunr> yes
    <teythoon> nevr mind
    <braunr> don't we have a macro for the cache line size ?
    <teythoon> ah, there are a great many more kmem_caches around and noone
      told me ...
    <braunr> teythoon: eh :)
    <braunr> aren't you familiar with type-specific caches ?
    <teythoon> no, i'm not familiar with anything in gnumach-land
    <braunr> well, it's the regular slab allocator, carrying the same ideas
      since 1994
    <braunr> it's pretty much the same in linux and other modern unices
    <teythoon> ok
    <braunr> the main difference is likely that we allocate our caches
      statically because we have no kernel modules and know we'll never destroy
      them, only reap them
    <teythoon> is there a macro for the cache line size ?
    <teythoon> there is one burried in the linux source
    <teythoon> L1_CACHE_BYTES from linux/src/include/asm-i386/cache.h
    <braunr> there is one in kern/slab.h
    <teythoon> but it is out of date
    <teythoon> there is ?
    <braunr> but it's commented out
    <braunr> only used when SLAB_USE_CPU_POOLS is defined
    <braunr> but the build system should give you CPU_L1_SHIFT
    <teythoon> hm
    <braunr> and we probably should define CPU_L1_SIZE from that
      unconditionnally in config.h or a general param.h file if there is one
    <braunr> the architecture-specific one perhaps
    <braunr> although it's exported to userland so maybe not


## IRC, freenode, #hurd, 2014-01-07

    <teythoon> braunr: linux defines ____cacheline_aligned :
      http://lxr.free-electrons.com/source/include/linux/cache.h#L20
    <teythoon> where would i put a similar definition in gnumach ?
    <taylanub> .oO( four underscores ?!? )
    <teythoon> heh
    <teythoon> yes, four
    <braunr> teythoon: yes :)

    <teythoon> are kmem_cache objects ever allocated dynamically in gnumach ?
    <braunr> no
    <teythoon> hm
    <braunr> i figured that, since there are no kernel modules, there is no
      need to allocate them dynamically, since they're never destroyed
    <teythoon> so i aligned all statically declarations with
      __attribute__((align(1 << CPU_L1_SHIFT)))
    <teythoon> but i still see 77% of all accesses being to objects that are
      not properly aligned o_O
    <teythoon> ah
    <teythoon> >,<
    <braunr> you could add an assertion in kmem_cache_init to find out what's
      wrong
    <teythoon> *aligned
    <braunr> eh :)
    <braunr> right
    <teythoon> grr
    <teythoon> sweet, the kmem_caches are now all properly aligned :)
    <braunr> :)

    <braunr> hm
    <braunr> i guess i should change what vmstat reports as "cache" from the
      cached objects to the external ones (which map files and not anonymous
      memory)
    <teythoon> braunr: http://paste.debian.net/74869/
    <teythoon> turned out that struct kmem_cache was actually an easy target
    <teythoon> no bitfields, no embedded structs that were addressed as such
      (and not aliased)
    <braunr> :)


## IRC, freenode, #hurd, 2014-01-09

    <teythoon> braunr: i didn't quite get what you and youpi were talking about
      wrt to the alignment attribute
    <teythoon> define a type for struct kmem_cache with the alignment attribute
      ? is that possible ?
    <teythoon> ah, like it's done for kmem_cpu_pool
    <braunr> teythoon: that's it :)
    <braunr> note that aligning a struct doesn't change what sizeof returns
    <teythoon> heh, that save's one a whole lot of trouble indeed
    <braunr> you have to align a member inside for that
    <teythoon> why would it change the size ?
    <braunr> imagine an array of such structs
    <teythoon> ah
    <teythoon> right
    <teythoon> but it fits into two cachelines exactly
    <braunr> that wouldn't be a problem with an array either
    <teythoon> so an array of those will still be aligned element-wise
    <teythoon> yes
    <braunr> and it's often used like that, just as i did for the cpu pools
    <braunr> but then one is tempted to think the size of each element has
      changed too
    <braunr> and then use that technique for, say, reserving a whole cache line
      for one variable
    <teythoon> ah, now i get that remark ;)
    <braunr> :)

    <teythoon> braunr: i annotated struct kmem_cache in slab.h with
      __cacheline_aligned and it did not have the desired effect
    <braunr> can you show the diff please ?
    <teythoon> http://paste.debian.net/75192/
    <braunr> i don't know why :/
    <teythoon> that's how it's done for kmem_cpu_pool
    <braunr> i'll try it here
    <teythoon> wait
    <teythoon> i made a typo
    <teythoon> >,<
    <teythoon> __cachline_aligned
    <teythoon> bad one
    <braunr> uh :)
    <braunr> i don't see it
    <braunr> ah yes
    <braunr> missing e
    <teythoon> yep, works like a charme :)
    <teythoon> nice, good to know :)
    <braunr> :)
    <teythoon> given the previous discussion, shall i send it to the list or
      commit it right away ?
    <braunr> i'd say go ahead and commit