open_issues/performance/io_system/read-ahead.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391

[[!meta copyright="Copyright © 2011, 2012 Free Software Foundation, Inc."]]

[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
id="license" text="Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License, Version 1.2 or
any later version published by the Free Software Foundation; with no Invariant
Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
is included in the section entitled [[GNU Free Documentation
License|/fdl]]."]]"""]]

[[!tag open_issue_gnumach open_issue_hurd]]

[[!toc]]


# [[community/gsoc/project_ideas/disk_io_performance]]


# 2011-02

[[Etenil]] has been working in this area.


## IRC, freenode, #hurd, 2011-02-13

    <etenil> youpi: Would libdiskfs/diskfs.h be in the right place to make
      readahead functions?
    <youpi> etenil: no, it'd rather be at the memory management layer,
      i.e. mach, unfortunately
    <youpi> because that's where you see the page faults
    <etenil> youpi: Linux also provides a readahead() function for higher level
      applications. I'll probably have to add the same thing in a place that's
      higher level than mach
    <youpi> well, that should just be hooked to the same common implementation
    <etenil> the man page for readahead() also states that portable
      applications should avoid it, but it could be benefic to have it for
      portability
    <youpi> it's not in posix indeed


## IRC, freenode, #hurd, 2011-02-14

    <etenil> youpi: I've investigated prefetching (readahead) techniques. One
      called DiskSeen seems really efficient. I can't tell yet if it's patented
      etc. but I'll keep you informed
    <youpi> don't bother with complicated techniques, even the most simple ones
      will be plenty :)
    <etenil> it's not complicated really
    <youpi> the matter is more about how to plug it into mach
    <etenil> ok
    <youpi> then don't bother with potential pattents
    <antrik> etenil: please take a look at the work KAM did for last year's
      GSoC
    <youpi> just use a trivial technique :)
    <etenil> ok, i'll just go the easy way then

    <braunr> antrik: what was etenil referring to when talking about
      prefetching ?
    <braunr> oh, madvise() stuff
    <braunr> i could help him with that


## IRC, freenode, #hurd, 2011-02-15

    <etenil> oh, I'm looking into prefetching/readahead to improve I/O
      performance
    <braunr> etenil: ok
    <braunr> etenil: that's actually a VM improvement, like samuel told you
    <etenil> yes
    <braunr> a true I/O improvement would be I/O scheduling
    <braunr> and how to implement it in a hurdish way
    <braunr> (or if it makes sense to have it in the kernel)
    <etenil> that's what I've been wondering too lately
    <braunr> concerning the VM, you should look at madvise()
    <etenil> my understanding is that Mach considers devices without really
      knowing what they are
    <braunr> that's roughly the interface used both at the syscall() and the
      kernel levels in BSD, which made it in many other unix systems
    <etenil> whereas I/O optimisations are often hard disk drives specific
    <braunr> that's true for almost any kernel
    <braunr> the device knowledge is at the driver level
    <etenil> yes
    <braunr> (here, I separate kernels from their drivers ofc)
    <etenil> but Mach also contains some drivers, so I'm going through the code
      to find the apropriate place for these improvements
    <braunr> you shouldn't tough the drivers at all
    <braunr> touch
    <etenil> true, but I need to understand how it works before fiddling around
    <braunr> hm
    <braunr> not at all
    <braunr> the VM improvement is about pagein clustering
    <braunr> you don't need to know how pages are fetched
    <braunr> well, not at the device level
    <braunr> you need to know about the protocol between the kernel and
      external pagers
    <etenil> ok
    <braunr> you could also implement pageout clustering
    <etenil> if I understand you well, you say that what I'd need to do is a
      queuing system for the paging in the VM?
    <braunr> no
    <braunr> i'm saying that, when a page fault occurs, the kernel should
      (depending on what was configured through madvise()) transfer pages in
      multiple blocks rather than one at a time
    <braunr> communication with external pagers is already async, made through
      regular ports
    <braunr> which already implement message queuing
    <braunr> you would just need to make the mapped regions larger
    <braunr> and maybe change the interface so that this size is passed
    <etenil> mmh
    <braunr> (also don't forget that page clustering can include pages *before*
      the page which caused the fault, so you may have to pass the start of
      that region too)
    <etenil> I'm not sure I understand the page fault thing
    <etenil> is it like a segmentation error?
    <etenil> I can't find a clear definition in Mach's manual
    <braunr> ah
    <braunr> it's a fundamental operating system concept
    <braunr> http://en.wikipedia.org/wiki/Page_fault
    <etenil> ah ok
    <etenil> I understand now
    <etenil> so what's currently happening is that when a page fault occurs,
      Mach is transfering pages one at a time and wastes time 
    <braunr> sometimes, transferring just one page is what you want
    <braunr> it depends on the application, which is why there is madvise()
    <braunr> our rootfs, on the other hand, would benefit much from such an
      improvement
    <braunr> in UVM, this optimization is account for around 10% global
      performance improvement
    <braunr> accounted*
    <etenil> not bad
    <braunr> well, with an improved page cache, I'm sure I/O would matter less
      on systems with more RAM
    <braunr> (and another improvement would make mach support more RAM in the
      first place !)
    <braunr> an I/O scheduler outside the kernel would be a very good project
      IMO
    <braunr> in e.g. libstore/storeio
    <etenil> yes
    <braunr> but as i stated in my thesis, a resource scheduler should be as
      close to its resource as it can
    <braunr> and since mach can host several operating systems, I/O schedulers
      should reside near device drivers
    <braunr> and since current drivers are in the kernel, it makes sens to have
      it in the kernel too
    <braunr> so there must be some discussion about this
    <etenil> doesn't this mean that we'll have to get some optimizations in
      Mach and have the same outside of Mach for translators that access the
      hardware directly?
    <braunr> etenil: why ?
    <etenil> well as you said Mach contains some drivers, but in principle, it
      shouldn't, translators should do disk access etc, yes?
    <braunr> etenil: ok
    <braunr> etenil: so ?
    <etenil> well, let's say if one were to introduce SATA support in Hurd,
      nothing would stop him/her to do so with a translator rather than in Mach
    <braunr> you should avoid the term translator here
    <braunr> it's really hurd specific
    <braunr> let's just say a user space task would be responsible for that
      job, maybe multiple instances of it, yes
    <etenil> ok, so in this case, let's say we have some I/O optimization
      techniques like readahead and I/O scheduling within Mach, would these
      also apply to the user-space task, or would they need to be
      reimplemented?
    <braunr> if you have user space drivers, there is no point having I/O
      scheduling in the kernel
    <etenil> but we also have drivers within the kernel
    <braunr> what you call readahead, and I call pagein/out clustering, is
      really tied to the VM, so it must be in Mach in any case
    <braunr> well
    <braunr> you either have one or the other
    <braunr> currently we have them in the kernel
    <braunr> if we switch to DDE, we should have all of them outside
    <braunr> that's why such things must be discussed
    <etenil> ok so if I follow you, then future I/O device drivers will need to
      be implemented for Mach
    <braunr> currently, yes
    <braunr> but preferrably, someone should continue the work that has been
      done on DDe so that drivers are outside the kernel
    <etenil> so for the time being, I will try and improve I/O in Mach, and if
      drivers ever get out, then some of the I/O optimizations will need to be
      moved out of Mach
    <braunr> let me remind you one of the things i said
    <braunr> i said I/O scheduling should be close to their resource, because
      we can host several operating systems
    <braunr> now, the Hurd is the only system running on top of Mach
    <braunr> so we could just have I/O scheduling outside too
    <braunr> then you should consider neighbor hurds
    <braunr> which can use different partitions, but on the same device
    <braunr> currently, partitions are managed in the kernel, so file systems
      (and storeio) can't make good scheduling decisions if it remains that way
    <braunr> but that can change too
    <braunr> a single storeio representing a whole disk could be shared by
      several hurd instances, just as if it were a high level driver
    <braunr> then you could implement I/O scheduling in storeio, which would be
      an improvement for the current implementation, and reusable for future
      work
    <etenil> yes, that was my first instinct
    <braunr> and you would be mostly free of the kernel internals that make it
      a nightmare
    <etenil> but youpi said that it would be better to modify Mach instead
    <braunr> he mentioned the page clustering thing
    <braunr> not I/O scheduling
    <braunr> theseare really two different things
    <etenil> ok
    <braunr> you *can't* implement page clustering outside Mach because Mach
      implements virtual memory
    <braunr> both policies and mechanisms
    <etenil> well, I'd rather think of one thing at a time if that's alright
    <etenil> so what I'm busy with right now is setting up clustered page-in
    <etenil> which need to be done within Mach
    <braunr> keep clustered page-outs in mind too
    <braunr> although there are more constraints on those
    <etenil> yes
    <etenil> I've looked up madvise(). There's a lot of documentation about it
      in Linux but I couldn't find references to it in Mach (nor Hurd), does it
      exist?
    <braunr> well, if it did, you wouldn't be caring about clustered page
      transfers, would you ?
    <braunr> be careful about linux specific stuff
    <etenil> I suppose not
    <braunr> you should implement at least posix options, and if there are
      more, consider the bsd variants
    <braunr> (the Mach VM is the ancestor of all modern BSD VMs)
    <etenil> madvise() seems to be posix
    <braunr> there are system specific extensions
    <braunr> be careful
    <braunr> CONFORMING TO POSIX.1b.   POSIX.1-2001 describes posix_madvise(3)
      with constants POSIX_MADV_NORMAL, etc., with a behav‐ ior close to that
      described here.  There is a similar posix_fadvise(2) for file access.
    <braunr> MADV_REMOVE, MADV_DONTFORK, MADV_DOFORK, MADV_HWPOISON,
      MADV_MERGEABLE, and MADV_UNMERGEABLE  are  Linux- specific.
    <etenil> I was about to post these
    <etenil> ok, so basically madvise() allows tasks etc. to specify a usage
      type for a chunk of memory, then I could apply the relevant I/O
      optimization based on this
    <braunr> that's it
    <etenil> cool, then I don't need to worry about knowing what the I/O is
      operating on, I just need to apply the optimizations as advised
    <etenil> that's convenient
    <etenil> ok I'll start working on this tonight
    <etenil> making a basic readahead shouldn't be too hard
    <braunr> readahead is a misleading name
    <etenil> is pagein better?
    <braunr> applies to too many things, doesn't include the case where
      previous elements could be prefetched
    <braunr> clustered page transfers is what i would use
    <braunr> page prefetching maybe
    <etenil> ok
    <braunr> you should stick to something that's already used in the
      literature since you're not inventing something new
    <etenil> yes I've read a paper about prefetching
    <etenil> ok
    <etenil> thanks for your help braunr
    <braunr> sure
    <braunr> you're welcome
    <antrik> braunr: madvise() is really the least important part of the
      picture...
    <antrik> very few applications actually use it. but pretty much all
      applications will profit from clustered paging
    <antrik> I would consider madvise() an optional goody, not an integral part
      of the implementation
    <antrik> etenil: you can find some stuff about KAM's work on
      http://www.gnu.org/software/hurd/user/kam.html
    <antrik> not much specific though
    <etenil> thanks
    <antrik> I don't remember exactly, but I guess there is also some
      information on the mailing list. check the archives for last summer
    <antrik> look for Karim Allah Ahmed
    <etenil> antrik: I disagree, madvise gives me a good starting point, even
      if eventually the optimisations should run even without it
    <antrik> the code he wrote should be available from Google's summer of code
      page somewhere...
    <braunr> antrik: right, i was mentioning madvise() because the kernel (VM)
      interface is pretty similar to the syscall
    <braunr> but even a default policy would be nice
    <antrik> etenil: I fear that many bits were discussed only on IRC... so
      you'd better look through the IRC logs from last April onwards...
    <etenil> ok

    <etenil> at the beginning I thought I could put that into libstore
    <etenil> which would have been fine

    <antrik> BTW, I remembered now that KAM's GSoC application should have a
      pretty good description of the necessary changes... unfortunately, these
      are not publicly visible IIRC :-(


## IRC, freenode, #hurd, 2011-02-16

    <etenil> braunr: I've looked in the kernel to see where prefetching would
      fit best. We talked of the VM yesterday, but I'm not sure about it. It
      seems to me that the device part of the kernel makes more sense since
      it's logically what manages devices, am I wrong?
    <braunr> etenil: you are
    <braunr> etenil: well
    <braunr> etenil: drivers should already support clustered sector
      read/writes
    <etenil> ah
    <braunr> but yes, there must be support in the drivers too
    <braunr> what would really benefit the Hurd mostly concerns page faults, so
      the right place is the VM subsystem

[[clustered_page_faults]]


# 2012-03


## IRC, freenode, #hurd, 2012-03-21

    <mcsim> I thought that readahead should have some heuristics, like
      accounting size of object and last access time, but i didn't find any in
      kam's patch. Are heuristics needed or it will be overhead for
      microkernel? 
    <youpi> size  of object and last access time are not necessarily useful to
      take into account
    <youpi> what would usually typically be kept is the amount of contiguous
      data that has been read lately
    <youpi> to know whether it's random or sequential, and how much is read
    <youpi> (the whole size of the object does not necessarily give any
      indication of how much of it will be read)
    <mcsim> if big object is accessed often, performance could be increased if
      frame that will be read ahead will be increased too.
    <youpi> yes, but the size of the object really does not matter
    <youpi> you can just observe how much data is read and realize that it's
      read a lot
    <youpi> all the more so with userland fs translators
    <youpi> it's not because you mount a CD image that you need to read it all
    <mcsim> youpi: indeed. this will be better. But on other hand there is
      principle about policy and mechanism. And kernel should implement
      mechanism, but heuristics seems to be policy. Or in this case moving
      readahead policy to user level would be overhead?
    <antrik> mcsim: paging policy is all in kernel anyways; so it makes perfect
      sense to put the readahead policy there as well
    <antrik> (of course it can be argued -- probably rightly -- that all of
      this should go into userspace instead...)
    <mcsim> antrik: probably defpager partly could do that. AFAIR, it is
      possible for defpager to return more memory than was asked.
    <mcsim> antrik: I want to outline what should be done during gsoc. First,
      kernel should support simple readahead for specified number of pages
      (regarding direction of access) + simple heuristic for changing frame
      size. Also default pager could make some analysis, for instance if it has
      many data located consequentially it could return more data then was
      asked. For other pagers I won't do anything. Is it suitable?
    <antrik> mcsim: I think we actually had the same discussion already with
      KAM ;-)
    <antrik> for clustered pageout, the kernel *has* to make the decision. I'm
      really not convinced it makes sense to leave the decision for clustered
      pagein to the individual pagers
    <antrik> especially as this will actually complicate matters because a) it
      will require work in *every* pager, and b) it will probably make handling
      of MADVISE & friends more complex
    <antrik> implementing readahead only for the default pager would actually
      be rather unrewarding. I'm pretty sure it's the one giving the *least*
      benefit
    <antrik> it's much, much more important for ext2
    <youpi> mcsim: maybe try to dig in the irc logs, we discussed about it with
      neal. the current natural place would be the kernel, because it's the
      piece that gets the traps and thus knows what happens with each
      projection, while the backend just provides the pages without knowing
      which projection wants it. Moving to userland would not only be overhead,
      but quite difficult
    <mcsim> antrik: OK, but I'm not sure that I could do it for ext2. 
    <mcsim> OK, I'll dig.


## IRC, freenode, #hurd, 2012-04-01

    <mcsim> as part of implementing of readahead project I have to add
      interface for setting appropriate behaviour for memory range.  This
      interface than should be compatible with madvise call, that has a lot of
      possible advises, but most part of them are specific for Linux (according
      to man page). Should mach also support these Linux-specific values?
    <mcsim> p.s. these Linux-specific values shouldn't affect readahead
      algorithm.
    <youpi> the interface shouldn't prevent from adding them some day
    <youpi> so that we don't have to add them yet
    <mcsim> ok. And what behaviour with value MADV_NORMAL should be look like?
      Seems that it should be synonym to MADV_SEQUENTIAL, isn't it?
    <youpi> no, it just means "no idea what it is"
    <youpi> in the linux implementation, that means some given readahead value
    <youpi> while SEQUENTIAL means twice as much
    <youpi> and RANDOM means zero
    <mcsim> youpi: thank you.
    <mcsim> youpi: Than, it seems to be better that kernel interface for
      setting behaviour will accept readahead value, without hiding it behind
      such constants, like VM_BEHAVIOR_DEFAULT (like it was in kam's
      patch). And than implementation of madvise will call vm_behaviour_set
      with appropriate frame size. Is that right?
    <youpi> question of taste, better ask on the list
    <mcsim> ok