open_issues/performance/io_system/read-ahead.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301

[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]]

[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
id="license" text="Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License, Version 1.2 or
any later version published by the Free Software Foundation; with no Invariant
Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
is included in the section entitled [[GNU Free Documentation
License|/fdl]]."]]"""]]

[[!tag open_issue_gnumach open_issue_hurd]]

[[community/gsoc/project_ideas/disk_io_performance]]

IRC, #hurd, freenode, 2011-02-13:

    <etenil> youpi: Would libdiskfs/diskfs.h be in the right place to make
      readahead functions?
    <youpi> etenil: no, it'd rather be at the memory management layer,
      i.e. mach, unfortunately
    <youpi> because that's where you see the page faults
    <etenil> youpi: Linux also provides a readahead() function for higher level
      applications. I'll probably have to add the same thing in a place that's
      higher level than mach
    <youpi> well, that should just be hooked to the same common implementation
    <etenil> the man page for readahead() also states that portable
      applications should avoid it, but it could be benefic to have it for
      portability
    <youpi> it's not in posix indeed

---

IRC, #hurd, freenode, 2011-02-14:

    <etenil> youpi: I've investigated prefetching (readahead) techniques. One
      called DiskSeen seems really efficient. I can't tell yet if it's patented
      etc. but I'll keep you informed
    <youpi> don't bother with complicated techniques, even the most simple ones
      will be plenty :)
    <etenil> it's not complicated really
    <youpi> the matter is more about how to plug it into mach
    <etenil> ok
    <youpi> then don't bother with potential pattents
    <antrik> etenil: please take a look at the work KAM did for last year's
      GSoC
    <youpi> just use a trivial technique :)
    <etenil> ok, i'll just go the easy way then

    <braunr> antrik: what was etenil referring to when talking about
      prefetching ?
    <braunr> oh, madvise() stuff
    <braunr> i could help him with that

---

[[Etenil]] is now working in this area.

---

IRC, freenode, #hurd, 2011-02-15

    <etenil> oh, I'm looking into prefetching/readahead to improve I/O
      performance
    <braunr> etenil: ok
    <braunr> etenil: that's actually a VM improvement, like samuel told you
    <etenil> yes
    <braunr> a true I/O improvement would be I/O scheduling
    <braunr> and how to implement it in a hurdish way
    <braunr> (or if it makes sense to have it in the kernel)
    <etenil> that's what I've been wondering too lately
    <braunr> concerning the VM, you should look at madvise()
    <etenil> my understanding is that Mach considers devices without really
      knowing what they are
    <braunr> that's roughly the interface used both at the syscall() and the
      kernel levels in BSD, which made it in many other unix systems
    <etenil> whereas I/O optimisations are often hard disk drives specific
    <braunr> that's true for almost any kernel
    <braunr> the device knowledge is at the driver level
    <etenil> yes
    <braunr> (here, I separate kernels from their drivers ofc)
    <etenil> but Mach also contains some drivers, so I'm going through the code
      to find the apropriate place for these improvements
    <braunr> you shouldn't tough the drivers at all
    <braunr> touch
    <etenil> true, but I need to understand how it works before fiddling around
    <braunr> hm
    <braunr> not at all
    <braunr> the VM improvement is about pagein clustering
    <braunr> you don't need to know how pages are fetched
    <braunr> well, not at the device level
    <braunr> you need to know about the protocol between the kernel and
      external pagers
    <etenil> ok
    <braunr> you could also implement pageout clustering
    <etenil> if I understand you well, you say that what I'd need to do is a
      queuing system for the paging in the VM?
    <braunr> no
    <braunr> i'm saying that, when a page fault occurs, the kernel should
      (depending on what was configured through madvise()) transfer pages in
      multiple blocks rather than one at a time
    <braunr> communication with external pagers is already async, made through
      regular ports
    <braunr> which already implement message queuing
    <braunr> you would just need to make the mapped regions larger
    <braunr> and maybe change the interface so that this size is passed
    <etenil> mmh
    <braunr> (also don't forget that page clustering can include pages *before*
      the page which caused the fault, so you may have to pass the start of
      that region too)
    <etenil> I'm not sure I understand the page fault thing
    <etenil> is it like a segmentation error?
    <etenil> I can't find a clear definition in Mach's manual
    <braunr> ah
    <braunr> it's a fundamental operating system concept
    <braunr> http://en.wikipedia.org/wiki/Page_fault
    <etenil> ah ok
    <etenil> I understand now
    <etenil> so what's currently happening is that when a page fault occurs,
      Mach is transfering pages one at a time and wastes time 
    <braunr> sometimes, transferring just one page is what you want
    <braunr> it depends on the application, which is why there is madvise()
    <braunr> our rootfs, on the other hand, would benefit much from such an
      improvement
    <braunr> in UVM, this optimization is account for around 10% global
      performance improvement
    <braunr> accounted*
    <etenil> not bad
    <braunr> well, with an improved page cache, I'm sure I/O would matter less
      on systems with more RAM
    <braunr> (and another improvement would make mach support more RAM in the
      first place !)
    <braunr> an I/O scheduler outside the kernel would be a very good project
      IMO
    <braunr> in e.g. libstore/storeio
    <etenil> yes
    <braunr> but as i stated in my thesis, a resource scheduler should be as
      close to its resource as it can
    <braunr> and since mach can host several operating systems, I/O schedulers
      should reside near device drivers
    <braunr> and since current drivers are in the kernel, it makes sens to have
      it in the kernel too
    <braunr> so there must be some discussion about this
    <etenil> doesn't this mean that we'll have to get some optimizations in
      Mach and have the same outside of Mach for translators that access the
      hardware directly?
    <braunr> etenil: why ?
    <etenil> well as you said Mach contains some drivers, but in principle, it
      shouldn't, translators should do disk access etc, yes?
    <braunr> etenil: ok
    <braunr> etenil: so ?
    <etenil> well, let's say if one were to introduce SATA support in Hurd,
      nothing would stop him/her to do so with a translator rather than in Mach
    <braunr> you should avoid the term translator here
    <braunr> it's really hurd specific
    <braunr> let's just say a user space task would be responsible for that
      job, maybe multiple instances of it, yes
    <etenil> ok, so in this case, let's say we have some I/O optimization
      techniques like readahead and I/O scheduling within Mach, would these
      also apply to the user-space task, or would they need to be
      reimplemented?
    <braunr> if you have user space drivers, there is no point having I/O
      scheduling in the kernel
    <etenil> but we also have drivers within the kernel
    <braunr> what you call readahead, and I call pagein/out clustering, is
      really tied to the VM, so it must be in Mach in any case
    <braunr> well
    <braunr> you either have one or the other
    <braunr> currently we have them in the kernel
    <braunr> if we switch to DDE, we should have all of them outside
    <braunr> that's why such things must be discussed
    <etenil> ok so if I follow you, then future I/O device drivers will need to
      be implemented for Mach
    <braunr> currently, yes
    <braunr> but preferrably, someone should continue the work that has been
      done on DDe so that drivers are outside the kernel
    <etenil> so for the time being, I will try and improve I/O in Mach, and if
      drivers ever get out, then some of the I/O optimizations will need to be
      moved out of Mach
    <braunr> let me remind you one of the things i said
    <braunr> i said I/O scheduling should be close to their resource, because
      we can host several operating systems
    <braunr> now, the Hurd is the only system running on top of Mach
    <braunr> so we could just have I/O scheduling outside too
    <braunr> then you should consider neighbor hurds
    <braunr> which can use different partitions, but on the same device
    <braunr> currently, partitions are managed in the kernel, so file systems
      (and storeio) can't make good scheduling decisions if it remains that way
    <braunr> but that can change too
    <braunr> a single storeio representing a whole disk could be shared by
      several hurd instances, just as if it were a high level driver
    <braunr> then you could implement I/O scheduling in storeio, which would be
      an improvement for the current implementation, and reusable for future
      work
    <etenil> yes, that was my first instinct
    <braunr> and you would be mostly free of the kernel internals that make it
      a nightmare
    <etenil> but youpi said that it would be better to modify Mach instead
    <braunr> he mentioned the page clustering thing
    <braunr> not I/O scheduling
    <braunr> theseare really two different things
    <etenil> ok
    <braunr> you *can't* implement page clustering outside Mach because Mach
      implements virtual memory
    <braunr> both policies and mechanisms
    <etenil> well, I'd rather think of one thing at a time if that's alright
    <etenil> so what I'm busy with right now is setting up clustered page-in
    <etenil> which need to be done within Mach
    <braunr> keep clustered page-outs in mind too
    <braunr> although there are more constraints on those
    <etenil> yes
    <etenil> I've looked up madvise(). There's a lot of documentation about it
      in Linux but I couldn't find references to it in Mach (nor Hurd), does it
      exist?
    <braunr> well, if it did, you wouldn't be caring about clustered page
      transfers, would you ?
    <braunr> be careful about linux specific stuff
    <etenil> I suppose not
    <braunr> you should implement at least posix options, and if there are
      more, consider the bsd variants
    <braunr> (the Mach VM is the ancestor of all modern BSD VMs)
    <etenil> madvise() seems to be posix
    <braunr> there are system specific extensions
    <braunr> be careful
    <braunr> CONFORMING TO POSIX.1b.   POSIX.1-2001 describes posix_madvise(3)
      with constants POSIX_MADV_NORMAL, etc., with a behav‐ ior close to that
      described here.  There is a similar posix_fadvise(2) for file access.
    <braunr> MADV_REMOVE, MADV_DONTFORK, MADV_DOFORK, MADV_HWPOISON,
      MADV_MERGEABLE, and MADV_UNMERGEABLE  are  Linux- specific.
    <etenil> I was about to post these
    <etenil> ok, so basically madvise() allows tasks etc. to specify a usage
      type for a chunk of memory, then I could apply the relevant I/O
      optimization based on this
    <braunr> that's it
    <etenil> cool, then I don't need to worry about knowing what the I/O is
      operating on, I just need to apply the optimizations as advised
    <etenil> that's convenient
    <etenil> ok I'll start working on this tonight
    <etenil> making a basic readahead shouldn't be too hard
    <braunr> readahead is a misleading name
    <etenil> is pagein better?
    <braunr> applies to too many things, doesn't include the case where
      previous elements could be prefetched
    <braunr> clustered page transfers is what i would use
    <braunr> page prefetching maybe
    <etenil> ok
    <braunr> you should stick to something that's already used in the
      literature since you're not inventing something new
    <etenil> yes I've read a paper about prefetching
    <etenil> ok
    <etenil> thanks for your help braunr
    <braunr> sure
    <braunr> you're welcome
    <antrik> braunr: madvise() is really the least important part of the
      picture...
    <antrik> very few applications actually use it. but pretty much all
      applications will profit from clustered paging
    <antrik> I would consider madvise() an optional goody, not an integral part
      of the implementation
    <antrik> etenil: you can find some stuff about KAM's work on
      http://www.gnu.org/software/hurd/user/kam.html
    <antrik> not much specific though
    <etenil> thanks
    <antrik> I don't remember exactly, but I guess there is also some
      information on the mailing list. check the archives for last summer
    <antrik> look for Karim Allah Ahmed
    <etenil> antrik: I disagree, madvise gives me a good starting point, even
      if eventually the optimisations should run even without it
    <antrik> the code he wrote should be available from Google's summer of code
      page somewhere...
    <braunr> antrik: right, i was mentioning madvise() because the kernel (VM)
      interface is pretty similar to the syscall
    <braunr> but even a default policy would be nice
    <antrik> etenil: I fear that many bits were discussed only on IRC... so
      you'd better look through the IRC logs from last April onwards...
    <etenil> ok

    <etenil> at the beginning I thought I could put that into libstore
    <etenil> which would have been fine

    <antrik> BTW, I remembered now that KAM's GSoC application should have a
      pretty good description of the necessary changes... unfortunately, these
      are not publicly visible IIRC :-(

---

IRC, freenode, #hurd, 2011-02-16

    <etenil> braunr: I've looked in the kernel to see where prefetching would
      fit best. We talked of the VM yesterday, but I'm not sure about it. It
      seems to me that the device part of the kernel makes more sense since
      it's logically what manages devices, am I wrong?
    <braunr> etenil: you are
    <braunr> etenil: well
    <braunr> etenil: drivers should already support clustered sector
      read/writes
    <etenil> ah
    <braunr> but yes, there must be support in the drivers too
    <braunr> what would really benefit the Hurd mostly concerns page faults, so
      the right place is the VM subsystem

[[clustered_page_faults]]