[[!meta copyright="Copyright © 2011, 2012 Free Software Foundation, Inc."]] [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable id="license" text="Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled [[GNU Free Documentation License|/fdl]]."]]"""]] [[!tag open_issue_gnumach open_issue_hurd]] [[!toc]] # [[community/gsoc/project_ideas/disk_io_performance]] # 2011-02 [[Etenil]] has been working in this area. ## IRC, freenode, #hurd, 2011-02-13 youpi: Would libdiskfs/diskfs.h be in the right place to make readahead functions? etenil: no, it'd rather be at the memory management layer, i.e. mach, unfortunately because that's where you see the page faults youpi: Linux also provides a readahead() function for higher level applications. I'll probably have to add the same thing in a place that's higher level than mach well, that should just be hooked to the same common implementation the man page for readahead() also states that portable applications should avoid it, but it could be benefic to have it for portability it's not in posix indeed ## IRC, freenode, #hurd, 2011-02-14 youpi: I've investigated prefetching (readahead) techniques. One called DiskSeen seems really efficient. I can't tell yet if it's patented etc. but I'll keep you informed don't bother with complicated techniques, even the most simple ones will be plenty :) it's not complicated really the matter is more about how to plug it into mach ok then don't bother with potential pattents etenil: please take a look at the work KAM did for last year's GSoC just use a trivial technique :) ok, i'll just go the easy way then antrik: what was etenil referring to when talking about prefetching ? oh, madvise() stuff i could help him with that ## IRC, freenode, #hurd, 2011-02-15 oh, I'm looking into prefetching/readahead to improve I/O performance etenil: ok etenil: that's actually a VM improvement, like samuel told you yes a true I/O improvement would be I/O scheduling and how to implement it in a hurdish way (or if it makes sense to have it in the kernel) that's what I've been wondering too lately concerning the VM, you should look at madvise() my understanding is that Mach considers devices without really knowing what they are that's roughly the interface used both at the syscall() and the kernel levels in BSD, which made it in many other unix systems whereas I/O optimisations are often hard disk drives specific that's true for almost any kernel the device knowledge is at the driver level yes (here, I separate kernels from their drivers ofc) but Mach also contains some drivers, so I'm going through the code to find the apropriate place for these improvements you shouldn't tough the drivers at all touch true, but I need to understand how it works before fiddling around hm not at all the VM improvement is about pagein clustering you don't need to know how pages are fetched well, not at the device level you need to know about the protocol between the kernel and external pagers ok you could also implement pageout clustering if I understand you well, you say that what I'd need to do is a queuing system for the paging in the VM? no i'm saying that, when a page fault occurs, the kernel should (depending on what was configured through madvise()) transfer pages in multiple blocks rather than one at a time communication with external pagers is already async, made through regular ports which already implement message queuing you would just need to make the mapped regions larger and maybe change the interface so that this size is passed mmh (also don't forget that page clustering can include pages *before* the page which caused the fault, so you may have to pass the start of that region too) I'm not sure I understand the page fault thing is it like a segmentation error? I can't find a clear definition in Mach's manual ah it's a fundamental operating system concept http://en.wikipedia.org/wiki/Page_fault ah ok I understand now so what's currently happening is that when a page fault occurs, Mach is transfering pages one at a time and wastes time sometimes, transferring just one page is what you want it depends on the application, which is why there is madvise() our rootfs, on the other hand, would benefit much from such an improvement in UVM, this optimization is account for around 10% global performance improvement accounted* not bad well, with an improved page cache, I'm sure I/O would matter less on systems with more RAM (and another improvement would make mach support more RAM in the first place !) an I/O scheduler outside the kernel would be a very good project IMO in e.g. libstore/storeio yes but as i stated in my thesis, a resource scheduler should be as close to its resource as it can and since mach can host several operating systems, I/O schedulers should reside near device drivers and since current drivers are in the kernel, it makes sens to have it in the kernel too so there must be some discussion about this doesn't this mean that we'll have to get some optimizations in Mach and have the same outside of Mach for translators that access the hardware directly? etenil: why ? well as you said Mach contains some drivers, but in principle, it shouldn't, translators should do disk access etc, yes? etenil: ok etenil: so ? well, let's say if one were to introduce SATA support in Hurd, nothing would stop him/her to do so with a translator rather than in Mach you should avoid the term translator here it's really hurd specific let's just say a user space task would be responsible for that job, maybe multiple instances of it, yes ok, so in this case, let's say we have some I/O optimization techniques like readahead and I/O scheduling within Mach, would these also apply to the user-space task, or would they need to be reimplemented? if you have user space drivers, there is no point having I/O scheduling in the kernel but we also have drivers within the kernel what you call readahead, and I call pagein/out clustering, is really tied to the VM, so it must be in Mach in any case well you either have one or the other currently we have them in the kernel if we switch to DDE, we should have all of them outside that's why such things must be discussed ok so if I follow you, then future I/O device drivers will need to be implemented for Mach currently, yes but preferrably, someone should continue the work that has been done on DDe so that drivers are outside the kernel so for the time being, I will try and improve I/O in Mach, and if drivers ever get out, then some of the I/O optimizations will need to be moved out of Mach let me remind you one of the things i said i said I/O scheduling should be close to their resource, because we can host several operating systems now, the Hurd is the only system running on top of Mach so we could just have I/O scheduling outside too then you should consider neighbor hurds which can use different partitions, but on the same device currently, partitions are managed in the kernel, so file systems (and storeio) can't make good scheduling decisions if it remains that way but that can change too a single storeio representing a whole disk could be shared by several hurd instances, just as if it were a high level driver then you could implement I/O scheduling in storeio, which would be an improvement for the current implementation, and reusable for future work yes, that was my first instinct and you would be mostly free of the kernel internals that make it a nightmare but youpi said that it would be better to modify Mach instead he mentioned the page clustering thing not I/O scheduling theseare really two different things ok you *can't* implement page clustering outside Mach because Mach implements virtual memory both policies and mechanisms well, I'd rather think of one thing at a time if that's alright so what I'm busy with right now is setting up clustered page-in which need to be done within Mach keep clustered page-outs in mind too although there are more constraints on those yes I've looked up madvise(). There's a lot of documentation about it in Linux but I couldn't find references to it in Mach (nor Hurd), does it exist? well, if it did, you wouldn't be caring about clustered page transfers, would you ? be careful about linux specific stuff I suppose not you should implement at least posix options, and if there are more, consider the bsd variants (the Mach VM is the ancestor of all modern BSD VMs) madvise() seems to be posix there are system specific extensions be careful CONFORMING TO POSIX.1b. POSIX.1-2001 describes posix_madvise(3) with constants POSIX_MADV_NORMAL, etc., with a behav‐ ior close to that described here. There is a similar posix_fadvise(2) for file access. MADV_REMOVE, MADV_DONTFORK, MADV_DOFORK, MADV_HWPOISON, MADV_MERGEABLE, and MADV_UNMERGEABLE are Linux- specific. I was about to post these ok, so basically madvise() allows tasks etc. to specify a usage type for a chunk of memory, then I could apply the relevant I/O optimization based on this that's it cool, then I don't need to worry about knowing what the I/O is operating on, I just need to apply the optimizations as advised that's convenient ok I'll start working on this tonight making a basic readahead shouldn't be too hard readahead is a misleading name is pagein better? applies to too many things, doesn't include the case where previous elements could be prefetched clustered page transfers is what i would use page prefetching maybe ok you should stick to something that's already used in the literature since you're not inventing something new yes I've read a paper about prefetching ok thanks for your help braunr sure you're welcome braunr: madvise() is really the least important part of the picture... very few applications actually use it. but pretty much all applications will profit from clustered paging I would consider madvise() an optional goody, not an integral part of the implementation etenil: you can find some stuff about KAM's work on http://www.gnu.org/software/hurd/user/kam.html not much specific though thanks I don't remember exactly, but I guess there is also some information on the mailing list. check the archives for last summer look for Karim Allah Ahmed antrik: I disagree, madvise gives me a good starting point, even if eventually the optimisations should run even without it the code he wrote should be available from Google's summer of code page somewhere... antrik: right, i was mentioning madvise() because the kernel (VM) interface is pretty similar to the syscall but even a default policy would be nice etenil: I fear that many bits were discussed only on IRC... so you'd better look through the IRC logs from last April onwards... ok at the beginning I thought I could put that into libstore which would have been fine BTW, I remembered now that KAM's GSoC application should have a pretty good description of the necessary changes... unfortunately, these are not publicly visible IIRC :-( ## IRC, freenode, #hurd, 2011-02-16 braunr: I've looked in the kernel to see where prefetching would fit best. We talked of the VM yesterday, but I'm not sure about it. It seems to me that the device part of the kernel makes more sense since it's logically what manages devices, am I wrong? etenil: you are etenil: well etenil: drivers should already support clustered sector read/writes ah but yes, there must be support in the drivers too what would really benefit the Hurd mostly concerns page faults, so the right place is the VM subsystem [[clustered_page_faults]] # 2012-03 ## IRC, freenode, #hurd, 2012-03-21 I thought that readahead should have some heuristics, like accounting size of object and last access time, but i didn't find any in kam's patch. Are heuristics needed or it will be overhead for microkernel? size of object and last access time are not necessarily useful to take into account what would usually typically be kept is the amount of contiguous data that has been read lately to know whether it's random or sequential, and how much is read (the whole size of the object does not necessarily give any indication of how much of it will be read) if big object is accessed often, performance could be increased if frame that will be read ahead will be increased too. yes, but the size of the object really does not matter you can just observe how much data is read and realize that it's read a lot all the more so with userland fs translators it's not because you mount a CD image that you need to read it all youpi: indeed. this will be better. But on other hand there is principle about policy and mechanism. And kernel should implement mechanism, but heuristics seems to be policy. Or in this case moving readahead policy to user level would be overhead? mcsim: paging policy is all in kernel anyways; so it makes perfect sense to put the readahead policy there as well (of course it can be argued -- probably rightly -- that all of this should go into userspace instead...) antrik: probably defpager partly could do that. AFAIR, it is possible for defpager to return more memory than was asked. antrik: I want to outline what should be done during gsoc. First, kernel should support simple readahead for specified number of pages (regarding direction of access) + simple heuristic for changing frame size. Also default pager could make some analysis, for instance if it has many data located consequentially it could return more data then was asked. For other pagers I won't do anything. Is it suitable? mcsim: I think we actually had the same discussion already with KAM ;-) for clustered pageout, the kernel *has* to make the decision. I'm really not convinced it makes sense to leave the decision for clustered pagein to the individual pagers especially as this will actually complicate matters because a) it will require work in *every* pager, and b) it will probably make handling of MADVISE & friends more complex implementing readahead only for the default pager would actually be rather unrewarding. I'm pretty sure it's the one giving the *least* benefit it's much, much more important for ext2 mcsim: maybe try to dig in the irc logs, we discussed about it with neal. the current natural place would be the kernel, because it's the piece that gets the traps and thus knows what happens with each projection, while the backend just provides the pages without knowing which projection wants it. Moving to userland would not only be overhead, but quite difficult antrik: OK, but I'm not sure that I could do it for ext2. OK, I'll dig. ## IRC, freenode, #hurd, 2012-04-01 as part of implementing of readahead project I have to add interface for setting appropriate behaviour for memory range. This interface than should be compatible with madvise call, that has a lot of possible advises, but most part of them are specific for Linux (according to man page). Should mach also support these Linux-specific values? p.s. these Linux-specific values shouldn't affect readahead algorithm. the interface shouldn't prevent from adding them some day so that we don't have to add them yet ok. And what behaviour with value MADV_NORMAL should be look like? Seems that it should be synonym to MADV_SEQUENTIAL, isn't it? no, it just means "no idea what it is" in the linux implementation, that means some given readahead value while SEQUENTIAL means twice as much and RANDOM means zero youpi: thank you. youpi: Than, it seems to be better that kernel interface for setting behaviour will accept readahead value, without hiding it behind such constants, like VM_BEHAVIOR_DEFAULT (like it was in kam's patch). And than implementation of madvise will call vm_behaviour_set with appropriate frame size. Is that right? question of taste, better ask on the list ok