[[!meta copyright="Copyright © 2011, 2012, 2013, 2014 Free Software Foundation, Inc."]] [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable id="license" text="Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled [[GNU Free Documentation License|/fdl]]."]]"""]] [[!tag open_issue_gnumach open_issue_hurd]] [[!toc]] # [[community/gsoc/project_ideas/disk_io_performance]] # [[gnumach_page_cache_policy]] # 2011-02 [[Etenil]] has been working in this area. ## IRC, freenode, #hurd, 2011-02-13 youpi: Would libdiskfs/diskfs.h be in the right place to make readahead functions? etenil: no, it'd rather be at the memory management layer, i.e. mach, unfortunately because that's where you see the page faults youpi: Linux also provides a readahead() function for higher level applications. I'll probably have to add the same thing in a place that's higher level than mach well, that should just be hooked to the same common implementation the man page for readahead() also states that portable applications should avoid it, but it could be benefic to have it for portability it's not in posix indeed ## IRC, freenode, #hurd, 2011-02-14 youpi: I've investigated prefetching (readahead) techniques. One called DiskSeen seems really efficient. I can't tell yet if it's patented etc. but I'll keep you informed don't bother with complicated techniques, even the most simple ones will be plenty :) it's not complicated really the matter is more about how to plug it into mach ok then don't bother with potential pattents etenil: please take a look at the work KAM did for last year's GSoC just use a trivial technique :) ok, i'll just go the easy way then antrik: what was etenil referring to when talking about prefetching ? oh, madvise() stuff i could help him with that ## IRC, freenode, #hurd, 2011-02-15 oh, I'm looking into prefetching/readahead to improve I/O performance etenil: ok etenil: that's actually a VM improvement, like samuel told you yes a true I/O improvement would be I/O scheduling and how to implement it in a hurdish way (or if it makes sense to have it in the kernel) that's what I've been wondering too lately concerning the VM, you should look at madvise() my understanding is that Mach considers devices without really knowing what they are that's roughly the interface used both at the syscall() and the kernel levels in BSD, which made it in many other unix systems whereas I/O optimisations are often hard disk drives specific that's true for almost any kernel the device knowledge is at the driver level yes (here, I separate kernels from their drivers ofc) but Mach also contains some drivers, so I'm going through the code to find the apropriate place for these improvements you shouldn't tough the drivers at all touch true, but I need to understand how it works before fiddling around hm not at all the VM improvement is about pagein clustering you don't need to know how pages are fetched well, not at the device level you need to know about the protocol between the kernel and external pagers ok you could also implement pageout clustering if I understand you well, you say that what I'd need to do is a queuing system for the paging in the VM? no i'm saying that, when a page fault occurs, the kernel should (depending on what was configured through madvise()) transfer pages in multiple blocks rather than one at a time communication with external pagers is already async, made through regular ports which already implement message queuing you would just need to make the mapped regions larger and maybe change the interface so that this size is passed mmh (also don't forget that page clustering can include pages *before* the page which caused the fault, so you may have to pass the start of that region too) I'm not sure I understand the page fault thing is it like a segmentation error? I can't find a clear definition in Mach's manual ah it's a fundamental operating system concept http://en.wikipedia.org/wiki/Page_fault ah ok I understand now so what's currently happening is that when a page fault occurs, Mach is transfering pages one at a time and wastes time sometimes, transferring just one page is what you want it depends on the application, which is why there is madvise() our rootfs, on the other hand, would benefit much from such an improvement in UVM, this optimization is account for around 10% global performance improvement accounted* not bad well, with an improved page cache, I'm sure I/O would matter less on systems with more RAM (and another improvement would make mach support more RAM in the first place !) an I/O scheduler outside the kernel would be a very good project IMO in e.g. libstore/storeio yes but as i stated in my thesis, a resource scheduler should be as close to its resource as it can and since mach can host several operating systems, I/O schedulers should reside near device drivers and since current drivers are in the kernel, it makes sens to have it in the kernel too so there must be some discussion about this doesn't this mean that we'll have to get some optimizations in Mach and have the same outside of Mach for translators that access the hardware directly? etenil: why ? well as you said Mach contains some drivers, but in principle, it shouldn't, translators should do disk access etc, yes? etenil: ok etenil: so ? well, let's say if one were to introduce SATA support in Hurd, nothing would stop him/her to do so with a translator rather than in Mach you should avoid the term translator here it's really hurd specific let's just say a user space task would be responsible for that job, maybe multiple instances of it, yes ok, so in this case, let's say we have some I/O optimization techniques like readahead and I/O scheduling within Mach, would these also apply to the user-space task, or would they need to be reimplemented? if you have user space drivers, there is no point having I/O scheduling in the kernel but we also have drivers within the kernel what you call readahead, and I call pagein/out clustering, is really tied to the VM, so it must be in Mach in any case well you either have one or the other currently we have them in the kernel if we switch to DDE, we should have all of them outside that's why such things must be discussed ok so if I follow you, then future I/O device drivers will need to be implemented for Mach currently, yes but preferrably, someone should continue the work that has been done on DDe so that drivers are outside the kernel so for the time being, I will try and improve I/O in Mach, and if drivers ever get out, then some of the I/O optimizations will need to be moved out of Mach let me remind you one of the things i said i said I/O scheduling should be close to their resource, because we can host several operating systems now, the Hurd is the only system running on top of Mach so we could just have I/O scheduling outside too then you should consider neighbor hurds which can use different partitions, but on the same device currently, partitions are managed in the kernel, so file systems (and storeio) can't make good scheduling decisions if it remains that way but that can change too a single storeio representing a whole disk could be shared by several hurd instances, just as if it were a high level driver then you could implement I/O scheduling in storeio, which would be an improvement for the current implementation, and reusable for future work yes, that was my first instinct and you would be mostly free of the kernel internals that make it a nightmare but youpi said that it would be better to modify Mach instead he mentioned the page clustering thing not I/O scheduling theseare really two different things ok you *can't* implement page clustering outside Mach because Mach implements virtual memory both policies and mechanisms well, I'd rather think of one thing at a time if that's alright so what I'm busy with right now is setting up clustered page-in which need to be done within Mach keep clustered page-outs in mind too although there are more constraints on those yes I've looked up madvise(). There's a lot of documentation about it in Linux but I couldn't find references to it in Mach (nor Hurd), does it exist? well, if it did, you wouldn't be caring about clustered page transfers, would you ? be careful about linux specific stuff I suppose not you should implement at least posix options, and if there are more, consider the bsd variants (the Mach VM is the ancestor of all modern BSD VMs) madvise() seems to be posix there are system specific extensions be careful CONFORMING TO POSIX.1b. POSIX.1-2001 describes posix_madvise(3) with constants POSIX_MADV_NORMAL, etc., with a behav‐ ior close to that described here. There is a similar posix_fadvise(2) for file access. MADV_REMOVE, MADV_DONTFORK, MADV_DOFORK, MADV_HWPOISON, MADV_MERGEABLE, and MADV_UNMERGEABLE are Linux- specific. I was about to post these ok, so basically madvise() allows tasks etc. to specify a usage type for a chunk of memory, then I could apply the relevant I/O optimization based on this that's it cool, then I don't need to worry about knowing what the I/O is operating on, I just need to apply the optimizations as advised that's convenient ok I'll start working on this tonight making a basic readahead shouldn't be too hard readahead is a misleading name is pagein better? applies to too many things, doesn't include the case where previous elements could be prefetched clustered page transfers is what i would use page prefetching maybe ok you should stick to something that's already used in the literature since you're not inventing something new yes I've read a paper about prefetching ok thanks for your help braunr sure you're welcome braunr: madvise() is really the least important part of the picture... very few applications actually use it. but pretty much all applications will profit from clustered paging I would consider madvise() an optional goody, not an integral part of the implementation etenil: you can find some stuff about KAM's work on http://www.gnu.org/software/hurd/user/kam.html not much specific though thanks I don't remember exactly, but I guess there is also some information on the mailing list. check the archives for last summer look for Karim Allah Ahmed antrik: I disagree, madvise gives me a good starting point, even if eventually the optimisations should run even without it the code he wrote should be available from Google's summer of code page somewhere... antrik: right, i was mentioning madvise() because the kernel (VM) interface is pretty similar to the syscall but even a default policy would be nice etenil: I fear that many bits were discussed only on IRC... so you'd better look through the IRC logs from last April onwards... ok at the beginning I thought I could put that into libstore which would have been fine BTW, I remembered now that KAM's GSoC application should have a pretty good description of the necessary changes... unfortunately, these are not publicly visible IIRC :-( ## IRC, freenode, #hurd, 2011-02-16 braunr: I've looked in the kernel to see where prefetching would fit best. We talked of the VM yesterday, but I'm not sure about it. It seems to me that the device part of the kernel makes more sense since it's logically what manages devices, am I wrong? etenil: you are etenil: well etenil: drivers should already support clustered sector read/writes ah but yes, there must be support in the drivers too what would really benefit the Hurd mostly concerns page faults, so the right place is the VM subsystem [[clustered_page_faults]] # 2012-03 ## IRC, freenode, #hurd, 2012-03-21 I thought that readahead should have some heuristics, like accounting size of object and last access time, but i didn't find any in kam's patch. Are heuristics needed or it will be overhead for microkernel? size of object and last access time are not necessarily useful to take into account what would usually typically be kept is the amount of contiguous data that has been read lately to know whether it's random or sequential, and how much is read (the whole size of the object does not necessarily give any indication of how much of it will be read) if big object is accessed often, performance could be increased if frame that will be read ahead will be increased too. yes, but the size of the object really does not matter you can just observe how much data is read and realize that it's read a lot all the more so with userland fs translators it's not because you mount a CD image that you need to read it all youpi: indeed. this will be better. But on other hand there is principle about policy and mechanism. And kernel should implement mechanism, but heuristics seems to be policy. Or in this case moving readahead policy to user level would be overhead? mcsim: paging policy is all in kernel anyways; so it makes perfect sense to put the readahead policy there as well (of course it can be argued -- probably rightly -- that all of this should go into userspace instead...) antrik: probably defpager partly could do that. AFAIR, it is possible for defpager to return more memory than was asked. antrik: I want to outline what should be done during gsoc. First, kernel should support simple readahead for specified number of pages (regarding direction of access) + simple heuristic for changing frame size. Also default pager could make some analysis, for instance if it has many data located consequentially it could return more data then was asked. For other pagers I won't do anything. Is it suitable? mcsim: I think we actually had the same discussion already with KAM ;-) for clustered pageout, the kernel *has* to make the decision. I'm really not convinced it makes sense to leave the decision for clustered pagein to the individual pagers especially as this will actually complicate matters because a) it will require work in *every* pager, and b) it will probably make handling of MADVISE & friends more complex implementing readahead only for the default pager would actually be rather unrewarding. I'm pretty sure it's the one giving the *least* benefit it's much, much more important for ext2 mcsim: maybe try to dig in the irc logs, we discussed about it with neal. the current natural place would be the kernel, because it's the piece that gets the traps and thus knows what happens with each projection, while the backend just provides the pages without knowing which projection wants it. Moving to userland would not only be overhead, but quite difficult antrik: OK, but I'm not sure that I could do it for ext2. OK, I'll dig. ## IRC, freenode, #hurd, 2012-04-01 as part of implementing of readahead project I have to add interface for setting appropriate behaviour for memory range. This interface than should be compatible with madvise call, that has a lot of possible advises, but most part of them are specific for Linux (according to man page). Should mach also support these Linux-specific values? p.s. these Linux-specific values shouldn't affect readahead algorithm. the interface shouldn't prevent from adding them some day so that we don't have to add them yet ok. And what behaviour with value MADV_NORMAL should be look like? Seems that it should be synonym to MADV_SEQUENTIAL, isn't it? no, it just means "no idea what it is" in the linux implementation, that means some given readahead value while SEQUENTIAL means twice as much and RANDOM means zero youpi: thank you. youpi: Than, it seems to be better that kernel interface for setting behaviour will accept readahead value, without hiding it behind such constants, like VM_BEHAVIOR_DEFAULT (like it was in kam's patch). And than implementation of madvise will call vm_behaviour_set with appropriate frame size. Is that right? question of taste, better ask on the list ok ## IRC, freenode, #hurd, 2012-06-09 hello. What fictitious pages in gnumach are needed for? I mean why real page couldn't be grabbed straight, but in sometimes fictitious page is grabbed first and than converted to real? mcsim: iirc, fictitious pages are needed by device pagers which must comply with the vm pager interface mcsim: specifically, they must return a vm_page structure, but this vm_page describes device memory mcsim: and then, it must not be treated like normal vm_page, which can be added to page queues (e.g. page cache) ## IRC, freenode, #hurd, 2012-06-22 braunr: Ah. Patch for large storages introduced new callback pager_notify_evict. User had to define this callback on his own as pager_dropweak, for instance. But neal's patch change this. Now all callbacks could have any name, but user defines structure with pager ops and supplies it in pager_create. So, I just changed notify_evict to confirm it to new style. braunr: I want to changed interface of mo_change_attributes and test my changes with real partitions. For both these I have to update ext2fs translator, but both partitions I have are bigger than 2Gb, that's why I need apply this patch.z But what to do with mo_change_attributes? I need somehow inform kernel about page fault policy. When I change mo_ interface in kernel I have to update all programs that use this interface and ext2fs is one of them. braunr: Who do you think better to inform kernel about fault policy? At the moment I've added fault_strategy parameter that accepts following strategies: randow, sequential with single page cluster, sequential with double page cluster and sequential with quad page cluster. OSF/mach has completely another interface of mo_change_attributes. In OSF/mach mo_change_attributes accepts structure of parameter. This structure could have different formats depending o This rpc could be useful because it is not very handy to update mo_change_attributes for kernel, for hurd libs and for glibc. Instead of this kernel will accept just one more structure format. well, like i wrote on the mailing list several weeks ago, i don't think the policy selection is of concern currently you should focus on the implementation of page clustering and readahead concerning the interface, i don't think it's very important also, i really don't like the fact that the policy is per object it should be per map entry i think it mentioned that in my mail too i really think you're wasting time on this http://lists.gnu.org/archive/html/bug-hurd/2012-04/msg00064.html http://lists.gnu.org/archive/html/bug-hurd/2012-04/msg00029.html mcsim: any reason you completely ignored those ? braunr: Ok. I'll do clustering for map entries. no it's not about that either :/ clustering is grouping several pages in the same transfer between kernel and pager the *policy* is held in map entries mcsim: I'm not sure I properly understand your question about the policy interface... but if I do, it's IMHO usually better to expose individual parameters as RPC arguments explicitly, rather than hiding them in an opaque structure... (there was quite some discussion about that with libburn guy) antrik: Following will be ok? kern_return_t vm_advice(map, address, length, advice, cluster_size) Where advice will be either random or sequential looks fine to me... but then, I'm not an expert on this stuff :-) perhaps "policy" would be clearer than "advice"? madvise has following prototype: int madvise(void *addr, size_t len, int advice); hmm... looks like I made a typo. Or advi_c_e is ok too? advise is a verb; advice a noun... there is a reason why both forms show up in the madvise prototype :-) so final variant should be kern_return_t vm_advise(map, address, length, policy, cluster_size)? mcsim: nah, you are probably right that its better to keep consistency with madvise, even if the name of the "advice" parameter there might not be ideal... BTW, where does cluster_size come from? from the filesystem? I see merits both to naming the parameter "policy" (clearer) or "advice" (more consistent) -- you decide :-) antrik: also there is variant strategy, like with inheritance :) I'll choose advice for now. What do you mean under "where does cluster_size come from"? well, madvise doesn't have this parameter; so the value must come from a different source? in madvise implementation it could fixed value or somehow calculated basing on size of memory range. In OSF/mach cluster size is supplied too (via mo_change_attributes). ah, so you don't really know either :-) well, my guess is that it is derived from the cluster size used by the filesystem in question so for us it would always be 4k for now (and thus you can probably leave it out alltogether...) well, fatfs can use larger clusters I would say, implement it only if it's very easy to do... if it's extra effort, it's probably not worth it There is sense to make cluster size bigger for ext2 too, since most likely consecutive clusters will be within same group. But anyway I'll handle this later. well, I don't know what cluster_size does exactly; but by the sound of it, I'd guess it makes an assumption that it's *always* better to read in this cluster size, even for random access -- which would be simply wrong for 4k filesystem clusters... BTW, I agree with braunr that madvice() is optional -- it is way way more important to get readahead working as a default policy first ## IRC, freenode, #hurd, 2012-07-01 youpi: Do you think you could review my code? sure, just post it to the list make sure to break it down into logical pieces youpi: I pushed it my branch at gnumach repository youpi: or it is still better to post changes to list? posting to the list would permit feedback from other people too mcsim: posix distinguishes normal, sequential and random we should probably too the system call should probably be named "vm_advise", to be a verb like allocate etc. youpi: ok. A have a talk with antrik regarding naming, I'll change this later because compiling of glibc take a lot of time. mcsim: I find it odd that vm_for_every_page allocates non-existing pages there should probably be at least a flag to request it or not youpi: normal policy is synonym to default. And this could be treated as either random or sequential, isn't it? mcsim: normally, no yes, the normal policy would be the default it doesn't mean random or sequential it's just to be a compromise between both random is meant to make no read-ahead, since that'd be spurious anyway while by default we should make readahead and sequential makes even more aggressive readahead, which usually implies a greater number of pages to fetch that's all yes well, that part is handled by the cluster_size parameter actually what about reading pages preceding the faulted paged ? Shouldn't sequential clean some pages (if they, for example, are not precious) that are placed before fault page? ? that could make sense, yes you lost me and something that you wouldn't to with the normal policy braunr: clear what has been read previously ? since the access is supposed to be sequential oh the application will proabably not re-read what was already read you mean to avoid caching it ? yes inactive memory is there for that while with the normal policy you'd assume that the application might want to go back etc. yes, but you can help it yes instead of making other pages compete with it but then, it's for precious pages I have to say I don't know what a precious page it s does it mean dirty pages? no precious means cached pages "If precious is FALSE, the kernel treats the data as a temporary and may throw it away if it hasn't been changed. If the precious value is TRUE, the kernel treats its copy as a data repository and promises to return it to the manager; the manager may tell the kernel to throw it away instead by flushing and not cleaning the data" hm no precious means the kernel must keep it youpi: According to vm_for_every_page. What kind of flag do you suppose? If object is internal, I suppose not to cross the bound of object, setting in_end appropriately in vm_calculate_clusters. If object is external we don't know its actual size, so we should make mo request first. And for this we should create fictitious pages. mcsim: but how would you implement this "cleaning" with sequential ? mcsim: ah, ok, I thought you were allocating memory, but it's just fictitious pages comment "Allocate a new page" should be fixed :) braunr: I don't now how I will implement this specifically (haven't tried yet), but I don't think that this is impossible braunr: anyway it's useful as an example where normal and sequential would be different if it can be done simply because i can see more trouble than gains in there :) braunr: ok :) mcsim: hm also, why fictitious pages ? fictitious pages should normally be used only when dealing with memory mapped physically which is not real physical memory, e.g. device memory but vm_fault could occur when object represent some device memory. that's exactly why there are fictitious pages at the moment of allocating of fictitious page it is not know what backing store of object is. really ? damn, i've got used to UVM too much :/ braunr: I said something wrong? no no it's just that sometimes, i'm confusing details about the various BSD implementations i've studied out-of-gsoc-topic question: besides network drivers, do you think we'll have other drivers that will run in userspace and have to implement memory mapping ? like framebuffers ? or will there be a translation layer such as storeio that will handle mapping ? framebuffers typically will, yes that'd be antrik's work on drm hmm ok mcsim: so does the implementation work, and do you see performance improvement? youpi: I haven't tested it yet with large ext2 :/ youpi: I'm going to finish now moving of ext2 to new interface, than other translators in hurd repository and than finish memory policies in gnumach. Is it ok? which new interface? Written by neal. I wrote some temporary code to make ext2 work with it, but I'm going to change this now. you mean the old unapplied patch? yes did you have a look at Karim's work? (I have to say I never found the time to check how it related with neal's patch) I found only his work in kernel. I didn't see his work in applying of neal's patch. ok how do they relate with each other? (I have never actually looked at either of them :/) his work in kernel and neal's patch? yes They do not correlate with each other. ah, I must be misremembering what each of them do in kam's patch was changes to support sequential reading in reverse order (as in OSF/Mach), but posix does not support such behavior, so I didn't implement this either. I can't find the pointer to neal's patch, do you have it off-hand? http://comments.gmane.org/gmane.os.hurd.bugs/351 thx I think we are not talking about the same patch from Karim I mean lists.gnu.org/archive/html/bug-hurd/2010-06/msg00023.html I mean this patch: http://lists.gnu.org/archive/html/bug-hurd/2010-06/msg00024.html Oh. ok seems, this is just the same yes from a non-expert view, I would have thought these patches play hand in hand, do they really? this patch is completely for kernel and neal's one is completely for libpager. i.e. neal's fixes libpager, and karim's fixes the kernel yes ending up with fixing the whole path? AIUI, karim's patch will be needed so that your increased readahead will end up with clustered page request? I will not use kam's patch is it not needed to actually get pages in together? how do you tell libpager to fetch pages together? about the cluster size, I'd say it shouldn't be specified at vm_advise() level in other OSes, it is usually automatically tuned by ramping it up to a maximum readahead size (which, however, could be specified) that's important for the normal policy, where there are typically successive periods of sequential reads, but you don't know in advance for how long braunr said that there are legal issues with his code, so I cannot use it. did i ? mcsim: can you give me a link to the code again please ? see above :) which one ? both they only differ by a typo mcsim: i don't remember saying that, do you have any link ? or log ? sorry, can you rephrase "ending up with fixing the whole path"? cluster_size in vm_advise also could be considered as advise no it must be the third time we're talking about this mcsim: I mean both parts would be needed to actually achieve clustered i/o again, why make cluster_size a per object attribute ? :( wouldn't some objects benefit from bigger cluster sizes, while others wouldn't? but again, I believe it should rather be autotuned (for each object) if we merely want posix compatibility (and for a first attempt, it's quite enough), vm_advise is good, and the kernel selects the implementation (and thus the cluster sizes) if we want finer grained control, perhaps a per pager cluster_size would be good, although its efficiency depends on several parameters (e.g. where the page is in this cluster) but a per object cluster size is a large waste of memory considering very few applications (if not none) would use the "feature" .. (if any*) there must be a misunderstanding why would it be a waste of memory? "per object" so? there can be many memory objects in the kernel so? so such an overhead must be useful to accept it in my understanding, a cluster size per object is just a mere integer for each object what overhead? yes don't we have just thousands of objects? for now remember we're trying to remove the page cache limit :) that still won't be more than tens of thousands of objects times an integer that's completely neglectible braunr: Strange, Can't find in logs. Weird things are happening in my memory :/ Sorry. mcsim: i'm almost sure i never said that :/ but i don't trust my memory too much either youpi: depends mcsim: I mean both parts would be needed to actually achieve clustered i/o braunr: I made I call vm_advise that applies policy to memory range (vm_map_entry to be specific) mcsim: good actually the cluster size should even be per memory range youpi: In this sense, yes k sorry, Internet connection lags when changing a structure used to create many objects, keep in mind one thing if its size gets larger than a threshold (currently, powers of two), the cache used by the slab allocator will allocate twice the necessary amount sure this is the case with most object caching allocators, although some can have specific caches for common sizes such as 96k which aren't powers of two anyway, an integer is negligible, but the final structure size must be checked (for both 32 and 64 bits) braunr: ok. But I didn't understand what should be done with cluster size in vm_advise? Should I delete it? to me, the cluster size is a pager property to me, the cluster size is a map property whereas vm_advise indicates what applications want you could have several process accessing the same file in different ways youpi: that's why there is a policy isn't cluster_size part of the policy? but if the pager abilities are limited, it won't change much i'm not sure cluster_size is the amount of readahead, isn't it? no, it's the amount of data in a single transfer Yes, it is. ok, i'll have to check your code shouldn't transfers permit unbound amounts of data? braunr: than I misunderstand what readahead is well then cluster size is per policy :) e.g. random => 0, normal => 3, sequential => 15 why make it per map entry ? because it depends on what the application doezs let me check the code if it's accessing randomly, no need for big transfers just page transfers will be fine if accessing sequentially, rather use whole MiB of transfers and these behavior can be for the same file mcsim: the call is vm_advi*s*e mcsim: the call is vm_advi_s_e not advice yes, he agreed earlier ok cluster_size is the amount of data that I try to read at one time. at singe mo_data_request *single which, to me, will depend on the actual map ok so it is the transfer size and should be autotuned, especially for normal behavior youpi: it makes no sense to have both the advice and the actual size per map entry to get big readahead with all apps braunr: the size is not only dependent on the advice, but also on the application behavior youpi: how does this application tell this ? even for sequential, you shouldn't necessarily use very big amounts of transfers there is no need for the advice if there is a cluster size there can be, in the case of sequential, as we said, to clear previous pages but otherwise, indeed but for me it's the converse the cluster size should be tuned anyway and i'm against giving the cluster size in the advise call, as we may want to prefetch previous data as well I don't see how that collides well, if you consider it's the transfer size, it doesn't to me cluster size is just the size of a window if you consider it's the amount of pages following a faulted page, it will also, if your policy says e.g. "3 pages before, 10 after", and your cluster size is 2, what happens ? i would find it much simpler to do what other VM variants do: compute the I/O sizes directly from the policy don't they autotune, and use the policy as a maximum ? depends on the implementations ok, but yes I agree although casting the size into stone in the policy looks bogus to me but making cluster_size part of the kernel interface looks way too messy it is that's why i would have thought it as part of the pager properties the pager is the true component besides the kernel that is actually involved in paging ... well, for me the flexibility should still be per application by pager you mean the whole pager, not each file, right? if a pager can page more because e.g. it's a file system with big block sizes, why not fetch more ? yes it could be each file but only if we have use for it and i don't see that currently well, posix currently doesn't provide a way to set it so it would be useless atm i was thinking about our hurd pagers could we perhaps say that the policy maximum could be a fraction of available memory? why would we want that ? (total memory, I mean) to make it not completely cast into stone as have been in the past in gnumach i fail to understand :/ there must be a misunderstanding then (pun not intended) why do you want to limit the policy maximum ? how to decide it? the pager sets it actually I don't see how a pager could decide it on what ground does it make the decision? readahead should ideally be as much as 1MiB 02:02 < braunr> if a pager can page more because e.g. it's a file system with big block sizes, why not fetch more ? is the example i have in mind otherwise some default values that's way smaller than 1MiB, isn't it? yes and 1 MiB seems a lot to me :) for readahead, not really maybe for sequential that's what we care about! ah, i thought we cared about normal "as much as 1MiB", I said I don't mean normal :) right but again, why limit ? we could have 2 or more ? at some point you don't get more efficiency but eat more memory having the pager set the amount allows us to easily adjust it over time braunr: Do you think that readahead should be implemented in libpager? than needed mcsim: no mcsim: err mcsim: can't answer mcsim: do you read the log of what you have missed during disconnection? i'm not sure about what libpager does actually yes for me it's just mutualisation of code used by pagers i don't know the details youpi: yes youpi: that's why we want these values not hardcoded in the kernel youpi: so that they can be adjusted by our shiny user space OS (btw apparently linux uses minimum 16k, maximum 128 or 256k) that's more reasonable that's just 4 times less :) braunr: You say that pager should decide how much data should be read ahead, but each pager can't implement it on it's own as there will be too much overhead. So the only way is to implement this in libpager. mcsim: gni ? why couldn't they ? mcsim: he means the size, not the actual implementation the maximum size, actually actually, i would imagine it as the pager giving per policy parameters right like how many before and after I agree, then the kernel could limit, sure, to avoid letting pagers use completely insane values (and that's just a max, the kernel autotunes below that) why not that kernel limit could be a fraction of memory, then? it could, yes i see what you mean now mcsim: did you understand our discussion? don't hesitate to ask for clarification I supposed cluster_size to be such parameter. And advice will help to interpret this parameter (whether data should be read after fault page or some data should be cleaned before) mcsim: we however believe that it's rather the pager than the application that would tell that at least for the default values posix doesn't have a way to specify it, and I don't think it will in the future and i don't think our own hurd-specific programs will need more than that if they do, we can slightly change the interface to make it a per object property i've checked the slab properties, and it seems we can safely add it per object cf http://www.sceen.net/~rbraun/slabinfo.out so it would still be set by the pager, but if depending on the object, the pager could set different values youpi: do you think the pager should just provide one maximum size ? or per policy sizes ? I'd say per policy size so people can increase sequential size like crazy when they know their sequential applications need it, without disturbing the normal behavior right so the last decision is per pager or per object mcsim: i'd say whatever makes your implementation simpler :) braunr: how kernel knows that object are created by specific pager? that's the kind of things i'm referring to with "whatever makes your implementation simpler" but usually, vm_objects have an ipc port and some properties relatedto their pagers -usually the problem i had in mind was the locking protocol but our spin locks are noops, so it will be difficult to detect deadlocks braunr: and for every policy there should be variable in vm_object structure with appropriate cluster_size? if you want it per object, yes although i really don't think we want it better keep it per pager for now let's imagine youpi finishes his 64-bits support, and i can successfully remove the page cache limit we'd jump from 1.8 GiB at most to potentially dozens of GiB of RAM and 1.8, mostly unused to dozens almost completely used, almost all the times for the most interesting use cases we may have lots and lots of objects to keep around so if noone really uses the feature ... there is no point but also lots and lots of memory to spend on it :) a lot of objects are just one page, but a lof of them are not sure we wouldn't be doing that otherwise :) i'm just saying there is no reason to add the overhead of several integers for each object if they're simply not used at all hmm, 64-bits, better page cache, clustered paging I/O :> (and readahead included in the last ofc) good night ! than, probably, make system-global max-cluster_size? This will save some memory. Also there is usually no sense in reading really huge chunks at once. but that'd be tedious to set there are only a few pagers, that's no wasted memory the user being able to set it for his own pager is however a very nice feature, which can be very useful for databases, image processing, etc. In conclusion I have to implement following: 3 memory policies per object and per vm_map_entry. Max cluster size for every policy should be set per pager. So, there should be 2 system calls for setting memory policy and one for setting cluster sizes. Also amount of data to transfer should be tuned automatically by every page fault. youpi: Correct me, please, if I'm wrong. I believe that's what we ended up to decide, yes ## IRC, freenode, #hurd, 2012-07-02 is it safe to say that all memory objects implemented by external pagers have "file" semantics ? i wonder if the current memory manager interface is suitable for device pagers braunr: What does "file" semantics mean? mcsim: anonymous memory doesn't have the same semantics as a file for example anonymous memory that is discontiguous in physical memory can be contiguous in swap and its location can change with time whereas with a memory object, the data exchanged with pagers is identified with its offset in (probably) all other systems, this way of specifying data is common to all files, whatever the file system linux uses the struct vm_file name, while in BSD/Solaris they are called vnodes (the link between a file system inode and virtual memory) my question is : can we implement external device pagers with the current interface, or is this interface really meant for files ? also mcsim: something about what you said yesterday 02:39 < mcsim> In conclusion I have to implement following: 3 memory policies per object and per vm_map_entry. Max cluster size for every policy should be set per pager. not per object one policy per map entry transfer parameters (pages before and after the faulted page) per policy, defined by pagers 02:39 < mcsim> So, there should be 2 system calls for setting memory policy and one for setting cluster sizes. adding one call for vm_advise is good because it mirrors the posix call but for the parameters, i'd suggest changing an already existing call not sure which one though braunr: do you know how mo_change_attributes implemented in OSF/Mach? after a quick reading of the reference manual, i think i understand why they made it per object mcsim: no did they change the call to include those paging parameters ? it accept two parameters: flavor and pointer to structure with parameters. flavor determines semantics of structure with parameters. http://www.darwin-development.org/cgi-bin/cvsweb/osfmk/src/mach_kernel/vm/memory_object.c?rev=1.1 structure can have 3 different views and what exect view will be is determined by value of flavor So, I thought about implementing similar call that could be used for various purposes. like ioctl "pointer to structure with parameters" <= which one ? mcsim: don't model anything anywhere like ioctl please memory_object_info_t attributes ioctl is the very thing we want NOT to have on the hurd ok attributes and what are the possible values of flavour, and what kinds of attributes ? and then appears something like this on each case: behave = (old_memory_object_behave_info_t) attributes; ok i see flavor could be OLD_MEMORY_OBJECT_BEHAVIOR_INFO, MEMORY_OBJECT_BEHAVIOR_INFO, MEMORY_OBJECT_PERFORMANCE_INFO etc i don't really see the point of flavour here, other than compatibility having attributes is nice, but you should probably add it as a call parameter, not inside a structure as a general rule, we don't like passing structures too much to/from the kernel, because handling them with mig isn't very clean ok What policy parameters should be defined by pager? i'd say number of pages to page-in before and after the faulted page Only pages before and after the faulted page? for me yes youpi might have different things in mind the page cleaning in sequential mode is something i wouldn't do 1/ applications might want data read sequentially to remain in the cache, for other sequential accesses 2/ applications that really don't want to cache anything should use O_DIRECT 3/ it's complicated, and we're in july i'd rather have a correct and stable result than too many unused features braunr: MADV_SEQUENTIAL Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.) this is from linux man braunr: Can I at least make keeping in mind that it could be implemented? I mean future rpc interface braunr: From behalf of kernel pager is just a port. That's why it is not clear for me how I can make in kernel per-pager policy mcsim: you can't 15:19 < braunr> after a quick reading of the reference manual, i think i understand why they made it per object http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_madvise.html POSIX_MADV_SEQUENTIAL Specifies that the application expects to access the specified range sequentially from lower addresses to higher addresses. linux might free pages after their access, why not, but this is entirely up to the implementation I know, when but applications might want data read sequentially to remain in the cache, for other sequential accesses this kind of access could be treated rather normal or random we can do differently mcsim: no sequential means the access will be sequential so aggressive readahead (e.g. 0 pages before, many after), should be used for better performance from my pov, it has nothing to do with caching i actually sometimes expect data to remain in cache e.g. before playing a movie from sshfs, i sometimes prefetch it using dd then i use mplayer i'd be very disappointed if my data didn't remain in the cache :) At least these pages could be placed into inactive list to be first candidates for pageout. that's what will happen by default mcsim: if we need more properties for memory objects, we'll adjust the call later, when we actually implement them so, first call is vm_advise and second is changed mo_change_attributes? yes there will appear 3 new parameters in mo_c_a: policy, pages before and pages after? braunr: With vm_advise I didn't understand one thing. This call is defined in defs file, so that should mean that vm_advise is ordinal rpc call. But on the same time it is defined as syscall in mach internals (in mach_trap_table). mcsim: what ? were is it "defined" ? (it doesn't exit in gnumach currently) Ok, let consider vm_map I define it both in mach_trap_table and in defs file. But why? uh ? let me see Why defining in defs file is not enough? and previous question: there will appear 3 new parameters in mo_c_a: policy, pages before and pages after? mcsim: give me the exact file paths please mcsim: we'll discuss the new parameters after kern/syscall_sw.c right i see here mach_trap_table in defined i think they're not used they were probably introduced for performance and ./include/mach/mach.defs don't bother adding vm_advise as a syscall about the parameters, it's a bit more complicated you should add 6 parameters before and after, for the 3 policies but as seen in the posix page, there could be more policies .. ok forget what i said, it's stupid yes, the 3 parameters you had in mind are correct don't forget a "don't change" value for the policy though, so the kernel ignores the before/after values if we don't want to change that ok mcsim: another reason i asked about "file semantics" is the way we handle the cache mcsim: file semantics imply data is cached, whereas anonymous and device memory usually isn't (although having the cache at the vm layer instead of the pager layer allows nice things like the swap cache) But this shouldn't affect possibility of implementing of device pager. yes it may consider how a fault is actually handled by a device mach must use weird fictitious pages for that whereas it would be better to simply let the pager handle the fault as it sees fit setting may_cache to false should resolve the issue for the caching problem, yes which is why i still think it's better to handle the cache at the vm layer, unlike UVM which lets the vnode pager handle its own cache, and removes the vm cache completely The only issue with pager interface I see is implementing of scatter-gather DMA (as current interface does not support non-consecutive access) right but that's a performance issue my problem with device pagers is correctness currently, i think the kernel just asks pagers for "data" whereas a device pager should really map its device memory where the fault happen braunr: You mean that every access to memory should cause page fault? I mean mapping of device memory no i mean a fault on device mapped memory should directly access a shared region whereas file pagers only implement backing store let me explain a bit more here is what happens with file mapped memory you map it, access it (some I/O is done to get the page content in physical memory), then later it's flushed back whereas with device memory, there shouldn't be any I/O, the device memory should directly be mapped (well, some devices need the same caching behaviour, while others provide direct access) one of the obvious consequences is that, when you map device memory (e.g. a framebuffer), you expect changes in your mapped memory to be effective right away while with file mapped memory, you need to msync() it (some framebuffers also need to be synced, which suggests greater control is needed for external pagers) Seems that I understand you. But how it is implemented in other OS'es? Do they set something in mmu? mcsim: in netbsd, pagers have a fault operatin in addition to get and put the device pager sets get and put to null and implements fault only the fault callback then calls the d_mmap callback of the specific driver which usually results in the mmu being programmed directly (e.g. pmap_enter or similar) in linux, i think raw device drivers, being implemented as character device files, must provide raw read/write/mmap/etc.. functions so it looks pretty much similar i'd say our current external pager interface is insufficient for device pagers but antrik may know more since he worked on ggi antrik: ^ braunr: Seems he used io_map mcsim: where ar eyou looking at ? the incubator ? his master's thesis ah the thesis but where ? :) I'll give you a link http://dl.dropbox.com/u/36519904/kgi_on_hurd.pdf thanks see p 158 arg, more than 200 pages, and he says he's lazy :/ mcsim: btw, have a look at m_o_ready braunr: This is old form of mo_change attributes I'm not going to change it mcsim: these are actually the default object parameters right ? mcsim: if you don't change it, it means the kernel must set default values until the pager changes them, if it does yes. mcsim: madvise() on Linux has a separate flag to indicate that pages won't be reused. thus I think it would *not* be a good idea to imply it in SEQUENTIAL braunr: yes, my KMS code relies on mapping memory objects for the framebuffer (it should be noted though that on "modern" hardware, mapping graphics memory directly usually gives very poor performance, and drivers tend to avoid it...) mcsim: BTW, it was most likely me who warned about legal issues with KAM's work. AFAIK he never managed to get the copyright assignment done :-( (that's not really mandatory for the gnumach work though... only for the Hurd userspace parts) also I'd like to point out again that the cluster_size argument from OSF Mach was probably *not* meant for advice from application programs, but rather was supposed to reflect the cluster size of the filesystem in question. at least that sounds much more plausible to me... braunr: I have no idea whay you mean by "device pager". device memory is mapped once when the VM mapping is established; there is no need for any fault handling... mcsim: to be clear, I think the cluster_size parameter is mostly orthogonal to policy... and probably not very useful at all, as ext2 almost always uses page-sized clusters. I'm strongly advise against bothering with it in the initial implementation mcsim: to avoid confusion, better use a completely different name for the policy-decided readahead size antrik: ok braunr: well, yes, the thesis report turned out HUGE; but the actual work I did on the KGI port is fairly tiny (not more than a few weeks of actual hacking... everything else was just brooding) braunr: more importantly, it's pretty much the last (and only non-trivial) work I did on the Hurd :-( (also, I don't think I used the word "lazy"... my problem is not laziness per se; but rather inability to motivate myself to do anything not providing near-instant gratification...) antrik: right antrik: i shouldn't consider myself lazy either mcsim: i agree with antrik, as i told you weeks ago about 21:45 < antrik> mcsim: to be clear, I think the cluster_size parameter is mostly orthogonal to policy... and probably not very useful at all, as ext2 almost always uses page-sized clusters. I'm strongly advise against bothering with it in the initial implementation antrik: but how do you actually map device memory ? also, strangely enough, here is the comment in dragonflys madvise(2) 21:45 < antrik> mcsim: to be clear, I think the cluster_size parameter is mostly orthogonal to policy... and probably not very useful at all, as ext2 almost always uses page-sized clusters. I'm strongly advise against bothering with it in the initial implementation arg MADV_SEQUENTIAL Causes the VM system to depress the priority of pages immediately preceding a given page when it is faulted in. braunr: interesting... (about SEQUENTIAL on dragonfly) as for mapping device memory, I just use to device_map() on the mem device to map the physical address space into a memory object, and then through vm_map into the driver (and sometimes application) address space formally, there *is* a pager involved of course (implemented in-kernel by the mem device), but it doesn't really do anything interesting thinking about it, there *might* actually be page faults involved when the address ranges are first accessed... but even then, the handling is really trivial and not terribly interesting antrik: it does the most interesting part, create the physical mapping and as trivial as it is, it requires a special interface i'll read about device_map again but yes, the fact that it's in-kernel is what solves the problem here what i'm interested in is to do it outside the kernel :) why would you want to do that? there is no policy involved in doing an MMIO mapping you ask for the pysical memory region you are interested in, and that's it whether the kernel adds the page table entries immediately or on faults is really an implementation detail braunr: ^ yes it's a detail but do we currently have the interface to make such mappings from userspace ? and i want to do that because i'd like as many drivers as possible outside the kernel of course again, the userspace driver asks the kernel to establish the mapping (through device_map() and then vm_map() on the resulting memory object) hm i'm missing something http://www.gnu.org/software/hurd/gnumach-doc/Device-Map.html#Device-Map <= this one ? yes, this one but this implies the device is implemented by the kernel the mem device is, yes but that's not a driver ah it's just the interface for doing MMIO (well, any physical mapping... but MMIO is probably the only real use case for that) ok i was thinking about completely removing the device interface from the kernel actually but it makes sense to have such devices there well, in theory, specific kernel drivers can expose their own device_map() -- but IIRC the only one that does (besides mem of course) is maptime -- which is not a real driver either... [[Mapped-time_interface|microkernel/mach/gnumach/interface/device/time]]. oh btw, i didn't know you had a blog :) well, it would be possible to replace the device interface by specific interfaces for the generic pseudo devices... I'm not sure how useful that would be there are lots of interesting stuff there hehe... another failure ;-) failure ? well, when I realized that I'm speding a lot of time pondering things, and never can get myself to actually impelemnt any of them, I had the idea that if I write them down, there might at least be *some* good from it... unfortunately it turned out that I need so much effort to write things down, that most of the time I can't get myself to do that either :-( i see well it's still nice to have it (notice that the latest entry is two years old... and I haven't even started describing most of my central ideas :-( ) antrik: i tried to create a blog once, and found what i wrote so stupid i immediately removed it hehe actually some of my entries seem silly in retrospect as well... but I guess that's just the way it is ;-) :) i'm almost sure other people would be interested in what i had to say BTW, I'm actually not sure whether the Mach interfaces are sufficient to implement GEM/TTM... we would certainly need kernel support for GART (as for any other kind IOMMU in fact); but beyond that it's not clear to me GEM ? TTM ? GART ? GEM = Graphics Execution Manager. part of the "new" DRM interface, closely tied with KMS TTM = Translation Table Manager. does part of the background work for most of the GEM drivers "The Graphics Execution Manager (GEM) is a computer software system developed by Intel to do memory management for device drivers for graphics chipsets." hmm (in fact it was originally meant to provide the actual interface; but the Inter folks decided that it's not useful for their UMA graphics) GART = Graphics Aperture kind of an IOMMU for graphics cards allowing the graphics card to work with virtual mappings of main memory (i.e. allowing safe DMA) ok all this graphics stuff looks so complex :/ it is I have a whole big chapter on that in my thesis... and I'm not even sure I got everything right what is nvidia using/doing (except for getting the finger) ? flushing out all the details for KMS, GEM etc. took the developers like two years (even longer if counting the history of TTM) Nvidia's proprietary stuff uses a completely own kernel interface, which is of course not exposed or docuemented in any way... but I guess it's actually similar in what it does) ok (you could ask the nouveau guys if you are truly interested... they are doing most of their reverse engineering at the kernel interface level) it seems graphics have very special needs, and a lot of them and the interfaces are changing often so it's not that much interesting currently it just means we'll probably have to change the mach interface too like you said so the answer to my question, which was something like "do mach external pagers only implement files ?", is likely yes well, KMS/GEM had reached some stability; but now there are further changes ahead with the embedded folks coming in with all their dedicated hardware, calling for unified buffer management across the whole pipeline (from capture to output) and yes: graphics hardware tends to be much more complex regarding the interface than any other hardware. that's because it's a combination of actual I/O (like most other devices) with a very powerful coprocessor and the coprocessor part is pretty much unique amongst peripherial devices (actually, the I/O part is also much more complex than most other hardware... but that alone would only require a more complex driver, not special interfaces) embedded hardware makes it more interesting in that the I/O part(s) are separate from the coprocessor ones; and that there are often several separate specialised ones of each... the DRM/KMS stuff is not prepared to deal with this v4l over time has evolved to cover such things; but it's not really the right place to implement graphics drivers... which is why there are not efforts to unify these frameworks. funny times... ## IRC, freenode, #hurd, 2012-07-03 mcsim: vm_for_every_page should be static braunr: ok mcsim: see http://gcc.gnu.org/onlinedocs/gcc/Inline.html and it looks big enough that you shouldn't make it inline let the compiler decide for you (which is possible only if the function is static) (otherwise a global symbol needs to exist) mcsim: i don't know where you copied that comment from, but you should review the description of the vm_advice call in mach.Defs braunr: I see braunr: It was vm_inherit :) mcsim: why isn't NORMAL defined in vm_advise.h ? mcsim: i figured actually ;) braunr: I was going to do it later when. mcsim: for more info on inline, see http://www.kernel.org/doc/Documentation/CodingStyle arg that's an old one braunr: I know that I do not follow coding style mcsim: this one is about linux :p mcsim: http://lxr.linux.no/linux/Documentation/CodingStyle should have it mcsim: "Chapter 15: The inline disease" I was going to fix it later during refactoring when I'll merge mplaneta/gsoc12/working to mplaneta/gsoc12/master be sure not to forget :p and the best not to forget is to do it asap +way As to inline. I thought that even if I specify function as inline gcc makes final decision about it. There was a specifier that made function always inline, AFAIR. gcc can force a function not to be inline, yes but inline is still considered as a strong hint ## IRC, freenode, #hurd, 2012-07-05 braunr: hello. You've said that pager has to supply 2 values to kernel to give it an advice how execute page fault. These two values should be number of pages before and after the page where fault occurred. But for sequential policy number of pager before makes no sense. For random policy too. For normal policy it would be sane to make readahead symmetric. Probably it would be sane to make pager supply cluster_size (if it is necessary to supply any) that w *that will be advice for kernel of least sane value? And maximal value will be f(free_memory, map_entry_size)? mcsim1: I doubt symmetric readahead would be a good default policy... while it's hard to estimate an optimum over all typical use cases, I'm pretty sure most situtations will benefit almost exclusively from reading following pages, not preceeding ones I'm not even sure it's useful to read preceding pages at all in the default policy -- the use cases are probably so rare that the penalty in all other use cases is not justified. I might be wrong on that though... I wonder how other systems handle that antrik: if there is a mismatch between pages and the underlying store, like why changing small bits of data on an ssd is slow? mcsim1: i don't see why not antrik: netbsd reads a few pages before too actually, what netbsd does vary on the version, some only mapped in resident pages, later versions started asynchronous transfers in the hope those pages would be there LarstiQ: not sure what you are trying to say in linux : 321 * MADV_NORMAL - the default behavior is to read clusters. This 322 * results in some read-ahead and read-behind. not sure if it's actually what the implementation does well, right -- it's probably always useful to read whole clusters at a time, especially if they are the same size as pages... that doesn't mean it always reads preceding pages; only if the read is in the middle of the cluster AIUI antrik: basically what braunr just pasted and in most cases, we will want to read some *following* clusters as well, but probably not preceding ones * LarstiQ nods antrik: the default policy is usually rather sequential here are the numbers for netbsd 166 static struct uvm_advice uvmadvice[] = { 167 { MADV_NORMAL, 3, 4 }, 168 { MADV_RANDOM, 0, 0 }, 169 { MADV_SEQUENTIAL, 8, 7}, 170 }; struct uvm_advice { int advice; int nback; int nforw; }; surprising isn't it ? they may suggest sequential may be backwards too makes sense braunr: what are these numbers? pages? yes braunr: I suspect the idea behind SEQUENTIAL is that with typical sequential access patterns, you will start at one end of the file, and then go towards the other end -- so the extra clusters in the "wrong" direction do not actually come into play only situation where some extra clusters are actually read is when you start in the middle of a file, and thus do not know yet in which direction the sequential read will go... yes, there are similar comments in the linux code mcsim1: so having before and after numbers seems both straightforward and in par with other implementations I'm still surprised about the almost symmetrical policy for NORMAL though BTW, is it common to use heuristics for automatically recognizing random and sequential patterns in the absence of explicit madise? i don't know netbsd doesn't use any, linux seems to have different behaviours for anonymous and file memory when KAM was working on this stuff, someone suggested that... there is a file_ra_state struct in linux, for per file read-ahead policy now the structure is of course per file system, since they all use the same address (which is why i wanted it to be per pager in the first place) mcsim1: as I said before, it might be useful for the pager to supply cluster size, if it's different than page size. but right now I don't think this is something worth bothering with... I seriously doubt it would be useful for the pager to supply any other kind of policy braunr: I don't understand your remark about using the same address... braunr: pre-mapping seems the obvious way to implement readahead policy err... per-mapping the ra_state (read ahead state) isn't the policy the policy is per mapping, parts of the implementation of the policy is per file system braunr: How do you look at following implementation of NORMAL policy: We have fault page that is current. Than we have maximal size of readahead block. First we find first absent pages before and after current. Than we try to fit block that will be readahead into this range. Here could be following situations: in range RBS/2 (RBS -- size of readahead block) there is no any page, so readahead will be symmetric; if current page is first absent page than all RBS block will consist of pages that are after current; on the contrary if current page is last absent than readahead will go backwards. Additionally if current page is approximately in the middle of the range we can decrease RBS, supposing that access is random. mcsim1: i think your gsoc project is about readahead, we're in july, and you need to get the job done mcsim1: grab one policy that works, pages before and after are good enough use sane default values, let the pagers decide if they want something else and concentrate on the real work now braunr: I still don't see why pagers should mess with that... only complicates matters IMHO antrik: probably, since they almost all use the default implementation mcsim1: just use sane values inside the kernel :p this simplifies things by only adding the new vm_advise call and not change the existing external pager interface ## IRC, freenode, #hurd, 2012-07-12 mcsim: so, to begin with, tell us what state you've reached please braunr: I'm writing code for hurd and gnumach. For gnumach I'm implementing memory policies now. RANDOM and NORMAL seems work, but in hurd I found error that I made during editing ext2fs. So for now ext2fs does not work policies ? what about mechanism ? also I moved some translators to new interface. It works too well that's impressive braunr: I'm not sure yet that everything works right, but that's already a very good step i thought you were still working on the interfaces to be honest And with mechanism I didn't implement moving pages to inactive queue what do you mean ? ah you mean with the sequential policy ? yes you can consider this a secondary goal sequential I was going to implement like you've said, but I still want to support moving pages to inactive queue i think you shouldn't first get to a state where clustered transfers do work fine policies are implemented in function calculate_clusters then, you can try, and measure the difference ok. I'm now working on fixing ext2fs so, except from bug squashing, what's left to do ? finish policies and ext2fs; move fatfs, ufs, isofs to new interface; test this all; edit patches from debian repository, that conflict with my changes; rearrange commits and fix code indentation; update documentation; think about measurements too mcsim: Please don't spend a lot of time on ufs. No testing required for that one. and keep us informed about your progress on bug fixing, so we can test soon Forgot about moving system to new interfaces (I mean determine form of vm_advise and memory_object_change_attributes) s/determine/final/ braunr: ok. what do you mean "moving system to new interfaces" ? braunr: I also pushed code changes to gnumach and hurd git repositories I met an issue with memory_object_change_attributes when I tried to use it as I have to update all applications that use it. This includes libc and translators that are not in hurd repository or use debian patches. So I will not be able to run system with new memory_object_change_attributes interface, until I update all software that use this rpc this is a bit like the problem i had with my change the solution is : don't do it i mean, don't change the interface in an incompatible way if you can't change an existing call, add a new one temporary I changed memory_object_set_attributes as it isn't used any more. braunr: ok. Adding new call is a good idea :) ## IRC, freenode, #hurd, 2012-07-16 mcsim: how did you deal with multiple page transfers towards the default pager ? braunr: hello. Didn't handle this yet, but AFAIR default pager supports multiple page transfers. mcsim: i'm almost sure it doesn't braunr: indeed braunr: So, I'll update it just other translators. like other translators you mean ? *just as braunr: yes ok be aware also that it may need some support in vm_pageout.c in gnumach braunr: thank you if you see anything strange in the default pager, don't hesitate to talk about it braunr: ok. I didn't finish with ext2fs yet. so it's a good thing you're aware of it now, before you begin working on it :) braunr: I'm working on ext2 now. yes i understand i meant "before beginning work on the default pager" ok mcsim: BTW, we were mostly talking about readahead (pagein) over the past weeks, so I wonder what the status on clustered page*out* is?... antrik: I don't work on this, but following, I think, is an example of *clustered* pageout: _pager_seqnos_memory_object_data_return: object = 113, seqno = 4, control = 120, start_address = 0, length = 8192, dirty = 1. This is an example of debugging printout that shows that pageout manipulates with chunks bigger than page sized. antrik: Another one with bigger length _pager_seqnos_memory_object_data_return: object = 125, seqno = 124, control = 132, start_address = 131072, length = 126976, dirty = 1, kcopy mcsim: that's odd -- I didn't know the functionality for that even exists in our codebase... my understanding was that Mach always sends individual pageout requests for ever single page it wants cleaned... (and this being the reason for the dreadful thread storms we are facing...) antrik: ok antrik: yes that's what is happening the thread storms aren't that much of a problem now (by carefully throttling pageouts, which is a task i intend to work on during the following months, this won't be an issue any more) ## IRC, freenode, #hurd, 2012-07-19 I moved fatfs, ufs, isofs to new interface, corrected some errors in other that I already moved, moved kernel to new interface (renamed vm_advice to vm_advise and added rpcs memory_object_set_advice and memory_object_get_advice). Made some changes in mechanism and tried to finish ext2 translator. braunr: I've got an issue with fictitious pages... When I determine bounds of cluster in external object I never know its actual size. So, mo_data_request call could ask data that are behind object bounds. The problem is that pager returns data that it has and because of this fictitious pages that were allocated are not freed. why don't you know the size ? I see 2 solutions. First one is do not allocate fictitious pages at all (but I think that there could be issues). Another lies in allocating fictitious pages, but then freeing them with mo_data_lock. braunr: Because pages does not inform kernel about object size. i don't understand what you mean I think that second way is better. so how does it happen ? you get a page fault Don't you understand problem or solutions? then a lookup in the map finds the map entry and the map entry gives you the link to the underlying object from vm_object.h: vm_size_t size; /* Object size (only valid if internal) */ mcsim: ugh For external they are either 0x8000 or 0x20000... and for internal ? i'm very surprised to learn that braunr: for internal size is actual right sorry, wrong question did you find what 0x8000 and 0x20000 are ? for external I met only these 2 magic numbers when printed out arguments of functions _pager_seqno_memory_object_... when they were called. yes but did you try to find out where they come from ? braunr: no. I think that 0x2000(many zeros) is maximal possible object size. what's the exact value ? can't tell exactly :/ My hurd box has broken again. mcsim: how does the vm find the backing content then ? braunr: Do you know if it is guaranteed that map_entry size will be not bigger than external object size? mcsim: i know it's not but you can use the map entry boundaries though braunr: vm asks pager but if the page is already present how does it know ? it must be inside a vm_object .. If I can use these boundaries than the problem, I described is not actual. good it makes sense to use these boundaries, as the application can't use data outside the mapping I ask page with vm_page_lookup it would matter for shared objects, but then they have their own faults :p ok so the size is actually completely ignord if it is present than I stop expansion of cluster. which makes sense braunr: yes, for external. all right use the mapping boundaries, it will do mcsim: i have only one comment about what i could see mcsim: there are 'advice' fields in both vm_map_entry and vm_object there should be something else in vm_object i told you about pages before and after mcsim: how are you using this per object "advice" currently ? (in addition, using the same name twice for both mechanism and policy is very sonfusing) confusing* braunr: I try to expand cluster as much as it possible, but not much than limit they both determine policy, but advice for entry has bigger priority that's wrong mapping and content shouldn't compete for policy the mapping tells the policy (=the advice) while the content tells how to implement (e.g. how much content) IMO, you could simply get rid of the per object "advice" field and use default values for now braunr: What sense these values for number of pages before and after should have? or use something well known, easy, and effective like preceding and following pages they give the vm the amount of content to ask the backing pager braunr: maximal amount, minimal amount or exact amount? neither that's why i recommend you forget it for now but imagine you implement the three standard policies (normal, random, sequential) then the pager assigns preceding and following numbers for each of them, say [5;5], [0;0], [15;15] respectively these numbers would tell the vm how many pages to ask the pagers in a single request and from where braunr: but in fact there could be much more policies. yes also in kernel context there is no such unit as pager. so there should be a call like memory_object_set_advice(int advice, int preceding, int following); for example what ? the pager is the memory manager it does exist in kernel context (or i don't understand what you mean) there is only port, but port could be either pager or something else no, it's a pager it's a port whose receive right is hold by a task implementing the pager interface either the default pager or an untrusted task (or null if the object is anonymous memory not yet sent to the default pager) port is always pager? the object port is, yes struct ipc_port *pager; /* Where to get data */ So, you suggest to keep set of advices for each object? i suggest you don't change anything in objects for now keep the advice in the mappings only, and implement default behaviour for the known policies mcsim: if you understand this point, then i have nothing more to say, and we should let nowhere_man present his work braunr: ok. I'll implement only default behaviors for know policies for now. (actually, using the mapping boundaries is slightly unoptimal, as we could have several mappings for the same content, e.g. a program with read only executable mapping, then ro only) mcsim: another way to know the "size" is to actually lookup for pages in objects hm no, that's not true braunr: But if there is no page we have to ask it and I don't understand why using mappings boundaries is unoptimal here is bash 0000000000400000 868K r-x-- /bin/bash 00000000006d9000 36K rw--- /bin/bash two entries, same file (there is the anonymous memory layer for the second, but it would matter for the first cow faults) ## IRC, freenode, #hurd, 2012-08-02 braunr: You said that I probably need some support in vm_pageout.c to make defpager work with clustered page transfers, but TBH I thought that I have to implement only pagein. Do you expect from me implementing pageout either? Or I misunderstand role of vm_pageout.c? no you're expected to implement only pagins for now pageins well, I'm finishing merging of ext2fs patch for large stores and work on defpager in parallel. braunr: Also I didn't get your idea about configuring of paging mechanism on behalf of pagers. which one ? braunr: You said that pager has somehow pass size of desired clusters for different paging policies. mcsim: i said not to care about that and the wording isn't correct, it's not "on behalf of pagers" servers? pagers could tell the kernel what size (before and after a faulted page) they prefer for each existing policy but that's one way to do it defaults work well too as shown in other implementations ## IRC, freenode, #hurd, 2012-08-09 braunr: I'm still debugging ext2 with large storage patch mcsim: tough problems ? braunr: The same issues as I always meet when do debugging, but it takes time. mcsim: so nothing blocking so far ? braunr: I can't tell you for sure that I will finish up to 13th of August and this is unofficial pencil down date. all right, but are you blocked ? braunr: If you mean the issues that I can not even imagine how to solve than there is no ones. good mcsim: i'll try to review your code again this week end mcsim: make sure to commit everything even if it's messy braunr: ok braunr: I made changes to defpager, but I haven't tried them. Commit them too? mcsim: sure mcsim: does it work fine without the large storage patch ? braunr: looks fine, but TBH I can't even run such things like fsx, because even without my changes it failed mightily at once. [[file_system_exerciser]]. mcsim: right, well, that will be part of another task :) ## IRC, freenode, #hurd, 2012-08-13 braunr: hello. Seems ext2fs with large store patch works. ## IRC, freenode, #hurd, 2012-08-19 hello. Consider such situation. There is a page fault and kernel decided to request pager for several pages, but at the moment pager is able to provide only first pages, the rest ones are not know yet. Is it possible to supply only one page and regarding rest ones tell the kernel something like: "Rest pages try again later"? I tried pager_data_unavailable && pager_flush_some, but this seems does not work. Or I have to supply something anyway? mcsim: better not provide them the kernel only really needs one page don't try to implement "try again later", the kernel will do that if other page faults occur for those pages braunr: No, translator just hangs ? braunr: And I even can't deattach it without reboot hangs when what ? i mean, what happens when it hangs ? If kernel request 2 pages and I provide one, than when page fault occurs in second page translator hangs. well that's a bug clustered pager transfer is a mere optimization, you shouldn't transfer more than you can just to satisfy some requested size I think that it because I create fictitious pages before calling mo_data_request as placeholders ? Yes. Is it correct if I will not grab fictitious pages? no i don't know the details well enough about fictitious pages unfortunately, but it really feels wrong to use them where real physical pages should be used instead normally, an in-transfer page is simply marked busy But If page is already marked busy kernel will not ask it another time. when the pager replies, you unbusy them your bug may be that you incorrectly use pmap you shouldn't create mmu mappings for pages you didn't receive from the pagers I don't create them ok so you correctly get the second page fault If pager supplies only first pages, when asked were two, than second page will not become un-busy. that's a bug your code shouldn't assume the pager will provide all the pages it was asked for only the main one Will it be ok if I will provide special attribute that will keep information that page has been advised? what for ? i don't understand "page has been advised" Advised page is page that is asked in cluster, but there wasn't a page fault in it. I need this attribute because if I don't inform kernel about this page anyhow, than kernel will not change attributes of this page. why would it change its attributes ? But if page fault will occur in page that was asked than page will be already busy by the moment. and what attribute ? advised i'm lost 08:53 < mcsim> I need this attribute because if I don't inform kernel about this page anyhow, than kernel will not change attributes of this page. you need the advised attribute because if you don't inform the kernel about this page, the kernel will not change the advised attribute of this page ? Not only advised, but busy as well. And if page fault will occur in this page, kernel will not ask it second time. Kernel will just block. well that's normal But if kernel will block and pager is not going to report somehow about this page, than translator will hang. but the pager is going to report and in this report, there can be less pages then requested braunr: You told not to report the kernel can deduce it didn't receive all the pages, and mark them unbusy anyway i told not to transfer more than requested but not sending data can be a form of communication i mean, sending a message in which data is missing it simply means its not there, but this info is sufficient for the kernel hmmm... Seems I understood you. Let me try something. braunr: I informed kernel about missing page as follows: pager_data_supply (pager, precious, writelock, i, 1, NULL, 0); Am I right? i don't know the interface well what does it mean ? are you passing NULL as the data for a missing page ? yes i see you shouldn't need a request for that though, avoiding useless ipc is a good thing i is number of page, 1 is quantity but if you can't find a better way for now, it will do But this does not work :( that's a bug in your code probably braunr: supplying NULL as data returns MACH_SEND_INVALID_MEMORY but why would it work ? mach expects something you have to change that It's mig who refuses data. Mach does not even get the call. hum That's why I propose to provide new attribute, that will keep information regarding whether the page was asked as advice or not. i still don't understand why why don't you fix mig so you can your null message instead ? +send braunr: because usually this is an error the kernel will decide if it's an erro r what kinf of reply do you intend to send the kernel with for these "advised" pages ? no reply. But when page fault will occur in busy page and it will be also advised, kernel will not block, but ask this page another time. And how kernel will know that this is an error or not? why ask another time ?! you really don't want to flood pagers with useless messages here is how it should be 1/ the kernel requests pages from the pager it know the range 2/ the pager replies what it can, full range, subset of it, even only one page 3/ the kernel uses what the pager replied, and unbusies the other pages First time page was asked because page fault occurred in neighborhood. And second time because PF occurred in page. well it shouldn't or it should, but then you have a segfault But kernel does not keep bound of range, that it asked. if the kernel can't find the main page, the one it needs to make progress, it's a segfault And this range could be supplied in several messages. absolutely not you defeat the purpose of clustered pageins if you use several messages But interface supports it interface supported single page transfers, doesn't mean it's good well, you could use several messages as what we really want is less I/O Noone keeps bounds of requested range, so it couldn't be checked that range was split but it would be so much better to do it all with as few messages as possible does the kernel knows the main page ? know* Splitting range is not optimal, but it's not an error. i assume it does doesn't it ? no, that's why I want to provide new attribute. i'm sorry i'm lost again how does the kernel knows a page fault has been serviced ? know* It receives an interrupt ? let's not mix terms oh.. I read as received. Sorry It get mo_data_supply message. Than it replaces fictitious pages with real ones. so you get a message and you kept track of the range using fictitious pages use the busy flag instead, and another way to retain the range I allocate fictitious pages to reserve place. Than if page fault will occur in this page fictitious page kernel will not send another mo_data_request call, it will wait until fictitious page unblocks. i'll have to check the code but it looks unoptimal to me we really don't want to allocate useless objects when a simple busy flag would do busy flag for what? There is no page yet we're talking about mo_data_supply actually we're talking about the whole page fault process We can't mark nothing as busy, that's why kernel allocates fictitious page and marks it as busy until real page would be supplied. what do you mean "nothing" ? VM_PAGE_NULL uh ? when are physical pages allocated ? on request or on reply from the pager ? i'm reading mo_data_supply, and it looks like the page is already busy at that time they are allocated by pager and than supplied in reply Yes, but these pages are fictitious show me please in the master branch, not yours that page is fictitious? yes i'm referring to the way mach currently does things vm/vm_fault.c:582 that's memory_object_lock_page hm wait my bad ah that damn object chaining :/ ok the original code is stupid enough to use fictitious pages all the time, you probably have to do the same hm... Attributes will be useless, pager should tell something about pages, that it is not going to supply. yes that's what null is for Not null, null is error. one problem i can think of is making sure the kernel doesn't interpret missing as error right I think better have special value for mo_data_error probably ### IRC, freenode, #hurd, 2012-08-20 braunr: I think it's useful to allow supplying the data in several batches. the kernel should *not* assume that any data missing in the first batch won't be supplied later. antrik: it really depends i personally prefer synchronous approaches demanding that all data is supplied at once could actually turn readahead into a performace killer antrik: Why? The only drawback I see is higher response time for page fault, but it also leads to reduced overhead. that's why "it depends" mcsim: it brings benefit only if enough preloaded pages are actually used to compensate for the time it took the pager to provide them which is the case for many workloads (including sequential access, which is the common case we want to optimize here) mcsim: the overhead of an extra RPC is negligible compared to increased latencies when dealing with slow backing stores (such as disk or network) antrik: also many replies lead to fragmentation, while in one reply all data is gathered in one bunch. If all data is placed consecutively, than it may be transferred next time faster. mcsim: what kind of fragmentation ? I really really don't think it's a good idea for the page to hold back the first page (which is usually the one actually blocking) while it's still loading some other pages (which will probably be needed only in the future anyways, if at all) err... for the pager to hold back antrik: then all pagers should be changed to handle asynchronous data supply it's a bit late to change that now there could be two cases of data placement in backing store: 1/ all asked data is placed consecutively; 2/ it is spread among backing store. If pager gets data in one message it more like place it consecutively. So to have data consecutive in each pager, each pager has to try send data in one message. Having data placed consecutive is important, since reading of such data is much more faster. mcsim: you're confusing things .. or you're not telling them properly Ok. Let me try one more time since you're working *only* on pagein, not pageout, how do you expect spread pages being sent in a single message be better than multiple messages ? braunr: I think about future :) ok but antrik is right, paging in too much can reduce performance so the default policy should be adjusted for both the worst case (one page) and the average/best (some/mane contiguous pages) through measurement ideally mcsim: BTW, I still think implementing clustered pageout has higher priority than implementing madvise()... but if the latter is less work, it might still make sense to do it first of course :-) many* there aren't many users of madvise, true antrik: Implementing madvise I expect to be very simple. It should just translate call to vm_advise well, that part is easy of course :-) so you already implemented vm_advise itself I take it? antrik: Yes, that was also quite easy. great :-) in that case it would be silly of course to postpone implementing the madvise() wrapper. in other words: never mind my remark about priorities :-) ## IRC, freenode, #hurd, 2012-09-03 I try a test with ext2fs. It works, than I just recompile ext2fs and it stops working, than I recompile it again several times and each time the result is unpredictable. sounds like a concurrency issue I can run the same test several times and ext2 works until I recompile it. That's the problem. Could that be concurrency too? mcsim: without bad luck, yes, unless "several times" is a lot like several dozens of tries ## IRC, freenode, #hurd, 2012-09-04 hello. I want to tell that ext2fs translator, that I work on, replaced for my system old variant that processed only single pages requests. And it works with partitions bigger than 2 Gb. Probably I'm not for from the end. But it's worth to mention that I didn't fix that nasty bug that I told yesterday about. braunr: That bug sometimes appears after recompilation of ext2fs and always disappears after sync or reboot. Now I'm going to finish defpager and test other translators. ## IRC, freenode, #hurd, 2012-09-17 braunr: hello. Do you remember that you said that pager has to inform kernel about appropriate cluster size for readahead? I don't understand how kernel store this information, because it does not know about such unit as "pager". Can you give me an advice about how this could be implemented? mcsim: it can store it in the object youpi: It too big overhead youpi: at least from my pow *pov mcsim: we discussed this already mcsim: there is no "pager" entity in the kernel, which is a defect from my PoV mcsim: the best you can do is follow what the kernel already does that is, store this property per object$ we don't care much about the overhead for now my guess is there is already some padding, so the overhead is likely to be amortized by this like youpi said I remember that discussion, but I didn't get than whether there should be only one or two values for all policies. Or each policy should have its own values? braunr: ^ each policy should have its own values, which means it can be implemented with a simple static array somewhere the information in each object is a policy selector, such as an index in this static array ok mcsim: if you want to minimize the overhead, you can make this selector a char, and place it near another char member, so that you use space that was previously used as padding by the compiler mcsim: do you see what i mean ? yes good ## IRC, freenode, #hurd, 2012-09-17 hello. May I add function krealloc to slab.c? mcsim: what for ? braunr: It is quite useful for creating dynamic arrays you don't want dynamic arrays why? they're expensive try other data structures more expensive than linked lists? depends but linked lists aren't the only other alternative that's why btrees and radix trees (basically trees of arrays) exist the best general purpose data structure we have in mach is the red black tree currently but always think about what you want to do with it I want to store there sets of sizes for different memory policies. I don't expect this array to be big. But for sure I can use rbtree for it. why not a static array ? arrays are perfect for known data sizes I expect from pager to supply its own sizes. So at the beginning in this array is only default policy. When pager wants to supply it own policy kernel lookups table of advice. If this policy is new set of sizes then kernel creates new entry in table of advice. that would mean one set of sizes for each object why don't you make things simple first ? Object stores only pointer to entry in this table. but there is no pager object shared by memory objects in the kernel I mean struct vm_object so that's what i'm saying, one set per object it's useless overhead i would really suggest using a global set of policies for now Probably, I don't understand you. Where do you want to store this static array? it's a global one "for now"? It is not a problem to implement a table for local advice, using either rbtree or dynamic array. it's useless overhead and it's not a single integer, you want a whole container per object don't do anything fancy unless you know you really want it i'll link the netbsd code again as a very good example of how to implement global policies that work more than decently for every file system in this OS http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/uvm/uvm_fault.c?rev=1.194&content-type=text/x-cvsweb-markup&only_with_tag=MAIN look for uvmadvice But different translators have different demands. Thus changing of global policy for one translator would have impact on behavior of another one. i understand this isn't l4, or anything experimental we want something that works well for us And this is acceptable? until you're able to demonstrate we need different policies, i'd recommend not making things more complicated than they already are and need to be why wouldn't it ? we've been discussing this a long time :/ because every process runs in isolated environment and the fact that there is something outside this environment, that has no rights to do that, does it surprises me. ? ok. let me dip in uvm code. Probably my questions disappear i don't think it will you're asking about the system design here, not implementation details with l4, there are as you'd expect well defined components handling policies for address space allocation, or paging, or whatever but this is mach mach has a big shared global vm server with in kernel policies for it so it's ok to implement a global policy for this and let's be pragmatic, if we don't need complicated stuff, why would we waste time on this ? It is not complicated. retaining a whole container for each object, whereas they're all going to contain exactly the same stuff for years to come seems overly complicated for me I'm not going to create separate container for each object. i'm not following you then how can pagers upload their sizes in the kernel ? I'm going to create a new container only for combination of cluster sizes that are not present in table of advice. that's equivalent you're ruling out the default set, but that's just an optimization whenever a file system decides to use other sizes, the problem will arise Before creating a container I'm going to lookup a table. And only than create a table ? But there will be the same container for a huge bunch of objects how do you select it ? if it's a per pager container, remember there is no shared pager object in the kernel, only ports to external programs I'll give an example Suppose there are only two policies. At the beginning we have table {{random = 4096, sequential = 8096}}. Than pager 1 wants to add new policy where random cluster size is 8192. He asks kernel to create it and after this table will be following: {{random = 4096, sequential = 8192}, {random = 8192, sequential = 8192}}. If pager 2 wants to create the same policy as pager 1, kernel will lockup table and will not create new entry. So the table will be the same. And each object has link to appropriate table entry i'm not sure how this can work how can pagers 1 and 2 know the sizes are the same for the same policy ? (and actually they shouldn't) For faster lookup there will be create hash keys for each entry what's the lookup key ? They do not know The kernel knows then i really don't understand and how do you select sizes based on the policy ? and how do you remove unused entries ? (ok this can be implemented with a simple ref counter) "and how do you select sizes based on the policy ?" you mean at page fault? yes entry or object keeps pointer to appropriate entry in the table ok your per object data is a pointer to the table entry and the policy is the index inside so you really need a ref counter there yes and you need to maintain this table for me it's uselessly complicated but this keeps design clear not for me i don't see how this is clearer it's just more powerful a power we clearly don't need now and in the following years in addition, i'm very worried about the potential problems this can introduce In fact I don't feel comfortable from the thought that one translator can impact on behavior of another. simple example: the table is shared, it needs a lock, other data structures you may have added in your patch may also need a lock but our locks are noop for now, so you just can't be sure there is no deadlock or other issues and adding smp is a *lot* more important than being able to select precisely policy sizes that we're very likely not to change a lot what do you mean by "one translator can impact another" ? As I understand your idea (I haven't read uvm code yet) that there is a global table of cluster sizes for different policies. And every translator can change values in this table. That is what I mean under one translator will have an impact on another one. absolutely not translators *can't* change sizes the sizes are completely static, assumed to be fit all -be it's not optimial but it's very simple and effective in practice optimal* and it's not a table of cluster sizes it's a table of pages before/after the faulted one this reflects the fact tha in mach, virtual memory (implementation and policy) is in the kernel translators must not be able to change that let's talk about pagers here, not translators Finally I got you. This is an acceptable tradeoff. it took some time :) just to clear something 20:12 < mcsim> For faster lookup there will be create hash keys for each entry i'm not sure i understand you here To found out if there is such policy (set of sizes) in the table we can lookup every entry and compare each value. But it is better to create a hash value for set and thus find equal policies. first, i'm really not comfortable with hash tables they really need careful configuration next, as we don't expect many entries in this table, there is probably no need for this overhead remember that one property of tables is locality of reference you access the first entry, the processor automatically fills a whole cache line so if your table fits on just a few, it's probably faster to compare entries completely than to jump around in memory But we can sort hash keys, and in this way find policies quickly. cache misses are way slower than computation so unless you have massive amounts of data, don't use an optimized container (20:38:53) braunr: that's why btrees and radix trees (basically trees of arrays) exist and what will be the key? i'm not saying to use a tree instead of a hash table i'm saying, unless you have many entries, just use a simple table and since pagers don't add and remove entries from this table often, it's on case reallocation is ok one* So here dynamic arrays fit the most? probably it really depends on the number of entries and the write ratio keep in mind current processors have 32-bits or (more commonly) 64-bits cache line sizes bytes probably? yes bytes but i'm not willing to add a realloc like call to our general purpose kernel allocator i don't want to make it easy for people to rely on it, and i hope the lack of it will make them think about other solutions instead :) and if they really want to, they can just use alloc/free Under "other solutions" you mean trees? i mean anything else :) lists are simple, trees are elegant (but add non negligible overhead) i like trees because they truely "gracefully" scale but they're still O(log n) a good hash table is O(1), but must be carefully measured and adjusted there are many other data structures, many of them you can find in linux but in mach we don't need a lot of them Your favorite data structures are lists and trees. Next, what should you claim, is that lisp is your favorite language :) functional programming should eventually rule the world, yes i wouldn't count lists are my favorite, which are really trees as* there is a reason why red black trees back higher level data structures like vectors or maps in many common libraries ;) mcsim: hum but just to make it clear, i asked this question about hashing because i was curious about what you had in mind, i still think it's best to use static predetermined values for policies braunr: I understand this. :) braunr: Yeah. You should be cautious with me :) ## IRC, freenode, #hurd, 2012-09-21 mcsim: there is only one cluster size per object -- it depends on the properties of the backing store, nothing else. (while the readahead policies depend on the use pattern of the application, and thus should be selected per mapping) but I'm still not convinced it's worthwhile to bother with cluster size at all. do other systems even do that?... ## IRC, freenode, #hurd, 2012-09-23 mcsim: how long do you think it will take you to polish your gsoc work ? (and when before you begin that part actually, because we'll to review the whole stuff prior to polishing it) braunr: I think about 2 weeks But you may already start review it, if you're intended to do it before I'll rearrange commits. Gnumach, ext2fs and defpager are ready. I just have to polish the code. mcsim: i don't know when i'll be able to do that so expect a few weeks on my (our) side too ok sorry for being slow, that's how hurd development is :) What should I do with libc patch that adds madvise support? Post it to bug-hurd? hm probably the same i did for pthreads, create a topic branch in glibc.git there is only one commit yes (mine was a one liner :p) ok it will probably be a debian patch before going into glibc anyway, just for making sure it works But according to term. I expect that my study begins in a week and I'll have to do some stuff then, so actually probably I'll need a week more. don't worry, that's expected and that's the reason why we're slow And what should I do with large store patch? hm good question what did you do for now ? include it in your work ? that's what i saw iirc Yes. It consists of two parts. the original part and the modificaionts ? modifications* i think youpi would know better about that First (small) adds notification to libpager interface and second one adds support for large stores. i suppose we'll probably merge the large store patch at some point anyway Yes both original and modifications good I'll split these parts to different commits and I'll try to make support for large stores independent from other work. that would be best if you can make it so that, by ommitting (or including) one patch, we can add your patches to the debian package, it would be great (only with regard to the large store change, not other potential smaller conflicts) braunr: I also found several bugs in defpager, that I haven't fixed since winter. oh seems nobody hasn't expect them. i'm very interested in those actually (not too soon because it concerns my work on pageout, which is postponed after pthreads and select) ok. than I'll do it first. ## IRC, freenode, #hurd, 2012-09-24 mcsim: what is vm_get_advice_info ? braunr: hello. It should supply some machine specific parameters regarding clustered reading. At the moment it supplies only maximal possible size of cluster. mcsim: why such a need ? It is used by defpager, as it can't allocate memory dynamically and every thread has to allocate maximal size beforehand mcsim: i see ## IRC, freenode, #hurd, 2012-10-05 braunr: I think it's not worth to separate large store patch for ext2 and patch for moving it to new libpager interface. Am I right? mcsim: it's worth separating, but not creating two versions i'm not sure what you mean here First, I applied large store patch, and than I was changing patched code, to make it work with new libpager interface. So changes to make ext2 work with new interface depend on large store patch. braunr: ^ mcsim: you're not forced to make each version resulting from a new commit work but don't make big commits so if changing an interface requires its users to be updated twice, it doesn't make sense to do that just update the interface cleanly, you'll have one or more commits that produce intermediate version that don't build, that's ok then in another, separate commit, adjust the users braunr: The only user now is ext2. And the problem with ext2 is that I updated not the version from git repository, but the version, that I've got after applying the large store patch. So in other words my question is follows: should I make a commit that moves to new interface version of ext2fs without large store patch? you're asking if you can include the large store patch in your work, and by extension, in the main branch i would say yes, but this must be discussed with others ## IRC, freenode, #hurd, 2013-02-18 mcsim: so, currently reviewing gnumach braunr: hello mcsim: the review branch, right ? braunr: yes braunr: What do you start with? memory refreshing i see you added the advice twice, to vm_object and vm_map_entry iirc, we agreed to only add it to map entries am i wrong ? let me see the real question being: what do you use the object advice for ? >iirc, we agreed to only add it to map entries braunr: TBH, do not remember that. At some point we came to conclusion that there should be only one advice. But I'm not sure if it was final point. maybe it wasn't, yes that's why i've just reformulated the question if (map_entry && (map_entry->advice != VM_ADVICE_DEFAULT)) advice = map_entry->advice; else advice = object->advice; ok It just participates in determining actual advice ok that's not a bad thing let's keep it please document VM_ADVICE_KEEP and rephrase "How to handle page faults" in vm_object.h to something like 'How to tune page fault handling" mcsim: what's the point of VM_ADVICE_KEEP btw ? braunr: Probably it is better to remove it? well if it doesn't do anything, probably braunr: advising was part of mo_set_attributes before no it is redudant i see so yes, remove it s/no/now (don't waste time on a gcs-like changelog format for now) i also suggest creating _vX branches so we can compare the changes between each of your review branches hm, minor coding style issues like switch(...) instead of switch (...) why does syscall_vm_advise return MACH_SEND_INTERRUPTED if the target map is NULL ? is it modelled after an existing behaviour ? ah, it's the syscall version braunr: every syscall does so and the error is supposed to be used by user stubs to switch to the rpc version ok hm you've replaced obsolete port_set_select and port_set_backup calls with your own don't do that instead, add your calls to the new gnumach interface mcsim: out of curiosity, have you actually tried the syscall version ? braunr: Isn't it called by default? i don't think so, no than no ok you could name vm_get_advice_info vm_advice_info regarding obsolete calls, did you say that only in regard of port_set_* or all other calls too? all of the m i missed one, yes the idea is: don't change the existing interface >you could name vm_get_advice_info vm_advice_info could or should? i.e. rename? i'd say should, to remain consistent with the existing similar calls ok can you explain KERN_NO_DATA a bit more ? i suppose it's what servers should answer for neighbour pages that don't exist in the backend, right ? kernel can ask server for some data to read them beforehand, but server can be in situation when it does not know what data should be prefetched yes ok it is used by ext2 server with large store patch so its purpose is to allow the kernel to free the preallocated pages that won't be used do i get it right ? no. ext2 server has a buffer for pages and when kernel asks to read pages ahead it specifies region of that buffer ah ok but consecutive pages in buffer does not correspond to consecutive pages on disk so, the kernel can only prefetch pages that were already read by the server ? no, it can ask a server to prefetch pages that were not read by server hum ok but in case with buffer, if buffer page is empty, server does not know what to prefetch i'm not sure i'm following well, i'm sure i'm not following what happens when the kernel requests data from a server, right after a page fault ? what does the message afk for ? kernel is unaware regarding actual size of file where was page fault because of buffer indirection, right? i don't know what "buffer" refers to here this is buffer in memory where ext2 server reads pages with large store patch ext2 server does not map the whole disk, but some of its pages and it maps these pages in special buffer that means that constructiveness of pages in memory does not mean that they are consecutive on disk or logically (belong to the same file) ok so it's a page pool with unordered pages but what do you mean when you say "server does not know what to prefetch" it normally has everything to determine that For instance, page fault occurs that leads to reading of 4k-file. But kernel does not know actual size of file and asks to prefetch 16K bytes yes There is no sense to prefetch something that does not belong to this file yes but the server *knows* that and server answers with KERN_NO_DATA server should always say something about every page that was asked then, again, isn't the purpose of KERN_NO_DATA to notify the kernel it can release the preallocated pages meant for the non existing data ? (non existing or more generally non prefetchable) yes then why did you answer no to 15:46 < braunr> so its purpose is to allow the kernel to free the preallocated pages that won't be used is there something missing ? (well obviously, notify the kernel it can go on with page fault handling) braunr: sorry, misunderstoo/misread ok so good, i got this right :) i wonder if KERN_NO_DATA may be a bit too vague people might confuse it with ENODATA Actually, this is transformation of ENODATA I was looking among POSIX error codes and thought that this is the most appropriate i'm not sure it is first, it's about STREAMS, a commonly unused feature and second, the code is obsolete braunr: AFAIR purpose of KERN_NO_DATA is not only free pages. Without this call something should hang 15:59 < braunr> (well obviously, notify the kernel it can go on with page fault handling) yes hm sorry again i don't see anything better for the error name for now and it's really minor so let's keep it as it is actually, ENODATA being obsolete helps here ok, done for now, work calling we'll continue later or tomorrow braunr: ok other than that, this looks ok on the kernel side for now the next change is a bit larger so i'd like to take the time to read it braunr: ok regarding moving calls in mach.defs, can I put them elsewhere? gnumach.defs you'll probably need to rebase your changes to get it braunr: I'll rebase this later, when we finish with review ok keep the comments in a list then, not to forget (logging irc is also useful) ## IRC, freenode, #hurd, 2013-02-20 mcsim: why does VM_ADVICE_DEFAULT have its own entry ? braunr: this kind of fallback mode i suppose that even random strategy could even read several pages at once yes but then, why did you name it "default" ? because it is assigned by default ah so you expect pagers to set something else for all objects they create yes ok why not, but add a comment please at least until all pagers will support clustered reading ok even after that, it's ok just say it's there to keep the previous behaviour by default so people don't get the idea of changing it too easily comment in vm_advice.h? no, in vm_fault.C right above the array why does vm_calculate_clusters return two ranges ? also, "Function PAGE_IS_NOT_ELIGIBLE is used to determine if", PAGE_IS_NOT_ELIGIBLE doesn't look like a function I thought make it possible not only prefetch range, but also free some memory that is not used already braunr: ^ but didn't implement it :/ don't overengineer it reduce to what's needed braunr: ok braunr: do you think it's worth to implement? no braunr: it could be useful for sequential policy describe what you have in mind a bit more please, i think i don't have the complete picture with sequential policy user supposed to read strictly in sequential order, so pages that user is not supposed to read could be put in unused list what pages the user isn't supposed to read ? if user read pages in increasing order than it is not supposed to read pages that are right before the page where page fault occured right ? do you mean higher ? that are before before would be lower then oh "right before" yes :) why not ? the initial assumption, that MADV_SEQUENTIAL expects *strict* sequential access, looks wrong remember it's just a hint a user could just acces pages that are closer to one another and still use MADV_SEQUENTIAL, expecting a speedup because pages are close well ok, this wouldn't be wise MADV_SEQUENTIAL should be optimized for true sequential access, agreed but i'm not sure i'm following you but I'm not going to page these pages out. Just put in unused list, and if they will be used later they will be move to active list your optimization seem to be about freeing pages that were prefetched and not actually accessed what's the unused list ? inactive list ok so that they're freed sooner yes well, i guess all neighbour pages should first be put in the inactive list iirc, pages in the inactive list aren't mapped this would force another page fault, with a quick resolution, to tell the vm system the page was actually used, and must become active, and paged out later than other inactive pages but i really think it's not worth doing it now clustered pagins is about improving I/O page faults without I/O are orders of magnitude faster than I/O it wouldn't bring much right now ok, I remove this, but put in TODO I'm not sure that right list is inactive list, but the list that is scanned to pageout pages to swap partition. There should be such list both the active and inactive are the active one is scanned when the inactive isn't large enough (the current ratio of active pages is limited to 1/3) (btw, we could try increasing it to 1/2) iirc, linux uses 1/2 your comment about unlock_request isn't obvious, i'll have to reread again i mean, the problem isn't obvious ew, functions with so many indentation levels :/ i forgot how ugly some parts of the mach vm were mcsim: basically it's ok, i'll wait for the simplified version for another pass simplified? 22:11 < braunr> reduce to what's needed ok and what comment? your XXX in vm_fault.c when calling vm_calculate_clusters is m->unlock_request the same for all cluster or I should recalculate it for every page? s/all/whole that's what i say, i'll have to come back to that later after i have reviewed the userspace code i think so i understand the interactions better braunr: pushed v1 branch braunr: "Move new calls to gnumach.defs file" and "Implement putting pages in inactive list with sequential policy" are in my TODO mcsim: ok ## IRC, freenode, #hurd, 2013-02-24 mcsim: where does the commit from neal (reworking libpager) come from ? (ok the question looks a little weird semantically but i think you get my point) braunr: you want me to give you a link to mail with this commit? why not, yes http://permalink.gmane.org/gmane.os.hurd.bugs/446 ok so http://lists.gnu.org/archive/html/bug-hurd/2012-06/msg00001.html ok so, we actually have three things to review here that libpager patch, the ext2fs large store one, and your work mcsim: i suppose something in your work depends on neal's patch, right ? i mean, why did you work on top of it ? Yes All user level code i see it adds some notifications no notifacations are for large store ok but the rest is for my work but what does it do that you require ? braunr: this patch adds support for multipage work. There were just stubs that returned errors for chunks longer than one page before. ok for now, i'll just consider that it's ok, as well as the large store patch ok i've skipped all patches up to "Make mach-defpager process multipage requests in m_o_data_request." since they're obvious but this one isn't mcsim: why is the offset member a vm_size_t in struct block ? (these things matter for large file support on 32-bit systems) braunr: It should be vm_offset_t, right? yes well it seems so but im not sure what offset is here vm_offset is normally the offset inside a vm_object and if we want large file support, it could become a 64-bit integer while vm_size_t is a size inside an address space, so it's either 32 or 64-bit, depending on the address space size but here, if offset is an offset inside an address space, vm_size_t is fine same question for send_range_parameters braunr: TBH, I do not differ vm_size_t and vm_offset_t well they can be easily confused yes they're both offsets and sizes actually they're integers so here I used vm_offset_t because field name is offset but vm_size_t is an offset/size inside an address space (a vm_map), while vm_offset_t is an offset/size inside an object braunr: I didn't know that it's not clear at all and it may not have been that clear in mach either but i think it's best to consider them this way from now on well, it's not that important anyway since we don't have large file support, but we should some day :/ i'm afraid we'll have it as a side effect of the 64-bit port mcsim: just name them vm_offset_t when they're offsets for consistency but seems that I guessed, because I use vm_offset_t variables in mo_ functions well ok, but my question was about struct block where you use vm_size_t braunr: I consider this like a mistake ok moving on in upload_range, there are two XXX comments i'm not sure to understand Second XXX I put because at the moment when I wrote this not all hurd libraries and servers supported size different from vm_page_size But then I fixed this and replaced vm_page_size with size in page_read_file_direct ok then update the comment accordingly When I was adding third XXX, I tried to check everything. But I still had felling that I forgot something. No it is better to remove second and third XXX, since I didn't find what I missed well, that's what i mean by "update" :) ok and first XXX just an optimisation. Its idea is that there is no case when the whole structure is used in one function. ok But I was not sure if was worth to do, because if there will appear some bug in future it could be hard to find it. I mean that maintainability decreases because of using union So, I'd rather keep it like it is how is struct send_range_parameters used ? it doesn't looked to be something stored long also, you're allowed to use GNU extensions It is used to pass parameters from one function to another which of them? see http://gcc.gnu.org/onlinedocs/gcc-4.4.7/gcc/Unnamed-Fields.html#Unnamed-Fields mcsim: if it's used to pass parameters, it's likely always on the stack braunr: I use it when necessary we really don't care much about a few extra words on the stack the difference in size would agree matter oops the difference in size would matter if a lot of those were stored in memory for long durations that's not the case, so the size isn't a problem, and you should remove the comment ok mcsim: if i get it right, the libpager rework patch changes some parameters from byte offset to page frame numbers braunr: yes why don't you check errors in send_range ? braunr: it was absent in original code, but you're right, I should do this i'm not sure how to handle any error there, but at least an assert I found a place where pager just panics for now it's ok your work isn't about avoiding panics, but there must be a check, so if we can debug it and reach that point, we'll know what went wrong i don't understand the prototype change of default_read :/ it looks like it doesn't return anything any more has it become asynchronous ? It was returning some status before, but now it handles this status on its own hum how ? how do you deal with errors ? in old code default_read returned kr and this kr was used to determine what m_o_ function will be used now default_read calls m_o_ on its own ok ## IRC, freenode, #hurd, 2013-03-06 braunr: hi, regarding memory policies. Should I create separate policy that will do pageout or VM_ADVICE_SEQUENTIAL is good enough? braunr: at the moment it is exactly like NORMAL mcsim: i thought you only did pageins braunr: yes, but I'm doing pageouts now oh i'd prefer you didn't :/ if you want to improve paging, i have a suggestion i believe is a lot better and we have 3 patches concerning libpager that we need to review, polish, and merge in braunr: That's not hard, and I think I know what to do yes i understand that but it may change the interface and conflict with the pending changes braunr: What changes? the large store patch, neal's libpager rework patch on top of which you made your changes, and your changes the idea i have in mind was writeback throttling [[hurd/translator/ext2fs]], [[hurd/libpager]]. i was planning on doing it myself but if you want to work on it, feel free to it would be a much better improvement at this time than clustered pageouts (which can then immediately follow ) braunr: ok braunr: but this looks much more bigger task for me we'll talk about the strategy i had in mind tomorrow i hope you find it simple enough on the other hand, clustered pageouts are very similar to pageins and we have enough paging related changes to review that adding another wouldn't be such a problem actually so, add? if that's what you want to do, ok i'll think about your initial question tomorrow ## IRC, freenode, #hurd, 2013-09-30 talking about which... did the clustered I/O work ever get concluded? antrik: yes, mcsim was able to finish clustered pageins, and it's still on my TODO list it will get merged eventually, now that the large store patch has also been applied ## IRC, freenode, #hurd, 2013-12-31 mcsim: do you think you'll have time during january to work out your clustered pagein work again ? :) braunr: hello. yes, I think. Depends how much time :) shouldn't be much i guess what exactly should be done there? probably a rebase, and once the review and tests have been completed, writing the full changelogs ok the libpager notification on eviction patch has been pushed in as part of the merge of the ext2fs large store patch i have to review neal's rework patch again, and merge it and then i'll test your work and make debian packages for darnassus play with it a bit, see how itgoes mcsim: i guess you could start with 62004794b01e9e712af4943e02d889157ea9163f (Fix bugs and warnings in mach-defpager) rebase it, send it as a patch on bug-hurd, it should be straightforward and short ## IRC, freenode, #hurd, 2014-03-04 btw, has mcsim worked on vectorized i/o ? there was someting you wanted to integrate not sure what clustered pageins but he seems busy oh, pageins