From 5bd36fdff16871eb7d06fc26cac07e7f2703432b Mon Sep 17 00:00:00 2001 From: Thomas Schwinge Date: Thu, 29 Nov 2012 01:33:22 +0100 Subject: IRC. --- open_issues/performance/io_system/read-ahead.mdwn | 711 ++++++++++++++++++++++ 1 file changed, 711 insertions(+) (limited to 'open_issues/performance') diff --git a/open_issues/performance/io_system/read-ahead.mdwn b/open_issues/performance/io_system/read-ahead.mdwn index 657318cd..706e1632 100644 --- a/open_issues/performance/io_system/read-ahead.mdwn +++ b/open_issues/performance/io_system/read-ahead.mdwn @@ -1845,3 +1845,714 @@ License|/fdl]]."]]"""]] but that's one way to do it defaults work well too as shown in other implementations + + +## IRC, freenode, #hurd, 2012-08-09 + + braunr: I'm still debugging ext2 with large storage patch + mcsim: tough problems ? + braunr: The same issues as I always meet when do debugging, but it + takes time. + mcsim: so nothing blocking so far ? + braunr: I can't tell you for sure that I will finish up to 13th of + August and this is unofficial pencil down date. + all right, but are you blocked ? + braunr: If you mean the issues that I can not even imagine how to + solve than there is no ones. + good + mcsim: i'll try to review your code again this week end + mcsim: make sure to commit everything even if it's messy + braunr: ok + braunr: I made changes to defpager, but I haven't tried + them. Commit them too? + mcsim: sure + mcsim: does it work fine without the large storage patch ? + braunr: looks fine, but TBH I can't even run such things like fsx, + because even without my changes it failed mightily at once. + mcsim: right, well, that will be part of another task :) + + +## IRC, freenode, #hurd, 2012-08-13 + + braunr: hello. Seems ext2fs with large store patch works. + + +## IRC, freenode, #hurd, 2012-08-19 + + hello. Consider such situation. There is a page fault and kernel + decided to request pager for several pages, but at the moment pager is + able to provide only first pages, the rest ones are not know yet. Is it + possible to supply only one page and regarding rest ones tell the kernel + something like: "Rest pages try again later"? + I tried pager_data_unavailable && pager_flush_some, but this seems + does not work. + Or I have to supply something anyway? + mcsim: better not provide them + the kernel only really needs one page + don't try to implement "try again later", the kernel will do that + if other page faults occur for those pages + braunr: No, translator just hangs + ? + braunr: And I even can't deattach it without reboot + hangs when what + ? + i mean, what happens when it hangs ? + If kernel request 2 pages and I provide one, than when page fault + occurs in second page translator hangs. + well that's a bug + clustered pager transfer is a mere optimization, you shouldn't + transfer more than you can just to satisfy some requested size + I think that it because I create fictitious pages before calling + mo_data_request + as placeholders ? + Yes. Is it correct if I will not grab fictitious pages? + no + i don't know the details well enough about fictitious pages + unfortunately, but it really feels wrong to use them where real physical + pages should be used instead + normally, an in-transfer page is simply marked busy + But If page is already marked busy kernel will not ask it another + time. + when the pager replies, you unbusy them + your bug may be that you incorrectly use pmap + you shouldn't create mmu mappings for pages you didn't receive + from the pagers + I don't create them + ok so you correctly get the second page fault + If pager supplies only first pages, when asked were two, than + second page will not become un-busy. + that's a bug + your code shouldn't assume the pager will provide all the pages it + was asked for + only the main one + Will it be ok if I will provide special attribute that will keep + information that page has been advised? + what for ? + i don't understand "page has been advised" + Advised page is page that is asked in cluster, but there wasn't a + page fault in it. + I need this attribute because if I don't inform kernel about this + page anyhow, than kernel will not change attributes of this page. + why would it change its attributes ? + But if page fault will occur in page that was asked than page will + be already busy by the moment. + and what attribute ? + advised + i'm lost + 08:53 < mcsim> I need this attribute because if I don't inform + kernel about this page anyhow, than kernel will not change attributes of + this page. + you need the advised attribute because if you don't inform the + kernel about this page, the kernel will not change the advised attribute + of this page ? + Not only advised, but busy as well. + And if page fault will occur in this page, kernel will not ask it + second time. Kernel will just block. + well that's normal + But if kernel will block and pager is not going to report somehow + about this page, than translator will hang. + but the pager is going to report + and in this report, there can be less pages then requested + braunr: You told not to report + the kernel can deduce it didn't receive all the pages, and mark + them unbusy anyway + i told not to transfer more than requested + but not sending data can be a form of communication + i mean, sending a message in which data is missing + it simply means its not there, but this info is sufficient for the + kernel + hmmm... Seems I understood you. Let me try something. + braunr: I informed kernel about missing page as follows: + pager_data_supply (pager, precious, writelock, i, 1, NULL, 0); Am I + right? + i don't know the interface well + what does it mean + ? + are you passing NULL as the data for a missing page ? + yes + i see + you shouldn't need a request for that though, avoiding useless ipc + is a good thing + i is number of page, 1 is quantity + but if you can't find a better way for now, it will do + But this does not work :( + that's a bug + in your code probably + braunr: supplying NULL as data returns MACH_SEND_INVALID_MEMORY + but why would it work ? + mach expects something + you have to change that + It's mig who refuses data. Mach does not even get the call. + hum + That's why I propose to provide new attribute, that will keep + information regarding whether the page was asked as advice or not. + i still don't understand why + why don't you fix mig so you can your null message instead ? + +send + braunr: because usually this is an error + the kernel will decide if it's an erro + r + what kinf of reply do you intend to send the kernel with for these + "advised" pages ? + no reply. But when page fault will occur in busy page and it will + be also advised, kernel will not block, but ask this page another time. + And how kernel will know that this is an error or not? + why ask another time ?! + you really don't want to flood pagers with useless messages + here is how it should be + 1/ the kernel requests pages from the pager + it know the range + 2/ the pager replies what it can, full range, subset of it, even + only one page + 3/ the kernel uses what the pager replied, and unbusies the other + pages + First time page was asked because page fault occurred in + neighborhood. And second time because PF occurred in page. + well it shouldn't + or it should, but then you have a segfault + But kernel does not keep bound of range, that it asked. + if the kernel can't find the main page, the one it needs to make + progress, it's a segfault + And this range could be supplied in several messages. + absolutely not + you defeat the purpose of clustered pageins if you use several + messages + But interface supports it + interface supported single page transfers, doesn't mean it's good + well, you could use several messages + as what we really want is less I/O + Noone keeps bounds of requested range, so it couldn't be checked + that range was split + but it would be so much better to do it all with as few messages + as possible + does the kernel knows the main page ? + know* + Splitting range is not optimal, but it's not an error. + i assume it does + doesn't it ? + no, that's why I want to provide new attribute. + i'm sorry i'm lost again + how does the kernel knows a page fault has been serviced ? + know* + It receives an interrupt + ? + let's not mix terms + oh.. I read as received. Sorry + It get mo_data_supply message. Than it replaces fictitious pages + with real ones. + so you get a message + and you kept track of the range using fictitious pages + use the busy flag instead, and another way to retain the range + I allocate fictitious pages to reserve place. Than if page fault + will occur in this page fictitious page kernel will not send another + mo_data_request call, it will wait until fictitious page unblocks. + i'll have to check the code but it looks unoptimal to me + we really don't want to allocate useless objects when a simple + busy flag would do + busy flag for what? There is no page yet + we're talking about mo_data_supply + actually we're talking about the whole page fault process + We can't mark nothing as busy, that's why kernel allocates + fictitious page and marks it as busy until real page would be supplied. + what do you mean "nothing" ? + VM_PAGE_NULL + uh ? + when are physical pages allocated ? + on request or on reply from the pager ? + i'm reading mo_data_supply, and it looks like the page is already + busy at that time + they are allocated by pager and than supplied in reply + Yes, but these pages are fictitious + show me please + in the master branch, not yours + that page is fictitious? + yes + i'm referring to the way mach currently does things + vm/vm_fault.c:582 + that's memory_object_lock_page + hm wait + my bad + ah that damn object chaining :/ + ok + the original code is stupid enough to use fictitious pages all the + time, you probably have to do the same + hm... Attributes will be useless, pager should tell something about + pages, that it is not going to supply. + yes + that's what null is for + Not null, null is error. + one problem i can think of is making sure the kernel doesn't + interpret missing as error + right + I think better have special value for mo_data_error + probably + + +### IRC, freenode, #hurd, 2012-08-20 + + braunr: I think it's useful to allow supplying the data in several + batches. the kernel should *not* assume that any data missing in the + first batch won't be supplied later. + antrik: it really depends + i personally prefer synchronous approaches + demanding that all data is supplied at once could actually turn + readahead into a performace killer + antrik: Why? The only drawback I see is higher response time for + page fault, but it also leads to reduced overhead. + that's why "it depends" + mcsim: it brings benefit only if enough preloaded pages are + actually used to compensate for the time it took the pager to provide + them + which is the case for many workloads (including sequential access, + which is the common case we want to optimize here) + mcsim: the overhead of an extra RPC is negligible compared to + increased latencies when dealing with slow backing stores (such as disk + or network) + antrik: also many replies lead to fragmentation, while in one reply + all data is gathered in one bunch. If all data is placed consecutively, + than it may be transferred next time faster. + mcsim: what kind of fragmentation ? + I really really don't think it's a good idea for the page to hold + back the first page (which is usually the one actually blocking) while + it's still loading some other pages (which will probably be needed only + in the future anyways, if at all) + err... for the pager to hold back + antrik: then all pagers should be changed to handle asynchronous + data supply + it's a bit late to change that now + there could be two cases of data placement in backing store: 1/ all + asked data is placed consecutively; 2/ it is spread among backing + store. If pager gets data in one message it more like place it + consecutively. So to have data consecutive in each pager, each pager has + to try send data in one message. Having data placed consecutive is + important, since reading of such data is much more faster. + mcsim: you're confusing things .. + or you're not telling them properly + Ok. Let me try one more time + since you're working *only* on pagein, not pageout, how do you + expect spread pages being sent in a single message be better than + multiple messages ? + braunr: I think about future :) + ok + but antrik is right, paging in too much can reduce performance + so the default policy should be adjusted for both the worst case + (one page) and the average/best (some/mane contiguous pages) + through measurement ideally + mcsim: BTW, I still think implementing clustered pageout has + higher priority than implementing madvise()... but if the latter is less + work, it might still make sense to do it first of course :-) + many* + there aren't many users of madvise, true + antrik: Implementing madvise I expect to be very simple. It should + just translate call to vm_advise + well, that part is easy of course :-) so you already implemented + vm_advise itself I take it? + antrik: Yes, that was also quite easy. + great :-) + in that case it would be silly of course to postpone implementing + the madvise() wrapper. in other words: never mind my remark about + priorities :-) + + +## IRC, freenode, #hurd, 2012-09-03 + + I try a test with ext2fs. It works, than I just recompile ext2fs + and it stops working, than I recompile it again several times and each + time the result is unpredictable. + sounds like a concurrency issue + I can run the same test several times and ext2 works until I + recompile it. That's the problem. Could that be concurrency too? + mcsim: without bad luck, yes, unless "several times" is a lot + like several dozens of tries + + +## IRC, freenode, #hurd, 2012-09-04 + + hello. I want to tell that ext2fs translator, that I work on, + replaced for my system old variant that processed only single pages + requests. And it works with partitions bigger than 2 Gb. + Probably I'm not for from the end. + But it's worth to mention that I didn't fix that nasty bug that I + told yesterday about. + braunr: That bug sometimes appears after recompilation of ext2fs + and always disappears after sync or reboot. Now I'm going to finish + defpager and test other translators. + + +## IRC, freenode, #hurd, 2012-09-17 + + braunr: hello. Do you remember that you said that pager has to + inform kernel about appropriate cluster size for readahead? + I don't understand how kernel store this information, because it + does not know about such unit as "pager". + Can you give me an advice about how this could be implemented? + mcsim: it can store it in the object + youpi: It too big overhead + youpi: at least from my pow + *pov + mcsim: we discussed this already + mcsim: there is no "pager" entity in the kernel, which is a defect + from my PoV + mcsim: the best you can do is follow what the kernel already does + that is, store this property per object$ + we don't care much about the overhead for now + my guess is there is already some padding, so the overhead is + likely to be amortized by this + like youpi said + I remember that discussion, but I didn't get than whether there + should be only one or two values for all policies. Or each policy should + have its own values? + braunr: ^ + each policy should have its own values, which means it can be + implemented with a simple static array somewhere + the information in each object is a policy selector, such as an + index in this static array + ok + mcsim: if you want to minimize the overhead, you can make this + selector a char, and place it near another char member, so that you use + space that was previously used as padding by the compiler + mcsim: do you see what i mean ? + yes + good + + +## IRC, freenode, #hurd, 2012-09-17 + + hello. May I add function krealloc to slab.c? + mcsim: what for ? + braunr: It is quite useful for creating dynamic arrays + you don't want dynamic arrays + why? + they're expensive + try other data structures + more expensive than linked lists? + depends + but linked lists aren't the only other alternative + that's why btrees and radix trees (basically trees of arrays) + exist + the best general purpose data structure we have in mach is the red + black tree currently + but always think about what you want to do with it + I want to store there sets of sizes for different memory + policies. I don't expect this array to be big. But for sure I can use + rbtree for it. + why not a static array ? + arrays are perfect for known data sizes + I expect from pager to supply its own sizes. So at the beginning in + this array is only default policy. When pager wants to supply it own + policy kernel lookups table of advice. If this policy is new set of sizes + then kernel creates new entry in table of advice. + that would mean one set of sizes for each object + why don't you make things simple first ? + Object stores only pointer to entry in this table. + but there is no pager object shared by memory objects in the + kernel + I mean struct vm_object + so that's what i'm saying, one set per object + it's useless overhead + i would really suggest using a global set of policies for now + Probably, I don't understand you. Where do you want to store this + static array? + it's a global one + "for now"? It is not a problem to implement a table for local + advice, using either rbtree or dynamic array. + it's useless overhead + and it's not a single integer, you want a whole container per + object + don't do anything fancy unless you know you really want it + i'll link the netbsd code again as a very good example of how to + implement global policies that work more than decently for every file + system in this OS + + http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/uvm/uvm_fault.c?rev=1.194&content-type=text/x-cvsweb-markup&only_with_tag=MAIN + look for uvmadvice + But different translators have different demands. Thus changing of + global policy for one translator would have impact on behavior of another + one. + i understand + this isn't l4, or anything experimental + we want something that works well for us + And this is acceptable? + until you're able to demonstrate we need different policies, i'd + recommend not making things more complicated than they already are and + need to be + why wouldn't it ? + we've been discussing this a long time :/ + because every process runs in isolated environment and the fact + that there is something outside this environment, that has no rights to + do that, does it surprises me. + ? + ok. let me dip in uvm code. Probably my questions disappear + i don't think it will + you're asking about the system design here, not implementation + details + with l4, there are as you'd expect well defined components + handling policies for address space allocation, or paging, or whatever + but this is mach + mach has a big shared global vm server with in kernel policies for + it + so it's ok to implement a global policy for this + and let's be pragmatic, if we don't need complicated stuff, why + would we waste time on this ? + It is not complicated. + retaining a whole container for each object, whereas they're all + going to contain exactly the same stuff for years to come seems overly + complicated for me + I'm not going to create separate container for each object. + i'm not following you then + how can pagers upload their sizes in the kernel ? + I'm going to create a new container only for combination of cluster + sizes that are not present in table of advice. + that's equivalent + you're ruling out the default set, but that's just an optimization + whenever a file system decides to use other sizes, the problem + will arise + Before creating a container I'm going to lookup a table. And only + than create + a table ? + But there will be the same container for a huge bunch of objects + how do you select it ? + if it's a per pager container, remember there is no shared pager + object in the kernel, only ports to external programs + I'll give an example + Suppose there are only two policies. At the beginning we have table + {{random = 4096, sequential = 8096}}. Than pager 1 wants to add new + policy where random cluster size is 8192. He asks kernel to create it and + after this table will be following: {{random = 4096, sequential = 8192}, + {random = 8192, sequential = 8192}}. If pager 2 wants to create the same + policy as pager 1, kernel will lockup table and will not create new + entry. So the table will be the same. + And each object has link to appropriate table entry + i'm not sure how this can work + how can pagers 1 and 2 know the sizes are the same for the same + policy ? + (and actually they shouldn't) + For faster lookup there will be create hash keys for each entry + what's the lookup key ? + They do not know + The kernel knows + then i really don't understand + and how do you select sizes based on the policy ? + and how do you remove unused entries ? + (ok this can be implemented with a simple ref counter) + "and how do you select sizes based on the policy ?" you mean at + page fault? + yes + entry or object keeps pointer to appropriate entry in the table + ok your per object data is a pointer to the table entry and the + policy is the index inside + so you really need a ref counter there + yes + and you need to maintain this table + for me it's uselessly complicated + but this keeps design clear + not for me + i don't see how this is clearer + it's just more powerful + a power we clearly don't need now + and in the following years + in addition, i'm very worried about the potential problems this + can introduce + In fact I don't feel comfortable from the thought that one + translator can impact on behavior of another. + simple example: the table is shared, it needs a lock, other data + structures you may have added in your patch may also need a lock + but our locks are noop for now, so you just can't be sure there is + no deadlock or other issues + and adding smp is a *lot* more important than being able to select + precisely policy sizes that we're very likely not to change a lot + what do you mean by "one translator can impact another" ? + As I understand your idea (I haven't read uvm code yet) that there + is a global table of cluster sizes for different policies. And every + translator can change values in this table. That is what I mean under one + translator will have an impact on another one. + absolutely not + translators *can't* change sizes + the sizes are completely static, assumed to be fit all + -be + it's not optimial but it's very simple and effective in practice + optimal* + and it's not a table of cluster sizes + it's a table of pages before/after the faulted one + this reflects the fact tha in mach, virtual memory (implementation + and policy) is in the kernel + translators must not be able to change that + let's talk about pagers here, not translators + Finally I got you. This is an acceptable tradeoff. + it took some time :) + just to clear something + 20:12 < mcsim> For faster lookup there will be create hash keys + for each entry + i'm not sure i understand you here + To found out if there is such policy (set of sizes) in the table we + can lookup every entry and compare each value. But it is better to create + a hash value for set and thus find equal policies. + first, i'm really not comfortable with hash tables + they really need careful configuration + next, as we don't expect many entries in this table, there is + probably no need for this overhead + remember that one property of tables is locality of reference + you access the first entry, the processor automatically fills a + whole cache line + so if your table fits on just a few, it's probably faster to + compare entries completely than to jump around in memory + But we can sort hash keys, and in this way find policies quickly. + cache misses are way slower than computation + so unless you have massive amounts of data, don't use an optimized + container + (20:38:53) braunr: that's why btrees and radix trees (basically + trees of arrays) exist + and what will be the key? + i'm not saying to use a tree instead of a hash table + i'm saying, unless you have many entries, just use a simple table + and since pagers don't add and remove entries from this table + often, it's on case reallocation is ok + one* + So here dynamic arrays fit the most? + probably + it really depends on the number of entries and the write ratio + keep in mind current processors have 32-bits or (more commonly) + 64-bits cache line sizes + bytes probably? + yes bytes + but i'm not willing to add a realloc like call to our general + purpose kernel allocator + i don't want to make it easy for people to rely on it, and i hope + the lack of it will make them think about other solutions instead :) + and if they really want to, they can just use alloc/free + Under "other solutions" you mean trees? + i mean anything else :) + lists are simple, trees are elegant (but add non negligible + overhead) + i like trees because they truely "gracefully" scale + but they're still O(log n) + a good hash table is O(1), but must be carefully measured and + adjusted + there are many other data structures, many of them you can find in + linux + but in mach we don't need a lot of them + Your favorite data structures are lists and trees. Next, what + should you claim, is that lisp is your favorite language :) + functional programming should eventually rule the world, yes + i wouldn't count lists are my favorite, which are really trees + as* + there is a reason why red black trees back higher level data + structures like vectors or maps in many common libraries ;) + mcsim: hum but just to make it clear, i asked this question about + hashing because i was curious about what you had in mind, i still think + it's best to use static predetermined values for policies + braunr: I understand this. + :) + braunr: Yeah. You should be cautious with me :) + + +## IRC, freenode, #hurd, 2012-09-21 + + mcsim: there is only one cluster size per object -- it depends on + the properties of the backing store, nothing else. + (while the readahead policies depend on the use pattern of the + application, and thus should be selected per mapping) + but I'm still not convinced it's worthwhile to bother with cluster + size at all. do other systems even do that?... + + +## IRC, freenode, #hurd, 2012-09-23 + + mcsim: how long do you think it will take you to polish your gsoc + work ? + (and when before you begin that part actually, because we'll to + review the whole stuff prior to polishing it) + braunr: I think about 2 weeks + But you may already start review it, if you're intended to do it + before I'll rearrange commits. + Gnumach, ext2fs and defpager are ready. I just have to polish the + code. + mcsim: i don't know when i'll be able to do that + so expect a few weeks on my (our) side too + ok + sorry for being slow, that's how hurd development is :) + What should I do with libc patch that adds madvise support? + Post it to bug-hurd? + hm probably the same i did for pthreads, create a topic branch in + glibc.git + there is only one commit + yes + (mine was a one liner :p) + ok + it will probably be a debian patch before going into glibc anyway, + just for making sure it works + But according to term. I expect that my study begins in a week and + I'll have to do some stuff then, so actually probably I'll need a week + more. + don't worry, that's expected + and that's the reason why we're slow + And what should I do with large store patch? + hm good question + what did you do for now ? + include it in your work ? + that's what i saw iirc + Yes. It consists of two parts. + the original part and the modificaionts ? + modifications* + i think youpi would know better about that + First (small) adds notification to libpager interface and second + one adds support for large stores. + i suppose we'll probably merge the large store patch at some point + anyway + Yes both original and modifications + good + I'll split these parts to different commits and I'll try to make + support for large stores independent from other work. + that would be best + if you can make it so that, by ommitting (or including) one patch, + we can add your patches to the debian package, it would be great + (only with regard to the large store change, not other potential + smaller conflicts) + braunr: I also found several bugs in defpager, that I haven't fixed + since winter. + oh + seems nobody hasn't expect them. + i'm very interested in those actually (not too soon because it + concerns my work on pageout, which is postponed after pthreads and + select) + ok. than I'll do it first. + + +## IRC, freenode, #hurd, 2012-09-24 + + mcsim: what is vm_get_advice_info ? + braunr: hello. It should supply some machine specific parameters + regarding clustered reading. At the moment it supplies only maximal + possible size of cluster. + mcsim: why such a need ? + It is used by defpager, as it can't allocate memory dynamically and + every thread has to allocate maximal size beforehand + mcsim: i see + + +## IRC, freenode, #hurd, 2012-10-05 + + braunr: I think it's not worth to separate large store patch for + ext2 and patch for moving it to new libpager interface. Am I right? + mcsim: it's worth separating, but not creating two versions + i'm not sure what you mean here + First, I applied large store patch, and than I was changing patched + code, to make it work with new libpager interface. So changes to make + ext2 work with new interface depend on large store patch. + braunr: ^ + mcsim: you're not forced to make each version resulting from a new + commit work + but don't make big commits + so if changing an interface requires its users to be updated + twice, it doesn't make sense to do that + just update the interface cleanly, you'll have one or more commits + that produce intermediate version that don't build, that's ok + then in another, separate commit, adjust the users + braunr: The only user now is ext2. And the problem with ext2 is + that I updated not the version from git repository, but the version, that + I've got after applying the large store patch. So in other words my + question is follows: should I make a commit that moves to new interface + version of ext2fs without large store patch? + you're asking if you can include the large store patch in your + work, and by extension, in the main branch + i would say yes, but this must be discussed with others -- cgit v1.2.3