GSoC 2012 - Disk I/O Performance Tuning


Explored gnumach code. First I was reimplementing vm_fault_page as coroutine that returns before executing of mo_data_{unlock,request} calls to vm_fault. vm_fault had to analyse state of vm_fault_page for every page in loop and make a decision regarding further behavior (call mo_data_*, go to next page, etc.). But than I've got that this way is much worse, than doing everything in vm_fault_page (like in OSF mach), so I made a back-off and started working on clustered paging from the beginning (at least now I see clearer how things should be). At the moment I review kam's patch one more time and looked through mklinux code attentively.


Applied Neal's patch that reworks libpager, changed libdiskfs, tmpfs and ext2fs according to new interface. ext2fs isn't finished yet and should be reworked, but looks like I brought some bug to existing implementation and i want first to fix it and than finish rest of ext2fs. Also I pushed some code changes to hurd git repository into my branch mplaneta/gsoc12/working. Now I start working on gnumach implementation of clustered page reading. After this I'm going to implement madvise, than finish ext2fs and start porting of other translators.


First of all I'm going to do 2 programs. First will work as server, it will create an object and share it with second. Second will try to access to this object. This will cause page fault and kernel will refer to first program (server). This way I will be able to track how page faults are resolved and this will help me in debugging of readahead. IFR: server probably can use some of hurd's libraries, but it has to handle m_o_* RPC's on it's own. TODO: Find out how supply second program (client) with new object. NB: be sure that client will cause page fault, so that server always will be called (probably any caching should be disabled).

Notes on tmpfs

Current state



Infinite fsx on ext2fs:

Test 1. After about hour of work ext2fs breaks with such message: ext2fs.static: /home/buildd/build/chroot-sid/home/buildd/byhand/hurd/./ext2fs/pager.c:399: file_pager_write_page: Assertion 'block' failed. Fsx ended with Resource lost. There is no 'vmstat 1' log.

Test 2. Left for night. Finally got this message: ext2fs.static: /home/buildd/build/chroot-sid/home/buildd/byhand/hurd/./ext2fs/pager.c:399: file_pager_write_page: Assertion 'block' failed. vm_fault: memory_object_data_unlock failed. Fsx ended with Bus error.

Test 3. Finally got this message: ext2fs.static: /home/buildd/build/chroot-sid/home/buildd/byhand/hurd/./ext2fs/pager.c:399: file_pager_write_page: Assertion 'block' failed. memory_object_data_request(0x0, 0x0, 0x38000, 0x1000, 0x3) failed, 10000003. Fsx ended with Bus error. vmstat didn't show any leek.

The same test for tmpfs:

Test 1. The test continued about 5 hours and made 18314766 operations. Then this error message appeared:

READ BAD DATA: offset = 0x3317, size = 0x7012
0x 8a93 0x80f8  0x82f8  0x    1
operation# (mod 256) for the bad data may be 248
LOG DUMP (18314753 total operations):
18314754(2 mod 256): WRITE      0x38c1a thru 0x3ffff    (0x73e6 bytes)
18314755(3 mod 256): READ       0x92c3 thru 0x181b3     (0xeef1 bytes)
18314756(4 mod 256): WRITE      0x138c1 thru 0x1a06b    (0x67ab bytes)
18314757(5 mod 256): MAPWRITE 0x2564 thru 0xbc75        (0x9712 bytes)  ******WWWW
18314758(6 mod 256): MAPREAD    0x3f4e8 thru 0x3ffff    (0xb18 bytes)
18314759(7 mod 256): READ       0x2946a thru 0x30213    (0x6daa bytes)
18314760(8 mod 256): MAPWRITE 0x31fe6 thru 0x3ffff      (0xe01a bytes)
18314761(9 mod 256): TRUNCATE DOWN      from 0x40000 to 0x145e8
18314762(10 mod 256): MAPWRITE 0xca89 thru 0x1ba74      (0xefec bytes)
18314763(11 mod 256): MAPWRITE 0xb421 thru 0x11a37      (0x6617 bytes)
18314764(12 mod 256): READ      0x9495 thru 0xaa45      (0x15b1 bytes)
18314765(13 mod 256): TRUNCATE DOWN     from 0x1ba75 to 0x66cb  ******WWWW
18314766(14 mod 256): READ      0x5fa5 thru 0x66ca      (0x726 bytes)

vmstat has broken after about 2 hours, so no leak has been detected.

Test 2. fsx worked about 8 hours, than computer hanged. Log for last 2 hours of vmstat's work has been spoiled. So no vmstat information.

Test 3. fsx made 7904126 and than data error was occurred. It took about 2 hours, no vmstat was running.


UPD 26.01: All these bugs are fixed.

There left 2 bugs, I found:

  1. Passive translator doesn't work. UPD: Seems I confused something, but now it works: $ showtrans foo /usr/bin/env LD_LIBRARY_PATH=/home/mcsim/git/hurd-build/lib tmpfs/tmpfs 10M

  2. Writing by freed address somewhere. As workaround I set kalloc_max in mach-defpager/kalloc.c to 3, so vm_allocate is always used. UPD: Now I use another workaround (see below).

  3. Fsx still breaks. With 11 from 12 seeds, I tested the problem was in first 4 Kb. In 12th case problem was in range between 4 Kb and 8 Kb. I find this quite odd. Usual amount of operations before fsx breaks passes 100 000. UPD: The problem seems to be fixed.



My patch that fixes tmpfs and defpager

My vision of the problem in short: when user tries to access to the memory backed by tmpfs first time, kernel asks defpager to initialize memory object and sets it's request port to some value. But sometimes user directly accesses to defpager and supplies it own request port. So defpager should handle different request port, what causes errors.

TODO: pager_extend always frees old pagemap and allocates new. Consider if it is necessary.


Find out what causes crashes in tmpfs with defpager

TODO: Consider deleting of parameter "port" in function mach-defpager/default_pager.c:pager_port_list_insert since this parameter is unused

Probably pager_request shouldn't be stored because request may arrive from different kernels (or from kernel and translator), so this parameter doesn't have any sense.

22.11.11 Reading/writing for any size works, this works, but fsx test fails (see).

24.11.11 The problem with fsx.

Here are follow operations:

  1. Write some data at address 0x100 with size 0x20
  2. Truncate file to size 0x80. First written data should be lost.
  3. Write some data at address 0x200 size of 0x20. By this operation file size is increased up to 0x220.
  4. Read data at address 0x110. Fsx expects here zeros, but in fact here is data, that was written at step 1.

When fsx tries to read data kernel calls pager with seqno_memory_object_data_request, and pager returns on step 4 zeros either with memory_object_data_provided or memory_object_data_unavailable. Before this, in default_pager_set_size memory_object_lock_request called to flush any kernel caches, that could hold data to be truncated. When I set offset to 0 and size to limit in memory_object_lock_request it appeared another error (see). Both these behaviors appear to be quite strange for me. It is quite late now, so i put these notes to not forget this and went sleep. Continue tomorrow.

5.12.11 Here is a problem with writing by address, which was freed already. It happens in function dealloc_direct in macros invalidate_block. This function is called from pager_truncate in branch when condition "if (!INDIRECT_PAGEMAP(old_size))" is true. But I didn't find why reference to freed object is kept. As workaround we can reduce kalloc_max in hurd/mach-defpager/kalloc.c to 3 to make allocator use vm_allocate always. The drawback is that allocator will allocate only multiple of vm_page_size, but this is temporary tradeoff. Till now fsx reaches operation number 14277.

6.12.11 fsx works quite long and doesn't interrupt. I've stopped at 124784. Continued. It broke at 181091.

4.01.12 I've localized the problem. It is in these lines (starting approx. at hurd/mach-defpager/default_pager.c:1177):

  /* Now reduce the size of the direct map itself.  We don't bother
 with kalloc/kfree if it's not shrinking enough that kalloc.c
 would actually use less.  */
  if (PAGEMAP_SIZE (new_size) <= PAGEMAP_SIZE (old_size) / 2)
  const dp_map_t old_mapptr = pager->map;
  pager->map = (dp_map_t) kalloc (PAGEMAP_SIZE (new_size));
  memcpy (pager->map, old_mapptr, PAGEMAP_SIZE (old_size));
  my_tag |= 20;
  kfree ((char *) old_mapptr, PAGEMAP_SIZE (old_size));

I didn't find out yet what is wrong here exactly, but when I exclude this code memory spoiling disappears. Still breaks at 181091.

Write own pager

6.11.11 Reading/writing for files that fit in vm_page_size works

7.11.11 Works for any size.

TODO: During execution tmpfs hangs in random places. The most possible is variant is deadlocks,
because nothing was undertaken for thread safety.

TODO: Make tmpfs use not more space than it was allowed.

Make links work

Symlinks behavior: links

8.11.11 Symlinks work.

Patch by Ben Asselstine.

After sometime of inactivity tmpfs exits.

TODO: Find out why and correct this.

This may perhaps be the standard translator shutdown-after-a-period-of-inactivity functionality? This is of course not valid in the tmpfs case, as all its state (file system content) is not stored on any non-volatile backend. Can this auto-shutdown functionality be disabled (in libdiskfs, perhaps? See libdiskfs/init-first.c, the call to ports_manage_port_operations_multithread, the global_timeout paramenter (see the reference manual) (here: server_timeout). Probably this should be set to 0 in tmpfs to disable timing out? --tschwinge

Passive translator doesn't work

Must be some bug: diskfs_set_translator is implemented in node.c, so this is supposed to work. --tschwinge


Translators vs FUSE:

What can a translator do that FUSE can't?

Re: Hurd translators on FUSE

Example of sane utilization of filesystem stored in RAM (Russian). Author of this article copied some resources of game "World of Tanks" to RAM-drive and game started load much faster. Although he used Windows in this article, this could be good example of benefits, which filesystem, stored in RAM, could give.


To debug tmpfs, using libraries from "$PWD"/lib and trace rpc:

settrans -ca foo /usr/bin/env LD_LIBRARY_PATH="$PWD"/lib utils/rpctrace -I /usr/share/msgids/ tmpfs/tmpfs 1M
LD_LIBRARY_PATH="$PWD"/lib gdb tmpfs/tmpfs `pidof tmpfs` 

For debugging ext2fs:

settrans --create --active ramdisk0 /hurd/storeio -T copy zero:32M  && \
/sbin/mkfs.ext2 -F -b 4096 ramdisk0 && \
settrans --active --orphan ramdisk0 /usr/bin/env LD_LIBRARY_PATH="$PWD"/lib utils/rpctrace -I /usr/share/msgids/ \
ext2fs/ext2fs.static ramdisk0

How to install fsx. Get fsx sources using this link: Than add following line somewhere in code:

#define msync(...) 0

To compile fsx you may use following line:

gcc fsx-hurd.c -o fsx -Dlinux -g3 -O0


  1. What are sequence numbers? What are they used for?

See sequence numbering. --tschwinge

  1. Is there any way to debug mach-defpager? When I set breakpoint to any function in it, pager never breaks.

Is that still an unresolved problem? If not, what was the problem? --tschwinge

  1. Is it normal that defpager panics and breaks on any wrong data?

No server must ever panic (or reach any other invalid state), whichever input data it receives from untrusted sources (which may include every possible source). All input data must be checked for validity before use; anything else is a bug. --tschwinge


  1. Cthreads manuals




(10:29:11) braunr: mcsim: ln -s foo/bar foo/baz means the link name is baz in the foo directory,
and its target (relative to its directory) is foo/bar (which would mean /tmp/foo/foo/bar in canonical form)
(10:29:42) braunr: youpi: tschwinge: what did ludovic achieve ?
(10:30:06) tschwinge: mcsim: As Richard says, symlink targets are always relative to the directory they're contained in.
(10:31:26) braunr: oh ok
(10:31:27) mcsim: so, if I want to create link in cd, first I need to cd there?
(10:31:36) mcsim: in foo*
(10:31:36) braunr: mcsim: just provide the right paths
(10:32:11) braunr: $ touch foo/bar
(10:32:14) braunr: $ ln -s bar foo/baz
(10:32:32) braunr: bar
(10:32:35) braunr: baz -> bar


<tschwinge>: 1. On every system there is a ``default pager'' (mach-defpager).  That one is responsible 
for all ``anonymous memory''.  For example, when you do malloc(10 MiB), and then there is memory pressure, 
this 10 MiB memory region is backed by the default pager, whose job then is it to provide the backing store for this.
<tschwinge>: This is what commonly would be known as a swap partition.
<tschwinge>: And this is also the way tmpfs works (as I understand it).
<tschwinge>: malloc(10 MiB) can also be mmap(MAP_ANONYMOUS, 10 MIB); that's the same, essentially.
<tschwinge>: Now, for ext2fs or any other disk-based file system, this is different:
<tschwinge>: The ext2fs translator implements its own backing store, namely it accesses the disk for storing 
changed file content, or to read in data from disk if a new file is opened.


(10:36:14) mcsim: who else uses defpager besides tmpfs and kernel?
(10:36:27) braunr: normally, nothing directly
(10:37:04) mcsim: than why tmpfs should use defpager?
(10:37:22) braunr: it's its backend
(10:37:28) braunr: backign store rather
(10:37:38) braunr: the backing store of most file systems are partitions
(10:37:44) braunr: tmpfs has none, it uses the swap space
(10:39:31) mcsim: if we allocate memory for tmpfs using vm_allocate, will it be able to use swap partition?
(10:39:56) braunr: it should
(10:40:27) braunr: vm_allocate just maps anonymous memory
(10:41:27) braunr: anonymous memory uses swap space as its backing store too
(10:43:47) braunr: but be aware that this part of the vm system is known to have deficiencies
(10:44:14) braunr: which is why all mach based implementations have rewritten their default pager
(10:45:11) mcsim: what kind of deficiencies?
(10:45:16) braunr: bugs
(10:45:39) braunr: and design issues, making anonymous memory fragmentation horrible


(15:23:33) antrik: mcsim: vm_allocate doesn't return a memory object; so it can't be passed to clients for mmap()
(15:50:37) mcsim: antrik: I use vm_allocate in pager_read_page
(15:54:43) antrik: mcsim: well, that means that you have to actually implement a pager yourself
(15:56:10) antrik: also, when the kernel asks the pager to write back some pages, it expects the memory to become free.
if you are "paging" to ordinary anonymous memory, this doesn't happen; so I expect it to have a very bad effect
on system performance
(15:56:54) antrik: both can be avoided by just passing a real anonymous memory object, i.e. one provided by the defpager
(15:57:07) antrik: only problem is that the current defpager implementation can't really handle that...


fsx test on 22.11.11

$ ~/src/fsx/fsx -W -R -t 4096 -w 4096 -r 4096 bar
mapped writes DISABLED
truncating to largest ever: 0x32000
truncating to largest ever: 0x39000
READ BAD DATA: offset = 0x16000, size = 0xd9a0
0x1f000 0x0000  0x01a0  0x 2d14
operation# (mod 256) for the bad data may be 1
LOG DUMP (16 total operations):
1(1 mod 256): WRITE     0x1f000 thru 0x21d28    (0x2d29 bytes) HOLE     ***WWWW
2(2 mod 256): WRITE     0x10000 thru 0x15bfe    (0x5bff bytes)
3(3 mod 256): READ      0x1d000 thru 0x21668    (0x4669 bytes)
4(4 mod 256): READ      0xa000 thru 0x16a59     (0xca5a bytes)
5(5 mod 256): READ      0x8000 thru 0x8f2c      (0xf2d bytes)
6(6 mod 256): READ      0xa000 thru 0x17fe8     (0xdfe9 bytes)
7(7 mod 256): READ      0x1b000 thru 0x20f33    (0x5f34 bytes)
8(8 mod 256): READ      0x15000 thru 0x1c05b    (0x705c bytes)
9(9 mod 256): TRUNCATE UP       from 0x21d29 to 0x32000
10(10 mod 256): READ    0x3000 thru 0x5431      (0x2432 bytes)
11(11 mod 256): WRITE   0x29000 thru 0x34745    (0xb746 bytes) EXTEND
12(12 mod 256): TRUNCATE DOWN   from 0x34746 to 0x19000 ******WWWW
13(13 mod 256): READ    0x14000 thru 0x186d8    (0x46d9 bytes)
14(14 mod 256): TRUNCATE UP     from 0x19000 to 0x39000 ******WWWW
15(15 mod 256): WRITE   0x28000 thru 0x3548c    (0xd48d bytes)
16(16 mod 256): READ    0x16000 thru 0x2399f    (0xd9a0 bytes)  ***RRRR***
Correct content saved for comparison
(maybe hexdump "bar" vs "bar.fsxgood")

fsx test on 24.11.11

$ ~/src/fsx/fsx bar 
truncating to largest ever: 0x13e76
READ BAD DATA: offset = 0x1f62e, size = 0x2152
0x1f62e 0x0206  0x0000  0x 213e
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 
LOG DUMP (6 total operations):
1(1 mod 256): TRUNCATE UP       from 0x0 to 0x13e76
2(2 mod 256): WRITE     0x17098 thru 0x26857    (0xf7c0 bytes) HOLE     ***WWWW
3(3 mod 256): READ      0xc73e thru 0x1b801     (0xf0c4 bytes)
4(4 mod 256): MAPWRITE 0x32e00 thru 0x331fc     (0x3fd bytes)
5(5 mod 256): MAPWRITE 0x7ac1 thru 0x11029      (0x9569 bytes)
6(6 mod 256): READ      0x1f62e thru 0x2177f    (0x2152 bytes)  ***RRRR***
Correct content saved for comparison
(maybe hexdump "bar" vs "bar.fsxgood")