httpfs is a virtual filesystem allowing you to access web pages and files.

While the httpfs translator works, it is only suitable for very simple use cases: it just provides the actual file contents downloaded from the URL, but no additional status information that are necessary for interactive use. (Progress indication, error codes, HTTP redirects etc.)

Introduction

Here we describe the structure of the /http filesystem for the Hurd. Under the Hurd, we provide a translator called 'httpfs' which is intended to provide the filesystem structure.

The httpfs translator accepts an "http:// URL" as an argument. The underlying node of the translator can be a file or directory. This is guided by the --mode command lineoption. Default is a directory.

If its a file, only file system read requests are supported on that node. If its a directory, we can cd into that directory and ls would list the files in the web server. A web server may provide a directory listing or it may not provide, whatever it be the case the web server always returns an HTML stream for an user request (GET command). So to get the files residing in the web server, we have to parse the incoming HTML stream to find out the anchor tags. These anchor tags point to different pages or files in the web server. These file name are extracted and filled into the node of the translator. An anchor tag can also be a pointer to an external URL, in such a case we just show that URL as a regular file so that the user can make file system read requests on that URL. In case the file is a URL, we change the name of URL by converting all the /'s with .'s so that it can be displayed in the file system.

Only the root node is filled when the translator is set, subdirectories inside that are filled as on demand, i.e. when a cd or ls occurs on that particular sub directory.

The File size is now displayed as 0. One way of getting individual file sizes is sending a GET request for each file and cull the file size from Content-Length field of an HTTP response. But this may put a very heavy burden on the network, So as of now we have not incorporated this method with this http translator.

The translator uses the libxml2 library for doing the parsing of HTML stream. The libxml2 provides SAX interfaces for the parser which are used for finding the begining of anchor tags <A href="i.html">. So the translator has dependency on the libxml2 library.

If the connection to the Internet through a proxy, then the user must explicitly give the IP address and port of the proxy server by using the command line options --proxy and --port.

How to Use httpfs

# settrans -a tmp/ /hurd/httpfs http://www.gnu.org/software/hurd/index.html

# cd tmp/

# ls -l

# settrans -a tmp/ /hurd/httpfs http://www.gnu.org/software/hurd/index.html --proxy=192.168.1.103
            --port=3126

The above command should be used in case if the access to the Internet is through a proxy server, substitute your proxies IP and port no.s

TODO

  • https:// support
  • scheme-relative URL support (eg. "//example.com/")
  • query-string and fragment support
  • HTTP/1.1 support
  • HTTP/2 support
  • HTTP/3 support (there may exist a C library that provides HTTP/[123] support).
  • Teach httpfs to understand HTTP status codes like re-directs, 404 not found, etc.
  • Teach httpfs to look for "sitemaps". Many sites offer a sitemap, and this would be a nifty way for httpfs to allow grep-ing the entire site's contents. sitemaps.org is a great resource for this.
  • Teach httpfs to check if the computer has an internet connection at startup and during operation. The translator causes 30 second pauses on commands like "ls", when the internet is down.

Source

http://www.nongnu.org/hurdextras/#httpfs

Documentation

IRC, freenode, #hurd, 2013-09-03

<congzhang> hi, why I can't cd to /http:/richtlijn.be/~larstiq/hurd/ to do
<congzhang>         grep?
<pinotree> this is not ftp
<congzhang> it works for other file
<pinotree> ?
<congzhang> I can't cd to ~larstiq, I don't know why
<pinotree> http is not a filesystem protocol
<pinotree> while httpfs could try in representing it as such, it is not
  something reliable
<congzhang> ok, it's not reliable
<congzhang> I expect it can expose dir like browser
<congzhang> so, the translator just know href from home page, and one by
  one
<pinotree> uh?
<congzhang> if ...:80/a/b/c.png exits, but not has a href in homepage, so I
  can't cd to a, right?
<pinotree> you are looking things from the wrong point of view
<pinotree> a web server can do anything with URLs, including redirecting,
  URL rewriting and whatever else
<congzhang> so, how to understand httpfs's idea?
<congzhang> how httpfs list dir?
<pinotree> check its code
<congzhang> en, no need it's not reliable
<congzhang> it's not work, it's enough
<congzhang> I have an idea, for the file system, we explore dir level by
  level, but for http, we change full path one 
<congzhang> once time
<congzhang> maybe can allow user to cd any directory, and just mark as some
  special color to make user know the translator was not sure, file exist
  or not
<congzhang> once the file exits, mark all the parent directory as normal
  color?
<rekado> congzhang: you can find more info about httpfs here:
  http://nongnu.org/hurdextras/
<pinotree> congzhang: you're still looking at http from the wrong point of
  view
<pinotree> there are no directories nor files
<pinotree> you start a request for a URL, and you get some content back (in
  case there's no error)
<congzhang> you mean httpfs just for kidding?
<pinotree> that the content is a web page listing a directory on the
  filesystem of the web server machine, or a file sent back via the
  connection, or a complex web page, it's the same
<rekado> congzhang: you can only get a list of files if the web server
  responds with an index of files
<pinotree> "files"
<rekado> The readme explains how httpfs does its thing:
  http://cvs.savannah.gnu.org/viewvc/*checkout*/httpfs/README?root=hurdextras
<congzhang> if I can't cd to /http:/host/a/b how to get
  /http:/host/a/b/c.html, even the file exist?
<pinotree> you don't cd in http
<pinotree> cd is for changing directory, which does not apply to a protocol
  like http which is not fs-oriented
<congzhang> yes, I agree with you, http was not fs-oriented
<congzhang> so httpfs was not so useful
<rekado> You can access the document directly, though, can't you?
<congzhang> rekado: I try once more
<congzhang> I can't
<congzhang> so, the httpfs need some extend, http protocol was not fs
  oriented, so need some extend to make it work with file system
<pinotree> http is not designed for file system usage, so extending it is
  useless
<congzhang> or, httpfs was not so useful
<pinotree> there are many other protocols for file systems
<congzhang> I don't think so
<pinotree> i do
<congzhang> if we can't make it more useful, remove it from hurd rep, or
  extend it more useful
<congzhang> add some more rule, to make it work with file system
<pinotree> no
<congzhang> some paradox in it
<pinotree> which paradox?
<congzhang> for http vs file system
<pinotree> ???
<congzhang> tree oriented and  star topology oriented?
<pinotree> you don't make any sense