hurd/translator/httpfs.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173

[[!meta copyright="Copyright © 2013 Free Software Foundation, Inc."]]

[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
id="license" text="Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License, Version 1.2 or
any later version published by the Free Software Foundation; with no Invariant
Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
is included in the section entitled [[GNU Free Documentation
License|/fdl]]."]]"""]]

`httpfs` is a virtual filesystem allowing you to access web pages and files.

While the httpfs translator works, it is only suitable for very simple use cases: it just provides the actual file contents downloaded from the URL, but no additional status information that are necessary for interactive use. (Progress indication, error codes, HTTP redirects etc.)

# Introduction

Here we describe the structure of the /http filesystem for the Hurd.
Under the Hurd, we provide a translator called 'httpfs' which is intended
to provide the filesystem structure.

The httpfs translator accepts an "http:// URL" as an argument. The underlying
node of the translator can be a file or directory. This is guided by the --mode
command lineoption. Default is a directory.

If its a file, only file system read requests are supported on that node.  If
its a directory, we can cd into that directory and ls would list the files in
the web server. A web server may provide a directory listing or it may not
provide, whatever it be the case the web server always returns an HTML stream
for an user request (GET command). So to get the files residing in the web
server, we have to parse the incoming HTML stream to find out the anchor
tags. These anchor tags point to different pages or files in the web
server. These file name are extracted and filled into the node of the
translator. An anchor tag can also be a pointer to an external URL, in such a
case we just show that URL as a regular file so that the user can make file
system read requests on that URL. In case the file is a URL, we change the name
of URL by converting all the /'s with .'s so that it can be displayed in the
file system.

Only the root node is filled when the translator is set, subdirectories inside
that are filled as on demand, i.e. when a cd or ls occurs on that particular sub
directory.

The File size is now displayed as 0. One way of getting individual file sizes is
sending a GET request for each file and cull the file size from Content-Length
field of an HTTP response. But this may put a very heavy burden on the network,
So as of now we have not incorporated this method with this http translator.

The translator uses the libxml2 library for doing the parsing of HTML
stream. The libxml2 provides SAX interfaces for the parser which are used for
finding the begining of anchor tags `<A href="i.html">`. So the translator has
dependency on the libxml2 library.

If the connection to the Internet through a proxy, then the user must explicitly
give the IP address and port of the proxy server by using the command line
options --proxy and --port.


# How to Use httpfs

    # settrans -a tmp/ /hurd/httpfs http://www.gnu.org/software/hurd/index.html

<Remember to give the / at the end of the URL, unless you are specifying a specific file like www.hurd-project.com/httpfs.html >

    # cd tmp/

    # ls -l

    # settrans -a tmp/ /hurd/httpfs http://www.gnu.org/software/hurd/index.html --proxy=192.168.1.103
				--port=3126

The above command should be used in case if the access to the Internet is
through a proxy server, substitute your proxies IP and port no.s

# TODO

- https:// support
- scheme-relative URL support (eg. "//example.com/")
- query-string and fragment support
- HTTP/1.1 support
- HTTP/2 support
- HTTP/3 support
- Teach httpfs to understand HTTP status codes like re-directs, 404 not found,
  etc.
- Teach httpfs to look for "sitemaps".  Many sites offer a sitemap, and this
  would be a nifty way for httpfs to allow grep-ing the entire site's contents.

# Source

<http://www.nongnu.org/hurdextras/#httpfs>


# Documentation

## IRC, freenode, #hurd, 2013-09-03

[[!tag open_issue_documentation]]

    <congzhang> hi, why I can't cd to /http:/richtlijn.be/~larstiq/hurd/ to do
    <congzhang> 	    grep?
    <pinotree> this is not ftp
    <congzhang> it works for other file
    <pinotree> ?
    <congzhang> I can't cd to ~larstiq, I don't know why
    <pinotree> http is not a filesystem protocol
    <pinotree> while httpfs could try in representing it as such, it is not
      something reliable
    <congzhang> ok, it's not reliable
    <congzhang> I expect it can expose dir like browser
    <congzhang> so, the translator just know href from home page, and one by
      one
    <pinotree> uh?
    <congzhang> if ...:80/a/b/c.png exits, but not has a href in homepage, so I
      can't cd to a, right?
    <pinotree> you are looking things from the wrong point of view
    <pinotree> a web server can do anything with URLs, including redirecting,
      URL rewriting and whatever else
    <congzhang> so, how to understand httpfs's idea?
    <congzhang> how httpfs list dir?
    <pinotree> check its code
    <congzhang> en, no need it's not reliable
    <congzhang> it's not work, it's enough
    <congzhang> I have an idea, for the file system, we explore dir level by
      level, but for http, we change full path one 
    <congzhang> once time
    <congzhang> maybe can allow user to cd any directory, and just mark as some
      special color to make user know the translator was not sure, file exist
      or not
    <congzhang> once the file exits, mark all the parent directory as normal
      color?
    <rekado> congzhang: you can find more info about httpfs here:
      http://nongnu.org/hurdextras/
    <pinotree> congzhang: you're still looking at http from the wrong point of
      view
    <pinotree> there are no directories nor files
    <pinotree> you start a request for a URL, and you get some content back (in
      case there's no error)
    <congzhang> you mean httpfs just for kidding?
    <pinotree> that the content is a web page listing a directory on the
      filesystem of the web server machine, or a file sent back via the
      connection, or a complex web page, it's the same
    <rekado> congzhang: you can only get a list of files if the web server
      responds with an index of files
    <pinotree> "files"
    <rekado> The readme explains how httpfs does its thing:
      http://cvs.savannah.gnu.org/viewvc/*checkout*/httpfs/README?root=hurdextras
    <congzhang> if I can't cd to /http:/host/a/b how to get
      /http:/host/a/b/c.html, even the file exist?
    <pinotree> you don't cd in http
    <pinotree> cd is for changing directory, which does not apply to a protocol
      like http which is not fs-oriented
    <congzhang> yes, I agree with you, http was not fs-oriented
    <congzhang> so httpfs was not so useful
    <rekado> You can access the document directly, though, can't you?
    <congzhang> rekado: I try once more
    <congzhang> I can't
    <congzhang> so, the httpfs need some extend, http protocol was not fs
      oriented, so need some extend to make it work with file system
    <pinotree> http is not designed for file system usage, so extending it is
      useless
    <congzhang> or, httpfs was not so useful
    <pinotree> there are many other protocols for file systems
    <congzhang> I don't think so
    <pinotree> i do
    <congzhang> if we can't make it more useful, remove it from hurd rep, or
      extend it more useful
    <congzhang> add some more rule, to make it work with file system
    <pinotree> no
    <congzhang> some paradox in it
    <pinotree> which paradox?
    <congzhang> for http vs file system
    <pinotree> ???
    <congzhang> tree oriented and  star topology oriented?
    <pinotree> you don't make any sense