I have to write a script in perl which parses uris from html. Anyway, the real problem is how to resolve relative uris.
I have base URI (base href in html) for example http://a/b/c/d;p?q (let's go through rfc3986) and different other URIs:
/g, //g, ///g, ////g, h//g, g////h, h///g:f
In this RFC, section 5.4.1 (link above) there is only example of //g:
"//g" = "http://g"
What about all other cases? As far as I understood from rfc 3986, section 3.3, multiple slashes are allowed. So, is following resolution correct?
"///g" = "http://a/b/c///g"
Or what is should be? Does anyone can explain it better and prove it with not obsoleted rfc or documentation?
Update #1: Try to look at this working url - https:///stackoverflow.com////////a/////10161264/////6618577
What's going on here?
I'll start by confirming that all the URIs you provided are valid, and by providing the outcome of the URI resolutions you mentioned (and the outcome of a couple of my own):
$ perl -MURI -e'
for my $rel (qw( /g //g ///g ////g h//g g////h h///g:f )) {
my $uri = URI->new($rel)->abs("http://a/b/c/d;p?q");
printf "%-20s + %-7s = %-20s host: %-4s path: %s\n",
"http://a/b/c/d;p?q", $rel, $uri, $uri->host, $uri->path;
}
for my $base (qw( http://host/a/b/c/d http://host/a/b/c//d )) {
my $uri = URI->new("../../e")->abs($base);
printf "%-20s + %-7s = %-20s host: %-4s path: %s\n",
$base, "../../e", $uri, $uri->host, $uri->path;
}
'
http://a/b/c/d;p?q + /g = http://a/g host: a path: /g
http://a/b/c/d;p?q + //g = http://g host: g path:
http://a/b/c/d;p?q + ///g = http:///g host: path: /g
http://a/b/c/d;p?q + ////g = http:////g host: path: //g
http://a/b/c/d;p?q + h//g = http://a/b/c/h//g host: a path: /b/c/h//g
http://a/b/c/d;p?q + g////h = http://a/b/c/g////h host: a path: /b/c/g////h
http://a/b/c/d;p?q + h///g:f = http://a/b/c/h///g:f host: a path: /b/c/h///g:f
http://host/a/b/c/d + ../../e = http://host/a/e host: host path: /a/e
http://host/a/b/c//d + ../../e = http://host/a/b/e host: host path: /a/b/e
Next, we'll look at the syntax of relative URIs, since that's what your question circles around.
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
segment = *pchar ; 0 or more <pchar>
segment-nz = 1*pchar ; 1 or more <pchar> nz = non-zero
The key things from these rules for answering your question:
path-absolute
) can't start with //
. The first segment, if provided, must be non-zero in length. If the relative URI starts with //
, what follows must be an authority
.//
can otherwise occur in a path because segments can have zero-length.Now, let's look at each of the resolutions you provided in turn.
/g
is an absolute path path-absolute
, and thus a valid relative URI (relative-ref
), and thus a valid URI (URI-reference
).
Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Base.scheme: "http" R.scheme: undef
Base.authority: "a" R.authority: undef
Base.path: "/b/c/d;p" R.path: "/g"
Base.query: "q" R.query: undef
Base.fragment: undef R.fragment: undef
Following the algorithm in §5.2.2, we get:
T.path: "/g" ; remove_dot_segments(R.path)
T.query: undef ; R.query
T.authority: "a" ; Base.authority
T.scheme: "http" ; Base.scheme
T.fragment: undef ; R.fragment
Following the algorithm in §5.3, we get:
http://a/g
//g
is different. //g
isn't an absolute path (path_absolute
) because an absolute path can't start with an empty segment ("/" [ segment-nz *( "/" segment ) ]
).
Instead, it's follows the following pattern:
"//" authority path-abempty
Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Base.scheme: "http" R.scheme: undef
Base.authority: "a" R.authority: "g"
Base.path: "/b/c/d;p" R.path: ""
Base.query: "q" R.query: undef
Base.fragment: undef R.fragment: undef
Following the algorithm in §5.2.2, we get the following:
T.authority: "g" ; R.authority
T.path: "" ; remove_dot_segments(R.path)
T.query: "" ; R.query
T.scheme: "http" ; Base.scheme
T.fragment: undef ; R.fragment
Following the algorithm in §5.3, we get the following:
http://g
Note: This contacts server g
!
///g
is similar to //g
, except the authority is blank! This is surprisingly valid.
Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Base.scheme: "http" R.scheme: undef
Base.authority: "a" R.authority: ""
Base.path: "/b/c/d;p" R.path: "/g"
Base.query: "q" R.query: undef
Base.fragment: undef R.fragment: undef
Following the algorithm in §5.2.2, we get the following:
T.authority: "" ; R.authority
T.path: "/g" ; remove_dot_segments(R.path)
T.query: undef ; R.query
T.scheme: "http" ; Base.scheme
T.fragment: undef ; R.fragment
Following the algorithm in §5.3, we get the following:
http:///g
Note: While valid, this URI is useless because the server name (T.authority
) is blank!
////g
is the same as ///g
except the R.path
is //g
, so we get
http:////g
Note: While valid, this URI is useless because the server name (T.authority
) is blank!
The final three (h//g
, g////h
, h///g:f
) are all relative paths (path-noscheme
).
Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Base.scheme: "http" R.scheme: undef
Base.authority: "a" R.authority: undef
Base.path: "/b/c/d;p" R.path: "h//g"
Base.query: "q" R.query: undef
Base.fragment: undef R.fragment: undef
Following the algorithm in §5.2.2, we get the following:
T.path: "/b/c/h//g" ; remove_dot_segments(merge(Base.path, R.path))
T.query: undef ; R.query
T.authority: "a" ; Base.authority
T.scheme: "http" ; Base.scheme
T.fragment: undef ; R.fragment
Following the algorithm in §5.3, we get the following:
http://a/b/c/h//g # For h//g
http://a/b/c/g////h # For g////h
http://a/b/c/h///g:f # For h///g:f
I don't think the examples are suitable for answering what I think you really want to know, though.
Take a look at the following two URIs. They aren't equivalent.
http://host/a/b/c/d # Path has 4 segments: "a", "b", "c", "d"
and
http://host/a/b/c//d # Path has 5 segments: "a", "b", "c", "", "d"
Most servers will treat them the same —which is fine since servers are free to interpret paths in any way they wish— but it makes a difference when applying relative paths. For example, if these were the base URI for ../../e
, you'd get
http://host/a/b/c/d + ../../e = http://host/a/e
and
http://host/a/b/c//d + ../../e = http://host/a/b/e