I’m experimenting with boost::spirit to write a URL parser. My objective is to parse the input URL (valid or invalid) and break it down into prefix, host and suffix as below:
Input ipv6 URL: https://[::ffff:192.168.1.1]:8080/path/to/resource
Break this into below parts:
Prefix: https://
Host: ::ffff:192.168.1.1
Suffix: :8080/path/to/resource
Input ipv6 URL: https://::ffff:192.168.1.1/path/to/resource
Break this into below parts:
Prefix: https://
Host: ::ffff:192.168.1.1
Suffix: /path/to/resource
Input ipv4 URL: https://192.168.1.1:8080/path/to/resource
Break this into below parts:
Prefix: https://
Host: 192.168.1.1
Suffix: :8080/path/to/resource
The colon character ‘:’ is used as delimiter in ipv6 address and also as delimiter for port in ipv4 address. Due to this ambiguity, I’m having hard time defining the boost::spirit grammar that works both for ipv4 and ipv6 URLs. Please refer the code below:
struct UrlParts
{
std::string scheme;
std::string host;
std::string port;
std::string path;
};
BOOST_FUSION_ADAPT_STRUCT(
UrlParts,
(std::string, scheme)
(std::string, host)
(std::string, port)
(std::string, path)
)
void parseUrl_BoostSpirit(const std::string &input, std::string &prefix, std::string &suffix, std::string &host)
{
namespace qi = boost::spirit::qi;
// Define the grammar
qi::rule<std::string::const_iterator, UrlParts()> url = -(+qi::char_("a-zA-Z0-9+-.") >> "://") >> -qi::lit('[') >> +qi::char_("a-fA-F0-9:.") >> -qi::lit(']') >> -(qi::lit(':') >> +qi::digit) >> *qi::char_;
// Parse the input
UrlParts parts;
auto iter = input.begin();
if (qi::parse(iter, input.end(), url, parts))
{
prefix = parts.scheme.empty() ? "" : parts.scheme + "://";
host = parts.host;
suffix = (parts.port.empty() ? "" : ":" + parts.port) + parts.path;
}
else
{
host = input;
}
}
above code produces incorrect output for ipv4 URL as below:
Input URL ipv4: https://192.168.1.1:8080/path/to/resource
Broken parts:
Prefix: https://
Host: 192.168.1.1:8080
Suffix: /path/to/resource
i.e. Host is having :8080 instead of having it in Suffix.
If I change the URL grammar, I can fix the ipv4 but then ipv6 breaks.
Of-course this can be done using trivial if-else parsing logic, but I'm trying to do it more elegantly using boost::spirit. Any suggestions on how to update the grammar to support both ipv4 and ipv6 URLs ?
PS: I'm aware that URLs with ipv6 address w/o [ ] are invalid as per RFC, but the application I'm working on requires processing these invalid URLs as well.
Thanks in advance!
First off your expression char_("+-.")
accidentally allows for ',' inside the scheme: https://coliru.stacked-crooked.com/a/14c00775d9f3d99e
To innoculate against that always put -
first or last in character sets so it can't be misinterpreted as a range: char_("+.-")
. Yeah, that's subtle.
-'[' >> p >> -']'
allows for unmatched brackets. Instead say ('[' >> p >> ']' | p)
.
With those applied, let's rewrite the parser expression so we see what's happening:
// Define the grammar
auto scheme_ = qi::copy(+qi::char_("a-zA-Z0-9+.-") >> "://");
auto host_ = qi::copy(+qi::char_("a-fA-F0-9:."));
auto port_ = qi::copy(':' >> +qi::digit);
qi::rule<std::string::const_iterator, UrlParts()> const url =
-scheme_ >> ('[' >> host_ >> ']' | host_) >> -port_ >> *qi::char_;
So I went on to create a test-bed to demonstrate your question examples:
Note I simplified the handling by adding
raw[]
to include://
and just returning and printingUrlParts
because it is more insightful to see what the parser does
// #define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/pfr/io.hpp>
struct UrlParts { std::string scheme, host, port, path; };
BOOST_FUSION_ADAPT_STRUCT(UrlParts, scheme, host, port, path)
UrlParts parseUrl_BoostSpirit(std::string_view input) {
namespace qi = boost::spirit::qi;
using It = std::string_view::const_iterator;
qi::rule<It, UrlParts()> url;
//using R = qi::rule<It, std::string()>;
//R scheme_, host_, port_;
auto scheme_ = qi::copy(qi::raw[+qi::char_("a-zA-Z0-9+.-") >> "://"]);
auto host_ = qi::copy(+qi::char_("a-fA-F0-9:."));
auto port_ = qi::copy(':' >> +qi::digit);
url = -scheme_ >> ('[' >> host_ >> ']' | host_) >> -port_ >> *qi::char_;
// BOOST_SPIRIT_DEBUG_NODES((scheme_)(host_)(port_)(url));
BOOST_SPIRIT_DEBUG_NODES((url));
// Parse the input
UrlParts parts;
parse(input.begin(), input.end(), qi::eps > url > qi::eoi, parts);
return parts;
}
int main() {
using It = std::string_view::const_iterator;
using Exception = boost::spirit::qi::expectation_failure<It>;
for (std::string_view input : {
"https://[::ffff:192.168.1.1]:8080/path/to/resource",
"https://::ffff:192.168.1.1/path/to/resource",
"https://192.168.1.1:8080/path/to/resource",
}) {
try {
auto parsed = parseUrl_BoostSpirit(input);
// using boost::fusion::operator<<; // less clear output, without PFR
// std::cout << std::quoted(input) << " -> " << parsed << std::endl;
std::cout << std::quoted(input) << " -> " << boost::pfr::io(parsed) << std::endl;
} catch (Exception const& e) {
std::cout << std::quoted(input) << " EXPECTED " << e.what_ << " at "
<< std::quoted(std::string_view(e.first, e.last)) << std::endl;
}
}
}
Prints:
"https://[::ffff:192.168.1.1]:8080/path/to/resource" -> {"https://", "::ffff:192.168.1.1", "8080", "/path/to/resource"}
"https://::ffff:192.168.1.1/path/to/resource" -> {"https://", "::ffff:192.168.1.1", "", "/path/to/resource"}
"https://192.168.1.1:8080/path/to/resource" -> {"https://", "192.168.1.1:8080", "", "/path/to/resource"}
You already assessed the problem: :8080
matches the production for host_
. I'd reason that the port specification is the odd one out because it must be the last before '/'
or the end of input. In other words:
auto port_ = qi::copy(':' >> +qi::digit >> &('/' || qi::eoi));
Now you can do a negative look-ahead assertion in your host_
production to avoid eating port specifications:
auto host_ = qi::copy(+(qi::char_("a-fA-F0-9:.") - port_));
Now the output becomes
"https://[::ffff:192.168.1.1]:8080/path/to/resource" -> {"https://", "::ffff:192.168.1.1", "8080", "/path/to/resource"}
"https://::ffff:192.168.1.1/path/to/resource" -> {"https://", "::ffff:192.168.1.1", "", "/path/to/resource"}
"https://192.168.1.1:8080/path/to/resource" -> {"https://", "192.168.1.1", "8080", "/path/to/resource"}
Note that there are some inefficiencies and probably RFC violations in this implementation. Consider a static instance of the grammar. Also consider using X3.
I have a related answer here: What is the nicest way to parse this in C++?. It shows an X3 approach with validation using Asio's networking primitives.
Why roll your own?
UrlParts parseUrl(std::string_view input) {
auto parsed = boost::urls::parse_uri(input).value();
return {parsed.scheme(), parsed.host(), parsed.port(), std::string(parsed.encoded_resource())};
}
To be really pedantic and get the ://
as well:
UrlParts parseUrl(std::string_view input) {
auto parsed = boost::urls::parse_uri(input).value();
assert(parsed.has_authority());
return {
parsed.buffer().substr(0, parsed.authority().data() - input.data()),
parsed.host(),
parsed.port(),
std::string(parsed.encoded_resource()),
};
}
This parses what you have and much more (fragment from the Reference Help Card):
The notable value is
[]
)==== "https://[::ffff:192.168.1.1]:8080/path/to/resource" ====
Spirit -> {"https://", "::ffff:192.168.1.1", "8080", "/path/to/resource"}
URL -> {"https://", "[::ffff:192.168.1.1]", "8080", "/path/to/resource"}
==== "https://::ffff:192.168.1.1/path/to/resource" ====
Spirit -> {"https://", "::ffff:192.168.1.1", "", "/path/to/resource"}
URL -> leftover [boost.url.grammar:4]
==== "https://192.168.1.1:8080/path/to/resource" ====
Spirit -> {"https://", "192.168.1.1", "8080", "/path/to/resource"}
URL -> {"https://", "192.168.1.1", "8080", "/path/to/resource"}
==== "https://192.168.1.1:8080/s?quey=param&other=more%3Dcomplicated#bookmark" ====
Spirit -> {"https://", "192.168.1.1", "8080", "/s?quey=param&other=more%3Dcomplicated#bookmark"}
URL -> {"https://", "192.168.1.1", "8080", "/s?quey=param&other=more%3Dcomplicated#bookmark"}