erlanghttpclienterlang-ports

erlang gen_tcp connecting to erlang.org claims a 404


context: JA's "Programming Erlang" 2ed, chapter 16 on files, page 256, example on working with parsing urls from a Binary.

The steps suggested (after writing code for the scavenge_urls module) are these:

B = socket_examples:nano_get_url("www.erlang.org"),
L = scavenge_urls:bin2urls(B),
scavenge_urls:urls2htmlFile(L,"gathered.html").

And that fails (subtly) - the list L ends up being empty. Running the first step on its own, a strange thing is observed - it does return a binary, but it's not the binary I was looking for:

9> B.
<<"HTTP/1.1 404 Not Found\r\nServer: nginx\r\nDate: Sun, 19 Nov 2017 01:57:07 GMT\r\nContent-Type: text/html; charset=UTF-8\r\n"...>>
 shows that this is where the problem lies.

yet in the browser all's good with the mothership! I was able to complete the exercise by replacing the call to socket_examples:nano_get_urls/1 with, first, CURLing for the same url, dumping that into a file, and then file:read_file/1. The next steps all ran fine.

Peeking inside the socket_examples module, I see this:

nano_get_url(Host) ->
    {ok,Socket} = gen_tcp:connect(Host,80,[binary, {packet, 0}]), %% (1)
    ok = gen_tcp:send(Socket, "GET / HTTP/1.0\r\n\r\n"),  %% (2)
    receive_data(Socket, []).

receive_data(Socket, SoFar) ->
    receive
        {tcp,Socket,Bin} ->    %% (3)
            receive_data(Socket, [Bin|SoFar]);
        {tcp_closed,Socket} -> %% (4)
            list_to_binary(reverse(SoFar)) %% (5)
    end.

Nothing looks suspicious. First it establishes the connection, next it fires a GET, and then it receives the response. I've never before had to explicitly connect first, and fire a GET second, my http client libraries hid that from me. So maybe I don't know what to look for... and I sure trust Joe's code doesn't have any glaring mistakes! =) Yet the lines with comments (3),(4) and (5) aren't something I fully understand.

So, any ideas, fellow Erlangers? Thank a bunch!


Solution

  • The problem is not Erlang. It looks like the server running erlang.org requires a Host header as well:

    $ nc www.erlang.org 80
    GET / HTTP/1.0
    
    HTTP/1.1 404 Not Found
    Server: nginx
    Date: Sun, 19 Nov 2017 05:51:39 GMT
    Content-Type: text/html; charset=UTF-8
    Content-Length: 162
    Connection: close
    Vary: Accept-Encoding
    
    <html>
    <head><title>404 Not Found</title></head>
    <body bgcolor="white">
    <center><h1>404 Not Found</h1></center>
    <hr><center>nginx</center>
    </body>
    </html>
    $ nc www.erlang.org 80
    GET / HTTP/1.0
    Host: www.erlang.org
    
    HTTP/1.1 200 OK
    Server: nginx
    Date: Sun, 19 Nov 2017 05:51:50 GMT
    Content-Type: text/html; charset=UTF-8
    Content-Length: 12728
    Connection: close
    Vary: Accept-Encoding
    
    <!DOCTYPE html>
    <html>
    ...
    

    Your Erlang code also works with the Host header after GET HTTP/1.0\r\n:

    1> Host = "www.erlang.org".
    "www.erlang.org"
    2> {ok, Socket} = gen_tcp:connect(Host, 80, [binary, {packet, 0}]).
    {ok,#Port<0.469>}
    3> ok = gen_tcp:send(Socket, "GET / HTTP/1.0\r\nHost: www.erlang.org\r\n\r\n").
    ok
    4> flush().
    Shell got {tcp,#Port<0.469>,
                   <<"HTTP/1.1 200 OK\r\nServer: nginx\r\n...>>
    Shell got {tcp_closed,#Port<0.469>}