perllwplwp-useragent

Perl LWP get blocked by server 404 not found


I am using LWP::UserAgent to download an csv in unix server

my $ua1 = LWP::UserAgent->new();
my $res = $ua1->get($equity_history_url, @netscape_like_headers);

The server keeps giving me an error with code 404 (not found)

Despite that I can download this file from the browser -> http://www.nseindia.com/content/equities/scripvol/datafiles/18-08-2013-TO-17-08-2015ADANIPOWERALLN.csv

and the code works with other pages

I think the issue is one of the below

I tried passing a header similar to my browser which I captured using wireshark

my @netscape_like_headers = (
    'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36',
    'Accept-Language' => 'en-US,en;q=0.8',
    'Accept-Charset' => 'iso-8859-1,*,utf-8',
    'Accept-Encoding' => 'gzip, deflate, sdch',
    'Upgrade-Insecure-Requests' => '1',
    'Accept' =>     'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Connection' => 'keep-alive'
);

still no luck. any suggestions?


Solution

  • When I use LWP::UserAgent on that URL without your headers I get a 403 Forbidden error back. Same with curl.

    use strict;
    use warnings;
    use LWP::UserAgent;
    
    my $ua = LWP::UserAgent->new;
    
    my $res = $ua->get(
        'http://www.nseindia.com/content/equities/scripvol/datafiles/18-08-2013-TO-17-08-2015ADANIPOWERALLN.csv');
    print $res->as_string;
    
    __END__
    HTTP/1.1 403 Forbidden
    Connection: close
    Date: Tue, 18 Aug 2015 12:19:13 GMT
    Server: AkamaiGHost
    Content-Length: 388
    Content-Type: text/html
    Expires: Tue, 18 Aug 2015 12:19:13 GMT
    Client-Date: Tue, 18 Aug 2015 12:19:13 GMT
    Client-Peer: 104.85.166.76:80
    Client-Response-Num: 1
    Mime-Version: 1.0
    Title: Access Denied
    
    <HTML><HEAD>
    <TITLE>Access Denied</TITLE>
    </HEAD><BODY>
    <H1>Access Denied</H1>
    
    You don't have permission to access "http&#58;&#47;&#47;www&#46;nseindia&#46;com&#47;content&#47;equities&#47;scripvol&#47;datafiles&#47;18&#45;08&#45;2013&#45;TO&#45;17&#45;08&#45;2015ADANIPOWERALLN&#46;csv" on this server.<P>
    Reference&#32;&#35;18&#46;48a65568&#46;1439900353&#46;1006096e
    </BODY>
    </HTML>
    

    When I added your headers, it worked the first time I tried. When I re-ran it, it also gave 404 Not Found. Now when I click the link in the browser, it gives a 404 as well.

    I believe they are preventing you from dowloading the file multiple times. If you are on a dial-up connection or broadband with a non-static IP address, try to reconnect to get a fresh one, or use a proxy.


    Maybe they also have terms of services that forbid using automation tools to access their ressources because they are not ment to be APIs.

    Update

    In fact they do not allow what you are trying to do! Point 12 of their terms of services clearly states that.

    You may not conduct any systematic or automated data collection activities (including scraping, data mining, data extraction and data harvesting) on or in relation to our website without our express written consent.