pythonmechanicalsoup

Download file with mechanicalsoup


I want to download the Excel file on this ONS webpage using the MechanicalSoup package in Python. I have read the MechanicalSoup documentation. I have searched extensively for an example to follow, on StackOverflow and elsewhere, without luck.

My attempt is:

# Install dependencies
# pip install requests
# pip install BeautifulSoup4
# pip install MechanicalSoup

# Import libraries
import mechanicalsoup
import urllib.request
import requests
from bs4 import BeautifulSoup

# Create a browser object that can collect cookies
browser = mechanicalsoup.StatefulBrowser()

browser.open("https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/l2kq/qna")

browser.download_link("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna")

In that last line, I have also tried:

browser.download_link(link="https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna",file="c:/test/filename.xls")

Update 25 Jan 2019: And thanks to AKX's comment below, I've tried

browser.download_link(re.escape("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna"))

In each case, I get the error:

mechanicalsoup.utils.LinkNotFoundError

Yet the link does exist. Try pasting this into your address bar to confirm:

https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna

What am I doing wrong?

Update 2, 25 Jan 2019: Thanks to AKX's answers below, this is the full MWE that answers my question (posting for anyone who encounters the same difficulty later):

# Install dependencies
# pip install requests
# pip install BeautifulSoup4
# pip install MechanicalSoup

# Import libraries
import mechanicalsoup
import urllib.request
import requests
from bs4 import BeautifulSoup
import re

# Create a browser object that can collect cookies
browser = mechanicalsoup.StatefulBrowser()

browser.open("https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/l2kq/qna")

browser.download_link(link_text=".xls",file="c:/py/ONS_Data.xls" )

Solution

  • I haven't used Mechanical Soup, but looking at the docs,

    This function behaves similarly to follow_link()

    and follow_link says (emphasis mine)

    • If link is a bs4.element.Tag (i.e. from a previous call to links() or find_link()), then follow the link.
    • If link doesn’t have a href-attribute or is None, treat link as a url_regex and look it up with find_link(). Any additional arguments specified are forwarded to this function.

    Question marks (among other things) are regular expression (regex) metacharacters, so you'll want to escape them if you want to use them for follow_link/download_link:

    import re
    # ...
    browser.download_link(re.escape("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna"))
    

    However, if the first page you visit doesn't contain that direct link, I'm not sure it'll help anyway. (Do try first though.)

    You might be able to use the browser's underlying requests session that probably hosts the cookie jar (assuming some cookies are required for the download) to directly download the file:

    resp = browser.session.get("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna")
    resp.raise_for_status()  # raise an exception for 404, etc.
    with open('filename.xls', 'wb') as outf:
      outf.write(resp.content)