htmlweb-scrapingbeautifulsoupgoogle-colaboratoryflysystem-google-drive

Downloading files to google drive using beautifulsoup


I need to downloading files using beautifulsoup to my googledrive using a colaboratory.

I´m using the code below:

u = urllib.request.urlopen("https://s3.amazonaws.com/video.udacity-data.com/topher/2018/November/5bf32290_turnstile/turnstile.html")
html = u.read()

soup = BeautifulSoup(html, "html.parser")
links = soup.find_all('a')

I need only links that name contains '1706'. So, i´m trying:

for link in links:
  files = link.get('href')
  if '1706' in files: 
    urllib.request.urlretrieve(filelink, filename)

and don´t worked. "TypeError: argument of type 'NoneType' is not iterable". Ok, I know why this error but I don´t how fix, what is missing.

Using this

urllib.request.urlretrieve("https://s3.amazonaws.com/video.udacity-data.com/topher/2018/November/5bf32142_turnstile-170624/turnstile-170624.txt", 'turnstile-170624.txt')

I can get the individual files. But I want some way to downloading all files (that contains '1706') and to save this files to my google drive.

How can I do this?


Solution

  • You can use attribute = value css selector with * contains operator to specify the href attribute value contains 1706

    links = [item['href'] for item in soup.select("[href*='1706']")]