pythonpandaspython-newspaper

How to iterate over csv rows to extract text from URLS using pandas


I have a csv of a bunch of news articles, and I'm hoping to use the newspaper3k package to extract the body text from those articles and save them as txt files. I want to create a script that iterates over every row in the csv, extracts the URL, extracts the text from the URL, and then saves that as a uniquely named txt file. Does anyone know how I might do this? I'm a journalist who is new to Python, sorry if this is straightforward.

I only have the code below. Before figuring out how to save each body text as a txt file, I figured I should try and just get the script to print the text from each row in the csv.

import newspaper as newspaper
from newspaper import Article
import sys as sys
import pandas as pd

data = pd.read_csv('/Users/alexfrandsen14/Desktop/Projects/newspaper3k- 
 scraper/candidate_coverage.csv')

data.head()

for index,row in data.iterrows():
    article_name = Article(url=['link'], language='en')
    article_name.download()
    article_name.parse()
    print(article_name.text)

Solution

  • Since all the url's are in the same column, it is easier to access that column directly with a for loop. I will go over some explanation here:

    # to access your specific url column
    from newspaper import Article
    import sys as sys
    import pandas as pd
    
    data = pd.read_csv('/Users/alexfrandsen14/Desktop/Projects/newspaper3k-scraper/candidate_coverage.csv')
    
    for x in data['url_column_name']: #replace 'url_column_name' with the actual name in your df 
        article_name = Article(x, language='en') # x is the url in each row of the column
        article.download()
        article.parse()
        f=open(article.title, 'w') # open a file named the title of the article (could be long)   
        f.write(article.text)
        f.close()
    

    I have not tried this package before, but reading the tutorial posted this seems like it should work. Generally, you are accessing the url column in your dataframe by the line: for x in data['url_column_name']: you will replace the 'url_column_name' with the actual name of the column.

    Then, x will be the url in the first row so you will pass that to Article (you don't need brackets around x judging by the tutorial). It will download this first x and parse it, then open a file with the name of the title of the article, write the text to that file, then close that file.

    It will then do this same thing for the second x, and third x, all the way until you run out of urls.

    I hope this helps!