pythonbeautifulsoup

dataframe from BeautifulSoup object


I want to create a dataframe from a BeautifulSoup Object -

import pandas as pd
from requests import get
from bs4 import BeautifulSoup
import re

# Fetch the web page
url = 'https://carbondale.craigslist.org/search/apa#search=1~gallery~0~0'
response = get(url) # link exlcudes posts with no picures
page = response.text

# Parse the HTML content
soup = BeautifulSoup(page, 'html.parser')

# Information I need 
list_url = []
title = []
location = []
price = []

# I run the following 
list_url = [a['href'] for a in soup.select('a[href^="https"]')]
title = [x.text for x in soup.find_all(class_="title")]
location = [x.text for x in soup.find_all(class_="location")]
price = [x.text for x in soup.find_all(class_="price")]

But the problem I am facing is that for some class (e.g., title or location), some elements are missing, So, while I try to create a data frame, it shows error because of None value because all lists size are not equal. You can use the len() function to check the size of the list. Actually, I want to include the word "None" for missing elements in a column in the dataframe.


Solution

  • You need to iterate over each listing in the page and add values one by one to list_url, list_location, list_title and list_price. If any one of these values is missing, then add a None to the corresponding list. Then you may create the DataFrame using lists.

    To iterate over the list, I had to look at the how the rows were structured and noticed a li class="cl-static-search-result" was being used. You can then iterate over this list to find the required values instead of using find_all on the whole page which does not take into account the relation between items within a listing.

    Try this:

    import pandas as pd
    from requests import get
    from bs4 import BeautifulSoup
    
    
    # Fetch the web page
    url = 'https://carbondale.craigslist.org/search/apa#search=1~gallery~0~0'
    response = get(url) # link exlcudes posts with no picures
    page = response.text
    
    # Parse the HTML content
    soup = BeautifulSoup(page, 'html.parser')
    
    # Extract listings from the page
    listings = soup.find_all('li', class_='cl-static-search-result')
    
    for listing in listings:
        # Extract URL
        url = listing.find('a')
        list_url.append(url['href'] if url else None)
    
        # Extract Title
        title_text = listing.find('div', class_='title')
        title.append(title_text.text if title_text else None)
        
        # Extract Location
        location_text = listing.find('div', class_='location')
        location.append(location_text.text.strip() if location_text else None)
        
        # Extract Price
        price_text = listing.find('div', class_='price')
        price.append(price_text.text if price_text else None)
    
    # Create DataFrame
    df = pd.DataFrame({
        'URL': list_url,
        'Title': title,
        'Location': location,
        'Price': price
    })
    
    

    Printing out the first 5 rows

                                                     URL                                              Title        Location Price
    0  https://carbondale.craigslist.org/apa/d/johnst...                Almost New, 2 BR APT for Rent in JC   Johnston City  $900
    1  https://carbondale.craigslist.org/apa/d/northb...  Enjoy 2 Bed/2 Bath/2 Car Ranch Home With Great...  Northbrook, IL  $945
    2  https://carbondale.craigslist.org/apa/d/mount-...  Love where you live! Beautiful senior communit...    Mount Vernon  $848
    3  https://carbondale.craigslist.org/apa/d/mount-...  Your dream 1 bed, 1 bath is closer than you th...    Mount Vernon  $848
    4  https://carbondale.craigslist.org/apa/d/mount-...  Be at the center of it all: 1 BR, 1 BA, 553 Sq...    Mount Vernon  $848
    

    For any listing, since we append a None if the value is not available or if we cannot find it. The dataframe with missing values would like this:

                                                      URL                                              Title Location   Price
    16  https://carbondale.craigslist.org/apa/d/marion...  Cfd housing Houses for rent or purchase southe...     None    $500
    17  https://carbondale.craigslist.org/apa/d/herrin...                         4 Bedroom/2 bathroom House     None  $1,561