pythonweb-scrapingbeautifulsoupurllib

How to scrape specific IDs from a Webpage


I need to do some real estate market research and for this in need the prices, and other values from new houses.

So my idea was to go on the website where i get the information. Go to the Main-Search-Site and scrape all the RealEstateIDs that would navigate me directly to the single pages for each house where i can than extract my infos that i need.

My problem is how do i get all the real estate ids from the main page and store them in a list, so i can use them in the next step to build the urls with them to go to the acutal sites.

I tried it with beautifulsoup but failed because i dont understand how to search for a specific word and extract what comes after it.

The html code looks like this:

""realEstateId":110356727,"newHomeBuilder":"false","disabledGrouping":"false","resultlist.realEstate":{"@xsi.type":"search:ApartmentBuy","@id":"110356727","title":"

Since the value "realEstateId" appears around 60 times, i want to scrape evertime the number (here: 110356727) that comes after it and store it in a list, so that i can use them later.

Edit:

    import time
    import urllib.request
    from urllib.request import urlopen
    import bs4 as bs
    import datetime as dt
    import matplotlib.pyplot as plt
    from matplotlib import style
    import numpy as np
    import os
    import pandas as pd
    import pandas_datareader.data as web
    import pickle
    import requests
    from requests import get 
url = 'https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list'
        response = get(url)
        from bs4 import BeautifulSoup
        html_soup = BeautifulSoup(response.text, 'html.parser')
        type(html_soup)

        def expose_IDs():
            resp = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
            soup = bs.BeautifulSoup(resp.text, 'lxml')
            table = soup.find('resultListModel')
            tickers = []
            for row in table.findAll('realestateID')[1:]:
                ticker = row.findAll(',')[0].text
                tickers.append(ticker)
            with open("exposeID.pickle", "wb") as f:
                pickle.dump(tickers, f)
            return tickers

        expose_IDs()

Solution

  • Something like this? There are 68 keys in a dictionary that are ids. I use regex to grab the same script as you are after and trim of an unwanted character, then load with json.loads and access the json object as shown in image at bottom.

    import requests
    import json
    from bs4 import BeautifulSoup as bs
    import re
    
    res = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
    soup = bs(res.content, 'lxml')
    r = re.compile(r'resultListModel:(.*)')
    data = soup.find('script', text=r).text
    script = r.findall(data)[0].rstrip(',')
    #resultListModel: 
    results = json.loads(script)
    ids = list(results['searchResponseModel']['entryInformation'].keys())
    print(ids)
    

    Ids:


    Since website updated:

    import requests
    import json
    from bs4 import BeautifulSoup as bs
    import re
    
    res = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
    soup = bs(res.content, 'lxml')
    r = re.compile(r'resultListModel:(.*)')
    data = soup.find('script', text=r).text
    script = r.findall(data)[0].rstrip(',')
    results = json.loads(script)
    ids = [item['@id'] for item in results['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']]
    print(ids)