pythonweb-scrapingbeautifulsouppython-requestsweb-inspector

Scrape page links


I want to scrape this site page link https://kw.com/agent/search/IL/Chicago but this page inspection doesn't have any div class or a href link. I don't understand which function needs to call to scrape these 652 agent links.

My code:

import requests
from bs4 import BeautifulSoup

url = 'https://kw.com/agent/search/IL/Chicago'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')

urls = []
for link in soup.find_all('a'):
    print(link.get('href'))

This code is working for other pages but this site looks complicated to me. How can I collect these site links?


Solution

  • Actually , all required data is generating from API. Each agent link/url contains a unique Id and this id value with the domain name is the agent link/details page link.

    Example:

    import requests
    
    api_url = "https://api-endpoint.cons-prod-us-central1.kw.com/graphql"
    data={"operationName":"searchAgentsQuery","variables":{"searchCriteria":{"searchTerms":{"param1":"IL","param2":"Chicago"}},"first":50,"after":"99","queryId":"0.8691595723322416"},"query":"query searchAgentsQuery($searchCriteria: AgentSearchCriteriaInput, $first: Int, $after: String) {\n  SearchAgentQuery(searchCriteria: $searchCriteria) {\n    result {\n      agents(first: $first, after: $after) {\n        edges {\n          node {\n            ...AgentProfileFragment\n            __typename\n          }\n          __typename\n        }\n        pageInfo {\n          ...PageInfoFragment\n          __typename\n        }\n        totalCount\n        __typename\n      }\n      __typename\n    }\n    __typename\n  }\n}\n\nfragment PageInfoFragment on PageInfo {\n  endCursor\n  hasNextPage\n  __typename\n}\n\nfragment AgentProfileFragment on AgentProfileType {\n  id\n  name {\n    full\n    given\n    initials\n    __typename\n  }\n  image\n  location {\n    address {\n      state\n      city\n      __typename\n    }\n    __typename\n  }\n  realEstateEntity {\n    name\n    __typename\n  }\n  specialties\n  languages\n  isAgentLuxuryEnabled\n  phone {\n    entries {\n      ... on ContactSetEntryMobile {\n        number\n        __typename\n      }\n      ... on ContactSetEntryEmail {\n        email\n        __typename\n      }\n      __typename\n    }\n    __typename\n  }\n  agentLicenses {\n    licenseNumber\n    state\n    __typename\n  }\n  marketCenter {\n    market_center_name\n    market_center_address1\n    market_center_address2\n    __typename\n  }\n  __typename\n}\n"}
    headers={
            'content-type': 'application/json',
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
            'x-datadog-origin': 'rum',
            'x-datadog-parent-id': '5420198475190660541',
            'x-datadog-sampled': '1',
            'x-datadog-sampling-priority': '1',
            'x-datadog-trace-id': '1837163169752685118',
            'x-shared-secret': 'MjFydHQ0dndjM3ZAI0ZHQCQkI0BHIyM='
    }
    
    res = requests.post(api_url,headers=headers,json=data)
    data = res.json()['data']['SearchAgentQuery']['result']['agents']['edges']
    
    for item in data:
            link='https://kw.com/agent/' + item['node']['id']
            print(link)
    

    Output:

    https://kw.com/agent/UPA-6587385404419399681-8
    https://kw.com/agent/UPA-6587385313789222917-3
    https://kw.com/agent/UPA-6704789234247561216-6
    https://kw.com/agent/UPA-6587385427490459656-4
    https://kw.com/agent/UPA-6587385454284918792-0
    https://kw.com/agent/UPA-6882009464351350784-8
    https://kw.com/agent/UPA-6937439716674322432-5
    https://kw.com/agent/UPA-6587385379476373510-1
    https://kw.com/agent/UPA-6853411032351416320-2
    https://kw.com/agent/UPA-6587385065789456390-4
    https://kw.com/agent/UPA-6587385175436890114-3
    https://kw.com/agent/UPA-6942951019140222976-1
    https://kw.com/agent/UPA-6808491123018551296-7
    https://kw.com/agent/UPA-6587385273946116100-8
    https://kw.com/agent/UPA-6587385281007677447-9
    https://kw.com/agent/UPA-6592268954554945544-5
    https://kw.com/agent/UPA-6587385270364864517-7
    https://kw.com/agent/UPA-6856325267405185024-3
    https://kw.com/agent/UPA-6804158392167718912-3
    https://kw.com/agent/UPA-6638843865929490435-1
    https://kw.com/agent/UPA-6587384999272361984-6
    https://kw.com/agent/UPA-6592267095708119045-4
    https://kw.com/agent/UPA-6587385271389274119-4
    https://kw.com/agent/UPA-6587385271385079815-8
    https://kw.com/agent/UPA-6587385288161681409-1
    https://kw.com/agent/UPA-6587385375965011973-7
    https://kw.com/agent/UPA-6587385274994008066-1
    https://kw.com/agent/UPA-6913250263682408448-6
    https://kw.com/agent/UPA-6587385272597565443-9
    https://kw.com/agent/UPA-6859526404702093312-9
    https://kw.com/agent/UPA-6587385390518407175-2
    https://kw.com/agent/UPA-6587385436077776899-8
    https://kw.com/agent/UPA-6587384956740640770-9
    https://kw.com/agent/UPA-6587385297339674632-1
    https://kw.com/agent/UPA-6587385390593904641-1
    https://kw.com/agent/UPA-6811013526642786304-3
    https://kw.com/agent/UPA-6932834317516042240-9
    https://kw.com/agent/UPA-6587385437068947458-5
    https://kw.com/agent/UPA-6587385380989808647-6
    https://kw.com/agent/UPA-6892926376478015488-5
    https://kw.com/agent/UPA-6905262704995926016-2
    https://kw.com/agent/UPA-6592947303925784578-6
    https://kw.com/agent/UPA-6587385393920495624-5
    https://kw.com/agent/UPA-6783788552269369344-7
    https://kw.com/agent/UPA-6710285049427382272-8
    https://kw.com/agent/UPA-6844700377378430976-0
    https://kw.com/agent/UPA-6934540598372548608-6
    https://kw.com/agent/UPA-6711387287014834176-1
    https://kw.com/agent/UPA-6587385367301132290-0
    https://kw.com/agent/UPA-6714648183099023360-3