pythonbeautifulsouppython-requeststripadvisor

Scrape Email address from a Tripadvisor webpage


I am trying to scrape the Email Address from the following webpage using Python-BS4-requests, but the email address is not accessible in the source code.

https://www.tripadvisor.in/Attraction_Review-g189400-d2020955-Reviews-Chat_Tours-Athens_Attica.html

The email address opens up in my Mail App, but I could not find the link to it on the page source. I understand this could be done by observing the network tab and making the same post request that websites makes, but could not make it work.

enter image description here

enter image description here

Thanks in advance!!


Solution

  • The email is Base64 encoded inside the Json variable found on the page.

    You can use this example to get all emails found on page:

    import re
    import json
    import base64
    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://www.tripadvisor.in/Attraction_Review-g189400-d2020955-Reviews-Chat_Tours-Athens_Attica.html'
    
    html_data = requests.get(url).text
    data = re.search(r'window\.__WEB_CONTEXT__=(\{.*?\});', html_data).group(1)
    data = json.loads(data.replace('pageManifest', '"pageManifest"'))
    
    def get_emails(val):
        if isinstance(val, dict):
            for k, v in val.items():
                if k == 'email':
                    if v:
                        yield v
                else:
                    yield from get_emails(v)
        elif isinstance(val, list):
            for v in val:
                yield from get_emails(v)
    
    for email in get_emails(data):
        email = base64.b64decode(email).decode('utf-8')
        email = re.search(r'mailto:(.*)_', email).group(1)
    
        print(email)
    

    Prints:

    chat@chatours.gr