pythonweb-scrapingbeautifulsoupincapsula

Any option to bypass Incapsula protection in python3 while scraping?


I'm new in scraping, and I'm already blocked by the Incapsula protection.

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.immoweb.be/fr/recherche/immeuble-de-rapport/a-vendre'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

page_soup.h1 

I can't access any data from the website because I'm blocked by the InCapsula problem...
When I type :

print(page_soup)

I get this message:

<html style="height:100%"><head><meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/><meta content="telephone=no" name="format-detection"/>
[...]
Request unsuccessful. Incapsula incident ID: 936002200207012991-

Solution

  • I did some tests described here Getting ‘wrong’ page source when calling url from python and only the workaround of @Karl Anka worked out.

    See the example below:

    from bs4 import BeautifulSoup
    from selenium import webdriver
    
    url = 'https://www.immoweb.be/fr/recherche/immeuble-de-rapport/a-vendre'
    
    driver = webdriver.Chrome(executable_path='./chromedriver')
    driver.get(url)
    
    soup = BeautifulSoup(driver.page_source, features='html.parser')
    driver.quit()
    
    print(soup.prettify())
    

    Output:

    <html class="js flexbox rgba borderradius boxshadow opacity cssgradients csstransitions generatedcontent localstorage sessionstorage" style="" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
     <head>
      <script async="" src="https://c.pebblemedia.be/js/data/david/_david_publishers_master_produpress.js" type="text/javascript">
      </script>
      <script async="" src="https://scdn.cxense.com/cx.js" type="text/javascript">
      </script>
      <script async="" src="https://connect.facebook.net/signals/plugins/inferredEvents.js?v=2.8.47">
      </script>
      <script async="" src="https://connect.facebook.net/signals/config/1554445828209863?v=2.8.47&amp;r=stable">
      </script>
    [...]