htmlparsingpython-requestsget-requestweb-application-firewall

Can't access site programmatically


I'm trying to get a list of shutdowns from dtek-kem.com.ua/ua/shutdowns list But when I send a GET request via python, I get a response: unsuccessful request, Incapsula incident ID: ... Also I know this site uses imperva security

Sending a request using python aiohttp:

method='GET'
Host: www.dtek-kem.com.ua
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en,ru;q=0.9,uk;q=0.8,en-US;q=0.7
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36
cache-control: max-age=0
sec-ch-ua: "Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
sec-fetch-dest: document
sec-fetch-mode: navigate
sec-fetch-site: same-origin
sec-fetch-user: ?1
upgrade-insecure-requests: 1

I get the following response:

https://www.dtek-kem.com.ua/ua/shutdowns [200 OK]
Content-Type: text/html
Cache-Control: no-cache, no-store
Connection: close
Content-Length: 899
X-Iinfo: 4-43048402-0 0NNN RT(1670585645218 54) q(0 -1 -1 -1) r(0 -1) B12(4,316,0) U2
Strict-Transport-Security: max-age=31536000; includeSubDomains
Set-Cookie: incap_ses_287_2224657=4b9AWuO2/2fTOuVPWqH7Ay0dk2MAAAAAtnXLv3+84L80QP1nTKP8Fg==; Domain=dtek-kem.com.ua; Path=/; SameSite=None; Secure
Set-Cookie: visid_incap_2224657=OOVTSrqKRCeH0QB7kzrgIC0dk2MAAAAAQUIPAAAAAAB47Nowjvq7LxL76cUkJG0a; Domain=dtek-kem.com.ua; expires=Fri, 08 Dec 2023 22:17:56 GMT; HttpOnly; Path=/; SameSite=None; Secure

and html content:

<html style="height:100%">
 <head>
  <meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
  <meta content="telephone=no" name="format-detection"/>
  <meta content="initial-scale=1.0" name="viewport"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <script async="" src="/Physicken-Like-my-Hath-I-haue-ster-Banq-All-bids">
  </script>
 </head>
 <body style="margin:0px;height:100%">
  <iframe frameborder="0" height="100%" id="main-iframe" marginheight="0px" marginwidth="0px" src="/_Incapsula_Resource?SWUDNSAI=31&amp;xinfo=4-43048402-0%200NNN%20RT%281670585645218%2054%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c316%2c0%29%20U2&amp;incident_id=287000410527500428-206407667178998340&amp;edet=12&amp;cinfo=04000000&amp;rpinfo=0&amp;cts=swfgpEczXy9hSsxHaaLf43gsGYhnGBhKA1jABnA0Ljuov3FUOG0mGjfE6li1tAg6&amp;mth=GET" width="100%">
   Request unsuccessful. Incapsula incident ID: 287000410527500428-206407667178998340
  </iframe>
 </body>
</html>

I completely copied the headers for the request from the network tab by going to the site through the browser and choosing first packet send to server first packet send When doing this, I get different responses from the server. Doesn't the server receive absolutely identical requests? response from browser request:

access-control-allow-credentials: true
access-control-allow-credentials: true
access-control-allow-headers: DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type
access-control-allow-headers: DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type
access-control-allow-methods: GET, POST, OPTIONS
access-control-allow-methods: GET, POST, OPTIONS
access-control-allow-origin: https://admin.dtek-kem.com.ua
cache-control: no-store, no-cache, must-revalidate
cache-control: max-age=900
cache-control: public, max-age=900
cache-control: no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0
content-encoding: gzip
content-type: text/html; charset=UTF-8
date: Fri, 09 Dec 2022 12:02:38 GMT
expect-ct: enforce; max-age=3600
expect-ct: enforce; max-age=3600
expires: Thu, 19 Nov 1981 08:52:00 GMT
pragma: no-cache
referrer-policy: strict-origin-when-cross-origin
server: nginx
path=/; secure; secure; HttpOnly
status: 200
httpVersion: http/2.0
cookies: [{'name': 'dtek-kem', 'value': '0mspqled433d6pq7t9q9ttcjos'}, {'name': '_csrf-dtek-kem', 'value': '0957f055f621ade8b7c6a5136201e0081a1579972aa33443a65646c44afeb161a%3A2%3A%7Bi%3A0%3Bs%3A14%3A%22_csrf-dtek-kem%22%3Bi%3A1%3Bs%3A32%3A%22aJodoGWonH3u7fdI7jVzex4n6yBPZ9qX%22%3B%7D'}, {'name': 'Domain', 'value': 'dtek-kem.com.ua'}, {'name': 'incap_wrt_356', 'value': '3iOTYwAAAAA3Gkt0FwAI5AIQxJuq1AEYicrMnAYgAijdx8ycBknxuwb65PIpngUwOmGF+xE='}]
content: {'size': 635168, 'mimeType': 'text/html'}

Am I entering in a big theme like "bypass firewall" or I missing something


Solution

  • Requests

    Requests work fine if you pass "incap_ses_1612_2224657" cookie to session:

    import requests
    import urllib.parse
    from bs4 import BeautifulSoup as bs
    
    url = r'https://www.dtek-kem.com.ua'
    s = requests.Session()
    s.cookies['incap_ses_1612_2224657'] = 'oRiXXtkFuiaomXJJnfleFu98mGMAAAAACfnEff2NJ+ZJhjCB4Sr2Zw=='
    r = s.get(urllib.parse.urljoin(url, 'ua/shutdowns'))
    soup = bs(r.content, 'lxml')
    

    So it's not a big theme like "bypass firewall", the site is pretty fine. Furthermore reCAPTCHA is bypassed in browser by simply updating the page with F5. Cookie can be taken from there and used for a while as far as session is active.
    Yet I don't know how to get it with requests alone, sometimes it get's full cookies on it's own, headers don't really matter.

    Make a table

    Now, how would we prepare a table without using rendering and things like Scrapy, dryscrape, requests_html and other cool but resource heavy libraries?
    In certain cases those would be helpful, but here the data can be acquired with or even alone. We need just a single <script> element from the webpage that contains all the needed information.

    Get the table data

    import re
    import json
    
    d = soup.find_all(lambda tag: tag.name == 'script' and not tag.attrs)[-1].decode_contents()
    d_parsed = {}
    for i in re.findall(r'(?<=DisconSchedule\.)(\w+)(?:\s=\s)(.+)',d):
        d_parsed[i[0]] = json.loads(i[1])
    d = d_parsed
    

    Now d variable contains a dictionary object with street names, current day of the week and data with table values that represent a some sort of a 3-dimensional table that will need some further parsing.
    But first we'll need to get house information with a post request:

    csrf = soup.find('meta', {'name': 'csrf-token'})['content']
    headers = {
        'X-CSRF-Token': csrf,
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'
    }
    body = 'method=getHomeNum&data[0][name]=street&data[0][value]='+d['streets'][193]
    r = s.post(urllib.parse.urljoin(url, '/ua/ajax'), body.encode('utf-8'), headers=headers)
    house = json.loads(r.content)['data']['20']
    house
    
    Output:
    {'sub_type': 'Застосування стабілізаційних графіків',
     'start_date': '1670926920',
     'end_date': '16:00 13.12.2022',
     'type': '2',
     'sub_type_reason': ['1']}
    

    Here we need some headers for sure. Specify content type and pass a token. Cookies are already in the session. The body of this query contains a street name d['streets'][193] is 'вул. Газопровідна'.
    Response has some useful information that is rendered in a div above the table with yellow background. So, worth having it.

    But what we are looking for is a "sub_type_reason". This is the 3rd dimension I was talking about. It is shown right to the house number and stands for 'Група' 1 / 2 / 3. There might be more groups at some point.

    For this particular address "вул. Газопровідна 20" we'll be using group 1.

    Build a table

    I'll be using for this. We'll be doing some modifications further, so pandas will be great in this case.

    gr = house['sub_type_reason'][0]
    df = pd.DataFrame({int(k):d['preset']['data'][gr][k].values() for k in d['preset']['days'].keys()})
    df
    
    Output:
    
        1       2       3       4       5       6       7
    0   no      maybe   no      no      maybe   no      no
    1   no      maybe   yes     no      maybe   yes     no
    2   no      maybe   yes     no      maybe   yes     no
    3   no      no      maybe   no      no      maybe   no
    4   yes     no      maybe   yes     no      maybe   yes
    5   yes     no      maybe   yes     no      maybe   yes
    6   maybe   no      no      maybe   no      no      maybe
    7   maybe   yes     no      maybe   yes     no      maybe
    8   maybe   yes     no      maybe   yes     no      maybe
    9   no      maybe   no      no      maybe   no      no
    10  no      maybe   yes     no      maybe   yes     no
    11  no      maybe   yes     no      maybe   yes     no
    12  no      no      maybe   no      no      maybe   no
    13  yes     no      maybe   yes     no      maybe   yes
    14  yes     no      maybe   yes     no      maybe   yes
    15  maybe   no      no      maybe   no      no      maybe
    16  maybe   yes     no      maybe   yes     no      maybe
    17  maybe   yes     no      maybe   yes     no      maybe
    18  no      maybe   no      no      maybe   no      no
    19  no      maybe   yes     no      maybe   yes     no
    20  no      maybe   yes     no      maybe   yes     no
    21  no      no      maybe   no      no      maybe   no
    22  yes     no      maybe   yes     no      maybe   yes
    23  yes     no      maybe   yes     no  maybe   yes
    

    Okay, great!
    Basically this is the same table you see on the website but without icons for electricity and transposed as it is viewed in mobile version.
    d['preset']['time_type']:

    {'yes': 'Світло є', 'maybe': 'Можливо відключення', 'no': 'Світла немає'}
    

    Modify a table

    As per your screenshot this is something you want to get. As far as I understand it, it's about collapsing 'yes' and 'maybe' values into one row with an overlapping time period.
    That's challenging, but can be done.

    from operator import itemgetter
    from itertools import groupby
    
    row = ['']*len(df.columns)
    df = df.replace(['no'],'').replace(['yes','maybe'],True)
    collapsed_df = pd.DataFrame(columns=df.columns)
    for col_ix, col in enumerate(df.columns):
        for k,g in groupby(enumerate(df.groupby(df[col], axis=0).get_group(True)[col].index), lambda x: x[0]-x[1]):
            intervals = list(map(itemgetter(1), g))
            interval = pd.Interval(intervals[0], intervals[-1]+1, closed='both')
            if interval not in collapsed_df.index:
                collapsed_df.loc[interval] = list(row)
            collapsed_df.loc[interval].iloc[col_ix] = True
    df = collapsed_df.sort_index()
    df
    
    Output:
                1       2       3       4       5       6       7
    [0, 3]              True                    True        
    [1, 6]                      True                    True    
    [4, 9]      True                    True                    True
    [7, 12]             True                    True        
    [10, 15]                    True                    True    
    [13, 18]    True                    True                    True
    [16, 21]            True                    True        
    [19, 24]                    True                    True    
    [22, 24]    True                    True                    True
    

    I'm not going to describe in details the magic behind collapsing columns as the answer would be too long. And I'm more than sure that this piece of code can be done better.
    In a few words, I iterate through each row to find groups of consecutive values and collapse their indices. Collapsed indices are casted as intervals and true value is added to a row with corresponding interval. Row is created on first appearance with empty values.

    Anyway, done.
    It has same output as your screenshot but data is different as we're on a different day and data has changed so far.
    Now what is left is to cast index values that stand for hour intervals to hours string, change columns and prettify the table to depict your screenshot.

    Final touch

    from base64 import b64encode
    
    img = {
        'maybe': b64encode(s.get(urllib.parse.urljoin(url,'media/page/maybe-electricity.png')).content),
        'no': b64encode(s.get(urllib.parse.urljoin(url,'media/page/no-electricity.png')).content)
    df = df.replace(True, '<img src="data:image/webp;base64,'+re.sub(r"^b'|'$",'',str(img['no']))+'"></img>')
    
    df.index = ['{:02d}:00 – {:02d}:00'.format(i.left, i.right) for i in df.index]
    df.columns = ['Пн','Вт','Ср','Чт','Пт','Сб','Нд']
    df.columns.name = 'Години'
    
    styled_df = df.style.set_table_styles([
        {'selector': '',
        'props': [
            ('border-collapse', 'collapse'),
            ('border', '1px solid #cfcfcf'),
            ('font-size', '20px')
        ]},
        {'selector': 'thead tr',
        'props': [
            ('background-color', '#ffe500'),
            ('color', 'black'),
            ('height', '70px')
        ]},
        {'selector': 'thead tr th:first-child',
        'props': [
            ('border', '1px solid #cfcfcf'),
            ('width', '240px')
        ]},
        {'selector': 'td',
        'props': [
            ('border-left', '1px solid #cfcfcf'),
            ('text-align', 'center'),
            ('width', '95px'),
            ('height', '56px')
        ]},
        {'selector': 'td, th',
        'props': [
            ('font-weight', 'lighter')
        ]},
        {'selector': 'thead tr th:nth-child({})'.format(d['currentWeekDayIndex']+1),
        'props': [
            ('font-weight', 'bold')
        ]},
        {'selector': 'img',
        'props': [
            ('height', '23px'),
            ('width', '21px')
        ]},
            {'selector': 'td:has(> img)',
        'props': [
            ('background-color', '#f4f4f4')
        ]}
    ])
    }
    
    styled_df.to_html(escape=False, border=0, encoding='utf-8')
    
    Output:

    const image_bin = ""
    var images = document.getElementsByTagName("img")
    for (var i = 0; i < images.length; i++) {
        images[i].src = image_bin;
    }
    #T_b04e1  {
      border-collapse: collapse;
      border: 1px solid #cfcfcf;
      font-size: 20px;
    }
    #T_b04e1 thead tr {
      background-color: #ffe500;
      color: black;
      height: 70px;
    }
    #T_b04e1 thead tr th:first-child {
      border: 1px solid #cfcfcf;
      width: 240px;
    }
    #T_b04e1 td {
      border-left: 1px solid #cfcfcf;
      text-align: center;
      width: 95px;
      height: 56px;
    }
    #T_b04e1 td {
      font-weight: lighter;
    }
    #T_b04e1  th {
      font-weight: lighter;
    }
    #T_b04e1 thead tr th:nth-child(3) {
      font-weight: bold;
    }
    #T_b04e1 img {
      height: 23px;
      width: 21px;
    }
    #T_b04e1 td:has(> img) {
      background-color: #f4f4f4;
    }
    <table id="T_b04e1">
      <thead>
        <tr>
          <th class="index_name level0" >Години</th>
          <th id="T_b04e1_level0_col0" class="col_heading level0 col0" >Пн</th>
          <th id="T_b04e1_level0_col1" class="col_heading level0 col1" >Вт</th>
          <th id="T_b04e1_level0_col2" class="col_heading level0 col2" >Ср</th>
          <th id="T_b04e1_level0_col3" class="col_heading level0 col3" >Чт</th>
          <th id="T_b04e1_level0_col4" class="col_heading level0 col4" >Пт</th>
          <th id="T_b04e1_level0_col5" class="col_heading level0 col5" >Сб</th>
          <th id="T_b04e1_level0_col6" class="col_heading level0 col6" >Нд</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th id="T_b04e1_level0_row0" class="row_heading level0 row0" >00:00 – 03:00</th>
          <td id="T_b04e1_row0_col0" class="data row0 col0" ></td>
          <td id="T_b04e1_row0_col1" class="data row0 col1" ><img></img></td>
          <td id="T_b04e1_row0_col2" class="data row0 col2" ></td>
          <td id="T_b04e1_row0_col3" class="data row0 col3" ></td>
          <td id="T_b04e1_row0_col4" class="data row0 col4" ><img></img></td>
          <td id="T_b04e1_row0_col5" class="data row0 col5" ></td>
          <td id="T_b04e1_row0_col6" class="data row0 col6" ></td>
        </tr>
        <tr>
          <th id="T_b04e1_level0_row1" class="row_heading level0 row1" >01:00 – 06:00</th>
          <td id="T_b04e1_row1_col0" class="data row1 col0" ></td>
          <td id="T_b04e1_row1_col1" class="data row1 col1" ></td>
          <td id="T_b04e1_row1_col2" class="data row1 col2" ><img></img></td>
          <td id="T_b04e1_row1_col3" class="data row1 col3" ></td>
          <td id="T_b04e1_row1_col4" class="data row1 col4" ></td>
          <td id="T_b04e1_row1_col5" class="data row1 col5" ><img></img></td>
          <td id="T_b04e1_row1_col6" class="data row1 col6" ></td>
        </tr>
        <tr>
          <th id="T_b04e1_level0_row2" class="row_heading level0 row2" >04:00 – 09:00</th>
          <td id="T_b04e1_row2_col0" class="data row2 col0" ><img></img></td>
          <td id="T_b04e1_row2_col1" class="data row2 col1" ></td>
          <td id="T_b04e1_row2_col2" class="data row2 col2" ></td>
          <td id="T_b04e1_row2_col3" class="data row2 col3" ><img></img></td>
          <td id="T_b04e1_row2_col4" class="data row2 col4" ></td>
          <td id="T_b04e1_row2_col5" class="data row2 col5" ></td>
          <td id="T_b04e1_row2_col6" class="data row2 col6" ><img></img></td>
        </tr>
        <tr>
          <th id="T_b04e1_level0_row3" class="row_heading level0 row3" >07:00 – 12:00</th>
          <td id="T_b04e1_row3_col0" class="data row3 col0" ></td>
          <td id="T_b04e1_row3_col1" class="data row3 col1" ><img></img></td>
          <td id="T_b04e1_row3_col2" class="data row3 col2" ></td>
          <td id="T_b04e1_row3_col3" class="data row3 col3" ></td>
          <td id="T_b04e1_row3_col4" class="data row3 col4" ><img></img></td>
          <td id="T_b04e1_row3_col5" class="data row3 col5" ></td>
          <td id="T_b04e1_row3_col6" class="data row3 col6" ></td>
        </tr>
        <tr>
          <th id="T_b04e1_level0_row4" class="row_heading level0 row4" >10:00 – 15:00</th>
          <td id="T_b04e1_row4_col0" class="data row4 col0" ></td>
          <td id="T_b04e1_row4_col1" class="data row4 col1" ></td>
          <td id="T_b04e1_row4_col2" class="data row4 col2" ><img></img></td>
          <td id="T_b04e1_row4_col3" class="data row4 col3" ></td>
          <td id="T_b04e1_row4_col4" class="data row4 col4" ></td>
          <td id="T_b04e1_row4_col5" class="data row4 col5" ><img></img></td>
          <td id="T_b04e1_row4_col6" class="data row4 col6" ></td>
        </tr>
        <tr>
          <th id="T_b04e1_level0_row5" class="row_heading level0 row5" >13:00 – 18:00</th>
          <td id="T_b04e1_row5_col0" class="data row5 col0" ><img></img></td>
          <td id="T_b04e1_row5_col1" class="data row5 col1" ></td>
          <td id="T_b04e1_row5_col2" class="data row5 col2" ></td>
          <td id="T_b04e1_row5_col3" class="data row5 col3" ><img></img></td>
          <td id="T_b04e1_row5_col4" class="data row5 col4" ></td>
          <td id="T_b04e1_row5_col5" class="data row5 col5" ></td>
          <td id="T_b04e1_row5_col6" class="data row5 col6" ><img></img></td>
        </tr>
        <tr>
          <th id="T_b04e1_level0_row6" class="row_heading level0 row6" >16:00 – 21:00</th>
          <td id="T_b04e1_row6_col0" class="data row6 col0" ></td>
          <td id="T_b04e1_row6_col1" class="data row6 col1" ><img></img></td>
          <td id="T_b04e1_row6_col2" class="data row6 col2" ></td>
          <td id="T_b04e1_row6_col3" class="data row6 col3" ></td>
          <td id="T_b04e1_row6_col4" class="data row6 col4" ><img></img></td>
          <td id="T_b04e1_row6_col5" class="data row6 col5" ></td>
          <td id="T_b04e1_row6_col6" class="data row6 col6" ></td>
        </tr>
        <tr>
          <th id="T_b04e1_level0_row7" class="row_heading level0 row7" >19:00 – 24:00</th>
          <td id="T_b04e1_row7_col0" class="data row7 col0" ></td>
          <td id="T_b04e1_row7_col1" class="data row7 col1" ></td>
          <td id="T_b04e1_row7_col2" class="data row7 col2" ><img></img></td>
          <td id="T_b04e1_row7_col3" class="data row7 col3" ></td>
          <td id="T_b04e1_row7_col4" class="data row7 col4" ></td>
          <td id="T_b04e1_row7_col5" class="data row7 col5" ><img></img></td>
          <td id="T_b04e1_row7_col6" class="data row7 col6" ></td>
        </tr>
        <tr>
          <th id="T_b04e1_level0_row8" class="row_heading level0 row8" >22:00 – 24:00</th>
          <td id="T_b04e1_row8_col0" class="data row8 col0" ><img></img></td>
          <td id="T_b04e1_row8_col1" class="data row8 col1" ></td>
          <td id="T_b04e1_row8_col2" class="data row8 col2" ></td>
          <td id="T_b04e1_row8_col3" class="data row8 col3" ><img></img></td>
          <td id="T_b04e1_row8_col4" class="data row8 col4" ></td>
          <td id="T_b04e1_row8_col5" class="data row8 col5" ></td>
          <td id="T_b04e1_row8_col6" class="data row8 col6" ><img></img></td>
        </tr>
      </tbody>
    </table>

    The output is a copy-paste of the styled_df.to_html() output, so it's a fully generated one.
    I only added a small js code to distribute the repetitive image binary through <img src=""> to save characters in this answer. This is the only thing done manually in making the snippet, you may automate it with regex or other means if you need.

    Output can be saved to a file by adding buf:

    styled_df.to_html(buf='lovely_table.html', escape=False, border=0, encoding='utf-8')
    

    You may now play with columns collapsing and do it separately on 'yes' and 'maybe' to get different results that suit your needs.