pythonpython-requestsweb-crawlersearch-enginebing

How do I convert Python crawled Bing web page content to human-readable?


I'm playing with crawling Bing web search page using python. I find the raw content received looks like byte type, but the attempt to decompress it has failed. Does someone have clue what kind of data is this, and how should I extract readable from this raw content? Thanks!

My code displayed the raw content and then tried to do the gunzip, so you could see the raw content as well as error from the decompression. Due to the raw content is too long, I just paste the first a few lines in below.

Code:

import urllib.request as Request
import gzip

req = Request.Request('www.bing.com')
req.add_header('upgrade-insecure-requests', 1)
res = Request.urlopen(req).read()
print("RAW Content: %s" %ResPage) # show raw content of web
print("Try decompression:")
print(gzip.decompress(ResPage))   # try decompression

Result:

RAW Content: b'+p\xe70\x0bi{)\xee!\xea\x88\x9c\xd4z\x00Tgb\x8c\x1b\xfa\xe3\xd7\x9f\x7f\x7f\x1d8\xb8\xfeaZ\xb6\xe3z\xbe\'\x7fj\xfd\xff+\x1f\xff\x1a\xbc\xc5N\x00\xab\x00\xa6l\xb2\xc5N\xb2\xdek\xb9V5\x02\t\xd0D \x1d\x92m%\x0c#\xb9>\xfbN\xd7\xa7\x9d\xa5\xa8\x926\xf0\xcc\'\x13\x97\x01/-\x03... ...

Try decompression:
Traceback (most recent call last):
OSError: Not a gzipped file (b'+p')


Process finished with exit code 1

Solution

  • It's much easier to get started with the requests library. Plus, this is also the most commonly used lib for http requests nowadays.

    Install requests in your python environment:

    pip install requests
    

    In your .py file:

    import requests
    
    r = requests.get("http://www.bing.com")
    
    print(r.text)