pythonpython-3.xbyteurlliburlopen

Extracting some numbers from a byte data


I am new to web-scraping. After scraping some websites using these lines:

x1 = urllib.request.urlopen('somewebsite1').read()
x2 = urllib.request.urlopen('somewebsite2').read()
x3 = urllib.request.urlopen('somewebsite3').read()

I have the following data:

In [14]:print(x1)
b'<li><span class="Price down2">0.071&nbsp;</span></li>'

In [15]:print(x2)
b'<li><span class="Price up2">0.059&nbsp;</span></li>'

In [16]:print(x3)
b'<li><span class="Price down2">0.079&nbsp;</span></li>'

The datatypes of x1, x2 and x3 are bytes. I want to extract 0.071, 0.059, 0.079 as floats from x1, x2 and x3. What's the pythonic way to do so?

Thank you in advance

EDIT: for better presentation


Solution

  • You could use regular expressions:

    import re
    x1_extracted = re.findall('(?<=>)\d+\.*\d*', x1.decode('utf-8'))
    x1_extracted = float(x1_extracted[0])
    

    First, you need to decode your bytes sequence (convert it from bytes to string. I'm assuming the encoding is utf-8). Then you can use the re module to find the values. Step-by-step, the expression means: Find one or more digits (\d+), optionally followed by a dot (\.*) and optionally followed by more digits (\d*). All that should be preceded by a > ((?<=>)).