I am new to web-scraping. After scraping some websites using these lines:
x1 = urllib.request.urlopen('somewebsite1').read()
x2 = urllib.request.urlopen('somewebsite2').read()
x3 = urllib.request.urlopen('somewebsite3').read()
I have the following data:
In [14]:print(x1)
b'<li><span class="Price down2">0.071 </span></li>'
In [15]:print(x2)
b'<li><span class="Price up2">0.059 </span></li>'
In [16]:print(x3)
b'<li><span class="Price down2">0.079 </span></li>'
The datatypes of x1, x2 and x3 are bytes. I want to extract 0.071, 0.059, 0.079 as floats from x1, x2 and x3. What's the pythonic way to do so?
Thank you in advance
EDIT: for better presentation
You could use regular expressions:
import re
x1_extracted = re.findall('(?<=>)\d+\.*\d*', x1.decode('utf-8'))
x1_extracted = float(x1_extracted[0])
First, you need to decode your bytes sequence (convert it from bytes to string. I'm assuming the encoding is utf-8). Then you can use the re
module to find the values. Step-by-step, the expression means: Find one or more digits (\d+
), optionally followed by a dot (\.*
) and optionally followed by more digits (\d*
). All that should be preceded by a >
((?<=>)
).