Environment: python3.
There are many files ,some of them encoding with gbk,others encoding with utf-8.
I want to extract all the jpg with regular expression
For s.html encoding with gbk.
tree = open("/tmp/s.html","r").read()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 135: invalid start byte
tree = open("/tmp/s.html","r",encoding="gbk").read()
pat = "http://.+\.jpg"
result = re.findall(pat,tree)
print(result)
['http://somesite/2017/06/0_56.jpg']
It is a huge job to open all the files with specified encoding,i want a smart way to extract jpg urls in all the files.
If they have mixed encoding, you could try one encoding and fall back to another:
# first open as binary
with open(..., 'rb') as f:
f_contents = f.read()
try:
contents = f_contents.decode('UTF-8')
except UnicodeDecodeError:
contents = f_contents.decode('gbk')
...
If they are html files, you may also be able to find the encoding tag, or search them as binary with a binary regex:
contents = open(..., 'rb').read()
regex = re.compile(b'http://.+\.jpg')
result = regex.findall(contents)
# now you'll probably want to `.decode()` each of the urls, but you should be able to do that pretty trivially with even the `ASCII` codec
Though now that I think of it, you probably don't really want to use regex to capture the links as you'll then have to deal with html entities (&
) and may do better with something like pyquery
Here's a quick example using pyquery
contents = open(..., 'rb').read()
pq = pyquery.PyQuery(contents)
images = pq.find('img')
for img in images:
img = pyquery.PyQuery(img)
if img.attr('src').endswith('.jpg')
print(img.attr('src'))
Not on my computer with things installed, so mileage with these code samples may vary