pythonpython-re

How to get list of file's url using urllib.request?


from urllib.request import urlopen
import re

urlpath =urlopen("http://blablabla.com/file")
string = urlpath.read().decode('utf-8')

pattern = re.compile('*.docx"')
onlyfiles = pattern.findall(string)

print(onlyfiles)

Target output

['http://blablabla.com/file/1.docx','http://blablabla.com/file/2.docx']

But I got this

[]

I get this error message when trying this.

re.error: nothing to repeat at position 0

Solution

  • The star from this line:

    pattern = re.compile('*.docx"')
    

    Apparently seems to be a python known bug:

    Check out this related answers: regex error - nothing to repeat

    Try this using word or a-z regexp:

    pattern = re.compile('\w*.docx"')
    # or
    pattern = re.compile('[a-zA-Z0-9]*.docx"')