I am using linkchecker to crawl the UK government website, map the relations between hyperlinks, and output to a GML file.
I do not want to include URLs of images, for example any URL that contains a jpeg or png file format reference (e.g. "www.gov.uk/somefile.jpeg").
I have tried for hours to achieve this using the --ignore-url
command line parameter and various regular expressions. Here is my final attempt before giving up:
linkchecker --ignore-url='(png|jpg|jpeg|gif|tiff|bmp|svg|js)$' -r1 --verbose --no-warnings -ogml/utf_8 --file-output=gml/utf_8/www.gov.uk_RECURSION_1_LEVEL_NO_IMAGES.gml https://www.gov.uk
Could anyone please advise if this is possible, and if so suggest a solution?
According to docs:
--ignore-url=REGEX
URLs matching the given regular expression will be ignored and not checked.
This option can be given multiple times.
LinkChecker accepts Python regular expressions. See http://docs.python.org/howto/regex.html for an introduction. An addition is that a leading exclamation mark negates the regular expression.
Thus we can easily check your regex with python to see why it doesn't work (live test):
import re
our_pattern = re.compile(r'(png|jpg|jpeg|gif|tiff|bmp|svg|js)$')
input_data = '''
www.gov.uk/
www.gov.uk/index.html
www.gov.uk/admin.html
www.gov.uk/somefile.jpeg
www.gov.uk/anotherone.png
'''
input_data = input_data.strip().split('\n')
for address in input_data:
print('Address: %s\t Matched as Image: %s' % (address, bool(our_pattern.match(address))))
# ^ or our_pattern.fullmatch
Output:
Address: www.gov.uk/ Matched as Image: False
Address: www.gov.uk/index.html Matched as Image: False
Address: www.gov.uk/admin.html Matched as Image: False
Address: www.gov.uk/somefile.jpeg Matched as Image: False
Address: www.gov.uk/anotherone.png Matched as Image: False
And I think, that problem here because of partially match, hence let's try full match (pattern, live test):
...
our_pattern = re.compile(r'.*(?:png|jpg|jpeg|gif|tiff|bmp|svg|js)$')
# ^ Note this (matches any character unlimited times)
...
...and output is:
Address: www.gov.uk/ Matched as Image: False
Address: www.gov.uk/index.html Matched as Image: False
Address: www.gov.uk/admin.html Matched as Image: False
Address: www.gov.uk/somefile.jpeg Matched as Image: True
Address: www.gov.uk/anotherone.png Matched as Image: True
As you can see, in your attempt your URLs does not match the given regular expression and not ignored. The only things, thoose match that regex are the listed extensions (png, jpg, ...).
To overcome this problem - match all characters before extensions with .*
.
Another problem - enclosing quotes.
From doc's examples:
Don't check mailto: URLs. All other links are checked as usual:
linkchecker --ignore-url=^mailto: mysite.example.org
So your final option is:
--ignore-url=.*(?:png|jpg|jpeg|gif|tiff|bmp|svg|js)$
Hope it helps!