pythonlinkchecker

How to ignore URLs containing image formats using linkchecker


I am using linkchecker to crawl the UK government website, map the relations between hyperlinks, and output to a GML file.

I do not want to include URLs of images, for example any URL that contains a jpeg or png file format reference (e.g. "www.gov.uk/somefile.jpeg").

I have tried for hours to achieve this using the --ignore-url command line parameter and various regular expressions. Here is my final attempt before giving up:

linkchecker --ignore-url='(png|jpg|jpeg|gif|tiff|bmp|svg|js)$' -r1 --verbose --no-warnings -ogml/utf_8 --file-output=gml/utf_8/www.gov.uk_RECURSION_1_LEVEL_NO_IMAGES.gml https://www.gov.uk

Could anyone please advise if this is possible, and if so suggest a solution?


Solution

  • Trivia:

    According to docs:

    --ignore-url=REGEX

    URLs matching the given regular expression will be ignored and not checked.

    This option can be given multiple times.

    LinkChecker accepts Python regular expressions. See http://docs.python.org/howto/regex.html for an introduction. An addition is that a leading exclamation mark negates the regular expression.

    Thus we can easily check your regex with python to see why it doesn't work (live test):

    import re
    
    our_pattern = re.compile(r'(png|jpg|jpeg|gif|tiff|bmp|svg|js)$')
    input_data = '''
    www.gov.uk/
    www.gov.uk/index.html
    www.gov.uk/admin.html
    www.gov.uk/somefile.jpeg
    www.gov.uk/anotherone.png
    '''
    
    input_data = input_data.strip().split('\n')
    
    for address in input_data:
        print('Address: %s\t Matched as Image: %s' % (address, bool(our_pattern.match(address))))
        #                                                           ^ or our_pattern.fullmatch
    

    Output:

    Address: www.gov.uk/     Matched as Image: False
    Address: www.gov.uk/index.html   Matched as Image: False
    Address: www.gov.uk/admin.html   Matched as Image: False
    Address: www.gov.uk/somefile.jpeg    Matched as Image: False
    Address: www.gov.uk/anotherone.png   Matched as Image: False
    

    And I think, that problem here because of partially match, hence let's try full match (pattern, live test):

    ...
    our_pattern = re.compile(r'.*(?:png|jpg|jpeg|gif|tiff|bmp|svg|js)$')
    #                          ^ Note this (matches any character unlimited times)
    ...
    

    ...and output is:

    Address: www.gov.uk/     Matched as Image: False
    Address: www.gov.uk/index.html   Matched as Image: False
    Address: www.gov.uk/admin.html   Matched as Image: False
    Address: www.gov.uk/somefile.jpeg    Matched as Image: True
    Address: www.gov.uk/anotherone.png   Matched as Image: True
    

    Solution:

    As you can see, in your attempt your URLs does not match the given regular expression and not ignored. The only things, thoose match that regex are the listed extensions (png, jpg, ...).

    To overcome this problem - match all characters before extensions with .*. Another problem - enclosing quotes.

    From doc's examples:

    Don't check mailto: URLs. All other links are checked as usual:

    linkchecker --ignore-url=^mailto: mysite.example.org

    So your final option is:

    --ignore-url=.*(?:png|jpg|jpeg|gif|tiff|bmp|svg|js)$
    

    Hope it helps!