python-2.7search-enginexapian

How to use xapian which returns a URL when indexing a web page


I am using Ubuntu 12.04, Python 2.7

My code for getting the contents from a given URL:

def get_page(url):
'''Gets the contents of a page from a given URL'''
    try:
        f = urllib.urlopen(url)
        page = f.read()
        f.close()
        return page
    except:
        return ""
    return ""

To filter the content of a page provided by get_page(url):

def filterContents(content):
'''Filters the content from a page'''
    filteredContent = ''
    regex = re.compile('(?<!script)[>](?![\s\#\'-<]).+?[<]')
    for words in regex.findall(content):
        word_list = split_string(words, """ ,"!-.()<>[]{};:?!-=/_`&""")
        for word in word_list:
            filteredContent = filteredContent + word
    return filteredContent

def split_string(source, splitlist):
    return ''.join([ w if w not in splitlist else ' ' for w in source])

How to index the filteredContent in Xapian so that when I query, i get returned the URLs the query was present in?


Solution

  • I'm not completely clear what your filterContents() and split_string() are actually trying to do (throwing away some HTML tag contents and then word splitting), so let me talk through a similar problem that doesn't have that complexity folded into it.

    Let's assume we have a function strip_tags() which returns just the textual content of an HTML document, and your get_page() function. We want to build up a Xapian database where

    So you could index as follows:

    import xapian
    def index_url(database, url):
        text = strip_tags(get_page(url))
        doc = xapian.Document()
    
        # TermGenerator will split text into words
        # and then (because we set a stemmer) stem them
        # into terms and add them to the document
        termgenerator = xapian.TermGenerator()
        termgenerator.set_stemmer(xapian.Stem("en"))
        termgenerator.set_document(doc)
        termgenerator.index_text(text)
    
        # We want to be able to get at the URL easily
        doc.set_data(url)
        # And we want to ensure each URL only ends up in
        # the database once. Note that if your URLs are long
        # then this won't work; consult the FAQ on unique IDs
        # for more: http://trac.xapian.org/wiki/FAQ/UniqueIds
        idterm = 'Q' + url
        doc.add_boolean_term(idterm)
        db.replace_document(idterm, doc)
    
    # then index an example URL
    db = xapian.WritableDatabase("exampledb", xapian.DB_CREATE_OR_OPEN)
    
    index_url(db, "https://stackoverflow.com/")
    

    Searching is then simple, although it can obviously get more sophisticated if needed:

    qp = xapian.QueryParser()
    qp.set_stemmer(xapian.Stem("en"))
    qp.set_stemming_strategy(qp.STEM_SOME)
    query = qp.parse_query('question')
    query = qp.parse_query('question and answer')
    enquire = xapian.Enquire(db)
    enquire.set_query(query)
    for match in enquire.get_mset(0, 10):
        print match.document.get_data()
    

    which will display 'https://stackoverflow.com/', since 'question and answer' is on the homepage when you aren't logged in.

    I'd recommend checking out the Xapian getting started guide both for concepts and code.