I am using Ubuntu 12.04, Python 2.7
My code for getting the contents from a given URL:
def get_page(url):
'''Gets the contents of a page from a given URL'''
try:
f = urllib.urlopen(url)
page = f.read()
f.close()
return page
except:
return ""
return ""
To filter the content of a page provided by get_page(url)
:
def filterContents(content):
'''Filters the content from a page'''
filteredContent = ''
regex = re.compile('(?<!script)[>](?![\s\#\'-<]).+?[<]')
for words in regex.findall(content):
word_list = split_string(words, """ ,"!-.()<>[]{};:?!-=/_`&""")
for word in word_list:
filteredContent = filteredContent + word
return filteredContent
def split_string(source, splitlist):
return ''.join([ w if w not in splitlist else ' ' for w in source])
How to index the filteredContent
in Xapian
so that when I query, i get returned the URLs
the query was present in?
I'm not completely clear what your filterContents()
and split_string()
are actually trying to do (throwing away some HTML tag contents and then word splitting), so let me talk through a similar problem that doesn't have that complexity folded into it.
Let's assume we have a function strip_tags()
which returns just the textual content of an HTML document, and your get_page()
function. We want to build up a Xapian database where
strip_tags()
) become searchable terms that index those documentsSo you could index as follows:
import xapian
def index_url(database, url):
text = strip_tags(get_page(url))
doc = xapian.Document()
# TermGenerator will split text into words
# and then (because we set a stemmer) stem them
# into terms and add them to the document
termgenerator = xapian.TermGenerator()
termgenerator.set_stemmer(xapian.Stem("en"))
termgenerator.set_document(doc)
termgenerator.index_text(text)
# We want to be able to get at the URL easily
doc.set_data(url)
# And we want to ensure each URL only ends up in
# the database once. Note that if your URLs are long
# then this won't work; consult the FAQ on unique IDs
# for more: http://trac.xapian.org/wiki/FAQ/UniqueIds
idterm = 'Q' + url
doc.add_boolean_term(idterm)
db.replace_document(idterm, doc)
# then index an example URL
db = xapian.WritableDatabase("exampledb", xapian.DB_CREATE_OR_OPEN)
index_url(db, "https://stackoverflow.com/")
Searching is then simple, although it can obviously get more sophisticated if needed:
qp = xapian.QueryParser()
qp.set_stemmer(xapian.Stem("en"))
qp.set_stemming_strategy(qp.STEM_SOME)
query = qp.parse_query('question')
query = qp.parse_query('question and answer')
enquire = xapian.Enquire(db)
enquire.set_query(query)
for match in enquire.get_mset(0, 10):
print match.document.get_data()
which will display 'https://stackoverflow.com/', since 'question and answer' is on the homepage when you aren't logged in.
I'd recommend checking out the Xapian getting started guide both for concepts and code.