pythonxapian

How to use different ids in Xapian?


I'm trying to implement a search with Xapian. My documents have its own ids, that are strings. I already do as the tutorail says:

db.replace_document(doc.docno, doc_x)

where doc.docno is the string that identifies the document. But when I search:

for match in enquire.get_mset(0, 10):
    print match.document.get_docid()

The docid recovered is just a simple number. Anyone know if I should have to do something else?


Solution

  • Xapian document ids are always numbers, however it provides a mechanism for you to address documents by term as well as by id. So replace_document() and delete_document() can be given a string, as you have done, and they will find all existing documents matching that term, and remove them from the database. replace_document() will then create a new document, re-using the lowest matching (numeric) document id, or using a new id if no documents match.

    The documentation for this variant of replace_document() says:

    One common use is to allow UIDs from another system to easily be mapped to terms in Xapian. Note that this method doesn't automatically add unique_term as a term, so you'll need to call document.add_term(unique_term) first when using replace_document() in this way.

    If you're using QueryParser, or in some other way are following the term prefixing convention that a lot of Xapian systems follow, then it's common to use Q as the prefix. This means you probably want to do the following before you call replace_document():

    doc_x.add_term('X' + doc.docno)

    Then, when you query the database, you'll need to get your document id out again. You can do this by reading from the termlist, but that's a bit fiddly so it's more common to store your "external" id (external to Xapian) in the Document data. (I often store JSON in there, to provide some scope for growth of what I need; for instance it can sometimes be useful to include all the information needed to render a search result in the document data.)

    This approach is covered in Xapian's FAQ on working with existing unique ids, particularly the second section, "Using a term for the external unique id".