data-modelingravendbravendb4

RavenDB modelling/indexing: address verification with reference data


I'm using RavenDB 4.2.3 stable on Windows x64, trying to load an artificial document by id from another index.

I'm working on an ETL system, where a significant part of the work is address verification. I have reference data for valid addresses, comprised of country, state, zipcode, city/community, street and house number range (where applicable) plus additional info like LAT/LON, type of community and so on. The typical input to be verified would be comprised of tens or hundreds of thousand address records (usually zipcode, community, street, house number), where about 30% would not match a record in reference data (after normalization, of course), because of misspellings, transpositions, etc.

Those I assume I would have to query for one by one using fuzzy queries, then either automatically correcting using predefined Levenshtein-Damerau distance or filtering out for manual correction (while supplying best matches along with them).

But what about the other 70% or so that do have a direct match in reference data? Those I don't want to query for, because network latency and so on even if exact queries are faster than fuzzy ones. I need an index for that.

Since I cannot query in an index definition, my idea is to build a collection from this reference data, using hashed values of (normalized) address parts to lookup (minus house number) and using said hash as part of the entity Id (or even using the address parts directly), storing normalized and canonical addresses; so I can then use LoadDocument() on this reference collection in an index on the addresses to be verified to test for a 100% verifiable address (LoadDocument() returns non null entity and house number is in range in returned reference address) and to filter out those I have to make slower queries for.

Does that sound to be a good solution given the requirements, or am I missing something here and there is a better, more RavenDB-ish way?

One thing I don't think is possible, is to build said reference collection as an output from a map/reduce index, since I can't control the Id generation; the Id is defined as a hash on the group by fields, but I don't have access to this hashing function in the index definition for the addresses to be verified in order to use its result as document Id parameter for LoadDocument().

I found an issue for a feature controlling artificial document Id generation, but it is not done yet: https://issues.hibernatingrhinos.com/issue/RavenDB-12932


Solution

  • The easiest way to handle that is to do something like: var code = sha256( (u.Zip + u.Community + u.Street + u.HouseNumber).ToLower() ); You get a bunch of records and send their codes to the database, then see what you already have.It is better to do that in batches (calling: Load(IEnumerable ) to avoid call per item. Because you have 70% / 30% new / existing rate, you can also use a bloom filter on the hashes, and likely save a lot of database queries because you'll know that the value isn't there (although if you find it, you'll need to check if this is the right one)

    I'm not sure what the index / LoadDocument is related here. I would only go that route if you need to do more complex searches.Unfortunately, you probably do need to do things like: "First Av" vs "1st Avenue", etc, no?