pythonmysqlpostgresqlrelational-databaseaudio-fingerprinting

How to more efficiently search through an acoustid database with over 30 million rows?


I'm currently playing around with an open source music recognition project called acoustid. I've imported a table with over 30 million rows(300gb of data) but it takes A TON of time to simply SELECT these rows. Currently, selecting 200,000 rows can take 30 seconds.

The project offers acoustid-index to index the rows by only looking up the first 15 seconds of a fingerprint and storing this on the hdd... which is then loaded into ram. https://bitbucket.org/acoustid/acoustid-index/overview

Only, I have no idea how to use this.The directions are confusing. It seems this was created for PostgreSQL. I'm using MySQL and Python on the server I'm working on. Can I still use this to index my db?

Any suggestions as to how I use this to index the rows in the database? Are there any other ways I can make the search through this database more efficient?


Solution

  • When dealing with a lot of data, like in this case, you need to understand and exploit the structure to work with it effectively. You can't have a blob in your database and expect to magically index it and have fast search.

    If you had text documents, the normal approach would be to use a search engine that parses the text, extracts words from it, possibly does some post-processing on them and then creates an index on those words. This is a common use case and the MySQL fulltext indexes do this, for example.

    In your case, you have acoustic fingerprints produced by Chromaprint, which are far less common use case. There is no built-in solution that will make the search fast. It's up to you how you index the data and how you search it is. You need to understand that the fingerprints consist of 32-bit hashes (equivalent of words in a text document) and you need to understand how inverted indexes work. If you index the fingerprints by the hashes, you avoid the need to scan the entire database, you will be only looking for the specific hashes in your inverted index.

    You could build a very crude inverted index in MySQL using a table like this:

    CREATE TABLE fingerprint_hash (
      hash INT NOT NULL,
      fingerprint_id INT NOT NULL,
    );
    

    Then you load your data and create a physical index:

    CREATE INDEX fingerprint_hash_idx_hash ON fingerprint_hash(hash);
    

    Once you have this, you can query the index like this:

    SELECT fingerprint_id, COUNT(*) AS num_matching_hashes
    FROM fingerprint_hash
    WHERE hash IN (627833118,627767582,627697982,627624254,627956095,...)
    GROUP BY fingerprint_id
    

    That will give you fingerprint IDs that have some common hashes.

    Note that the above will most likely still be slow. The custom AcoustID index uses a very compact format that fits as much data as possible in memory, it only indexes certain parts of the fingerprints and it doesn't even store the entire hashes, it truncates some of the bits. All that is done to make the search fast. And it still wouldn't be fast enough on an average server that's typically used for hosting a web site.