I am trying to perform some n-gram counting in python and I thought I could use MySQL (MySQLdb module) for organizing my text data.
I have a pretty big table, around 10mil records, representing documents that are indexed by a unique numeric id (auto-increment) and by a language varchar
field (e.g. "en", "de", "es" etc..)
select * from table
is too slow and memory devastating.
I ended up splitting the whole id range into smaller ranges (say 2000 records wide each) and processing each of those smaller record sets one by one with queries like:
select * from table where id >= 1 and id <= 1999
select * from table where id >= 2000 and id <= 2999
and so on...
Is there any way to do it more efficiently with MySQL and achieve similar performance to reading a big corpus text file serially?
I don't care about the ordering of the records, I just want to be able to process all the documents that pertain to a certain language in my big table.
You can use the HANDLER statement to traverse a table (or index) in chunks. This is not very portable and works in an "interesting" way with transactions if rows appear and disappear while you're looking at it (hint: you're not going to get consistency) but makes code simpler for some applications.
In general, you are going to get a performance hit, as if your database server is local to the machine, several copies of the data will be necessary (in memory) as well as some other processing. This is unavoidable, and if it really bothers you, you shouldn't use mysql for this purpose.