pythonperformancehypertable

for loop performance in huge list or dictionary


I am making a research project on search engines, and i am having problems with a performance of the for loop. I have the following problem:

for value in hash_array.keys():  
        cell= db_conn.use_client().hql_query(db_conn.use_namespace(),'SELECT doc_text FROM SE_doc_text WHERE ROW=\"'+
                             value+"\" SCAN_AND_FILTER_ROWS LIMIT 1").cells  
        list_result[value].append(cell[0].value)

Using hypertable i want to have the best performance, but when i am using it on 20.000 or 30.000 elements of the hash_array.keys() i have a result like 5-6 min, so that's slow really slow. Any other way of performing a fast for? or should i use my plan b for the whole search engine.

Performance of the query:

  Elapsed time:  0.10 s
Avg value size:  1428.00 bytes
  Avg key size:  57.00 bytes
    Throughput:  595190.38 bytes/s
   Total cells:  1
    Throughput:  400.80 cells/s

Tested on hardware:

8gb 1600mhz
1055t amd

Additional facts:

 Without SCAN_AND_FILTER_ROWS: 1.78163504601ms
 With SCAN_AND_FILTER_ROWS: 3.27163504601ms
 Only for loop: 0.0630ms

With the solution (from the answer) i have the following:

stringRow = ' AND '.join(["ROW = \""+value + "\"" for value in hash_array.keys()])
cell = db_conn.use_client().hql_query(db_conn.use_namespace(),stringBatch+stringRow).cells  

Results:  0.355320930481ms for 2097 documents
          1.0214779377 for 3565 documents

It is good but it's not best solution.


Solution

  • I've never used Hypertable but simply reading the documentation suggest the SCAN_AND_FILTER_ROWS clause may be a problem:

    This is an explicit optimization for the case where you're querying for a very large number of row intervals (e.g. 10,000+). Instead of fetching each row interval independently, this option will cause the system to do a full table scan and filter the results to find the rows that are desired. Use this option with caution, it can be extremely inefficient for smaller number of row intervals.

    This clause may become more appropriate if you send batches of keys at once.

    Reading more documentation suggests batching would be possible with the following construct:

    | '(' [row_key relop] ROW relop row_key
          (OR [row_key relop] ROW relop row_key)* ')'
    

    So, construct a query with a disjunction of, say, 100 separate keys: ROW=x OR ROW=y OR....