I have 185 files of data, which contains a total number of 30 million rows. Each two has two columns; a single int which I want to use as an index, and a list of 512 ints.
So it looks something like this
IndexID Ids
1899317 [0, 47715, 1757, 9, 38994, 230, 12, 241, 12228...
22861131 [0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...
2163410 [0, 26039, 41156, 227, 860, 3320, 6673, 260, 1...
15760716 [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
12244098 [0, 45651, 4128, 227, 5, 10397, 995, 731, 9, 3...
The data is too large to load into memory, but I would like to retrieve say a couple hundred rows at a time using a list of indices.
I got advice from this comment to use Parquet. Most efficient way of saving a pandas dataframe or 2d numpy array into h5py, with each row a seperate key, using a column
I've been looking at the official parquet python guide
https://arrow.apache.org/docs/python/parquet.html
and
fast parquet guide
https://fastparquet.readthedocs.io/en/latest/api.html
But I can't seem to find to find any way to retrieve a row using an index, and if the table is stored on disk, or if it's all loaded into memory.
Is this possible? If so, how would I do something like this?
For example
ParquetTable[22861131, 15760716]
[0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1... [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
Parquet is a columnar data store that will not fit your use case. If your goal is to store data too large to fit in memory, and yet still be able to retrieve rows at a time to work on, I would suggest you use a database.
The simplest database to start with is sqlite3
, which is built into Python itself. Sqlite databases are stored as files instead of requiring you to setup a database system.
Before we continue, you'll need to break down your column of lists into 512 columns to (1) make them easier to work with, and (2) I don't think lists are innately supported by database systems.
Here's a minimal example on how you can dump your data and retrieve the rows you need:
# Creating sample data to work on
import pandas as pd
import numpy as np
df = pd.DataFrame(
data=np.random.randint(low=1, high=1000, size=(1000, 3)),
columns=['a', 'b', 'c'],
index=pd.Series(range(1000), name='IndexID')
)
import sqlite3
# Write dataframes to database
with sqlite3.connect('sqlite.db') as conn:
df.to_sql('data', con=conn, if_exists='append')
This code above dumps the contents of df
into a sqlite database in your current working directory, named sqlite.db
. Notice the if_exists
option in the to_sql
call; you need to change it to replace
if you want to overwrite an existing database.
When you want to retrieve specific rows from the same working directory, you can run the following. This example below retrieves the 200th to 210th index:
# How to read from database
with sqlite3.connect('sqlite.db') as conn:
# `rowid` is a keyword in sqlite queries to represent the index
query = "SELECT * FROM data WHERE rowid BETWEEN %d AND %d" % (200, 210)
subset = pd.read_sql(query, con=conn)
print(subset)
# This prints the following
# IndexID a b c
# 0 199 704 3 423
# 1 200 590 299 767
# 2 201 45 953 560
# 3 202 237 662 746
# 4 203 123 920 275
# 5 204 453 10 370
# 6 205 35 628 602
# 7 206 957 465 735
# 8 207 602 810 154
# 9 208 927 796 352
# 10 209 969 130 217
For more info on moving data between sqlite3 and pandas, I recommend reading this https://www.dataquest.io/blog/python-pandas-databases/ and How to open and convert sqlite database to pandas dataframe.