I have a (syntactically-valid, not-huge) CSV file. I want to run an arbitrary SQL SELECT query on it; let's assume that it can be a complex query, with some inner queries, with self-joins, maybe with window functions etc.
Now, obviously, I could install a DBMS, run the server process, create a new DB, maybe create an appropriate table, load that CSV data into a table, then finally use a client for the DB to send the query and get the result.
But - that seems like quite a hassle, and I don't need the loaded DB table again for another query.
So, is there an easier way to do this, from a command-line/shell session, hopefully without so much overhead and waste of space?
Notes:
You can use sqlite for this:
sqlite3 :memory: -cmd '.import -csv taxi.csv taxi' \
'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count'
There are many wrappers that try to simplify this (e.g. csvsql --query
, dsq or q
), but you may find they have limitations of their own (e.g. q
doesn't support CTEs, though sqlite obiously does).
duckdb, as an analytic-focused alternative to sqlite, makes this use-case very simple and fast out of the box, while providing some flexibility too:
duckdb -c "SELECT * FROM 'data_source.csv.gz'"