I am using the MonetDBLite R package to create a MonetDB. I can create database tables just fine using the instructions from here, with the following code:
library(DBI)
library(MonetDBLite)
# Write tsv file of mtcars
write.table(mtcars, "mtcars.tsv", row.names=FALSE, sep= "\t")
# Initialize MonetDB
dbdir <- "/Users/admin/my_directory"
con <- dbConnect(MonetDBLite::MonetDBLite(), dbdir)
# Write table
dbWriteTable(con, "test4", "mtcars.tsv", delim="\t")
and the following query gives
> dbGetQuery(con, "SELECT * FROM test4 LIMIT 3")
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
So far so good. But, say I have another file mtcars2 with different mpg values:
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$mpg + 5
write.table(mtcars2, "mtcars2.tsv", row.names= FALSE, sep = "\t")
I can load it to another table:
dbWriteTable(con, "test5", "mtcars2.tsv", delim = "\t")
> dbGetQuery(con, "SELECT * FROM test5 LIMIT 3")
mpg cyl disp hp drat wt qsec vs am gear carb
1 26.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 26.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 27.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Also fine. But my problem is this: I want to later on look up the mpg
for all cars with 6 cyl
, and know which dataset it came from (mtcars or mtcars2). From what I understand of SQL indexing (which is not a lot and basically what I've read here), I should have all my data in one table to have the most efficient searches. I tried loading the first tsv file, then added another column using ALTER TABLE test4 ADD dataset TEXT
and UPDATE test4 SET dataset = dataset1
sql commands-
dbSendQuery(con, "UPDATE test4 SET dataset = dataset1")
dbSendQuery(con, "UPDATE test4 SET dataset = 1")
> dbGetQuery(con, "SELECT * FROM test4 LIMIT 3")
mpg cyl disp hp drat wt qsec vs am gear carb dataset
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
but then when I tried to append mtcars2 to the table, it had a different number of columns (as I should have expected, duh). What's the best way to concatenate data from many tsv files with identical columns into a single table, while keeping track of the data's source?
EDIT- as you might have guessed, the real data is not mtcars- it's flat tsv files millions of lines long, meaning I want to avoid reading the whole file into memory and manipulating with R.
Following xQbert's suggestion, I solved using SQL commands only (necessary and faster than bash commands, considering my data is 10s of files, each millions of lines long).
library(DBI)
library(MonetDBLite)
# Write tsv file of mtcars
write.table(mtcars, "mtcars.tsv", row.names=FALSE, sep= "\t")
# Write tsv of second mtcars
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$mpg + 5
write.table(mtcars2, "mtcars2.tsv", row.names= FALSE, sep = "\t")
# Initialize MonetDB
dbdir <- "/Users/admin/"
con <- dbConnect(MonetDBLite::MonetDBLite(), dbdir)
# Write table
dbWriteTable(con, "test4", "mtcars.tsv", delim="\t")
# Add data source information
dbSendQuery(con, "ALTER TABLE test4 ADD source TEXT")
dbSendQuery(con, "UPDATE test4 SET source = 'dataset1'")
# Write second dataset to a temporary table
dbWriteTable(con, "temptable", "mtcars2.tsv", delim="\t")
# Add data source information
dbSendQuery(con, "ALTER TABLE temptable ADD source TEXT")
dbSendQuery(con, "UPDATE temptable SET source = 'dataset2'")
# Insert temp table into main table
dbSendQuery(con, "INSERT INTO test4 SELECT * FROM temptable")
# Drop temp table
dbSendQuery(con, "DROP TABLE temptable")
# Checking the data, truncated for clarity
> dbGetQuery(con, "SELECT * FROM test4")
mpg cyl disp hp drat wt qsec vs am gear carb source
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 dataset1
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 dataset1
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 dataset1
...
33 26.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 dataset2
34 26.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 dataset2
35 27.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 dataset2
...
64 26.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 dataset2
Sorry if I didn't make it clear enough in the question that my data is much larger than mtcars- if you have medium sized data, the data.tables
package is probably a better solution than a database.