sqlrmonetdbmonetdblite

SQL concatenate many tsv files into single table in a database, while keeping track of file source (MonetDBLite)


I am using the MonetDBLite R package to create a MonetDB. I can create database tables just fine using the instructions from here, with the following code:

library(DBI)
library(MonetDBLite)

# Write tsv file of mtcars
write.table(mtcars, "mtcars.tsv", row.names=FALSE, sep= "\t")

# Initialize MonetDB
dbdir <- "/Users/admin/my_directory"
con <- dbConnect(MonetDBLite::MonetDBLite(), dbdir)

# Write table
dbWriteTable(con, "test4", "mtcars.tsv", delim="\t")

and the following query gives

> dbGetQuery(con, "SELECT * FROM test4 LIMIT 3")
mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1

So far so good. But, say I have another file mtcars2 with different mpg values:

mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$mpg + 5
write.table(mtcars2, "mtcars2.tsv", row.names= FALSE, sep = "\t")

I can load it to another table:

dbWriteTable(con, "test5", "mtcars2.tsv", delim = "\t")
> dbGetQuery(con, "SELECT * FROM test5 LIMIT 3")
mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 26.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 26.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 27.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Also fine. But my problem is this: I want to later on look up the mpg for all cars with 6 cyl, and know which dataset it came from (mtcars or mtcars2). From what I understand of SQL indexing (which is not a lot and basically what I've read here), I should have all my data in one table to have the most efficient searches. I tried loading the first tsv file, then added another column using ALTER TABLE test4 ADD dataset TEXT and UPDATE test4 SET dataset = dataset1 sql commands-

dbSendQuery(con, "UPDATE test4 SET dataset = dataset1")
dbSendQuery(con, "UPDATE test4 SET dataset = 1")
> dbGetQuery(con, "SELECT * FROM test4 LIMIT 3")
mpg cyl disp  hp drat    wt  qsec vs am gear carb dataset
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4       1
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4       1
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1       1

but then when I tried to append mtcars2 to the table, it had a different number of columns (as I should have expected, duh). What's the best way to concatenate data from many tsv files with identical columns into a single table, while keeping track of the data's source?

EDIT- as you might have guessed, the real data is not mtcars- it's flat tsv files millions of lines long, meaning I want to avoid reading the whole file into memory and manipulating with R.


Solution

  • Following xQbert's suggestion, I solved using SQL commands only (necessary and faster than bash commands, considering my data is 10s of files, each millions of lines long).

    library(DBI)
    library(MonetDBLite)
    
    # Write tsv file of mtcars
    write.table(mtcars, "mtcars.tsv", row.names=FALSE, sep= "\t")
    
    # Write tsv of second mtcars
    mtcars2 <- mtcars
    mtcars2$mpg <- mtcars2$mpg + 5
    write.table(mtcars2, "mtcars2.tsv", row.names= FALSE, sep = "\t")
    
    # Initialize MonetDB
    dbdir <- "/Users/admin/"
    con <- dbConnect(MonetDBLite::MonetDBLite(), dbdir)
    
    # Write table
    dbWriteTable(con, "test4", "mtcars.tsv", delim="\t")
    
    # Add data source information
    dbSendQuery(con, "ALTER TABLE test4 ADD source TEXT")
    dbSendQuery(con, "UPDATE test4 SET source = 'dataset1'")
    
    # Write second dataset to a temporary table
    dbWriteTable(con, "temptable", "mtcars2.tsv", delim="\t")
    
    # Add data source information
    dbSendQuery(con, "ALTER TABLE temptable ADD source TEXT")
    dbSendQuery(con, "UPDATE temptable SET source = 'dataset2'")
    
    # Insert temp table into main table
    dbSendQuery(con, "INSERT INTO test4 SELECT * FROM temptable")
    
    # Drop temp table
    dbSendQuery(con, "DROP TABLE temptable")
    
    # Checking the data, truncated for clarity
    > dbGetQuery(con, "SELECT * FROM test4")
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb   source
    1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 dataset1
    2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 dataset1
    3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 dataset1
    ...
    33 26.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 dataset2
    34 26.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 dataset2
    35 27.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 dataset2
    ...
    64 26.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2 dataset2
    

    Sorry if I didn't make it clear enough in the question that my data is much larger than mtcars- if you have medium sized data, the data.tables package is probably a better solution than a database.