How can I speed up my joins against CSV files?
I've got a query that is joining 8 files:
//please note this is the simplified query
DECLARE ... //summarizing here
FROM ...
USING Extractors.Text(delimiter : '|');
//8 more statements like the above ommitted
SELECT one.an_episode_id,
one.id2_enc_kgb_id,
one.suffix,
two.suffixa,
three.suffixThree,
four.suffixFour,
five.suffixFive,
six.suffixSix,
seven.suffixSeven,
eight.suffixEight,
nine.suffixNine,
ten.suffixTen
FROM @one_files AS one
JOIN @two_files AS two
ON one.id3_enc_kgb_id == two.id3_enc_kgb_id
JOIN @three_files AS three
ON three.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN @four_files AS four
ON four.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN @five_files AS five
ON five.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN @six_files AS six
ON six.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @seven_files AS seven
ON seven.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @eight_files AS eight
ON eight.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @nine_files AS nine
ON nine.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @ten_files AS ten
ON ten.id2_enc_kgb_id == one.id2_enc_kgb_id;
I submitted the job to Azure, and had to cancel it after a few hours and $80 in cost!
It was my understanding that Data Lake is meant exactly for this type of job?! I've got perhaps 100 files total, totaling maybe 20mb of data.
How can I speed up my joins?
What is important for you to note that small files are suboptimal in every scenario. The suggested solution for smaller files, by Michal Rys is to consider these alternative to concat them into large files:
Note:
fast file set
allows you to consume hundreds of thousands of such files in bulk in a single EXTRACT
.
I would use INNER JOIN
instead of JOIN
to be sure you know which join you are really using.
It is rather important to see how you have EXTRACTed the information from the CSV files. The JOINed result should be OUTPUTed into a tsv (Tab-Separated-Value - Note: TVF is Table-Valued Functions for u-sql code reuse) file.
The TSV structure:
This format should be very efficient for u-sql (I did not yet measure it myself).
To have complete information you can have three different build-in outputter types .Text(), .Csv(), Tsv().
Your example is missing the variables so I'll try to guess them
USE DATABASE <your_database>;
USE SCHEMA <your_schema>;
DECLARE @FirstCsvFile string = "/<path>/first.csv";
@firstFile = EXTRACT an_episode_id string, id2_enc_kgb_id string, suffix string
FROM @FirstCsvFile USING Extractors.Text(delimiter : '|');
// probably 8 more statements which where omitted in the OP
@encode = SELECT one.an_episode_id,
one.id2_enc_kgb_id,
one.suffix,
two.suffixa,
three.suffixThree,
four.suffixFour,
five.suffixFive,
six.suffixSix,
seven.suffixSeven,
eight.suffixEight,
nine.suffixNine,
ten.suffixTen
FROM @firstFile AS one
INNER JOIN @two_files AS two
ON one.id3_enc_kgb_id == two.id3_enc_kgb_id
INNER JOIN @three_files AS three
ON three.id3_enc_kgb_id == one.id3_enc_kgb_id
INNER JOIN @four_files AS four
ON four.id3_enc_kgb_id == one.id3_enc_kgb_id
INNER JOIN @five_files AS five
ON five.id3_enc_kgb_id == one.id3_enc_kgb_id
INNER JOIN @six_files AS six
ON six.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN @seven_files AS seven
ON seven.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN @eight_files AS eight
ON eight.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN @nine_files AS nine
ON nine.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN @ten_files AS ten
ON ten.id2_enc_kgb_id == one.id2_enc_kgb_id;
OUTPUT @encode TO "/outputs/encode_joins.tsv" USING Outputters.Tsv();