azure-data-lakeu-sql

how to optimize multi join job


How can I speed up my joins against CSV files?

I've got a query that is joining 8 files:

//please note this is the simplified query
DECLARE ... //summarizing here
FROM ...
USING Extractors.Text(delimiter : '|');

//8 more statements like the above ommitted

SELECT  one.an_episode_id, 
        one.id2_enc_kgb_id, 
        one.suffix, 
        two.suffixa, 
        three.suffixThree, 
        four.suffixFour, 
        five.suffixFive,
        six.suffixSix,
        seven.suffixSeven,
        eight.suffixEight,
        nine.suffixNine,
        ten.suffixTen  
FROM @one_files AS one
JOIN @two_files AS two 
  ON one.id3_enc_kgb_id == two.id3_enc_kgb_id
JOIN @three_files AS three  
  ON three.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN @four_files AS four
  ON four.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN @five_files AS five
  ON five.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN @six_files AS six
  ON six.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @seven_files AS seven
  ON seven.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @eight_files AS eight
  ON eight.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @nine_files AS nine
  ON nine.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @ten_files AS ten
  ON ten.id2_enc_kgb_id == one.id2_enc_kgb_id;

I submitted the job to Azure, and had to cancel it after a few hours and $80 in cost!

enter image description here

It was my understanding that Data Lake is meant exactly for this type of job?! I've got perhaps 100 files total, totaling maybe 20mb of data.

How can I speed up my joins?


Solution

  • What is important for you to note that small files are suboptimal in every scenario. The suggested solution for smaller files, by Michal Rys is to consider these alternative to concat them into large files:

    Note: fast file set allows you to consume hundreds of thousands of such files in bulk in a single EXTRACT.

    I would use INNER JOIN instead of JOIN to be sure you know which join you are really using.

    It is rather important to see how you have EXTRACTed the information from the CSV files. The JOINed result should be OUTPUTed into a tsv (Tab-Separated-Value - Note: TVF is Table-Valued Functions for u-sql code reuse) file.

    The TSV structure:

    This format should be very efficient for u-sql (I did not yet measure it myself).

    To have complete information you can have three different build-in outputter types .Text(), .Csv(), Tsv().

    Your example is missing the variables so I'll try to guess them

    USE DATABASE <your_database>;
    USE SCHEMA <your_schema>;
    
    DECLARE @FirstCsvFile string = "/<path>/first.csv";
    @firstFile = EXTRACT an_episode_id string, id2_enc_kgb_id string, suffix string
    FROM @FirstCsvFile USING Extractors.Text(delimiter : '|');
    
    // probably 8 more statements which where omitted in the OP
    
    
    @encode = SELECT  one.an_episode_id, 
                      one.id2_enc_kgb_id, 
                      one.suffix, 
                      two.suffixa, 
                      three.suffixThree, 
                      four.suffixFour, 
                      five.suffixFive,
                      six.suffixSix,
                      seven.suffixSeven,
                      eight.suffixEight,
                      nine.suffixNine,
                      ten.suffixTen  
              FROM @firstFile AS one
              INNER JOIN @two_files AS two 
                ON one.id3_enc_kgb_id == two.id3_enc_kgb_id
              INNER JOIN @three_files AS three  
                ON three.id3_enc_kgb_id == one.id3_enc_kgb_id
              INNER JOIN @four_files AS four
                ON four.id3_enc_kgb_id == one.id3_enc_kgb_id
              INNER JOIN @five_files AS five
                ON five.id3_enc_kgb_id == one.id3_enc_kgb_id
              INNER JOIN @six_files AS six
                ON six.id2_enc_kgb_id == one.id2_enc_kgb_id
              INNER JOIN @seven_files AS seven
                ON seven.id2_enc_kgb_id == one.id2_enc_kgb_id
              INNER JOIN @eight_files AS eight
                ON eight.id2_enc_kgb_id == one.id2_enc_kgb_id
              INNER JOIN @nine_files AS nine
                ON nine.id2_enc_kgb_id == one.id2_enc_kgb_id
              INNER JOIN @ten_files AS ten
                ON ten.id2_enc_kgb_id == one.id2_enc_kgb_id;
     OUTPUT @encode TO "/outputs/encode_joins.tsv" USING Outputters.Tsv();