performancesasleft-joinproc

Time efficient approach to do SAS left join on large data along with case statement used in select condition


I have 2 data which are in .sas7bdat format and I need to perform a left join on them based on some condition. To give an overview of the PROC SQL it is as follows:

PROC SQL;
  CREATE TABLE XYZ 
     SELECT T1.A,
     case statement
       case
           when max(T1.X) <= 0 then T1.X           
           when max(T1.X) > 0 then T2.Y
           AS FINAL_XY
     FROM T1 left join T2 ON
     T1.A= T2.A
     T1.B=T2.B AND
     UPCASE(T1.C)="CLASS1"
    quit;

Both the SAS datasets are around 2GB each. Any efficient way I can get this done. The above statement takes almost an hour to complete and generate the output. I am fine using data step operation as well if that will provide better performance than above. I do not have memory limitation if that is a question: RAM: 64 GB. Incase, this information is helpful.

EDIT: Doing a SAS SQL statement as in some scenarios I need to assign value from the table2 or in from table1.So, essentially trying to join two tables and create a new table which consists some of the columns from table1 while some from table2 based on condition specified in the case statement.


Solution

  • Make sure the datasets are sorted by the key variables and just merge them. If there are key combinations in the second dataset that do not appear in the first that want to ignore you can use the IN= dataset option to create flag variables to indicate if the dataset is contributing to the observation.

    So when A and B uniquely identify the observations in T2 then a left join on the values of A and B would look like this:

    data XYZ;
      merge T1 (in=in1) T2 ;
      by A B ;
      if in1 ;
    run;
    

    Say the reason for the merge is to attach the values of Y from T2 , and no other variables from T2, to the observations from T1 then you can use the KEEP= dataset option to pick which variables from T2 to use. So then the MERGE statement might look like this instead:

    merge T1(in=in1) T2(keep=A B Y);
    

    To mimic the impact of including that extra test of variable C in your join criteria you could add an IF statement that clears the new variables from T2 when that condition is NOT true.

     if not (UPCASE(C)="CLASS1") then call missing(Y);
    

    To add other variables that do not already come from T1 or T2 you can use normal programming language statements like assignment statements and IF/THEN logic (instead of SQL's convoluted CASE syntax).

    But if you really do need to also generate aggregate statistics, like the MAX() in your example, then it will probably take more time since you will need at least a second pass through the data to calculate those. (SQL code needs that extra pass also, but it does it for you.)

    So for your example you really just need to first calculate the maximum value of X. For example you can use PROC SUMMARY to do that and then include that value into your resulting dataset so you can reference it in your code.

    proc summary data=t1 ;
      var X;
      output out=max_x max=max_x ;
    run;
    data XYZ;
      if _n_=1 then set max_x(keep=max_x);
      merge T1 (in=in1) T2 (keep=A B Y) ;
      by A B ;
      if in1 ;
      if not (UPCASE(C)="CLASS1") then call missing(Y);
      if max_x <= 0 then FINAL_XY = X;
      else FINAL_XY=Y;         
    run;
    

    But since the logic of what value to assign to FINAL_XY does not seem to depend on the values of any of the actual variables in the current observations you might just use macro code to conditionally generate either of the two assignment statements instead. So you might do something like this instead to first put the name of the variable to assign to FINAL_XY into a macro variable. And then use the macro variable to help generate the assignment statement in the data step.

    proc sql noprint;
       select case when (max(x)<=0) then 'X' else 'Y' end
       into :final_xy 
       from T1 ;
    quit;
    data XYZ;
      merge T1 (in=in1) T2 (keep=A B Y) ;
      by A B ;
      if in1 ;
      if not (UPCASE(T1.C)="CLASS1") then call missing(Y);
      FINAL_XY = &final_xy ;
    run;