sasproc-sql

Union different columns from the same table using a data step


I would like to turn the following MWE into a data step (possibly using a macro) so I can run it with six years worth of data, and five or six columns of data. Here the years are pre-defined variables.

PROC SQL;
   CREATE TABLE SASUSER.OUT_DATA AS 
       SELECT  &YEAR1 AS YEAR,
               ID,
               ROUND(ANN_TURN&YEAR1) AS ANN_TURN,
               ROUND(ANN_PROF&YEAR1) AS ANN_PROF
 FROM WORK.IN_DATA
    WHERE ANN_TURN&YEAR1 IS NOT MISSING OR
          ANN_PROF&YEAR1 IS NOT MISSING
 UNION
       SELECT &YEAR2 AS YEAR,
              ID,
              ROUND(ANN_TURN&YEAR2) AS ANN_TURN,
              ROUND(ANN_PROF&YEAR2) AS ANN_PROF
 FROM WORK.IN_DATA
    WHERE ANN_TURN&YEAR2 IS NOT MISSING OR
          ANN_PROF&YEAR2 IS NOT MISSING
 UNION 
    SELECT  &YEAR3 AS YEAR,
            ID,
            ROUND(ANN_TURN&YEAR3) AS ANN_TURN,
            ROUND(ANN_PROF&YEAR3) AS ANN_PROF
    FROM WORK.IN_DATA
        WHERE ANN_TURN&YEAR3 IS NOT MISSING OR
              ANN_PROF&YEAR3 IS NOT MISSING;
   QUIT;

Example of IN_DATA (apologies I'm not sure how to add a table in correct format):

ID ANN_TURN_Y1819 ANN_TURN_Y1920 ANN_TURN_Y2021  ANN_PROF_Y1819 ANN_PROF_Y1920 ANN_PROF_Y2021 

1  100 110 120 10 12 15

Solution

  • The DATA Step SET statement is used to stack data sets.

    However, your SQL is performing a UNION of selections from the same data set, essentially reshaping the data through a transposition.

    If you are ROUNDing for display purposes, use a FORMAT instead. The WHERE clause might be suspect. OR'ing double negatives always made this old head wobble.

    Example:

    You can do this in a non-macro way using VVALUEX to extract the value of a dynamically named variable.

    data have;
      do companyid = 1 to 1000;
        array ann_turn ann_turn1920-ann_turn2025;
        array ann_prof ann_prof1920-ann_prof2025;
    
        do over ann_turn ;
          x + 1;  ann_turn = x;
          y + 10; ann_prof = y;
        end;
        output;
      end;
      keep companyid ann_turn: ann_prof: ;
    run;
    
    %let years_list = 1920, 1940, 1961, 1962, 1991 to 1995, 2009;
    
    data want(keep=companyid year ann_turn ann_prof);
      set have;
      do year = &years_list;
        ann_turn = input(vvaluex(cats('ANN_TURN', year)),best16.);
        ann_prof = input(vvaluex(cats('ANN_PROF', year)),best16.);
        output;
      end;
      format ann_: dollar 15.;
    run;