I have a large text file with fixed columns that I need to import to SAS. The same file contains information at three levels: firms, associates and economic variables.
The first character of every observation tells which level the values in the row are about (1 is about firms, 2 about associates...). According to the layout, each level has different variables of different lengths.
This is a small example of the data I have to import. The first 4 variable are the same but afterwards each level has its own variables with different lengths. The variable "type" identifies which level the observation is about. And all the other variables are bunched up in "other":
DATA WORK.test;
INFILE DATALINES;
INPUT
type $ 1-1
full_registry $ 2-2
type_update $ 3-3
cnpj $ 4-17
other $ 18-1200
;
DATALINES;
1F 000000000001911BANCO DO BRASIL SA DIRECAO GERAL 022005110300
2F 000000000001912MARCIO HAMILTON FERREIRA 000***923641**100000020101117
2F 000000000001912NILSON MARTINIANO MOREIRA 000***491386**100000020101117
2F 000000000001912WALTER MALIENI JUNIOR 000***718468**100000020101117
2F 000000000001912PAULO ROGERIO CAFFARELLI 000***415907**100000020120327
2F 000000000001912ADRIANO MEIRA RICCI 000***295868**100000020180412
2F 000000000001912MARCOS RENATO COLTRI 000***890627**100000020180418
2F 000000000001912CARLA NESI 000***370266**100000020170614
6F 000000000001916499999
1F 000000000002722BANCO DO BRASIL SA MANAUS (AM) 022005110300
1F 000000000003532BANCO DO BRASIL SA SANTOS - SANTOS (SP) 022005110300
1F 000000000004342BANCO DO BRASIL SA CAMPOS EST.UNIF. 022005110300
1F 000000000005152BANCO DO BRASIL SA MACAPA-EST.UNIF. 022005110300
;;;;
run;
What's the best way to read this file?
I thought of two possibilities:
I. Read the txt three times with the three layouts for each level. Throw away the the observations that were read wrong. I'm not sure if this will work.
II. Read the first 4 variables and put the rest of the line on a string variable, like I did in this example. Then use a bunch of substr() to separate the several variables. I'm not sure if this is the smartest way to do this.
Am I'm missing a much simpler way to do this?
I'm using Enterprise version 7.13.
You can use the single trailing @
at the end of an input statement to hold the current line for further input statements in the same data step. E.g.
DATA want;
INFILE DATALINES;
/*Always input the first 4 variables*/
INPUT type $ 1-1 full_registry $ 2-2 type_update $ 3-3 cnpj $ 4-17 @;
select(type);
when('1') input var1; /*fill this in with input rules for type 1 records*/
when('2') input var2; /*fill this in with input rules for type 2 records*/
otherwise input; /*do not input any other variables - just ignore the row*/
end;
DATALINES;
1F 000000000001911BANCO DO BRASIL SA DIRECAO GERAL 022005110300
2F 000000000001912MARCIO HAMILTON FERREIRA 000***923641**100000020101117
2F 000000000001912NILSON MARTINIANO MOREIRA 000***491386**100000020101117
2F 000000000001912WALTER MALIENI JUNIOR 000***718468**100000020101117
2F 000000000001912PAULO ROGERIO CAFFARELLI 000***415907**100000020120327
2F 000000000001912ADRIANO MEIRA RICCI 000***295868**100000020180412
2F 000000000001912MARCOS RENATO COLTRI 000***890627**100000020180418
2F 000000000001912CARLA NESI 000***370266**100000020170614
6F 000000000001916499999
1F 000000000002722BANCO DO BRASIL SA MANAUS (AM) 022005110300
1F 000000000003532BANCO DO BRASIL SA SANTOS - SANTOS (SP) 022005110300
1F 000000000004342BANCO DO BRASIL SA CAMPOS EST.UNIF. 022005110300
1F 000000000005152BANCO DO BRASIL SA MACAPA-EST.UNIF. 022005110300
;;;;
run;