I would like to create a numeric indicator variable that captures the levels of a complex string id
variable in SAS. Note that I cannot use if-then logic, as the id
values will change constantly.
I have searched multiple threads and previous questions, and I am unable to find an approach that works.
Below is some sample code that captures the essence of the problem.
data value_id;
length id $54 ;
informat id $54. ;
input id $ ;
cards ;
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
;
proc sort data = value_id
out = value_id_s ascii ;
by id ;
run;
Ideally, I would like to have a result where I can create a numeric indicator variable id_n
, like the following:
id id_n
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
... ...
Any advice / guidance on the approach would be very much appreciated.
Aside
If I were using Stata, I would use encode
and move on with my day...
. encode id, gen(id_n)
.
.
. tab1 id id_n , nolabel
-> tabulation of id
id | Freq. Percent Cum.
----------------------------------------+-----------------------------------
98llcibqon-u86ww0mzgo-ut58htcfv1-lybg.. | 4 10.00 10.00
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20.. | 4 10.00 20.00
byxl352kpd-se5godm0gv-jukpzv1u7x-8kff.. | 4 10.00 30.00
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8q.. | 4 10.00 40.00
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubj.. | 4 10.00 50.00
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4l.. | 4 10.00 60.00
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73o.. | 4 10.00 70.00
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tr.. | 4 10.00 80.00
tsknifwb29-818zgpj2be-vq7558xhqa-1lgq.. | 4 10.00 90.00
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9.. | 4 10.00 100.00
----------------------------------------+-----------------------------------
Total | 40 100.00
-> tabulation of id_n
id_n | Freq. Percent Cum.
------------+-----------------------------------
1 | 4 10.00 10.00
2 | 4 10.00 20.00
3 | 4 10.00 30.00
4 | 4 10.00 40.00
5 | 4 10.00 50.00
6 | 4 10.00 60.00
7 | 4 10.00 70.00
8 | 4 10.00 80.00
9 | 4 10.00 90.00
10 | 4 10.00 100.00
------------+-----------------------------------
Total | 40 100.00
Try this
proc sort data = value_id out = value_id_s ascii;
by id;
run;
data want;
set value_id_s;
by id;
if first.id then id_n + 1;
run;