sasencodelevels

Create a numeric encoding of levels for a string variable in SAS


I would like to create a numeric indicator variable that captures the levels of a complex string id variable in SAS. Note that I cannot use if-then logic, as the id values will change constantly.

I have searched multiple threads and previous questions, and I am unable to find an approach that works.

Below is some sample code that captures the essence of the problem.


data value_id;
length id $54 ;
informat id $54. ; 
input id $ ;
cards ;
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
;

proc sort data = value_id
out = value_id_s  ascii  ;
by id ;
run; 

Ideally, I would like to have a result where I can create a numeric indicator variable id_n, like the following:

                                                    id      id_n
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym         1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym         1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym         1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym         1
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8         2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8         2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8         2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8         2
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf         3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf         3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf         3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf         3
...                                                          ...

Any advice / guidance on the approach would be very much appreciated.


Aside

If I were using Stata, I would use encode and move on with my day...

. encode id, gen(id_n)

. 
. 
. tab1 id id_n , nolabel

-> tabulation of id  

                                     id |      Freq.     Percent        Cum.
----------------------------------------+-----------------------------------
98llcibqon-u86ww0mzgo-ut58htcfv1-lybg.. |          4       10.00       10.00
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20.. |          4       10.00       20.00
byxl352kpd-se5godm0gv-jukpzv1u7x-8kff.. |          4       10.00       30.00
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8q.. |          4       10.00       40.00
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubj.. |          4       10.00       50.00
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4l.. |          4       10.00       60.00
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73o.. |          4       10.00       70.00
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tr.. |          4       10.00       80.00
tsknifwb29-818zgpj2be-vq7558xhqa-1lgq.. |          4       10.00       90.00
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9.. |          4       10.00      100.00
----------------------------------------+-----------------------------------
                                  Total |         40      100.00

-> tabulation of id_n  

       id_n |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          4       10.00       10.00
          2 |          4       10.00       20.00
          3 |          4       10.00       30.00
          4 |          4       10.00       40.00
          5 |          4       10.00       50.00
          6 |          4       10.00       60.00
          7 |          4       10.00       70.00
          8 |          4       10.00       80.00
          9 |          4       10.00       90.00
         10 |          4       10.00      100.00
------------+-----------------------------------
      Total |         40      100.00

Solution

  • Try this

    proc sort data = value_id out = value_id_s ascii;
       by id;
    run; 
    
    data want;
       set value_id_s;
       by id;
       if first.id then id_n + 1;
    run;