pythonlistdictionaryvocabulary

Build vocabulary representations python


I have a list of strings of this form:

['1---d--e--g--gh','1---c---e--gh--', '1---ghj--h--h--', '1---g--gkk--h--', '1---d--dfe---fg', '1---c--d--dh--j', '1---f--gh--h--h', '1---fg-hg-hh-fg', '1---d--cd7--d--', '1---gghG--g77--', '1---hkj--kl--l-', '1---gged--ghjg-', '1---kk--k--k---', '1---gjklk--khgl', '1---c---d---dh-', '1---g---ghkk--k', '1---fH---h--g--', '1---f--gij---hj', '1---g--ghg---g-', '1---c---dc--cf-', '1---d---e--gh--', '1---l--lmnmlk-l', '1---d77---c--d-', '1---kj--k--lk-l', '1---g---gd--e--', '1---hhgh--d---h', '1---f--f---h---', '1---g--gkh-jkhg', '1---fg-hgh-fhfg', '1---k-k--klkj--', '1---g--l--kjhg-', 'gh--g---gh--g--', '1---f--df--fhij', '1---g--g--g---g', '1---g---gh-kh--', '1---g---gk--h--']

I want to create vocabulary representations of 3 types : a, b, c.

a are separated by at least one dash -, b by at least two --, and c by at least three dashes ---.

For example, 1--d--d--dfd-dc---f---g--ghgf-ghg-hj--h should give:

a: {d, d, dfd, dc, f, g, ghgf, ghg, hj, h}
b: {d, d, dfd-dc, f, g, ghgf-ghg-hj, h}
c: {d--d--dfd-dc, f, g--ghgf-ghg-hj--h}

As vocabulary representations (we skip the 1 in the beginning). Does anyone know a way to do that in python?


Solution

  • You can use list comprehension for each string in the list:

    string = '1--d--d--dfd-dc---f---g--ghgf-ghg-hj--h'
    a = [i.strip("-") for i in string.split("-") if i and i.strip("-")!='1']
    b = [i.strip("-") for i in string.split("--") if i and i.strip("-")!='1']
    c = [i.strip("-") for i in string.split("---") if i and i.strip("-")!='1']
    

    If you have a list vps containings those strings, you can just do:

    l =[[i.strip("-") for i in string.split("-") if i and i.strip("-")!='1'] for string in vps]