regexawksedgrep

Split the word using bigram, trigram


I have this text file:

worked
working
works
tested
tests
find
found

It contains a million words without spaces. It may contain unicode characters.

The longest word is "working":

awk '{print length, $0}' test.txt | sort -nr | head -1
7 working

I need to create bigram, trigram (Max 7 columns)

w,wo,wor,work,worke,worked,
w,wo,wor,work,worki,workin,working
w,wo,wor,work,works,,
t,te,tes,test,teste,tested,
t,te,tes,test,tests,,
f,fi,fin,find,,,,
f,fo,fou,foun,found,,

preferably in awk (because it's fast)


Solution

  • A straightforward approach would be:

    awk -v n=7 -v OFS=, \
      '{s=$0; len=length(s); for (i=1;i<=len;i++) $i=substr(s,1,i); $n=$n}1'
    
    w,wo,wor,work,worke,worked,
    w,wo,wor,work,worki,workin,working
    w,wo,wor,work,works,,
    t,te,tes,test,teste,tested,
    t,te,tes,test,tests,,
    f,fi,fin,find,,,
    f,fo,fou,foun,found,,
    

    Tested on GNU Awk 5.3.0, mawk 1.3.4 20240819, and The One True Awk version 20240728.