I have this text file:
worked
working
works
tested
tests
find
found
It contains a million words without spaces. It may contain unicode characters.
The longest word is "working":
awk '{print length, $0}' test.txt | sort -nr | head -1
7 working
I need to create bigram, trigram (Max 7 columns)
w,wo,wor,work,worke,worked,
w,wo,wor,work,worki,workin,working
w,wo,wor,work,works,,
t,te,tes,test,teste,tested,
t,te,tes,test,tests,,
f,fi,fin,find,,,,
f,fo,fou,foun,found,,
preferably in awk (because it's fast)
A straightforward approach would be:
awk -v n=7 -v OFS=, \
'{s=$0; len=length(s); for (i=1;i<=len;i++) $i=substr(s,1,i); $n=$n}1'
w,wo,wor,work,worke,worked,
w,wo,wor,work,worki,workin,working
w,wo,wor,work,works,,
t,te,tes,test,teste,tested,
t,te,tes,test,tests,,
f,fi,fin,find,,,
f,fo,fou,foun,found,,
Tested on GNU Awk 5.3.0, mawk 1.3.4 20240819, and The One True Awk version 20240728.