awksedgrephunspell

Stemming a word using Linux command line


I have a corpus file that I need to compare with another file "vertically" and list unique left-over strings. For example:

exclude.txt:

ed
s
ing

And the second file is:

corpus.txt:

worked
working
works
tested
tests
find
found

Expected output:

work/ed,ing,s
test/ed,s

The other words (find and found) can also be returned optionally.

I tried like this:

with open ('/home/corpus.txt') as corpus:
    for i in corpus:
        i = i.strip('\n')
        with open ('/home/exclude.txt') as exclude:
            for x in exclude:
                x = x.strip('\n')
                if i.endswith(x):
                    print (x, i, re.sub(r'(.*)'+x, r'\1', i)+'/'+x)

The output is something like this...

ed worked work/ed
ing working work/ing
s works work/s
ed tested test/ed
s tests test/s

As you can see I am not getting expected output. Nor is this Python code useful if the corpus file is large.


Here's some slightly less trivial input to test with:

$ head -100 exclude.txt corpus.txt
==> exclude.txt <==
works
ed
s
ing
ings

==> corpus.txt <==
worked
working
works
tested
tests
find
found
workings

The expected output given the above would be (printing the unmatched words is optional, as is printing matched/unmatched at the start of each line):

unmatched find
unmatched found
matched working/s
matched work/ed,ing,s,ings
matched test/ed,s

Solution

  • Using any awk:

    $ cat tst.awk
    { lineLgth = length($0) }
    NR == FNR {
        suffixes[$0]
        sfxLgths[lineLgth]
        next
    }
    {
        base = ""
        for ( sfxLgth in sfxLgths ) {
            baseLgth = lineLgth - sfxLgth
            if ( baseLgth > 0 ) {
                sfx = substr($0,baseLgth+1)
                if ( sfx in suffixes ) {
                    base = substr($0,1,baseLgth)
                    bases2sfxs[base] = bases2sfxs[base] "," sfx
                }
            }
        }
        if ( base == "" ) {
            print "unmatched", $0
        }
    }
    END {
        for ( base in bases2sfxs ) {
            sub(/,/,"/",bases2sfxs[base])
            print "matched", base bases2sfxs[base]
        }
    }
    

    The above uses literal string functionality so it'll work even if the strings in either input file could contain regexp metacharacters, backreference characters/strings, typical string/regexp delimiters, escape sequences, or any other potentially problematic characters (the OP may not have all of those cases but others with similar problems reading this in future might).

    For example, given this input which tests all of the above occurring in both files:

    $ cat exclude.txt
    works
    ed
    s
    ing
    ings
    .
    *
    .*
    /
    &
    '
    "
    \
    \1
    \t
    

    $ cat corpus.txt
    worked
    working
    works
    tested
    tests
    find
    found
    workings
    abacus
    formless
    impress
    stories
    amusing
    linings
    toppings
    underlings
    x.
    x*
    x.*
    x/
    x&
    x'
    x"
    x\
    x\1
    x\t
    

    the above script would produce this output (output sorted just for ease of seeing which words got which suffixes):

    $ awk -f tst.awk exclude.txt corpus.txt | sort
    matched abacu/s
    matched amus/ing
    matched formles/s
    matched impres/s
    matched lin/ings
    matched lining/s
    matched storie/s
    matched test/ed,s
    matched topp/ings
    matched topping/s
    matched underl/ings
    matched underling/s
    matched work/ed,ing,s,ings
    matched working/s
    matched x./*
    matched x/.,*,.*,/,&,',",\,\1,\t
    unmatched find
    unmatched found
    

    I'm also looping on the length of the strings in exclude.txt (contents of suffixes[]) rather than looping on the actual strings for efficiency.

    If we have, say, 10 3-letter words in suffixes that we want to match against the last 3 letters of a line from corpus.txt then we'd populate our suffixes array from exclude.txt and loop through corpus.txt either way but then our options for implementing "check for a match" are (pseudo-code):

    a) Compare each suffix from exclude.txt to the last 3 letters on each line from corpus.txt (this will loop 10 times as we have 10 suffixes):

    NR == FNR {
        suffixes[$0]
        next
    }
    {
        for ( sfx in suffixes ) {     # loops 10 times
            wordEnd = substr($0,length($0)-length(sfx))
            if ( wordEnd == sfx ) {
                it's a match
            }
        }
    }
    

    b) Compare the last 3 letters of each line from corpus.txt to all suffixes from exclude.txt at once via a hash lookup (this will loop 1 time as all suffixes are 3 chars long in this example):

    NR == FNR {
        suffixes[$0]
        sfxLgths[length($0)]
        next
    }
    {
        for ( sfxLgth in sfxLgths ) {    # loops 1 time
            wordEnd = substr($0,length($0)-sfxLgth)
            if ( wordEnd in suffixes ) {
                it's a match
            }
        }
    }
    

    So I implemented option "b" to only loop 1 time for all suffixes that are N characters long instead of looping N times.

    Regarding how much more efficient looping by length is than looping by suffix - for what it's worth, according to https://en.wiktionary.org/wiki/Appendix:English_suffixes there are 392 English language suffixes but 368 of those are in a bell curve between 2 and 7 characters long averaging 61 suffixes per length in that range. So looping one time on a length in that range would save on average about 60 iterations compared to looping on each of the suffixes of that length.

    The test for if ( baseLgth > 0 ) is necessary to not output <null>/<suffix> if, say, a word from corpus.txt also exists in exclude.txt.

    Tweak it to modify the output format and/or add a variable to control printing the unmatched words if you like as your question isn't clear on how to handle that.