javanlpextractkeywordstemming

Java library for keywords extraction from input text


I'm looking for a Java library to extract keywords from a block of text.

The process should be as follows:

stop word cleaning -> stemming -> searching for keywords based on English linguistics statistical information - meaning if a word appears more times in the text than in the English language in terms of probability than it's a keyword candidate.

Is there a library that performs this task?


Solution

  • Here is a possible solution using Apache Lucene. I didn't use the last version but the 3.6.2 one, since this is the one I know the best. Besides the /lucene-core-x.x.x.jar, don't forget to add the /contrib/analyzers/common/lucene-analyzers-x.x.x.jar from the downloaded archive to your project: it contains the language-specific analyzers (especially the English one in your case).

    Note that this will only find the frequencies of the input text words based on their respective stem. Comparing these frequencies with the English language statistics shall be done afterwards (this answer may help by the way).


    The data model

    One keyword for one stem. Different words may have the same stem, hence the terms set. The keyword frequency is incremented every time a new term is found (even if it has been already found - a set automatically removes duplicates).

    public class Keyword implements Comparable<Keyword> {
    
      private final String stem;
      private final Set<String> terms = new HashSet<String>();
      private int frequency = 0;
    
      public Keyword(String stem) {
        this.stem = stem;
      }
    
      public void add(String term) {
        terms.add(term);
        frequency++;
      }
    
      @Override
      public int compareTo(Keyword o) {
        // descending order
        return Integer.valueOf(o.frequency).compareTo(frequency);
      }
    
      @Override
      public boolean equals(Object obj) {
        if (this == obj) {
          return true;
        } else if (!(obj instanceof Keyword)) {
          return false;
        } else {
          return stem.equals(((Keyword) obj).stem);
        }
      }
    
      @Override
      public int hashCode() {
        return Arrays.hashCode(new Object[] { stem });
      }
    
      public String getStem() {
        return stem;
      }
    
      public Set<String> getTerms() {
        return terms;
      }
    
      public int getFrequency() {
        return frequency;
      }
    
    }
    

    Utilities

    To stem a word:

    public static String stem(String term) throws IOException {
    
      TokenStream tokenStream = null;
      try {
    
        // tokenize
        tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(term));
        // stem
        tokenStream = new PorterStemFilter(tokenStream);
    
        // add each token in a set, so that duplicates are removed
        Set<String> stems = new HashSet<String>();
        CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
          stems.add(token.toString());
        }
    
        // if no stem or 2+ stems have been found, return null
        if (stems.size() != 1) {
          return null;
        }
        String stem = stems.iterator().next();
        // if the stem has non-alphanumerical chars, return null
        if (!stem.matches("[a-zA-Z0-9-]+")) {
          return null;
        }
    
        return stem;
    
      } finally {
        if (tokenStream != null) {
          tokenStream.close();
        }
      }
    
    }
    

    To search into a collection (will be used by the list of potential keywords):

    public static <T> T find(Collection<T> collection, T example) {
      for (T element : collection) {
        if (element.equals(example)) {
          return element;
        }
      }
      collection.add(example);
      return example;
    }
    

    Core

    Here is the main input method:

    public static List<Keyword> guessFromString(String input) throws IOException {
    
      TokenStream tokenStream = null;
      try {
    
        // hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")
        input = input.replaceAll("-+", "-0");
        // replace any punctuation char but apostrophes and dashes by a space
        input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " ");
        // replace most common english contractions
        input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", "");
    
        // tokenize input
        tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(input));
        // to lowercase
        tokenStream = new LowerCaseFilter(Version.LUCENE_36, tokenStream);
        // remove dots from acronyms (and "'s" but already done manually above)
        tokenStream = new ClassicFilter(tokenStream);
        // convert any char to ASCII
        tokenStream = new ASCIIFoldingFilter(tokenStream);
        // remove english stop words
        tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, EnglishAnalyzer.getDefaultStopSet());
    
        List<Keyword> keywords = new LinkedList<Keyword>();
        CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
          String term = token.toString();
          // stem each term
          String stem = stem(term);
          if (stem != null) {
            // create the keyword or get the existing one if any
            Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));
            // add its corresponding initial token
            keyword.add(term.replaceAll("-0", "-"));
          }
        }
    
        // reverse sort by frequency
        Collections.sort(keywords);
    
        return keywords;
    
      } finally {
        if (tokenStream != null) {
          tokenStream.close();
        }
      }
    
    }
    

    Example

    Using the guessFromString method on the Java wikipedia article introduction part, here are the first 10 most frequent keywords (i.e. stems) that were found:

    java         x12    [java]
    compil       x5     [compiled, compiler, compilers]
    sun          x5     [sun]
    develop      x4     [developed, developers]
    languag      x3     [languages, language]
    implement    x3     [implementation, implementations]
    applic       x3     [application, applications]
    run          x3     [run]
    origin       x3     [originally, original]
    gnu          x3     [gnu]
    

    Iterate over the output list to know which were the original found words for each stem by getting the terms sets (displayed between brackets [...] in the above example).


    What's next

    Compare the stem frequency / frequencies sum ratios with the English language statistics ones, and keep me in the loop if your managed it: I could be quite interested too :)