javapdficepdf

Search for sentences and get the line number using icepdf


I tried searching sentences with icepdf.And got the right results most of the time.But the problems i am facing now are


Solution

  • Loop through all the lines in the document and create a list of a sentences. Each sentence can be a list of WordText objects. Then search through the list of lists you have created to find your sentence.

    Here is some example code (I have not checked as of now) to build the list of lists of WordText objects.

    ArrayList<ArrayList<WordText>> Sentences = new ArrayList<ArrayList<WordText>>;
    ArrayList<WordText> currentSentence = new ArrayList<WordText>;
    Document document = new Document();
    
    // Build sentences
    for (int pageNumber = 0, max = document.getNumberOfPages(); 
         pageNumber < max;     pageNumber++) {
      PageText pageText = document.getPageText(pageNumber);
      ArrayList<LineText> pageLines = pageText.getPageLines();
      for (LineText pageLine : pageLines) {
        ArrayList<WordText> words = pageLine.getWords();
        for (WordText word : words) {
          // If this is a word, and the last word was not a space, 
          // start a new sentence
          if(!word.getText().equals(" ") && currentSentence.size() > 0
             !currentSentence.get(currentSentence.size() - 1).getText().equals(" ")) {
            sentences.add(currentSentence);
            currentSentence = new ArrayList<WordText>;
          }
          // Add word to current sentnece
          currentSentence.add(word);
        }
       // Add the last sentence in
       sentences.add(currentSentence);
      }
    }
    

    If you need to sort your WordText lists, you can compare the WordText objects y and then x values.