javaxpathtag-soup

Troubles with XPath and Links


my first time posting!

The problem I'm having is I'm using XPath and Tag-Soup to parse a webpage and read in the data. As these are news articles sometimes they have links embedded in the content and these are what is messing with my program.

The XPath I'm using is storyPath = "//html:article//html:p//text()"; where the page has a structure of:

<article ...>
   <p>Some text from the story.</p>
   <p>More of the story, which proves <a href="">what a great story this is</a>!</p>
   <p>More of the story without links!</p>
</article>

My code relating to the xpath evaluation is this:

NodeList nL = XPathAPI.selectNodeList(doc,storyPath);

LinkedList<String> story = new LinkedList<String>();
    for (int i=0; i<nL.getLength(); i++) {
        Node n = nL.item(i);

        String tmp = n.toString();
        tmp = tmp.replace("[#text:", "");
        tmp = tmp.replace("]", "");
        tmp = tmp.replaceAll("’", "'");
        tmp = tmp.replaceAll("‘", "'");
        tmp = tmp.replaceAll("–", "-");
        tmp = tmp.replaceAll("¬", "");
        tmp = tmp.trim();

        story.add(tmp);
    }

this.setStory(story);
...

private void setStory(LinkedList<String> story) {
    String tmp = "";
    for (String p : story) {
        tmp = tmp + p + "\n\n";
    }

    this.story = tmp.trim();
}

The output this gives me is

Some text from the story.

More of the story, which proves 

what a great story this is

!

More of the story without links!

Does anyone have a way of me eliminating this error? Am I taking a wrong approach somewhere? (I understand I could well be with the setStory code, but don't see another way.

And without the tmp.replace() codes, all the results appear like [#text: what a great story this is] etc

EDIT:

I am still having troubles, though possibly of a different kind.. what is killing me here is again a link, but the way the BBC have their website, the link is on a separate line, thus it still reads in with the same problem as described before (note that problem was fixed with the example given). The section of code on the BBC page is:

    <p>    Former Queens Park Rangers trainee Sterling, who 

    <a  href="http://news.bbc.co.uk/sport1/hi/football/teams/l/liverpool/8541174.stm" >moved to the Merseyside club in February 2010 aged 15,</a> 

    had not started a senior match for the Reds before this season.
    </p>

which appears in my output as:

    Former Queens Park Rangers trainee Sterling, who 

    moved to the Merseyside club in February 2010 aged 15, 

         had not started a senior match for the Reds before this season.

Solution

  • For the problem with your edit where new lines in the html source code come out into your text document, you'll want to remove them before you print them. Instead of System.out.print(text.trim()); do System.out.println(text.trim().replaceAll("[ \t\r\n]+", " "));