Hi I want to scrape some text from a website using the JSoup library. I have tried the following code, and that gives me the whole webpage, I want to just extract a specific line. Here is the code I am using:
Document doc = null;
try {
doc = Jsoup.connect("http://www.example.com").get();
} catch (IOException e) {
e.printStackTrace();
}
String text = doc.html();
System.out.println(text);
That prints out the following
<html>
<head></head>
<body>
Martin,James,28,London,20k
<br /> Sarah,Jackson,43,Glasgow,32k
<br /> Alex,Cook,22,Liverpool,18k
<br /> Jessica,Adams,34,London,27k
<br />
</body>
</html>
How can I extract just the 6th line that reads Alex,Cook,22,Liverpool,18k
and put it into an array where each element is a word before a comma (eg: [0] = Alex, [1] = Cook, etc)
Maybe you have to format (?) the Result a bit:
Document doc = Jsoup.connect("http://www.example.com").get();
int count = 0; // Count Nodes
for( Node n : doc.body().childNodes() )
{
if( n instanceof TextNode )
{
if( count == 2 ) // Node 'Alex'
{
String t[] = n.toString().split(","); // you have an array with each word as string now
System.out.println(Arrays.toString(t)); // eg. output
}
count++;
}
}
Output:
[ Alex, Cook, 22, Liverpool, 18k ]
Since you cant select TextNode
's by its ccntent (only possible with Element
s) you need a small workaround:
for( Node n : doc.body().childNodes() )
{
if( n instanceof TextNode )
{
str = n.toString().trim();
if( str.toLowerCase().startsWith("alex") ) // Node 'Alex'
{
String t[] = n.toString().split(","); // you have an array with each word as string now
System.out.println(Arrays.toString(t)); // eg. output
}
}
}