I am trying to make an automated java program that will get the source code of a certain webpage, but the source code I am able to get with the automated program is different from the one I get when I right-click on the webpage. Right now, based on the code that I found on the internet, this is my solution, which doesn't work. I need to get the text of the reviews, and the code below does not return it.
public static void main(String[] args) throws IOException {
URL url = new URL(
"http://www.tripadvisor.com/ShowUserReviews-g60745-d481776-r184086024-Prudential_Center-Boston_Massachusetts.html#REVIEWS");
URLConnection spoof = url.openConnection();
spoof.setRequestProperty("User-Agent",
"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)");
BufferedReader in = new BufferedReader(new InputStreamReader(
spoof.getInputStream()));
String strLine = "";
String finalHTML = "";
// Loop through every line in the source
while ((strLine = in.readLine()) != null) {
finalHTML += strLine+"\n";
}
System.out.println(finalHTML);
}
}
You generally cannot retrieve "the source code" of a page unless the page is a 1990's purely static HTML page. The source code of a page will consist of HTML (or XML+XSLT) plus CSS, along with Javascript that modifies the DOM after the page has been loaded.
In addition, after the page has been loaded the DOM can continue to be modified in response to events, and can continue to fetch data from one or more servers via Ajax or even raw sockets. So there is no such thing as "the source code" unless you mean just the originally transmitted HTML, CSS, Javascript and images.