As an example https://download.bls.gov/pub/time.series/ shows date/ timestamp / filesize information that doesn't appear to be enclosed by HTML tags. If we'd like to consider the date and timestamp information related to each link, what are ideal techniques to capture this information using JSoup.
<br> 9/14/2021 8:31 AM 2114 <A HREF="/pub/time.series/ap/ap.area">ap.area</A><br> 4/14/2005 2:53 PM 987 <A HREF="/pub/time.series/ap/ap.contacts">ap.contacts</A><br>
There are some debates whether this sort of information can be parsed efficiently - Getting directory listing over http.
But if we examine your concrete example, we observe the following:
TextNode
s inside the pre
element,a
element) has a direct sibling br
that precedes it. Well, except for the root directory: https://download.bls.gov/. You have to treat that case separately.This constitutes enough information for efficient queries:
Document doc = Jsoup.connect("https://download.bls.gov/pub/time.series/").get();
Elements links = doc.select("pre br + a");
List<TextNode> metaData = doc.select("pre").textNodes();
for (int i = 0; i < links.size(); i++) {
String metaDataRow = metaData.get(i).toString();
System.out.println(metaDataRow + " | " + links.get(i));
}
You can further split up the metaDataRow
to extract timestamps like so:
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("M/d/yyyy pph:m a", Locale.ENGLISH);
// ...
String[] metaColumns = metaDataRow.split(" ");
LocalDate lastUpdated = LocalDate.parse(metaColumns[0].strip(), formatter);