[SOLVED] How to parse unstructured data (i.e. from an HTML directory listing) using JSOUP?

How to parse unstructured data (i.e. from an HTML directory listing) using JSOUP?

As an example https://download.bls.gov/pub/time.series/ shows date/ timestamp / filesize information that doesn't appear to be enclosed by HTML tags. If we'd like to consider the date and timestamp information related to each link, what are ideal techniques to capture this information using JSoup.

<br> 9/14/2021  8:31 AM         2114 <A HREF="/pub/time.series/ap/ap.area">ap.area</A><br> 4/14/2005  2:53 PM          987 <A HREF="/pub/time.series/ap/ap.contacts">ap.contacts</A><br>

Solution

There are some debates whether this sort of information can be parsed efficiently - Getting directory listing over http.

But if we examine your concrete example, we observe the following:

your file/folder metadata are stored as TextNodes inside the pre element,
every relevant file/folder link (a element) has a direct sibling br that precedes it. Well, except for the root directory: https://download.bls.gov/. You have to treat that case separately.

This constitutes enough information for efficient queries:

Document doc = Jsoup.connect("https://download.bls.gov/pub/time.series/").get();
Elements links = doc.select("pre br + a");
List<TextNode> metaData = doc.select("pre").textNodes();
for (int i = 0; i < links.size(); i++) {
    String metaDataRow = metaData.get(i).toString();
    System.out.println(metaDataRow  + " | " + links.get(i));
}

You can further split up the metaDataRow to extract timestamps like so:

DateTimeFormatter formatter = DateTimeFormatter.ofPattern("M/d/yyyy pph:m a", Locale.ENGLISH);
// ...
String[] metaColumns = metaDataRow.split("        ");
LocalDate lastUpdated = LocalDate.parse(metaColumns[0].strip(), formatter);