javajsoupdirectory-listing

How to parse unstructured data (i.e. from an HTML directory listing) using JSOUP?


As an example https://download.bls.gov/pub/time.series/ shows date/ timestamp / filesize information that doesn't appear to be enclosed by HTML tags. If we'd like to consider the date and timestamp information related to each link, what are ideal techniques to capture this information using JSoup.

<br> 9/14/2021  8:31 AM         2114 <A HREF="/pub/time.series/ap/ap.area">ap.area</A><br> 4/14/2005  2:53 PM          987 <A HREF="/pub/time.series/ap/ap.contacts">ap.contacts</A><br>

Solution

  • There are some debates whether this sort of information can be parsed efficiently - Getting directory listing over http.

    But if we examine your concrete example, we observe the following:

    This constitutes enough information for efficient queries:

    Document doc = Jsoup.connect("https://download.bls.gov/pub/time.series/").get();
    Elements links = doc.select("pre br + a");
    List<TextNode> metaData = doc.select("pre").textNodes();
    for (int i = 0; i < links.size(); i++) {
        String metaDataRow = metaData.get(i).toString();
        System.out.println(metaDataRow  + " | " + links.get(i));
    }
    

    You can further split up the metaDataRow to extract timestamps like so:

    DateTimeFormatter formatter = DateTimeFormatter.ofPattern("M/d/yyyy pph:m a", Locale.ENGLISH);
    // ...
    String[] metaColumns = metaDataRow.split("        ");
    LocalDate lastUpdated = LocalDate.parse(metaColumns[0].strip(), formatter);