javahtmlinputstreamreader

Fetch HTML part in java


I have some troubles understanding how can I download only part of html page. I tryed traditional way through URL::openStream method and BufferedReader but I'm not quite sure if this way pushes me to download whole page. The problem is: I have quite big HTML page and I need to parse 2 numbers from it, which updating at least once a second. Way above helps to detect changes once in 2-3 seconds and I wonder if there is way to make it faster. So I thought if fetching page partly can help me.


Solution

  • I think you should see how the data is fetched (SSE or WebSocket) and just try to subscribe to that service. If that is impossible try more efficient XML parser. I recommend https://vtd-xml.sourceforge.io/ it can be ~10x faster then DOM parser that comes with JDK.

    Also be careful with the BufferedReader.readLine() as there is a hidden cost of allocation (this is pretty advanced stuff as you have to think about CPU memory bandwidth, L1 cache misses etc..) for the strings that you don't really need.

    Example using the library I mentioned:

    byte[] pageInBytes = readAllBytesFromTheURL();
    VTDGen vg = new VTDGen();
    vg.setDoc(pageInBytes);
    vg.parse(false);
    VTDNav vn = vg.getNav();
    
    AutoPilot ap = new AutoPilot(vn);
    
    //Jump to the section that we want to process
    ap.selectXPath("/html/body/div");
    String fileId = vn.toString(vu.getElementFragment());