javajericho-html-parser

How do I look for a custom start tag using Jericho in Java?


As the title says, I'm trying to match a non-standard StartTagType in the form of <foo:bar ...>

How would I do this with Jericho?

Edit:

I have created the follow custom StartTagType:

PrimoResultStartTagType primoSTT = new PrimoResultStartTagType("search", "<sear:DOC", ">", EndTagType.NORMAL, false, true, true);

...and:

class PrimoResultStartTagType extends StartTagType {

    protected PrimoResultStartTagType(String arg0, String arg1, String arg2, EndTagType arg3, boolean arg4, boolean arg5, boolean arg6) {
        super(arg0, arg1, arg2, arg3, arg4, arg5, arg6);
    }

    @Override
    protected Tag constructTagAt(Source arg0, int arg1) {
        return null;
    }

}

However, when i do a source.getAllElements(...), I get no matches.


Solution

  • Maybe it will help:

    Example html:

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">
    <html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
        <title>StartTagType (Jericho HTML Parser 3.1)</title>
    </head>
    
    <body>
    
    <span>simple tag</span>
    
    <test:name>custom tag</test:name>
    
    </body>
    
    </html>
    

    And sample code:

    public class Main {
    
    public static void main(String[] args)
            throws IOException {
    
        URL url = Main.class.getClassLoader().getResource("test.html");
        Source source = new Source(url);
        List<Element> elementList = source.getAllElements("test:name");
        for (Element element : elementList) {
            System.out.println("Custom tag content: " + element.getContent().toString());
        }
    }
    

    }

    Output:

    Custom tag content: custom tag