parsingwikipediawikipedia-api

Get the first lines of a Wikipedia article


I got a Wikipedia article and I want to fetch the first z lines (or the first x characters, or the first y words; it doesn't matter) from the article.

The problem: I can get either the source Wiki text (via the API) or the parsed HTML (via a direct HTTP request, eventually on the print-version), but how can I find the first lines displayed? Normally, the source (both HTML and wikitext) starts with the info-boxes and images and the first real text to display is somewhere down in the code.

For example:

Albert Einstein on Wikipedia (print version). Look in the code. The first real-text-line "Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪ̯nʃtaɪ̯n]; 14 March 1879–18 April 1955) was a theoretical physicist." is not on the start. The same applies to the Wiki source; it starts with the same info-box and so on.

So how would you accomplish this task? The programming language is Java, but this shouldn't matter.

A solution which came to my mind was to use an XPath query, but this query would be rather complicated to handle all the border-cases.


It wasn't that complicated; see my solution below!


Solution

  • I worked out the following solution:

    Using an XPath query on the XHTML source code (I took the print-version, because it is shorter, but it also works on the normal version).

    //html/body//div[@id='bodyContent']/p[1]
    

    This works on German and on English Wikipedia and I haven't found an article where it doesn't output the first paragraph. The solution is also quite fast, I also thought of only taking the first x characters of the XHTML, but this would render the XHTML invalid.

    If someone is searching for the Java code, here it is then:

    private static DocumentBuilderFactory dbf;
    
    static {
        dbf = DocumentBuilderFactory.newInstance();
        dbf.setAttribute("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
    }
    
    private static XPathFactory xpathf = XPathFactory.newInstance();
    private static String xexpr = "//html/body//div[@id='bodyContent']/p[1]";
    
    
    private static String getPlainSummary(String url) {
        try {
            // Open Wikipage
            URL u = new URL(url);
            URLConnection uc = u.openConnection();
            uc.setRequestProperty("User-Agent", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1) Gecko/20090616 Firefox/3.5");
            InputStream uio = uc.getInputStream();
            InputSource src = new InputSource(uio);
    
            // Construct Builder
            DocumentBuilder builder = dbf.newDocumentBuilder();
            Document docXML = builder.parse(src);
    
            // Apply XPath
            XPath xpath = xpathf.newXPath();
            XPathExpression xpathe = xpath.compile(xexpr);
            String s = xpathe.evaluate(docXML);
    
            // Return Attribute
            if (s.length() == 0) {
                return null;
            } else {
                return s;
            }
        }
        catch (IOException ioe) {
            logger.error("Cant get XML", ioe);
            return null;
        }
        catch (ParserConfigurationException pce) {
            logger.error("Cant get DocumentBuilder", pce);
            return null;
        }
        catch (SAXException se) {
            logger.error("Cant parse XML", se);
            return null;
        }
        catch (XPathExpressionException xpee) {
            logger.error("Cant parse XPATH", xpee);
            return null;
        }
    }
    

    Use it by calling getPlainSummary("http://de.wikipedia.org/wiki/Uma_Thurman");