pythondomweb-scrapinglxmliterparse

Grabbing <title> tag with lxml's iterparse


I'm running into a problem with using lxml's iterparse on my HTML. I'm trying to get the <title>'s text but this simple function doesn't work on complete web pages:

def get_title(str):
    titleIter = etree.iterparse(StringIO(str), tag="title")
    try:
        for event, element in titleIter:
            return element.text
        # print "Script goes here when it doesn't work"
    except etree.XMLSyntaxError:
        return None

This function works fine on simple input like "<title>test</title>", but when I give it a complete page it's unable to extract the title.

UPDATE: Here's the HTML I'm working with:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html dir="ltr" lang="it" xmlns="http://www.w3.org/1999/xhtml">
<head>
<link rel="icon" href="http://www.tricommerce.it/tricommerce.ico" />
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Tricommerce - Informazioni sulla privacy</title>
<meta name="description" content="Info sulla privacy" />
<meta name="keywords" content="Accessori notebook Alimentatori Case Cavi e replicatori Controllo ventole Lettori e masterizzatori Modding Pannelli &amp; display Dissipatori Tastiere e mouse Ventole Griglie e filtri Hardware Accessori vari Box esterni Casse e cuffie Sistemi a liquido Paste termiche vendita modding thermaltake vantec vantecusa sunmbeam sunbeamtech overclock thermalright xmod aerocool arctic cooling arctic silver zalman colorsit colors-it sharkoon mitron acmecom Info sulla privacy" />
<meta name="robots" content="index, follow" />
<link rel="stylesheet" href="http://www.tricommerce.it/css/tricommerce.css" />
<link rel="stylesheet" href="css/static.css" />
<script type="text/javascript" src="http://www.tricommerce.it/javascript/vertical_scroll.js"></script>

<script type="text/javascript">
//<![CDATA[
function MM_preloadImages() { //v3.0
 var d=document; if(d.images){ if(!d.MM_p) d.MM_p=new Array();
   var i,j=d.MM_p.length,a=MM_preloadImages.arguments; for(i=0; i<a.length; i++)
   if (a[i].indexOf("#")!=0){ d.MM_p[j]=new Image; d.MM_p[j++].src=a[i];}}
}
//]]>
</script>

<link rel="stylesheet" type="text/css" href="http://www.tricommerce.it/css/chromestyle.css" />

<script type="text/javascript" src="http://www.tricommerce.it/javascript/chrome.js">
/***********************************************
* AnyLink CSS Menu script- ? Dynamic Drive DHTML code library (www.dynamicdrive.com)
* This notice MUST stay intact for legal use
* Visit Dynamic Drive at http://www.dynamicdrive.com/ for full source code
***********************************************/
</script>

</head>
</html>

Also, a quick note on why I'm using iterparse-- it's because I don't want to load in the entire DOM just to get a single tag early on in the document.


Solution

  • You might want to post at least a portion of the data you're actually trying to parse. Absent that information, here's a guess. If the <html> element defines a default XML namespace, you'll need to use that when looking for elements. For example, look at this simple document:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
    "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd"
      xml:lang="en">
      <head>            
        <title>Document Title</title>
      </head>
      <body>
      </body>
    </html>
    

    Given this input, the following will return no results:

    >>> doc = etree.parse(open('foo.html'))
    >>> doc.xpath('//title')
    []
    

    This fails because we're look for a <title> element without specifying a namespace...and absent a namespace, the parser isn't going to find a match (because foo:title is different from bar:title, assuming that foo: and bar: are defined XML namespaces).

    You can explicitly use a namespace with the ElementTree interface like this:

    >>> doc.xpath('//html:title',
    ...   namespaces={'html': 'http://www.w3.org/1999/xhtml'})
    [<Element {http://www.w3.org/1999/xhtml}title at 0x1087910>]
    

    And there's our match.

    You can pass namespace prefixes to the tag argument of iterparse, too:

    >>> titleIter = etree.iterparse(StringIO(str), 
    ...   tag='{http://www.w3.org/1999/xhtml}title')
    >>> list(titleIter)
    [(u'end', <Element {http://www.w3.org/1999/xhtml}title at 0x7fddb7c4b8c0>)]
    

    If this doesn't solve your problem, post some sample input and we'll work from there.