webharvest

Webharvest crawler script not creating XML file


I'm hoping someone can point out my (probably stupid) problem with this script. I'm trying to crawl a website to get the posts on the site and to load this into an XML document. I have tried to combine a couple of example scripts - the crawler and nytimes examples.

The script runs without error, however only the <edublogs date="02.10.2015"></edublogs> tags are exported.

Thanks in advance for your help.

<?xml version="1.0" encoding="UTF-8"?>

<config charset="UTF-8">

<!-- set initial page -->
<var-def name="home"><<SNIPPED>></var-def>

<!-- define script functions and variables -->
<script><![CDATA[
    /* checks if specified URL is valid for download */
    boolean isValidUrl(String url) {
        String urlSmall = url.toLowerCase();
        return urlSmall.startsWith("http://<<SNIPPED>>/") || urlSmall.startsWith("https://<<SNIPPED>>/");
    }

    /* set of unvisited URLs */
    Set unvisited = new HashSet();
    unvisited.add(home);

    /* pushes to web-harvest context initial set of unvisited pages */
    SetContextVar("unvisitedVar", unvisited);

    /* set of visited URLs */
    Set visited = new HashSet();
]]></script>

<file action="write" path="posts${sys.date()}.xml" charset="UTF-8">
    <template>
        <![CDATA[ <allposts date="${sys.datetime("dd.MM.yyyy")}"> ]]>
    </template>

    <!-- loop while there are any unvisited links -->
    <while condition="${unvisitedVar.toList().size() != 0}">
        <loop item="currUrl">
            <list>
                <var name="unvisitedVar"/>
            </list>
            <body>
                <empty>
                    <!-- Get page content -->
                    <var-def name="content">
                        <html-to-xml>
                            <http url="${currUrl}"/>
                        </html-to-xml>
                    </var-def>
                    <!-- Get variables -->
                    <xquery>
                    <xq-param name="doc">
                            <var name="content"/>
                    </xq-param>
                    <xq-expression><![CDATA[
                        declare variable $doc as node() external;

                        let $title := data($doc//h1)
                        let $text := data($doc//div[@class="post-entry"])
                        let $categories := data($doc//div[@class="post-data"])
                            return 
                            <post>
                                <title>{data($title)}</title>
                                <url>$(currUrl)</url>
                                <text>{data($text)}</text>
                                <categories>{data($categories)}</categories>
                            </post>
                        ]]></xq-expression>
                    </xquery>

                    <!-- adds current URL to the list of visited -->
                    <script><![CDATA[
                        visited.add(sys.fullUrl(home, currUrl));
                        Set newLinks = new HashSet();
                    ]]></script>

                    <!-- loop through all collected links on the downloaded page -->
                    <loop item="currLink">
                        <list>
                            <xpath expression="//a/@href">
                                <var name="content"/>
                            </xpath>
                        </list>
                        <body>
                            <script><![CDATA[
                                String fullLink = sys.fullUrl(home, currLink);
                                fullLink = fullLink.replaceAll("#.*","");
                                if ( isValidUrl(fullLink.toString()) && !visited.contains(fullLink) && !unvisitedVar.toList().contains(fullLink) && !fullLink.endsWith(".png") ) {
                                    newLinks.add(fullLink);
                                }
                            ]]></script>
                        </body>
                    </loop>
                </empty>
            </body>
        </loop>

        <!-- unvisited link are now all the collected new links from downloaded pages  -->
        <script><![CDATA[
             SetContextVar("unvisitedVar", newLinks);
        ]]></script>
    </while>
    <![CDATA[ </posts> ]]>
</file>


Solution

  • Its because your while doesnt RETURN anything. Most likely because you've surrounded the body with empty - which will force no results to be returned (see manual). It sets variables etc, but doesn't return anything to "console" for file to print.