I'm hoping someone can point out my (probably stupid) problem with this script. I'm trying to crawl a website to get the posts on the site and to load this into an XML document. I have tried to combine a couple of example scripts - the crawler and nytimes examples.
The script runs without error, however only the <edublogs date="02.10.2015"></edublogs>
tags are exported.
Thanks in advance for your help.
<?xml version="1.0" encoding="UTF-8"?>
<config charset="UTF-8">
<!-- set initial page -->
<var-def name="home"><<SNIPPED>></var-def>
<!-- define script functions and variables -->
<script><![CDATA[
/* checks if specified URL is valid for download */
boolean isValidUrl(String url) {
String urlSmall = url.toLowerCase();
return urlSmall.startsWith("http://<<SNIPPED>>/") || urlSmall.startsWith("https://<<SNIPPED>>/");
}
/* set of unvisited URLs */
Set unvisited = new HashSet();
unvisited.add(home);
/* pushes to web-harvest context initial set of unvisited pages */
SetContextVar("unvisitedVar", unvisited);
/* set of visited URLs */
Set visited = new HashSet();
]]></script>
<file action="write" path="posts${sys.date()}.xml" charset="UTF-8">
<template>
<![CDATA[ <allposts date="${sys.datetime("dd.MM.yyyy")}"> ]]>
</template>
<!-- loop while there are any unvisited links -->
<while condition="${unvisitedVar.toList().size() != 0}">
<loop item="currUrl">
<list>
<var name="unvisitedVar"/>
</list>
<body>
<empty>
<!-- Get page content -->
<var-def name="content">
<html-to-xml>
<http url="${currUrl}"/>
</html-to-xml>
</var-def>
<!-- Get variables -->
<xquery>
<xq-param name="doc">
<var name="content"/>
</xq-param>
<xq-expression><![CDATA[
declare variable $doc as node() external;
let $title := data($doc//h1)
let $text := data($doc//div[@class="post-entry"])
let $categories := data($doc//div[@class="post-data"])
return
<post>
<title>{data($title)}</title>
<url>$(currUrl)</url>
<text>{data($text)}</text>
<categories>{data($categories)}</categories>
</post>
]]></xq-expression>
</xquery>
<!-- adds current URL to the list of visited -->
<script><![CDATA[
visited.add(sys.fullUrl(home, currUrl));
Set newLinks = new HashSet();
]]></script>
<!-- loop through all collected links on the downloaded page -->
<loop item="currLink">
<list>
<xpath expression="//a/@href">
<var name="content"/>
</xpath>
</list>
<body>
<script><![CDATA[
String fullLink = sys.fullUrl(home, currLink);
fullLink = fullLink.replaceAll("#.*","");
if ( isValidUrl(fullLink.toString()) && !visited.contains(fullLink) && !unvisitedVar.toList().contains(fullLink) && !fullLink.endsWith(".png") ) {
newLinks.add(fullLink);
}
]]></script>
</body>
</loop>
</empty>
</body>
</loop>
<!-- unvisited link are now all the collected new links from downloaded pages -->
<script><![CDATA[
SetContextVar("unvisitedVar", newLinks);
]]></script>
</while>
<![CDATA[ </posts> ]]>
</file>
Its because your while
doesnt RETURN anything. Most likely because you've surrounded the body
with empty
- which will force no results to be returned (see manual). It sets variables etc, but doesn't return anything to "console" for file
to print.