marklogicdate-comparisonflworcts-search

Marklogic count old documents using cts:search over FLWOR


Question on using cts:search over FLOWR. I have a xqy that runs over all docs in the database and checks an element that has a timestamp. We created that timestamp on insertion of the doc. Goal is to delete documents older then x days.

Now I need to know how many docuements I have that are older then x days so I can try a CORB job to delete them.

My query so far works:

xquery version "1.0-ml";
declare namespace j = "http://marklogic.com/xdmp/json/basic";
declare namespace dikw = 'http://www.example.com/dikw_functions.xqy';

(:let $foo := cts:uris((),(), cts:not-query(cts:element-query(xs:QName("j:dikwmetadata"), cts:element-query(xs:QName("j:data"), cts:and-query(()))))):)
let $uris := cts:uri-match("/twitter/*")[1 to 10]
let $today := fn:current-date()
let $days := xs:dayTimeDuration("P30D")

let $today_minus_x := xs:dateTime($today - $days)

for $uri in $uris (:cts:search(doc(), $random-query):)
    let $doc_dikw_date := xdmp:parse-dateTime("[Y0001]-[M01]-[D01] [h01]",xs:string(fn:doc($uri)//j:dikwmetadata//j:timestamp))
    let $to_old := if ($today_minus_x >= $doc_dikw_date)
    then
        true() (: deleted document:)
    else
        false()

return ($uri,$to_old) 

This works ok but I need to know how many there are to see if I can run it from the query console or that I need to set up a sheduled CORB job running every day.

I was looking into cts:search something like:

(:
let $uris2 := cts:search($uris,cts:query(xdmp:parse-dateTime("[Y0001]-[M01]-[D01] [h01]",xs:string(fn:doc($uris)//j:dikwmetadata//j:timestamp))) < $today_minus_x)
:)

But this seems to need elements ... no I am stuck.

Questions: is there a more straightforward way to find and count all documents older then x days?


Solution

  • One of the problem with your current code is that you are parsing dates at run-time. That is always going to be slow, because it needs access to the XML itself.

    This would work best if your j:timestamp element would contain a string matching xs:date or xs:dateTime. Then you can declare a (path) range index on that element of type date/dateTime (whatever suits you best).

    Alternative is to create something like iso-date(Time) attribute on that element containing a preparsed date of type xs:date(Time), so you can index that one.

    Once you have a range index, you can do a (path-)range-query on your element. You could then also use cts:uris to get the docs that need to be deleted..

    HTH!