oracle-databasexpathsolrdataimporthandlerdih

Solr DIH slows when importing XML data from Oracle database


I'm working on a Solr DIH (DataImportHandler) task to import roughly 20 million documents stored in an Oracle database. Initially these imports will ramp up to over 500 docs a second, but within the first 150,000 the speed will crash to under 200 and eventually degrades to around 50-60/s; at which point my patience reaches it's end and I kill the process. No process should take 30+ hours to import 5 million docs.

These documents are stored as an XMLType, so they have to be 'decoded' or extrapolated in the query.

Some of the team members think the use of getCLOBVal() is causing memory bloat and resources to be consumed in the JDBC pipeline (possibly on the server side), but my tests that compare getCLOB to XMLSeriaize don't seem to bear that out.

Maybe there are some connection options that I"ve not tried yet in the JDBC connector that can help reduce the overhead and keep the throughput high.

In the past I've used simple HTTP post (in CasperJS and PHP) to submit 150 million documents+ in less than a day.. so I'm convinced this is a problem with Solr DIH and/or the way we're connecting to Oracle.

Here is what the connection looks like:

<dataSource
    name="OracDB"
    type="JdbcDataSource"
    driver="oracle.jdbc.driver.OracleDriver"
    url="jdbc:oracle:thin:@vape.blah-blah-blah.PUB"
    user="some-user-name"
    password="some-user-name"
    convertType="true"
    />

The query looks like this SELECT PRODUCT_ID, PRODUCT_TYPE, STATUS, XMLSerialize(DOCUMENT DOCXML) as xml FROM CDBXML.PRODUCTS WHERE PRODUCT_TYPE='MegaAwesome' AND gp.STATUS <> 'C'

XPATH is in play here to get data from the XML in this database... that looks like this:

        <entity
            name="DB/Import"
            dataSource="OracDB"
            onError="skip"
            query="<already refe'd above>"
            >
        <field column="ID" name="id" />
        <entity
            name="productxml"
            rootEntity="false"
            dataSource="db"
            dataField="DB/Import.XML"
            processor="XPathEntityProcessor"
            transformer="TemplateTransformer,com.megacorp.solr.dih.Transform"
            forEach="/document">

            <!--
                XPATH PARSING OF FIELDS
             -->

            <field column="buildversion" xpath="/document/attribs/attrib[@id='build']" />
            <field column="lastPublishedTime" setValue="n/a" />
            <field column="SearchStatus" xpath="/document/attribs/attrib[@id='searchStatus']" />
            <field column="ProdcutType" xpath="/document/attribs/attrib[@id='productType']" commonField="true" />
            [... and a bunch more stuff ...]
        </entity>
     </entity>

I've run a lot of tests to see if changes in the managed-schema or import .xml config can improve or degrade the intake, to no avail.

There was a point a few weeks back the process was importing 7 million documents in 11 hours, then the source dataset was increased to almost 20 million docs and that's when things seemed to go off the rails.

Do I have the wrong JDBC connector settings? Is there something I can set to tell Oracle not to cache this query.. or.. ?? This problem is baffling us. Looking for some hints before I have to punt and go old-school brute-force data stuffing over HTTPD..


Solution

  • SOLVED The problem that caused the slowdowns was an interaction between Oracle and their JDBC driver, which caused the entire select to be cached in sort internal cursor, and as that cursor iterated through the list, resources diminished, speed diminished and we ended up what an absurdly long process. Using Ruby, we confirmed this problem occurred completely outside the context of Solr... it's an "Oracle Thing". At this point, it was decided to cut bait and work around it, which is what I did.

    To do this, we had to create a helper, or lookup table so that it would be possible to quickly fast-forward to any point in the stack of data, with SQL. Sorting on the product_id was not possible due it's compounding, and a chance to that table was not approved. So, another table was created with just a product ID and a sequence number. This allowed the very fast dissection of the 17,000,000 records into smaller, sequential chunks of 10,000 documents each (around the 10,000 document point, the query would start to loose speed).

    That was not the complete solution.. we'd still be hamstrung by the connect/teardown costs and serial processing.

    The final fix as to create a multitude of identical, sequentially numbered data import handlers, and then run a small import against each one individually. I ended up with 30 handlers, and then wrote an iterator tool that submitted separate import requests to each handler, effectively working around the sequential import issue.

    The end product was the indexing was cut from nearly 100 hours to 1 hr. 15m. Problem solved, despite Oracle 11's best efforts to thwart our data extracts.