phpmongodbsymfonydoctrine-ormdoctrine

Best practice for mongodb bulk inserts in Symfony2


In my symfony2 command, I am running a script that inserts hundreds of thousands of urls (as string) into a document.

Here are the basic structures of the 2 documents I'm using. Before the program is run, there are thousands of ParentDocuments already inside the mongodb, but zero ChildDocuments:

ParentDocument:
    $id:id
    $subDocument:OneToManyReference(ChildDocument)
    $etc:everythingelse

ChildDocument:
    $id:id
    $url:string
    $parentDocument:ManyToOneReference(ParentDocument)

And my Command code:

$dm = $this->getContainer()->get('doctrine_mongodb.odm.document_manager');
$parentDocuments = $dm->repository('My:Bundle:ParentDocument')->findAll();
while ($parentDocument = $parentDocuments->getNext()) {
    //Returns an array of hundreds of thousands urls
    $urls = $this->somehowFetchUrlsRelatedToTheParentDocument($parentDocument);
    foreach ($urls as $url) {
        $subDocument = new SubDocument();
        $subDocument->setUrl($url);
        $subDocument->setParentDocument($parentDocument);
        $dm->persist($subDocument);
    }
    $dm->flush();
}

When I run this simple command, the write speed at first is incredibly fast. However, in the case of inserting millions of rows, the write speeds become significantly slower. As slow as 1 write per second after the command has been running for 10 minutes, making the code extremely ineffective.

My first attempt at fixing this problem was to clear the document manager right after it flushes using $dm->clear(); But this meant that the document manager would lose track of the current ParentDocument. So my solution was this:

$dm = $this->getContainer()->get('doctrine_mongodb.odm.document_manager');
$parentDocumentCursors = $dm->repository('My:Bundle:ParentDocument')->findAll();
$parentDocuments = array();
while ($parentDocument = $parentDocumentCursors->getNext()) {
    array_push($parentDocuments, $parentDocument);
}
$dm->clear();
unset($dm);
$dm = $this->getContainer()->get('doctrine_mongodb.odm.document_manager');
foreach ($parentDocuments as $parentDocument) {
    $urls = $this->somehowFetchUrlsRelatedToTheParentDocument($parentDocument);
    foreach ($urls as $url) {
        $subDocument = new SubDocument();
        $subDocument->setUrl($url);
        $subDocument->setParentDocument($parentDocument);
        $dm->persist($subDocument);
    }
    $dm->flush();
    $dm->clear();
}

This solved the problem. The write speeds were consistently fast throughout the whole execution of the program and millions of rows were able to be inserted without gradual delay.

However, this feels like a bad practice and a quick fix hack. What is the best practice for inserting millions of rows in Symfony2 using document manager without read/write speeds becoming slow?


Solution

  • I would avoid using Symfony's document manager and use the batchInsert() function directly. This is described in the documentation at http://php.net/manual/en/mongocollection.batchinsert.php It feels to me like Doctrine's ODM is actually hurting you here.