let's suppose that we have a huge mongodb collection (around 60,000,000 records) and we want to scroll over it without fetching of course a huge amount in memory. For Hibernate orm, it's possible as there is the scrolling api and criteria queries but what could be a solution for pagination in OGM(already spend a lot of time but I couldn't find something)? I'm indexing in batches of 50,000 the objects using the setFirstResult-setMaxResult Api. These are the time in seconds of getting the batches from a query like
em.createNativeQuery(query, Entity.class).setFirstResult(i).setMaxResults(batchSize).getResultList()
results.stream().forEach(res -> fullTextEntityManager.index(res));
by increasing i in each iteration with like i+=batchSize;
I already tried using OgmMassIndexer but I need to be able to start-stop, index specific ranges so I prefer to do it manually.
As it's obvious and logical the time to find the firstResult in each iteration is increasing. Here I have time in secs to find the next batch of 50000 from the beginning from 4 million(setFirstResult(4000000).setMaxResult(50000)):
eg. to go to 4000000 it took 17 seconds etc. to go to 4050000 it took 15 seconds etc. to go to 4100000 it took 12 seconds etc. but later this number increases a lot:
Found: 17 Found: 15 Found: 12 Found: 13 Found: 13 Found: 13 Found: 15 Found: 16 Found: 16 Found: 17 Found: 18 Found: 18 Found: 19 Found: 19 Found: 20 Found: 20 Found: 21 Found: 21 Found: 22 Found: 21 Found: 22 Found: 23 Found: 23 Found: 23 Found: 24 Found: 24 Found: 25 Found: 25 Found: 26 Found: 26 Found: 27 Found: 28 Found: 27 Found: 29 Found: 29 Found: 30 Found: 31 Found: 32 Found: 33 Found: 30 Found: 33 Found: 32 Found: 34 Found: 34 Found: 35 Found: 35 Found: 38 Found: 36 Found: 38 Found: 36 Found: 41 Found: 41 Found: 39 Found: 41 Found: 41 Found: 40 Found: 42 Found: 43 Found: 42 Found: 44 Found: 44 Found: 45 Found: 47 Found: 45 Found: 44 Found: 44 Found: 47 Found: 44 Found: 47 Found: 47 Found: 50 Found: 52 Found: 93
Any options for scrolling mongodb using ogm cursor or something to fetch the objects in session and index them efficiently? I mean this is not viable even for an app that want to paginate that big amount of data using OGM without Hibernate Search, so I suppose there is a solution I don't see.
Thanks a lot.
Hibernate OGM 5.3.1, Hibernate Search 5.9.0 using ElasticSearch
Scrolling is not supported in OGM yet, because it requires the Criteria API which isn't supported.
That being said, you could implement your process differently.
I am going to assume that the process getting slower and slower comes from the query part (MongoDB having a harder and harder time finding the Nth result) and not from the indexing part (Elasticsearch having a harder and harder time adding documents to the index).
If this is the case, you could try "chunk" queries instead of pagination. The idea would be to first retrieve the first and last ID for the entity type you want to index, then instead of using pagination, run queries with a condition similar to where ID between <last ID in the previous query + 1> AND <last ID in the previous query + page size>
.
Provided the ID field has an ascending index in MongoDB, this should get rid of performance getting worse over time.