I wouldn't exactly say it is limited but as long as I can see the recommendations given are of the sort of "If you need to go beyond that you can change the backend store... ". Why? Why is Sesame not as efficient as lets say OWLIM or Allegrgraph when goes beyond 150-200m triples. What optimizations are implemented in order to go that big? Are the underlying data structures different?
Answered here by @Jeen Broekstra: http://answers.semanticweb.com/questions/21881/why-is-sesame-limited-to-lets-say-150m-triples
- the actual values that make up an RDF statements (that is, the subjects, predicates, and objects) are indexed in a relatively simple hash, mapping integer ids to actual data values. This index does a lot of in-memory caching to speed up lookups but as the size of the store increases, the probability (during insertion or lookup) that a value is not present in the cache and needs to be retrieved from disk increases, and in addition the on-disk lookup itself becomes more expensive as the size of the hash increases.
- data retrieval in the native store has been balanced to make optimal use of the file system page size, for maximizing retrieval speed of B-tree nodes. This optimization relies on consecutive lookups reusing the same data block so that the OS-level page cache can be reused. This heuristic start failing more often as transaction sizes (and therefore B-trees) grow, however.
- as B-trees grow in size, the chances of large cascading splits increase.