To have a large dataset with images and videos, I would like to use Apache Xindice. There are very few tutorials and guides on WWW for Apache Xindece. How to store image and video files in Apache Xindice? Is Apache Xindice suitable to stroe large set of data? Is there any latest repository which can store large set of data in XML format (Not SQL type of databases. Should save TB size data)? Can I use MongoDB for storing large dataset?
I suggest to store external documents (images/videos, XML files) in MongoDB, using the GridFS file system. GirdFS collection consist of two parts: the chunks collection, where the binary data are stored, and the files collection, holding the information about the files, including customer defined metadata. From the FAQ:
In some situations, storing large files may be more efficient in a MongoDB database than on a system-level filesystem.
If your filesystem limits the number of files in a directory, you can use GridFS to store as many files as needed. When you want to keep your files and metadata automatically synced and deployed across a number of systems and facilities.
When using geographically distributed replica sets MongoDB can distribute files and their metadata automatically to a number of mongod instances and facilitates.
When you want to access information from portions of large files without having to load whole files into memory, you can use GridFS to recall sections of files without reading the entire file into memory.
For large data sets, GridFS can be sharded (see http://docs.mongodb.org/manual/core/sharded-cluster-internals/#sharding-gridfs-stores).
For fast delivery of GridFS data, there are modules for ngnix (ngnix-gridfs) and Apache (mod_gridfs). See also http://nosql.mypopescu.com/post/28085493064/mongodb-gridfs-over-http-with-mod-gridfs for a quick comparison