I have a Docker-Compose stack on which there are instances of Accumulo version 1.9.3. On another machine I have configured an identical stack, but with updated versions of the various applications, including Accumulo 2.0.1.
In the first stack, Accumulo stores data in the /data/hdfs directory. I copied its contents and brought it to the new stack at the same path and would like to import that data into Accumulo to see if the new version 2.0.1 can interpret it correctly. At the moment the data seems not to be hooked, because Accumulo Monitor does not see any tables. Is there any way to make that data visible to Accumulo?
Apache Accumulo uses metadata to track information about the data stored in the files. It is usually insufficient to merely copy the underlying files in HDFS over. You will also need to migrate information about the files, or copy the data in a way that the metadata is largely irrelevant.
I'll talk about metadata-irrelevant situation first. Accumulo stores its files in a format called RFile (extension .rf
). The client code has a bulk import API to import a directory of such RFiles. There is also a command in the Accumulo shell to do it as well. If you already have a directory full of files, you can just create the table in the new instance, and use this import command to bulk add these files to your new table.
There are several pitfalls to watch out for with the bulk import method of migrating:
Accumulo also provides a convenient export table feature that avoids a lot of this complexity. That feature creates a listing of files for you to copy from the original table, and also creates a dump of the table's metadata. There is a corresponding import table feature that helps you import the files and the metadata on the receiving end. I believe there is an Accumulo shell command for this as well. Using this feature allows you to avoid doing all the compactions and checking HDFS for orphaned files, as it gives you the list of files to copy over, and recreates the metadata for you to create the tablet boundaries. You will still need to flush the table and probably take it offline (which will ensure ingest and splitting is halted), but the command itself should check for those prerequisites to help you through the process.
Also, please note that the export/import table feature may not work fully with multiple volumes yet in released versions of Accumulo, so you'll need to take that into consideration if that situation applies to you.
Also, please be aware that the latest version of Accumulo as of the time of this writing is 2.1, which is a long-term maintenance release (LTM). 2.0 versions are non-LTM and are not expected to receive any updates (or rather, its updates have been rolled into 2.1 instead). So, if you're setting up a new cluster, I would strongly advise against using 2.0, which is the version in your initial question, and choosing the latest 2.1 instead.
If you have any follow-up questions or need help, the best place to get answers is on the documentation on the Accumulo website or via the mailing list (especially the user mailing list), which you can find at the project website.