I have a huge amount of data in one table (~7 billion rows) in an AWS Oracle RDS instance. The end result is I want that table as pipe-separated values stored in S3 so that I can read it into EMR. This is basically a one-time thing so I need it to work accurately and without having to re-run the whole upload because something timed out; I don't really care how it works or how difficult/annoying it is to set up. I have root access on the Oracle box. I looked at Data Pipelines but it appears they only support MySQL and I must have it work with Oracle. Also, I do not have enough hard drive space to dump the whole table to a CSV on the Oracle instance and then upload it. How can I get this done?
You can use Sqoop (http://sqoop.apache.org/) to do this. You can write a sqoop script that can be scheduled as an 'EMR Activity' under Data Pipeline.
Sqoop uses Hadoop and can open multiple connections to Oracle and load.
You can keep your raw data on S3 and directly read from that on EMR. Or you can choose to copy that onto EMR using 'S3Distcp' activity (again scheduled on Data Pipeline, if you want).
If you dont have scheduling need, you can spin up an EMR cluster using EMR console - and run Sqoop on that.