hadoopapache-sparkhivehcatalogqubole

Adding results of Hadoop job to Hive Table


I have a Map-only job that processes a large text file. Each line is analyzed and categorized. MultipleOutputs are used to output each category into separate files. Eventually all the data gets added to a Hive table dedicated to each category. My current workflow does the job but is a bit cumbersome. I am going to add a couple of categories, and thought I might be able to streamline line the process. I have a couple of ideas and was looking for some input.

Current Workflow:

  1. Map-only job divides large file into categories. The output looks like this:
  1. An external (non-Hadoop) process copies the output files into separate directories for each category.
  1. An external table is created for each category and then the data is inserted into the permanent Hive table for that category.

Possible new workflows


Solution

  • For MultipleOutputs set output path to base folder where your hive external tables located. Then write data into "<table_name>/<filename_prefix>". And your data will be located in your target tables.