hiveamazon-emrparquetprestohadoop-partitioning

How do you add partitions to a partitioned table in Presto running in Amazon EMR?


I'm running Presto 0.212 in EMR 5.19.0, because AWS Athena doesn't support the user defined functions that Presto supports. I'm using EMR configured to use the glue schema. I have pre-existing Parquet files that already exist in the correct partitioned format in S3.

It appears that recent Presto versions have removed the ability to create and view partitions. This raises the question: How do you add individual partitions? I can use the Athena console in AWS and run MSCK REPAIR mytable; and that creates the partitions correctly, which I can then query successfully using the Presto CLI or HUE. However, How do I do this in Presto?

If I try this in presto-cli on the EMR master node:

use hive.default;
INSERT INTO "mytable$partitions" VALUES (2018, 9, 20)

I get an error saying

java.sql.SQLException: Query failed (#20181113_172115_00004_yywie): com.facebook.presto.connector.system.SystemTableHandle cannot be cast to com.facebook.presto.hive.HiveTableHandle

(Note that I'm using the database default in Glue to store the schema. That's where "default" comes from.)

The old ways of doing this in Presto have all been removed relatively recently (alter table mytable add partition (p1=value, p2=value, p3=value) or INSERT INTO TABLE mytable PARTITION (p1=value, p2=value, p3=value), for example), although still found in the tests it appears. They don't work. If I try to execute such queries in HUE or in the Presto CLI, I get errors.

However, in the Presto CLI I can view the partitions that exist, entering this query on the EMR master node:

use hive.default;
select * from "mytable$partitions";

Initially that query result is empty, because no partitions exist, of course. If I manually run MSCK REPAIR in Athena to create the partitions, then that query will show me all the partitions that have been created.

If I try using the HIVE CLI on the EMR master node, it doesn't work.

use default;
ALTER TABLE mytable
    ADD PARTITION (p1=2018, p2=9, p3=20)
    location 's3://bucketname/rootfolder/p1=2018/p2=9/p3=20/';

FAILED: SemanticException [Error 10001]: Table not found mytable

So... how, using the Presto-CLI, or using HUE, or even using the Hive CLI, can I add partitions to a partitioned table stored in S3? Now that Presto has removed the ability to do this, what is the way it is supposed to be done? Trying to follow earlier examples such as this one doesn't work.

While "MSCK REPAIR"works, it's an expensive way of doing this and causes a full S3 scan. I would prefer to add partitions individually rather than scan the entire S3 bucket to find existing partitions, especially when adding one new partition to a large table that already exists.

I also note this quote at page Using the AWS Glue Data Catalog as the Metastore for Hive:

We recommend creating tables using applications through Amazon EMR rather than creating them directly using AWS Glue. Creating a table through AWS Glue may cause required fields to be missing and cause query exceptions.

There must be a way of doing this within EMR. What is it?


Solution

  • It turns out that Hive and Presto, in EMR, require separate configuration to be able to use the Glue catalog. Thus, my AWS CLI script needed to be modified to contain configuration for each one to be able to do that. The configuration ended up looking like this:

    --configurations '[
                        {
                          "Classification": "presto-connector-hive",
                          "Properties": { "hive.metastore.glue.datacatalog.enabled": "true" },
                          "Configurations":[]
                        },
                        {
                          "Classification": "hive-site",
                          "Properties": { "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory" }
                        }
                      ]'
    

    It looks like the current Presto versions cannot create or view partitions directly, but Hive can. My problem was that Hive wasn't configured to see the Glue catalog. Once I fixed that, Hive was able to create partitions with statements like

    ALTER TABLE mytable ADD IF NOT EXISTS
        PARTITION (p1=2018, p2=9, p3=18)
        PARTITION (p1=2018, p2=9, p3=19)
        PARTITION (p1=2018, p2=9, p3=20);