apache-sparkapache-spark-sqlspark-hive

How to create external Hive table without location?


I have a spark sql 2.1.1 job on a yarn cluster in cluster mode where I want to create an empty external hive table (partitions with location will be added in a later step).

CREATE EXTERNAL TABLE IF NOT EXISTS new_table (id BIGINT, StartTime TIMESTAMP, EndTime TIMESTAMP) PARTITIONED BY (year INT, month INT, day INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

When I run the job I get the error:

CREATE EXTERNAL TABLE must be accompanied by LOCATION

But when I run the same query on Hive Editor on Hue it runs just fine. I was trying to find an answer in the SparkSQL 2.1.1 documentation but came up empty.

Does anyone know why Spark SQL is more strict on queries?


Solution

  • TL;DR EXTERNAL with no LOCATION is not allowed.

    The definitive answer is in Spark SQL's grammar definition file SqlBase.g4.

    You can find the definition of CREATE EXTERNAL TABLE as createTableHeader:

    CREATE TEMPORARY? EXTERNAL? TABLE (IF NOT EXISTS)? tableIdentifier
    

    This definition is used in the supported SQL statements.

    Unless I'm mistaken locationSpec is optional. That's according to the ANTLR grammar. The code may decide otherwise and it seems it does.

    scala> spark.version
    res4: String = 2.3.0-SNAPSHOT
    
    val q = "CREATE EXTERNAL TABLE IF NOT EXISTS new_table (id BIGINT, StartTime TIMESTAMP, EndTime TIMESTAMP) PARTITIONED BY (year INT, month INT, day INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'"
    scala> sql(q)
    org.apache.spark.sql.catalyst.parser.ParseException:
    Operation not allowed: CREATE EXTERNAL TABLE must be accompanied by LOCATION(line 1, pos 0)
    
    == SQL ==
    CREATE EXTERNAL TABLE IF NOT EXISTS new_table (id BIGINT, StartTime TIMESTAMP, EndTime TIMESTAMP) PARTITIONED BY (year INT, month INT, day INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
    ^^^
    
      at org.apache.spark.sql.catalyst.parser.ParserUtils$.operationNotAllowed(ParserUtils.scala:39)
      at org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitCreateHiveTable$1.apply(SparkSqlParser.scala:1096)
      at org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitCreateHiveTable$1.apply(SparkSqlParser.scala:1064)
      at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99)
      at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitCreateHiveTable(SparkSqlParser.scala:1064)
      at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitCreateHiveTable(SparkSqlParser.scala:55)
      at org.apache.spark.sql.catalyst.parser.SqlBaseParser$CreateHiveTableContext.accept(SqlBaseParser.java:1124)
      at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42)
      at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71)
      at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71)
      at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99)
      at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:70)
      at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:69)
      at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:68)
      at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:97)
      at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
      at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
      at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
      ... 48 elided
    

    The default SparkSqlParser (with astBuilder as SparkSqlAstBuilder) has the following assertion that leads to the exception:

    if (external && location.isEmpty) {
      operationNotAllowed("CREATE EXTERNAL TABLE must be accompanied by LOCATION", ctx)
    

    I'd consider reporting an issue in Spark's JIRA if you think that the case should be allowed. See SPARK-2825 to have a strong argument to have the support:

    CREATE EXTERNAL TABLE already works as far as I know and should have the same semantics as Hive.