pythonpysparkgoogle-bigquerygoogle-cloud-dataprocspark-bigquery-connector

Spark Read BigQuery External Table


Trying to Read a external table from BigQuery but gettint a error

    SCALA_VERSION="2.12"
    SPARK_VERSION="3.1.2"
    com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.0,
    com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.24.2'

    table = 'data-lake.dataset.member'
    df = spark.read.format('bigquery').load(table)
    df.printSchema()

Result:

root
  |-- createdAtmetadata: date (nullable = true)
  |-- eventName: string (nullable = true)
  |-- producerName: string (nullable = true)

So when im print

df.createOrReplaceTempView("member")
spark.sql("select * from member limit 100").show()

i got this message error:

INVALID_ARGUMENT: request failed: Only external tables with connections can be read with the Storage API.


Solution

  • As external tables are not supported in queries by spark, i tried the other way and got!

        def read_query_bigquery(project, query):
          df = spark.read.format('bigquery') \
          .option("parentProject", "{project}".format(project=project))\
          .option('query', query)\
          .option('viewsEnabled', 'true')\
          .load()
        
          return df
        
        project = 'data-lake'
        query = 'select * from data-lake.dataset.member'
        spark.conf.set("materializationDataset",'dataset')
        df = read_query_bigquery(project, query)
        df.show()