apache-sparkavrospark-avro

AVRO file not read fully by Spark


I am reading AVRO file stored on ADLS gen2 using Spark as following:

import dbutils as dbutils
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

file="abfss://eventhub_user@mydatalake.dfs.core.windows.net/xyz-event-collection/my-events/27/2021/11/01/01/01/01.avro"
key="..........."
appName = "MyEventsReadTest"
master = "local[*]"
sparkConf=SparkConf() \
    .setAppName(appName) \
    .setMaster(master) \
    .set("fs.azure.account.key.dechitraguptdatalake.dfs.core.windows.net", key)

spark=SparkSession.builder.config(conf=sparkConf).getOrCreate()
df=spark.read.format("avro").load(file)
df.show()

I submit this readEventsFromADLS2.py file as following:

spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.8 --jars hadoop-azure-3.3.1.jar  ./readEventsFromADLS2.py

However, I get only shortened output as a result.

21/11/15 13:21:03 INFO CodeGenerator: Code generated in 13.582867 ms
+--------------+--------+--------------------+--------------------+----------+--------------------+
|SequenceNumber|  Offset|     EnqueuedTimeUtc|    SystemProperties|Properties|                Body|
+--------------+--------+--------------------+--------------------+----------+--------------------+
|         31411|21976208|11/10/2021 12:11:...|{x-opt-enqueued-t...|        {}|[7B 22 70 61 79 6...|
|         31412|21977032|11/10/2021 12:11:...|{x-opt-enqueued-t...|        {}|[7B 22 70 61 79 6...|
|         31413|21977736|11/10/2021 12:12:...|{x-opt-enqueued-t...|        {}|[7B 22 70 61 79 6...|
|         31414|21977800|11/10/2021 12:12:...|{x-opt-enqueued-t...|        {}|[7B 22 70 61 79 6...|
|         31415|21978336|11/10/2021 12:12:...|{x-opt-enqueued-t...|        {}|[7B 22 70 61 79 6...|
|         31416|21978872|11/10/2021 12:12:...|{x-opt-enqueued-t...|        {}|[7B 22 70 61 79 6...|
|         31417|21979632|11/10/2021 12:12:...|{x-opt-enqueued-t...|        {}|[7B 22 70 61 79 6...|
+--------------+--------+--------------------+--------------------+----------+--------------------+

21/11/15 13:21:03 INFO SparkContext: Invoking stop() from shutdown hook

Questions:

  1. How do I get to print fully expanded column in the above output?
  2. How do I see the Body section (last column in the above output) in the text format? Body is actually a JSON, but coming as byte array here.

When I changed the df.show() to df.show(10,False) , I still get binary byte array representation for the Body field:

|31411 |21976208|11/10/2021 12:11:46 PM|{x-opt-enqueued-time -> {1636546306366, null, null, null}}|{} |[7B 22 70 61 79 6C 6F 61 64 22 3A....]


Solution

  • To fully display all of the column you can use:

    df.select("body").show(false)
    

    If the data really is JSON and you want it read is JSON, consider specifying the schema instead of letting Spark interpret it for you.