scalaapache-sparkdataframe

Read pdf file in apache spark dataframes


We can read avro file using the below code,

val df = spark.read.format("com.databricks.spark.avro").load(path)

is it possible to read pdf files using Spark dataframes?


Solution

  • You cannot read a pdf and store in a df as it will cannot interrupt the columns of the dataframe(basically it doens't have a standard schema), so if you want to get some data from a pdf first convert that to csv or parquet and then you can read from that file and then create a dataframe as it has a defined schema

    visit this gitbook to understand more on what are the available read formats which you can use to get the data as a Dataframe

    DataFrameReader — Loading Data From External Data Sources