Im trying to read parque file from S3 using akka streams following the official doc but I am getting this error
java.io.IOException: No FileSystem for scheme: s3a
this is the code that triggered that exception. I will highly appreciate any clue/example of how should I do it correctly
val path = s"s3a://bucketName/path/to/foo/part-00000-656418ee-7cc0-42ee-93e-aaa69ee6f916.c000.snappy.parquet"
val conf: Configuration = new Configuration()
conf.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, true)
val file = HadoopInputFile.fromPath(new Path(path), conf)
val reader: ParquetReader[GenericRecord] =
AvroParquetReader.builder[GenericRecord](file).withConf(conf).build()
//should read the file lines here but not there yet ...
You are most likely missing hadoop-aws
lib on your classpath.
Have a look here: https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
And also this SO gives some more details how to setup credentials for access to S3: How do I configure S3 access for org.apache.parquet.avro.AvroParquetReader?
Once you have AvroParquetReader
correctly initialized, then you can create Akka Stream's Source
out of it as per the Alpakka Avro Parquet doc (https://doc.akka.io/docs/alpakka/current/avroparquet.html#source-initiation)