We've set up the HDInsights cluster on Azure with Blob as the storage for Hadoop. We tried uploading files to the Hadoop using hadoop CLI and the files were getting uploaded to the Azure Blob.
Command used to upload:
hadoop fs -put somefile /testlocation
However, when we tried using Spark to write files to the Hadoop, it was not getting uploaded to Azure Blob storage but to the disk of the VMs at the directory specified in the hdfs-site.xml
for the datanode
Code used:
df1mparquet = spark.read.parquet("hdfs://hostname:8020/dataSet/parquet/")
df1mparquet .write.parquet("hdfs://hostname:8020/dataSet/newlocation/")
Strange behavior:
When we run:
hadoop fs -ls / => It lists the files from Azure Blob storage
hadoop fs -ls hdfs://hostname:8020/ => It lists the files from local storage
is this an expected behavior?
You need to look at the value of fs.defaultFS
in the core-site.xml
.
Sounds like the default filesystem is blob storage.
https://hadoop.apache.org/docs/current/hadoop-azure/index.html
Regarding Spark, if it's loading the same hadoop configs as the CLI, you shouldn't need to specify the namenode host/port, just use the file paths, and it'll also default to blob storage.
If you specify a full URI to a different filesystem, then it'll use that, but hdfs://
should be different than the actual local file://