I am trying to upload my on-premise data on the Azure Datalake storage, the data is about 10 GB in total and divided into multiple folders. I have tried multiple ways to upload the files, the size of each file varies from some KBs to 56MBs, and all are binary data files.
Firstly, I tried to upload them using the python SDK for azure datalake using the following function:
def upload_file_to_directory_bulk(filesystem_name,directory_name,fname_local,fname_uploaded): try:
file_system_client = service_client.get_file_system_client(file_system=filesystem_name)
directory_client = file_system_client.get_directory_client(directory_name)
file_client = directory_client.get_file_client(fname_uploaded)
local_file = open(fname_local,'r',encoding='latin-1')
file_contents = local_file.read()
file_client.upload_data(file_contents, length=len(file_contents),overwrite=True,validate_content=True)
except Exception as e:
print(e)
The problem with this function is that either it skips the files from the local folder to upload, or some of the files uploaded do not have the same size as the local same local file.
The second method that I tried was by uploading the whole folder using Azure Storage Explorer, the storage explorer would crash/fail after uploading about 90 to 100 files. Is there any way I can see the logs and see the reason why it stopped?
Thirdly, I just manually uploaded using the Azure Portal, but that was a complete mess as it also failed on some files.
Can anyone guide me how to upload bulk data on the Azure data lake? And what could be the problem occurring in these 3 methods.
Uploading files using Azure Portal is easiest and reliable option. I'm not sure what exactly wrong you are doing assuming you have reliable internet.
I have uploaded around 2.67 GB of data carrying 691 files, and it got uploaded easily without any issue. Many files are 75+ MB size. Check shared image below.
If you can split your data into 4 group and then upload each group you can easily upload the files without any issue.
Another Approach
You can use AzCopy
to upload the data.
AzCopy
is a command-line utility that you can use to copy blobs or files to or from a storage account.
It can easily upload large files with some simple command-line commands.
Refer: Get started with AzCopy, Upload files to Azure Blob storage by using AzCopy