I use Databricks (python) to download data from a public source directly into a Windows shared drive (NetApp folder) using SMB protocol, but it is downloading at 700 kbps on average, whereas when using Azure Data Factory (on the same Azure VNet) to do the same, it downloads at 5.5 mbps on average.
Some additional details for context:
This the code used in Databricks:
import os
import smbclient
import urllib
from tqdm import tqdm
import shutil
# Setup SMB client with my credentials
smbclient.ClientConfig(username='user', password='password')
# Constants for file transfer
CHUNK_SIZE = 1024 * 1024 * 1 # 1MB chunk size for large files
# Function to download a file from a URL directly to an SMB share (with progress bar)
def download_to_smb(url, smb_file_path):
with urllib.request.urlopen(url) as response:
file_size = int(response.getheader('Content-Length'))
with smbclient.open_file(smb_file_path, mode='wb') as fdst:
progress = tqdm(total=file_size, unit='B', unit_scale=True, desc=f"Downloading to {smb_file_path}")
while True:
chunk = response.read(CHUNK_SIZE)
if not chunk:
break
fdst.write(chunk)
progress.update(len(chunk))
progress.close()
print(f"File downloaded from {url} to {smb_file_path}")
# Download file directly to SMB
file_url = 'https://xxx/file.gz'
smb_file_path_download = r'\\server\folder/'
download_to_smb(file_url, smb_file_path_download)
My goal is to download the data via Databricks at a speed of minimum 4 mbps. How can I reach it? Is there a different way then SMB to connect to the share via Databricks? Is it actually a technical limitation coming from the driver itself in Databricks?
If databricks is the only option available for this work, try using below configuration for SMB client: smbclient.ClientConfig(username='user', password= 'password', min_protocol="SMB3", socket_options="TCP_NODELAY IPTOS_LOWDELAY)
Also, sometimes the source url itself have problem with downlaoding speed. Try different download URLs to check the environment performance.