We've used azure ADL2 for some time, and the amount of memory it consumes is problematic, forcing us to use awkward and lower-performing techniques. In a discussion in this github issue, it was confirmed that the API does a ton of buffering, and suggested to me that we try the "sync" interface when it is available, and this azure page suggest that v12 of the Java SDK supports the sync API. But I can't find out anything more about it, and the last response I got on the github issue was "thanks but we're closing this".
I'm trying to get two questions answered:
Thanks, any help appreciated.
UPDATE: Actually we already use what @Venkatesan is calling the "sync API'. Note that we made our own "MarkableFileInputStream", which does not buffer everything in memory and yet still satisfies the markable criteria. Even then the SDK buffers nearly everything in memory. When I run the example using 8 threads and 200MB parts, you can see that the memory consumption is 1.3GB:
Creating temp file
Uploading 8 parts of size 209715200 bytes
2024-12-26T10:56:39.132 [Thread-0] MemoryLogger.logMemory:35 INFO - JVM MEMORY: used=38MB, total=1,024MB, free=985MB
2024-12-26T10:56:40.146 [Thread-0] MemoryLogger.logMemory:35 INFO - JVM MEMORY: used=16MB, total=80MB, free=63MB
2024-12-26T10:56:41.160 [Thread-0] MemoryLogger.logMemory:35 INFO - JVM MEMORY: used=1,292MB, total=4,584MB, free=3,291MB
2024-12-26T10:56:42.163 [Thread-0] MemoryLogger.logMemory:35 INFO - JVM MEMORY: used=1,304MB, total=4,584MB, free=3,279MB
...
You can download the full example from my google drive
The exact amount of memory used seems completely up to the SDK, but the amount is "a lot". If I use 16x400MB parts, the SDK uses 2.9GB. The SDK behaves like nobody would ever do more in their application than upload a single file, but we are doing a lot, moving data between files and databases, and often performing multiple transfers concurrently. The SDK is just really hungry and not predictable. Is there any way to control the SDK memory?
"sync" API allow us to reduce memory consumption for multi-part uploads to ADL2?.
Yes, the sync API is designed to help with memory consumption by allowing more direct control over the file upload process.
You can use the below code that the append
and flush
methods reduce memory usage compared to loading the entire file into memory at once, the sync API in Azure SDK v12 promises a more straightforward approach to reduce overhead and buffering using Azure Java SDK.
Code:
import com.azure.storage.file.datalake.*;
import com.azure.storage.file.datalake.models.DataLakeStorageException;
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
String connectionString = "DefaultEndpointsProtocol=https;AccountName=xxx;AccountKey=xxxx;EndpointSuffix=core.windows.net";
String fileSystemName = "test";
String filename = "sample.csv";
String localFilePath = "xxxx";
DataLakeServiceClient serviceClient = new DataLakeServiceClientBuilder()
.connectionString(connectionString)
.buildClient();
// Get the DataLakeFileSystemClient
DataLakeFileSystemClient fileSystemClient = serviceClient.getFileSystemClient(fileSystemName);
// Get the DataLakeFileClient for the target file path
DataLakeFileClient fileClient = fileSystemClient.getFileClient(filename);
// Record the start time
long startTime = System.nanoTime();
try {
// Prepare the input stream for the file
Path path = Paths.get(localFilePath);
long fileSize = Files.size(path);
// Use try-with-resources to ensure 'data' is closed automatically
try (InputStream data = new BufferedInputStream(new FileInputStream(path.toFile()))) {
byte[] buffer = new byte[32 * 1024 * 1024]; // 32MB buffer size
boolean overwrite = true;
long bytesUploaded = 0;
// Check if the file already exists
if (!fileClient.exists()) {
// Initialize the file in ADLS (create if not exists)
fileClient.create(overwrite);
}
// Upload file in parts using append and commit
while (bytesUploaded < fileSize) {
int bytesRead = data.read(buffer);
if (bytesRead == -1) {
break; // End of file
}
InputStream inputStream = new ByteArrayInputStream(buffer, 0, bytesRead);
// Append data to the file at the specified offset
fileClient.append(inputStream, bytesUploaded, bytesRead);
bytesUploaded += bytesRead;
}
// Commit the uploaded chunks to the file
fileClient.flush(bytesUploaded, overwrite);
// Record the end time
long endTime = System.nanoTime();
long duration = endTime - startTime; //time in minutes
System.out.println("File uploaded successfully.");
System.out.println("Time taken to upload the file: " + duration / 1_000_000_000.0 / 60 + " minutes");
} catch (IOException e) {
e.printStackTrace(); // Handle IO exception here
}
} catch (DataLakeStorageException e) {
System.err.println("Azure Data Lake error: " + e.getMessage());
} catch (IOException e) {
System.err.println("IO Exception error: " + e.getMessage()); // Catch IOException at the outer level
}
}
}
Output:
File uploaded successfully.
Time taken to upload the file: 4.898536025 minutes
Portal:
Version:
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-storage-file-datalake</artifactId>
<version>12.22.0</version>
</dependency>
Reference:
Use Java to manage data in Azure Data Lake Storage - Azure Storage | Microsoft Learn