azure-blob-storageazure-data-lake-gen2azure-java-sdk

Using sync I/O with ADL2 storage to reduce memory use in Java


We've used azure ADL2 for some time, and the amount of memory it consumes is problematic, forcing us to use awkward and lower-performing techniques. In a discussion in this github issue, it was confirmed that the API does a ton of buffering, and suggested to me that we try the "sync" interface when it is available, and this azure page suggest that v12 of the Java SDK supports the sync API. But I can't find out anything more about it, and the last response I got on the github issue was "thanks but we're closing this".

I'm trying to get two questions answered:

Thanks, any help appreciated.

UPDATE: Actually we already use what @Venkatesan is calling the "sync API'. Note that we made our own "MarkableFileInputStream", which does not buffer everything in memory and yet still satisfies the markable criteria. Even then the SDK buffers nearly everything in memory. When I run the example using 8 threads and 200MB parts, you can see that the memory consumption is 1.3GB:

Creating temp file
Uploading 8 parts of size 209715200 bytes
2024-12-26T10:56:39.132 [Thread-0] MemoryLogger.logMemory:35 INFO - JVM MEMORY: used=38MB, total=1,024MB, free=985MB
2024-12-26T10:56:40.146 [Thread-0] MemoryLogger.logMemory:35 INFO - JVM MEMORY: used=16MB, total=80MB, free=63MB
2024-12-26T10:56:41.160 [Thread-0] MemoryLogger.logMemory:35 INFO - JVM MEMORY: used=1,292MB, total=4,584MB, free=3,291MB
2024-12-26T10:56:42.163 [Thread-0] MemoryLogger.logMemory:35 INFO - JVM MEMORY: used=1,304MB, total=4,584MB, free=3,279MB
...

You can download the full example from my google drive

The exact amount of memory used seems completely up to the SDK, but the amount is "a lot". If I use 16x400MB parts, the SDK uses 2.9GB. The SDK behaves like nobody would ever do more in their application than upload a single file, but we are doing a lot, moving data between files and databases, and often performing multiple transfers concurrently. The SDK is just really hungry and not predictable. Is there any way to control the SDK memory?


Solution

  • "sync" API allow us to reduce memory consumption for multi-part uploads to ADL2?.

    Yes, the sync API is designed to help with memory consumption by allowing more direct control over the file upload process.

    You can use the below code that the append and flush methods reduce memory usage compared to loading the entire file into memory at once, the sync API in Azure SDK v12 promises a more straightforward approach to reduce overhead and buffering using Azure Java SDK.

    Code:

    import com.azure.storage.file.datalake.*;
    import com.azure.storage.file.datalake.models.DataLakeStorageException;
    
    import java.io.*;
    import java.nio.file.Files;
    import java.nio.file.Path;
    import java.nio.file.Paths;
    
    public class App {
    
        public static void main(String[] args) {
            String connectionString = "DefaultEndpointsProtocol=https;AccountName=xxx;AccountKey=xxxx;EndpointSuffix=core.windows.net";
            String fileSystemName = "test";
            String filename = "sample.csv";
            String localFilePath = "xxxx";
    
            DataLakeServiceClient serviceClient = new DataLakeServiceClientBuilder()
                    .connectionString(connectionString)
                    .buildClient();
    
            // Get the DataLakeFileSystemClient
            DataLakeFileSystemClient fileSystemClient = serviceClient.getFileSystemClient(fileSystemName);
    
            // Get the DataLakeFileClient for the target file path
            DataLakeFileClient fileClient = fileSystemClient.getFileClient(filename);
    
            // Record the start time
            long startTime = System.nanoTime();
    
            try {
                // Prepare the input stream for the file
                Path path = Paths.get(localFilePath);
                long fileSize = Files.size(path);
    
                // Use try-with-resources to ensure 'data' is closed automatically
                try (InputStream data = new BufferedInputStream(new FileInputStream(path.toFile()))) {
                    byte[] buffer = new byte[32 * 1024 * 1024];  // 32MB buffer size
                    boolean overwrite = true;
                    long bytesUploaded = 0;
    
                    // Check if the file already exists
                    if (!fileClient.exists()) {
                        // Initialize the file in ADLS (create if not exists)
                        fileClient.create(overwrite);
                    }
    
                    // Upload file in parts using append and commit
                    while (bytesUploaded < fileSize) {
                        int bytesRead = data.read(buffer);
                        if (bytesRead == -1) {
                            break;  // End of file
                        }
    
                        InputStream inputStream = new ByteArrayInputStream(buffer, 0, bytesRead);
    
                        // Append data to the file at the specified offset
                        fileClient.append(inputStream, bytesUploaded, bytesRead);
    
                        bytesUploaded += bytesRead;
                    }
    
                    // Commit the uploaded chunks to the file
                    fileClient.flush(bytesUploaded, overwrite);
    
                    // Record the end time
                    long endTime = System.nanoTime();
                    long duration = endTime - startTime;  //time in minutes
                    System.out.println("File uploaded successfully.");
                    System.out.println("Time taken to upload the file: "  +  duration  /  1_000_000_000.0  /  60  +  " minutes");
                } catch (IOException e) {
                    e.printStackTrace();  // Handle IO exception here
                }
            } catch (DataLakeStorageException e) {
                System.err.println("Azure Data Lake error: " + e.getMessage());
            } catch (IOException e) {
                System.err.println("IO Exception error: " + e.getMessage());  // Catch IOException at the outer level
            }
        }
    }
    

    Output:

    File uploaded successfully.
    Time taken to upload the file: 4.898536025 minutes
    

    enter image description here

    Portal: enter image description here

    Version:

    <dependency>
        <groupId>com.azure</groupId>
        <artifactId>azure-storage-file-datalake</artifactId>
        <version>12.22.0</version>
    </dependency>
    

    Reference:

    Use Java to manage data in Azure Data Lake Storage - Azure Storage | Microsoft Learn