javaapache-sparkhadoop2

What is the difference between FileInputStream/FileOutputStream Vs FSDataInputStream/FSDataOutputStream and where we will use them


I am trying to understand the difference between FileInputStream Vs FSDataInputStream and FileOutputStream Vs FSDataOutputStream.

I am trying to read a file from S3 bucket and apply some formatting changes and then want to write it into another S3 bucket in the spark java application

I am confused about whether I need to use FileInputStream or FSDataInputStream to read files and how to write them into the S3 bucket using FileOutputStream or FSDataOutputStream.

Could someone explain how and where we need to use them appropriately with some example?


Solution

  • You can use either, it just depends on what you need.

    They both are just stream implementations. Ultimately what you will be doing is taking and inputstream from one bucket and writing it to the outputstream of another.

    The FileInputStream and FileOutputStream are concrete components that provide the functionality to read and write streams from a mapped file.

    The FSDataInputStream and FSDataOutputStream are concrete decorators of an inputstream. Meaning the provide or decorate the inputstream with functionality, such as reading and writing primitives and providing buffered streams.

    Which one to choose? Do you need a the decorations provided by FSDataOutputStream and FSDataInputStream? Is FileInputStream and FileOutputStream sufficient?

    Personally, I would look to use Readers and Writers as demonstrated here:

    How can I read an AWS S3 File with Java?

    private final AmazonS3 amazonS3Client = AmazonS3ClientBuilder.standard().build();
    
    private Collection<String> loadFileFromS3() {
        try (final S3Object s3Object = amazonS3Client.getObject(BUCKET_NAME,
                                                                FILE_NAME);
            final InputStreamReader streamReader = new InputStreamReader(s3Object.getObjectContent(), StandardCharsets.UTF_8);
            final BufferedReader reader = new BufferedReader(streamReader)) {
            return reader.lines().collect(Collectors.toSet());
        } catch (final IOException e) {
            log.error(e.getMessage(), e)
            return Collections.emptySet();
        }
    }