javautf-8character-encodingbyteascii

Java. Does file-encoding affect file-comparison at the level of pure bytes?


I'm using the following to compare the content of two supposedly identical files. I've read that—at least with respect to textual files, like TXT or HTML—the encoding of a file affects how a file's hexadecimal-sequence is translated into characters: i.e., for the same hexadecimal-sequence, a file encoded in UTF-8 will display different content to one encoded in ASCII. Does file-encoding affect my code below at all? or does it not, as I am comparing the files' contents at the basic level of bytes, whereat hexadecimal-sequences are not concerned?

Edit: I'm using this code to compare two supposedly identical files of any file type and of any file size.

bin_1 = new BufferedInputStream(file_input_stream_1); 
bin_2 = new BufferedInputStream(file_input_stream_2);

byte[] barr_1 = new byte[8192];
byte[] barr_2 = new byte[8192]; 

while(bin_1.available() > 0){

    bin_1.read(barr_1); bin_2.read(barr_2);

    if(Arrays.equals(barr_1, barr_2) == false){
        break;
    }

    else{

        barr_1 = new byte[8192]; 
        barr_2 = new byte[8192];
        continue;

    }
    
}

Solution

  • Short answer: NO!

    No, file-encoding does not come into play when you compare files on byte-level.

    Why? Because you read the files byte by byte and compare them byte by byte. Ok, for performance reasons, you want to read larger chunks, and not only single bytes. But this is done under the hood for you by BufferedInputStream, so the code is just working on the bytes.

    InputStream::read does not interpret the byte it reads in any way.

    var isEqual = true;
    try( final var inputStream1 = new BufferedInputStream( fileInputStream1 );
        final var inputStream2 = new BufferedInputStream( fileInputStream2 ) )
    {
      ReadLoop: while( isEqual )
      {
        final var v1 = inputStream1.read();
        final var v2 = inputStream2.read();
        isEqual = v1 == v2;
        if( v1 == EOF ) break ReadLoop;
      }  // ReadLoop:  
    }
    

    It would be different if you would use an instance of Reader instead of an InputStream. A Reader assumes a text file and does the transformation based on the encoding.

    From the Javadoc for FileInputStream:

    FileInputStream is meant for reading streams of raw bytes such as image data. For reading streams of characters, consider using FileReader.

    "File-encoding" is a concept that is only relevant when explicitly dealing with textual data – when talking about files, for "streams of characters" (aka. text files).