javabinarybytedataoutputstream

Confusion around how byte array writing works in Java


Let’s say I have a huge set of strings that I want to write into a file as efficient as possible. I don’t care if it’s not human readable.

The first thing that came to my mind was to write the string as raw bytes to a binary file. I tried using DataOutputStream and write the byte array. However when I open my file it is readable.

How does this work? Does it actually write binary under the hood and only my text editor is making it readable?

Is this the most efficient way to do this? I’d use this for a project where performance is key so I’m looking for the fastest way to write to a file (no need to be human readable).

Thanks in advance.


Solution

  • Files are just a sack of bytes. They always are, even a txt file.

    Characters don't really exist, in the sense that computers don't know what they are, not really. They just know numbers.

    So, how does that work?

    Welcome to the wonderful world of text encoding.

    The job of storing, say, the string "Hello!" in a file requires converting the notion of H, e, l, l, o, and ! into bytes first, and then write those bytes into a file.

    In order to do that, you first need a lookup table; one that translates characters into numbers. Then, we have to convert those numbers to bytes, and then we can save them to a file.

    A common encoding is US-ASCII. US-ASCII contains only 94 characters in its mapping. The 26 letters of the english alphabet in both lower and uppercase form, all digits, a few useful symbols such as !@#$%^&*( and space. That's it. US-ASCII simply has no 'mapping' for e.g. é or ☃ or even 😊.

    All these characters are mapped to a number between 32 and 126, and so to put this in a text file, just write that number, as bytes can represent anything between 0 and 255, so it 'just fits' (in fact, the high bit is always 0).

    But, it's 2021, and we have emoji and we figured out a while ago that as it turns out, there are languages out there that aren't english, amazing, that.

    So, the commonly used table is the unicode table. This table represent a liiiiitle more than 94 characters. Nono, this table has a whopping, at time of writing, 143859 characters in its table. Holy moly batman, that's a ton.

    Clearly, the numbers that these 143,859 glyphs are mapped to must, at the very least, fall between 0 and 143,859 (it's actually a larger number range; there are gaps for convenience and to leave room for future updates).

    You could just state that each number is an int (between 0 and 2^31 - 4 bytes total), and store each character as an int (so, Hello! would turn into a file on disk that is 24 bytes large).

    But, a more common encoding is UTF-8. UTF-8 has the property that it stores ASCII-compatible characters as, well, ASCII, because those 94 numbers have the same 'number translation' in unicode as it does in ASCII, AND UTF-8 stores those numbers as just that byte. UTF-8 stores each character in 1, 2, 3, or 4 bytes, depending on what character it is. It's a 'variable length encoding scheme'.

    You can look up UTF_8 on e.g. wikipedia if you want to know the deal.

    For english-esque text, UTF-8 is extremely efficient, and no worse than ascii (so there is no reason to use ascii). You can do this in java, quite easily:

    // Path, Files etc are from java.nio.file
    
    Path p = Paths.get("mytextfile.txt");
    Files.writeString(p, "Hello!");
    

    That's all you need; the Files API defaults to UTF_8 (be aware that the old and mostly obsolete APIs such as FileWriter don't, and you should ALWAYS specify charset encoding for those! Or better yet, just don't use em and use java.nio.file instead).

    Note that you can shove a unicode snowman or even emoji in there, it'll save fine.

    There is no 'binary' variant. The file is bytes. If you open it in a text editor or run cat thatfile.txt, guess what? cat or your editor is reading in the bytes, taking a wild stab in the dark as to what encoding it might be, looking every decoded value up in the character table, and then farm out the work to the font rendering engine to show the characters again. It's just the editor giving you the benefit of showing off the file with bytes:

    72, 101, 108, 108, 111, 33

    as Hello! because that's a lot easier to read. Open that 'text file' with a hex editor and you'll see that it contains exactly that sequence of numbers I showed (well, in hex, that too is just a rendering convenience).

    Still, if you want to store it 'efficiently', the answer is trivial: Use a compression algorithm. You can throw that data through e.g. new ZipOutputStream or use more fancy compressors:

    Path p = Paths.get("file.txt.gz");
    try (OutputStream out = Files.newOutputStream(p);
      ZipOutputStream zip = new ZipOutputStream(out)) {
    
    String shakespeare = "type the complete works of shakespeare here";
    zip.write(shakespeare.getBytes(StandardCharsets.UTF_8);
    }
    

    You'll find that file.txt.gz will be considerably fewer bytes than the total character count of the combined works of shakespeare. Voila. Efficiency.

    You can futz with your compression algorithm; there are many. Some are optimized for specific purposes, most fall on a tradeoff line between 'speed of compression' and 'efficiency of compression'. Many are configurable (compress better at the cost of running longer vs compress quickly, but it won't be quite as efficient). A few basic compression algorithms are baked into java, and for the fancier ones, well, a few have pure java implementations you can use.