javajsonavroavro-tools

java.io.IOException Not a data file after converting JSON to Avro with Avro Tools


I have a JSON file and an avro schema file, which correctly describes it's structure. I then convert the JSON file with the Avro tools into an avro file, without getting an error, like this:

java -jar .\avro-tools-1.7.7.jar fromjson --schema-file .\data.avsc .\data.json > .\data.avro

I then convert the generated Avro file back to JSON to verify that I got a valid Avro file like this:

java -jar .\avro-tools-1.7.7.jar tojson .\data.avro > .\data.json

This throws the error:

Exception in thread "main" java.io.IOException: Not a data file.
    at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
    at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
    at org.apache.avro.tool.DataFileGetMetaTool.run(DataFileGetMetaTool.java:64)
    at org.apache.avro.tool.Main.run(Main.java:84)
    at org.apache.avro.tool.Main.main(Main.java:73)

I get the same exception when doing 'getschema' or 'getmeta' and also if I use avro-tools-1.8.2 or avro-tools-1.7.4. I also tried it with multiple, varying pairs of json and schema data which I checked for validity.

The error is thrown here (in the Avro tools):

if (!Arrays.equals(DataFileConstants.MAGIC, magic)) {
    throw new IOException("Not a data file.");
}

It seems, the (binary) Avro file does not match the expected Avro file due to a few characters at the beginning.

I have checked all of the other stackoverflow questions regarding this error, but none of them helped. I used the command line on a Windows 10 PowerShell.

See https://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/#json-to-binary-avro

Anyone got an idea what the heck is going on here?

UPDATE: The conversion works if I do it on a Cloudera VM instead of in Windows. Only a few bites at the beginning are different in the generated Avro files.


Solution

  • Found the cause:

    The Windows 10 PowerShell transforms the binary stream into a UTF8 stream. Changing the encoding changes the magic bytes, which (correctly) causes the exception to be thrown.

    It works perfectly in another shell like the terminal etc.

    Side note: the PowerShell app can be forced not to change the encoding by using a pipe instead of greater-than like so:

    java -jar .\avro-tools-1.7.7.jar fromjson --schema-file .\data.avsc .\data.json | .\data.avro