I have a JSON file and an avro schema file, which correctly describes it's structure. I then convert the JSON file with the Avro tools into an avro file, without getting an error, like this:
java -jar .\avro-tools-1.7.7.jar fromjson --schema-file .\data.avsc .\data.json > .\data.avro
I then convert the generated Avro file back to JSON to verify that I got a valid Avro file like this:
java -jar .\avro-tools-1.7.7.jar tojson .\data.avro > .\data.json
This throws the error:
Exception in thread "main" java.io.IOException: Not a data file.
at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
at org.apache.avro.tool.DataFileGetMetaTool.run(DataFileGetMetaTool.java:64)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
I get the same exception when doing 'getschema' or 'getmeta' and also if I use avro-tools-1.8.2 or avro-tools-1.7.4. I also tried it with multiple, varying pairs of json and schema data which I checked for validity.
The error is thrown here (in the Avro tools):
if (!Arrays.equals(DataFileConstants.MAGIC, magic)) {
throw new IOException("Not a data file.");
}
It seems, the (binary) Avro file does not match the expected Avro file due to a few characters at the beginning.
I have checked all of the other stackoverflow questions regarding this error, but none of them helped. I used the command line on a Windows 10 PowerShell.
Anyone got an idea what the heck is going on here?
UPDATE: The conversion works if I do it on a Cloudera VM instead of in Windows. Only a few bites at the beginning are different in the generated Avro files.
Found the cause:
The Windows 10 PowerShell transforms the binary stream into a UTF8 stream. Changing the encoding changes the magic bytes, which (correctly) causes the exception to be thrown.
It works perfectly in another shell like the terminal etc.
Side note: the PowerShell app can be forced not to change the encoding by using a pipe instead of greater-than like so:
java -jar .\avro-tools-1.7.7.jar fromjson --schema-file .\data.avsc .\data.json | .\data.avro