In the book Spark The Definitive Guide (Chapter 4, first para) the authors mentions Parquet files are highly structured. That got me wondering how so? Shouldnt it be semi structured as in CSV Files?
Parquet is a columnar binary format. That means all your records must respect a same schema (with all columns and same data types !). The schema is stored in your files. Thus it is highly structured.
Semi-structured files include, for example :
CSV
which has no other data type than String
.Json
which has types but does not have schema (your objects can have different attributes / different datatypes for same attribute / ...)These are by, definition, unsafe (no / low strictness)