How do you link an Orc file's ColumnStatistics with the column name defined in in the schema (TypeDescription) using Java?
Reader reader = OrcFile.createReader(ignored);
TypeDescription schema = reader.getSchema();
ColumnStatistics[] stats = reader.getStatistics();
The column statistics contains stats for all column types in a flat array. The schema, however, is a tree of schemas. Are the column stats a tree traversal (depth-first?) of the schema?
I tried using orc-statistics
but that only outputs the column ID.
Turns out the file statistics match up to a DFS traversal of schema. The traversal includes intermediate schemas that don't hold data like Struct and List. Additionally, the traversal includes the overall schema as the first node. This is explained in the docs for Orc Specification v1:
The type tree is flattened in to a list via a pre-order traversal where each type is assigned the next id. Clearly the root of the type tree is always type id 0. Compound types have a field named subtypes that contains the list of their children’s type ids.
The complete code to get a flattened list of schema names from an Orc TypeDescription
:
final class OrcSchemas {
private OrcSchemas() {}
/**
* Returns all schema names in a depth-first traversal of schema.
*
* <p>The given schema is represented as '<ROOT>'. Intermediate, unnamed schemas like
* StructColumnVector and ListColumnVector are represented using their category, like:
* 'parent::<STRUCT>::field'.
*
* <p>This method is useful because some Orc file methods like statistics return all column stats
* in a single flat array. The single flat array is a depth-first traversal of all columns in a
* schema, including intermediate columns like structs and lists.
*/
static ImmutableList<String> flattenNames(TypeDescription schema) {
if (schema.getChildren().isEmpty()) {
return ImmutableList.of();
}
ArrayList<String> names = Lists.newArrayListWithExpectedSize(schema.getChildren().size());
names.add("<ROOT>");
mutateAddNamesDfs("", schema, names);
return ImmutableList.copyOf(names);
}
private static void mutateAddNamesDfs(
String parentName, TypeDescription schema, List<String> dfsNames) {
String separator = "::";
ImmutableList<String> schemaNames = getFieldNames(parentName, schema);
ImmutableList<TypeDescription> children = getChildren(schema);
for (int i = 0; i < children.size(); i++) {
String name = schemaNames.get(i);
dfsNames.add(name);
TypeDescription childSchema = schema.getChildren().get(i);
mutateAddNamesDfs(name + separator, childSchema, dfsNames);
}
}
private static ImmutableList<TypeDescription> getChildren(TypeDescription schema) {
return Optional.ofNullable(schema.getChildren())
.map(ImmutableList::copyOf)
.orElse(ImmutableList.of());
}
private static ImmutableList<String> getFieldNames(String parentName, TypeDescription schema) {
final List<String> names;
try {
// For some reason, getFieldNames doesn't handle null.
names = schema.getFieldNames();
} catch (NullPointerException e) {
// If there's no children, there's definitely no field names.
if (schema.getChildren() == null) {
return ImmutableList.of();
}
// There are children, so use the category since there's no names. This occurs with
// structs and lists.
return schema.getChildren().stream()
.map(child -> parentName + "<" + child.getCategory() + ">")
.collect(toImmutableList());
}
return names.stream().map(n -> parentName + n).collect(toImmutableList());
}
}