I have been given a parquet file with the following schema:
I dumped out the schema using a simple python script:
#! /usr/bin/env python3
import pyarrow as pa
import pyarrow.parquet as pq
import sys
schema = pq.read_schema(sys.argv[1])
print(schema)
My question is: What C++ type is date32[day]
? I've looked all over and it SEEMS like the underlying data type is int32_t. But if I try to read it using that type, I get:
terminate called after throwing an instance of 'parquet::ParquetException'
what(): Column converted type mismatch. Column 'Date' has converted type 'DATE' not 'INT_32'
I've also tried using a string but I get this error:
terminate called after throwing an instance of 'parquet::ParquetException'
what(): Column physical type mismatch. Column 'Date' has physical type 'INT32' not 'BYTE_ARRAY'
Here's my code:
#include <iostream>
#include "arrow/io/file.h"
#include "parquet/stream_reader.h"
#include "readfile.h"
int main(int argc, char **argv)
{
if (argc < 2)
{
std::cout << "Usage: " << argv[0] << " <input filename" << std::endl;
exit(2);
}
char *input_filename = argv[1];
std::cout << "Input filename is " << input_filename << std::endl;
std::shared_ptr<arrow::io::ReadableFile> infile;
PARQUET_ASSIGN_OR_THROW(
infile,
arrow::io::ReadableFile::Open(input_filename));
parquet::StreamReader os{parquet::ParquetFileReader::Open(infile)}
std::string name;
std::string ident;
int32_t date;
std::string ts_event_utc;
while (!os.eof())
{
os >> name >> ident >> date >> ts_event_utc >> parquet::EndRow;
}
It looks like the current StreamReader doesn't have a mapping for date32[day]
that you can use with the >>
operator. (See here for the code where these mappings are checked and added).
In theory, you could either add an override there or add the DATE type to the exceptions at the top of the file. Either one would allow the stream reader to decode a date32 value.
In the meantime, you likely have to either use the lower level column reader interfaces (you would use a TypedColumnReader<Int32Type>
) or (the easier route) you can use the parquet::arrow::FileReader
located in parquet/arrow/reader.h
to just read the data directly into Arrow Arrays or RecordBatches or even an arrow::Table
.
That all said, please file an issue on github.com/apache/arrow to track the StreamReader
not supporting Date values! And if you feel like filing a PR to fix it, I'd be happy to review it for you!