c++apache-arrowapache-arrow-cpp

Write Apache Arrow table to string C++


I'm trying to write an Apache Arrow table to a string. My big example has problems and I can't get this little example to work. This one segfaults inside of Arrow in the WriteTable call. My bigger example doesn't appear to serialize correctly.

#include <arrow/api.h>
#include <arrow/io/memory.h>
#include <arrow/ipc/api.h>
 
std::shared_ptr<arrow::Table> makeSimpleFakeArrowTable() {
    std::vector<std::shared_ptr<arrow::Field>> arrowFields;
    arrowFields.emplace_back(std::make_shared<arrow::Field>("Field1", arrow::int64()));
    arrowFields.emplace_back(std::make_shared<arrow::Field>("Field2", arrow::float64()));

    auto schema = std::make_shared<arrow::Schema>(arrowFields);

    std::vector<std::shared_ptr<arrow::Array>> columns(schema->num_fields());

    arrow::Int64Builder longBuilder;
    longBuilder.Append(20);
    longBuilder.Finish(&(columns.at(0)));
    arrow::DoubleBuilder doubleBuilder;
    doubleBuilder.Append(10.0);
    longBuilder.Finish(&(columns.at(1)));

    return arrow::Table::Make(schema, columns);
}

std::shared_ptr<arrow::RecordBatch>
getArrowBatchFromBytes(const std::string& bytes) {
    arrow::io::BufferReader arrowBufferReader{bytes};
    auto streamReader =
        arrow::ipc::RecordBatchStreamReader::Open(&arrowBufferReader).ValueOrDie();

    auto batch = streamReader->Next().ValueOrDie();

    return batch;
}


std::string arrowTableToByteString(const std::shared_ptr<arrow::Table>& table) {
    auto stream = arrow::io::BufferOutputStream::Create().ValueOrDie();
    auto batchWriter = arrow::ipc::MakeStreamWriter(stream, table->schema()).ValueOrDie();

    auto status = batchWriter->WriteTable(*table);
    if (not status.ok()) {
        throw std::runtime_error(
            "Couldn't write Arrow Table to byte string. Arrow status was: '" +
            status.ToString() + "'.");
    }

    std::shared_ptr<arrow::Buffer> buffer = stream->Finish().ValueOrDie();
    return buffer->ToHexString();
}

int main(int argc, char** argv) {
    auto simpleFakeArrowTable = makeSimpleFakeArrowTable();
    std::string tableAsByteString = arrowTableToByteString(simpleFakeArrowTable);

    auto batch = getArrowBatchFromBytes(tableAsByteString);
    assert(batch != nullptr);
}

Solution

  • Two things jump to mind. First, I think this is a typo:

        longBuilder.Finish(&(columns.at(0)));
        arrow::DoubleBuilder doubleBuilder;
        doubleBuilder.Append(10.0);
        longBuilder.Finish(&(columns.at(1))); // Shouldn't this be doubleBuilder?
    

    Whenever you create an arrow table by yourself it is a good idea to call arrow::Table::ValidateFull. This will help to catch mistakes like this (in this case the status returned would have reported that the input arrays did not match the schema).

    Second, if we fix that, we get an error because you return buffer->ToHexString(); which is going to turn your array of bytes into a hex string (e.g. the bytes [10, 20, 30] become the bytes [48, 48, 48, 65, 48, 48, 49, 52, 48, 48, 49, 69], more commonly represented as 000A0014001E).

    You then turn around and try to read these hex bytes as a table arrow::io::BufferReader arrowBufferReader{bytes};. If I change that ToHexString to ToString then your example runs and returns 0.