pythonparquetpyarrowapache-arrowvaex

How to read tsv file from vaex and output a pyarrow parquet file?


On these vaex and pyarrow version:

>>> vaex.__version__
{'vaex': '4.12.0',
 'vaex-core': '4.12.0',
 'vaex-viz': '0.5.3',
 'vaex-hdf5': '0.12.3',
 'vaex-server': '0.8.1',
 'vaex-astro': '0.9.1',
 'vaex-jupyter': '0.8.0',
 'vaex-ml': '0.18.0'}

>>> pyarrow.__version__
8.0.0

When reading a tsv file and exporting it to arrow, the arrow table couldn't be properly loaded by pyarrow.read_table(), e.g. given a file, e.g. s2t.tsv:

$ printf "test-1\nfoobar\ntest-1\nfoobar\ntest-1\nfoobar\ntest-1\nfoobar\n" > s
$ printf "1-best\npoo bear\n1-best\npoo bear\n1-best\npoo bear\n1-best\npoo bear\n" > t
$ paste s t > s2t.tsv

The file looks like this:

test-1  1-best
foobar  poo bear
test-1  1-best
foobar  poo bear
test-1  1-best
foobar  poo bear
test-1  1-best
foobar  poo bear

And when I tried exporting the tsv to arrow as such, then reading it back:

import vaex
import pyarrow as pa

df = vaex.from_csv('s2t.tsv', sep='\t', header=None)
df.export_arrow('s2t.parquet')

pa.parquet.read_table('s2t.parquet')

It throws the following error:

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
/tmp/ipykernel_17/3649263967.py in <module>
      1 import pyarrow as pa
      2 
----> 3 pa.parquet.read_table('s2t.parquet')

/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties)
   2746                 ignore_prefixes=ignore_prefixes,
   2747                 pre_buffer=pre_buffer,
-> 2748                 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit
   2749             )
   2750         except ImportError:

/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, **kwargs)
   2338 
   2339             self._dataset = ds.FileSystemDataset(
-> 2340                 [fragment], schema=schema or fragment.physical_schema,
   2341                 format=parquet_format,
   2342                 filesystem=fragment.filesystem

/opt/conda/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__()

/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Could not open Parquet input source 's2t.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Is there some additional args/kwargs that should be added when exporting or reading the parquet files?

Or is the exporting to arrow bugged/broken somehow?


Solution

  • According to https://github.com/vaexio/vaex/issues/2228

    df.export_parquet("file.parquet")
    # or 
    df.export("file.parquet") 
    

    will export to the right format that can be read by

    pa.parquet.read_table("file.parquet")