On these vaex and pyarrow version:
>>> vaex.__version__
{'vaex': '4.12.0',
'vaex-core': '4.12.0',
'vaex-viz': '0.5.3',
'vaex-hdf5': '0.12.3',
'vaex-server': '0.8.1',
'vaex-astro': '0.9.1',
'vaex-jupyter': '0.8.0',
'vaex-ml': '0.18.0'}
>>> pyarrow.__version__
8.0.0
When reading a tsv file and exporting it to arrow, the arrow table couldn't be properly loaded by pyarrow.read_table()
, e.g. given a file, e.g. s2t.tsv
:
$ printf "test-1\nfoobar\ntest-1\nfoobar\ntest-1\nfoobar\ntest-1\nfoobar\n" > s
$ printf "1-best\npoo bear\n1-best\npoo bear\n1-best\npoo bear\n1-best\npoo bear\n" > t
$ paste s t > s2t.tsv
The file looks like this:
test-1 1-best
foobar poo bear
test-1 1-best
foobar poo bear
test-1 1-best
foobar poo bear
test-1 1-best
foobar poo bear
And when I tried exporting the tsv to arrow as such, then reading it back:
import vaex
import pyarrow as pa
df = vaex.from_csv('s2t.tsv', sep='\t', header=None)
df.export_arrow('s2t.parquet')
pa.parquet.read_table('s2t.parquet')
It throws the following error:
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
/tmp/ipykernel_17/3649263967.py in <module>
1 import pyarrow as pa
2
----> 3 pa.parquet.read_table('s2t.parquet')
/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties)
2746 ignore_prefixes=ignore_prefixes,
2747 pre_buffer=pre_buffer,
-> 2748 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit
2749 )
2750 except ImportError:
/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, **kwargs)
2338
2339 self._dataset = ds.FileSystemDataset(
-> 2340 [fragment], schema=schema or fragment.physical_schema,
2341 format=parquet_format,
2342 filesystem=fragment.filesystem
/opt/conda/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__()
/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Could not open Parquet input source 's2t.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
Is there some additional args/kwargs that should be added when exporting or reading the parquet files?
Or is the exporting to arrow bugged/broken somehow?
According to https://github.com/vaexio/vaex/issues/2228
df.export_parquet("file.parquet")
# or
df.export("file.parquet")
will export to the right format that can be read by
pa.parquet.read_table("file.parquet")