apache-sparkpysparkparquetpyarrow

Reading / Fixing a corrupt parquet file


I have a parquet file, that was being written on by a continuous loop as follows:

    def process_data(self):
        #... other code ...

        with pq.ParquetWriter(self.destination_file, schema) as writer:
            with tqdm(total=total_rows, desc="Processing nodes") as pbar:
                for i in range(0, total_rows, self.batch_size):
                    # ... processing code ...

                    # Create a table from the batched data
                    batch_table = pa.Table.from_arrays(
                        [
                            pa.array(node_ids),
                            pa.array(mut_positions),
                            pa.array(new_6mers),
                            pa.array(context_embeddings),
                            pa.array(nonmutation_contexts),
                        ],
                        schema=schema
                    )

                    # Write the batch table
                    writer.write_table(batch_table)

                    # ...

                    pbar.update(len(batch_indices))

This loop was abruptly cut off due to the computer shutting off in the middle of the process.

Now when I try to read the file through pq.read_table, I (expectedly) get an error

pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from 'data/processed/data_with_embeddings.parquet'. Is this a 'parquet' file?: Could not open Parquet input source 'data/processed/data_with_embeddings.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

I desperately hope there's a way out of this. Like a workaround. Like losing a few (or more) rows, but saving most of the data. I have searched the web but seemingly there isn't really anything about this, or the existing ones are above my expertise (which might be obvious from my tag use, I apologize in advance).

Is there hope?


Solution

  • Programmatic solution.

    So based on documentation. In order to recover anything you need to have metadata for that anything.

    So, since you definitely don't have data for the whole table try reading it by row groups.

    Method you can take from doc. Notice constructor of ParquetFile - you can ignore checksum verification.

    Brute force.

    Open corrupted file in hex viewer (e.g. Notepad++ hex editor). Enjoy puzzle - this might be helpful here. :)