pythonpandasdataframespss

pyreadstat read and write spss without data loss


To read an spss .sav file using pandas/pyreadstat, you use:

df, meta = pyreadstat.read_sav()

to write a dataframe, you use:

pyreadstat.write_sav(df)

How can I read, edit and write a .sav file without losing any meta data, like labels and other things that can be changed in spss?

If this is not possible entirely, what would be the closest to not losing data this way?


Solution

  • Talk is cheap, here's the code. :-)

    # using pyreadstat
    from pyreadstat import write_sav
    
    class TempFile(type(pathlib.Path())):  # type: ignore
        def __exit__(self, exc_type, exc_val, exc_tb):
            filepath = str(self.absolute())
            try:
                os.remove(filepath)
            except OSError:
                logger.exception('romve temporary file: %s failed!', filepath)
            self._closed = True
    
    class SpssTool:
        @classmethod
        def to_spss(cls, df: DataFrame, io: BytesIO, metadata: metadata_container, *, compress: bool = False):
            """Writes a pandas dataframe to a BytesIO object.
    
            Parameters
            ----------
            df : pandas.DataFrame
                pandas data frame to write to sav or zsav
            io : BytesIO
                the buffer to save spss file
            metadata: metadata_container
                spss file meta data container
            compress : bool
                whether compress to zsav.
            """
    
            df.columns = SpssTool.get_legal_column_names(df.columns.to_list())
    
            with TempFile(f'/tmp/{uuid4().hex}.{"zsav" if compress else "sav"}') as fp:
                write_sav(
                    df=df,
                    dst_path=fp,
                    column_labels=metadata.column_labels if metadata else None,
                    variable_value_labels=dict(metadata.variable_value_labels) if metadata else {},
                    variable_measure=metadata.variable_measure if metadata else None,
                )
                io.write(fp.read_bytes())
    
    

    Some expalinations:

    this is needed because spss file has restriction about the column name, see official document for details: https://www.ibm.com/docs/en/spss-statistics/27.0.0?topic=view-variable-names

    This is from from pyreadstat import metadata_container. the container holding info about the dataset, you could find more detail in : https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html#metadata-object-description

    Those maybe what you need.