pandasdataframeunit-testingpython-unittestis-empty

Using unittest with pandas


I am new to unit tests in general and Python's unittest in particular.

When trying to validate a pandas dataframe df, I typically:

I would like to standardize the way I am running these tests.

The pandas documentation lists available assert functions (assert_frame_equal, assert_series_equal, assert_index_equal and assert_extention_array_equal), but as far as I understand I cannot use those to run the aforementioned tests.

I came up with the following class:

import pandas as pd
import unittest
class DataFrameTestCase(unittest.TestCase):

    def test_if_dataframe_is_empty(self,df):
        self.assertTrue(len(df) > 0)

    def test_if_dataframe_contains_required_columns(self,df,columns):
        self.assertTrue(set(df.columns.to_list()) == set(columns))

The following snippet...

data = [[412256, 142193, 4], [644402, 5208768 ,25]]
columns = ['easting', 'northing','elevation']
df = pd.DataFrame(data=data, columns=columns)
dataframetestcase = DataFrameTestCase()
dataframetestcase.test_if_dataframe_is_empty(df)
dataframetestcase.test_if_dataframe_contains_required_columns(df, columns)

...does not return any error.

On the other hand, passing an empty dataframe df or a different columns list raises an AssertionError: False is not true error.

Is this the way to proceed or is there a built-in set of pandas or unittest assert functions that handle this in a better way?


Solution

  • I'll try to show you a standard use of unittest (at least on my opinion) for your goal.

    The following is your code with some changes. The name of the script is: pandas_test_routine.py:

    import pandas as pd
    import unittest
    
    data = [[412256, 142193, 4], [644402, 5208768, 25]]
    columns = ['easting', 'northing', 'elevation']
    
    # this is a not desired data because is empty
    data_empty = []
    
    class DataFrameTestCase(unittest.TestCase):
    
        # the method setUp() is executed before any test
        def setUp(self):
            self.data = data
            self.columns = columns
            self.sut = pd.DataFrame(data=self.data, columns=self.columns)
    
        def test_if_dataframe_IS_NOT_empty(self):
            self.assertFalse(self.sut.empty)
    
        def test_if_dataframe_CONTAINS_required_columns(self):
            self.assertTrue(set(self.sut.columns.to_list()) == set(self.columns))
    
        def test_if_dataframe_IS_empty(self):
            self.data = data_empty
            self.sut = pd.DataFrame(data=self.data, columns=self.columns)
            # We can set a custumize message error by failIf()
            self.failIf(self.sut.empty, "data frame is empty")
    
    if __name__ == '__main__':
        unittest.main()
    

    To execute the tests you can do (in a terminal):

    /path/to/interpreter/python /path/to/script/pandas_test_routine.py
    

    While the first and the second test are successfully executed, the execution of the third test stops with the following error:

    AssertionError: True is not false : data frame is empty
    

    Note that the instruction failIf() is deprecated but I think it is suited for your needs.

    Method setUp()

    Useful the method setUp() of the class TestCase: it is execute before the execution of every tests.
    In your case setUp() create the object sut with the correct data.
    Note: sut stands for System Under Test (in your case is an instance of the class DataFrame).

    Method unittest.main()

    The snippet of code:

    if __name__ == '__main__':
        unittest.main()
    

    executes all methods of the class DataFrameTestCase with the name which starts with test