pythonpandasdataframedrop

Why is my dataframe unable to drop negative values appropriately?


I have a dataset and want to drop rows which contain negative values for a specific column. I have tried three different ways and all yield the same very odd outcome.

I will lay out my interaction with the dataframe, show the dataframe and the 'wrong' result that I am getting vs. what I would like to get. Lastly I'll show what I tried so far. Using Python3.

After import data = pd.read_csv('data.csv'),

At this point I want to drop all rows where Col2 has negative values. However all three approaches that I have tried so far have delievered the same 'wrong' result. Instead of dropping negatives, they drop everything except 0.


# my input dataframe 'sorted_data'
               Col1        Col 2
249182         0.3         -8.82
132350         0.3         -7.08
56440         -0.2         -4.66
265662         2.8         -1.17
143497        -0.3          0.00
...            ...           ...
35475         98.5        906.23
35526         98.5        906.23
35338         98.6        906.25
35834         98.5        906.28
35793         98.5        906.39

[286996 rows x 2 columns]


# my wrong output dataframe
              Col1          Col2
143497        -0.3           0.0
172367         0.1           0.0
172366         0.2           0.0
172365         0.2           0.0
172364         0.2           0.0
...        ...         ...           ...
201443         0.2           0.0
201460         0.1           0.0
201445         0.2           0.0
201444         0.2           0.0
201446         0.2           0.0

[137023 rows x 2 columns]

I got to this odd result via three different ways:


# dropping negatives
positive_data = sorted_data.drop(sorted_data.loc[sorted_data["Col2"] < 0.0].index, inplace=False)

#keeping positives
positive_data = sorted_data[sorted_data['Col2'] >= 0.0]

#convert negative to NaN, then drop Nans
sorted_data.loc[sorted_data["Col2"] < 0.0, "Col2"] = 'NaN'
positive_data = sorted_data.dropna(subset = 'Col2')

For that third option it replaces the negative values correctly, but when then dropping NaN it gives the same odd result as the other two.

Is my dataframe broken somehow? What could cause this behaviour?


Solution

  • You might not be calling the correct input and output variables for these testings. Why I suggest this:

    Options 1 and 2: Your lines work perfectly. Using the few rows at hand and your lines exactly ("dropping negatives" and "keeping positives"), I get the 6 zero or positives lines, not "Instead of dropping negatives, they drop everything except 0." So far, I cannot reproduce your issue

    Option 3: Now that one should cause a problem. "NaN" has nothing to see with an actual Not a Number, so dropna will not recognize it. You should use numpy.nan. A correct syntax for it is

    import numpy as np
    
    #convert negative to NaN, then drop Nans
    cleared_data = sorted_data.copy()
    cleared_data.loc[sorted_data["Col2"] < 0.0, "Col2"] = np.nan
    cleared_data.dropna(subset = 'Col2')
    

    Then it returns the same, correctly filtered dataframe, on my side.

    Hence when you say "it gives the same odd result as the other two", seems indicative that your root problem does not come from either of those 3 attempts.