I have a dataframe of 10899 rows × 32 columns where there are many cells containing data that start with -9. e.g. -99.0, -9, -9.678, etc.
How can I create a new dataframe from the original where all values beginning with '-9' are converted to NaN?
The code I used returned a new dataframe with no changes.
Let's call the original dataframe weatherData_original.
Here's a piece of weatherData_original:
Date Solar MaxRH AvgAirTemp
0 3/1/1983 -9.00 -9.0 -99.00
1 3/2/1983 -9.00 -9.0 0.31
2 3/3/1983 -9.00 -9.0 -99.00
3 3/4/1983 -9.00 -9.0 8.62
4 3/5/1983 19.97 64.6 8.91
... ... ... ... ...
10894 12/27/2012 9.67 53.9 5.99
10895 12/28/2012 10.21 89.7 0.96
10896 12/29/2012 10.25 57.9 1.89
10897 12/30/2012 3.72 86.3 3.85
10898 12/31/2012 9.68 92.9 3.53
Code:
new_df = weatherData_original.replace(regex=r'^-9\d+', value=pd.NA)
new_df = weatherData_original.replace(regex='^[-9].*', value=pd.NA)
I think your issue here is surrounding data types. You're writing a regular expression to replace substrings, but I imagine pandas is reading in the columns you're interested in as floats or integers. You can check this by running:
weatherData_original.dtypes
One way around this is to set the dtype to str when loading the data. This will read in every column as a string value.
data = {'Date': ['3/1/1983', '3/2/1983'], 'Solar': [-9.00, -9.00]}
df = pd.DataFrame(data, dtype=str)
Then you can run your regular expression replacements. I found your second regular expression to work well for me.
new_df = df.replace(regex='^[-9].*', value=pd.NA)