I'm getting this very strange error when trying to follow the following exercise on using the corr() method in Python
https://www.geeksforgeeks.org/python-pandas-dataframe-corr/
Specifically, when I try to run the following code: df.corr(method ='pearson')
The error message offers no clue. I thought the corr() method was supposed to automatically ignore strings and empty values etc.
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
df.corr(method='pearson')
File "C:\Users\d.o\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py", line 10059, in corr
mat = data.to_numpy(dtype=float, na_value=np.nan, copy=False)
File "C:\Users\d.o\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py", line 1838, in to_numpy
result = self._mgr.as_array(dtype=dtype, copy=copy, na_value=na_value)
File "C:\Users\d.o\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\internals\managers.py", line 1732, in as_array
arr = self._interleave(dtype=dtype, na_value=na_value)
File "C:\Users\d.o\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\internals\managers.py", line 1794, in _interleave
result[rl.indexer] = arr
ValueError: could not convert string to float: 'Avery Bradley'
When I try to replicate this behavior, the corr()
method works OK but spits out a warning (shown below) that warns that the ignoring of non-numeric columns will be removed in the future. Perhaps the future has arrived?
I've got pandas
version 1.5.3.
You may need to just specify which columns to use--which is actually a better way to do it rather than rely on pd to do this for you. You can do that by just supplying a list of the columns of interest as an index (shown below.)
In [1]: import pandas as pd
In [2]: data = {'name': ['bob', 'cindy', 'tom'],
...: 'x' : [ 1, 2, 3 ],
...: 'y' : [ 6.5, 8.9, 12.0]}
In [3]: df = pd.DataFrame(data)
In [4]: df
Out[4]:
name x y
0 bob 1 6.5
1 cindy 2 8.9
2 tom 3 12.0
In [5]: df.describe()
Out[5]:
x y
count 3.0 3.000000
mean 2.0 9.133333
std 1.0 2.757414
min 1.0 6.500000
25% 1.5 7.700000
50% 2.0 8.900000
75% 2.5 10.450000
max 3.0 12.000000
In [6]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 3 non-null object
1 x 3 non-null int64
2 y 3 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes
In [7]: df.corr(method='pearson')
<ipython-input-7-432dd9d4238b>:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
df.corr(method='pearson')
Out[7]:
x y
x 1.000000 0.997311
y 0.997311 1.000000
In [8]: df[['x', 'y']].corr(method='pearson')
Out[8]:
x y
x 1.000000 0.997311
y 0.997311 1.000000