I have applied PCA to an array of around 1000 observations but only want to keep the observation in the new array IF one of the features from the original array = something.
I have a numpy array df2
and a dataframe df
. I want to find all rows in df2
where df.Position
is CDM
.
My actual data:
df2
[[ -6.00987823e+00 4.46585005e+00]
[ -7.09055159e+00 1.89437600e+00]
[ -5.91044431e+00 -1.97888707e+00]
[ -4.85698965e+00 -1.09936724e+00]
[ -4.01780368e-01 -2.57178392e+00]
[ -2.97351215e+00 -3.15940358e+00]
[ -4.27973589e+00 2.82707326e+00]
[ 3.95086576e+00 1.08281922e+00]
[ -2.94075361e+00 -1.95544661e+00]
[ -4.83788056e+00 2.32369496e+00]
[ -5.00473716e+00 -3.37680552e-01]
[ -4.88905829e+00 -1.55527476e+00]
[ -3.38202709e+00 -1.04402867e+00]
[ -2.14261510e+00 -5.30757477e-01]
[ 3.00813803e-01 -2.11010985e+00]
[ -2.67824986e+00 -1.83303905e+00]
[ -1.64547049e+00 -2.48056250e+00]
[ -2.92550543e+00 -3.02363170e+00]
[ -4.01116933e+00 2.90363840e+00]
[ -1.04571206e+00 7.58064433e-01]
[ 2.34068739e-01 -2.33981296e+00]
[ 3.15597517e+00 1.09429188e+00]
[ -3.83828970e+00 1.14195305e-01]
[ -7.33794066e-01 -3.70152816e+00]
[ 8.21789967e-01 -4.77818413e-01]
[ -3.29257688e+00 -1.61887349e+00]
[ -4.24297171e+00 2.27187714e+00]
[ 1.45714199e+00 -3.56024788e+00]
[ 1.79855738e+00 -3.71818328e-01]
[ 3.68171085e-01 -3.52961707e+00]
[ 3.77585412e+00 -3.01627595e-01]
[ -4.21740128e+00 -1.30913719e+00]
[ -3.85041585e+00 -1.05515969e+00]
[ -5.01752378e+00 4.67348167e-01]
[ 3.65943448e+00 9.21016483e-01]
[ 3.12159896e+00 -1.25707872e-01]
[ -4.50219722e+00 -4.06752784e+00]
[ -3.92172250e+00 -2.88567430e+00]
[ -2.68908475e-01 -2.17506629e+00]
[ -1.13728112e+00 -2.66843007e+00]
[ -8.73467957e-01 -1.24389494e+00]
[ 3.21966300e+00 -1.35271239e-01]
[ -4.31060796e+00 -1.90505910e+00]
[ 3.73904981e+00 7.70228802e-01]
[ 1.02646986e+00 -5.91828676e-01]
[ 8.43840480e-01 -1.49636218e+00]
[ 1.54065978e+00 -1.65086030e+00]
[ 2.96602068e+00 -7.41024474e-01]
[ 6.53636345e-01 3.04647288e-01]
[ 2.59236989e+00 -6.70435261e-02]
[ 2.00184665e-01 -1.55230314e+00]
[ -7.29533092e-01 -2.73390749e+00]
[ -2.93578745e+00 -2.18118257e+00]
[ -4.37481195e+00 1.02701222e+00]
[ 1.00713302e+00 -1.39943282e+00]
...]
df
(simply playing position in football/soccer - FB, CB, CDM, CM, AM, FW)
Position
FW
FW
FW
FW
FB
AM
FW
CB
AM
FW
AM
FW
AM
CM
FB
AM
CM
CM
FW
CM
CDM
CB
AM
FB
CDM
FW
FW
CDM
FB
CDM
CB
AM
...
AM
When filtering, I get this output (along with a FutureWarning
):
Where am I going wrong and how can I filter the data appropriately?
The FutureWarning
is probably a result of your numpy
and pandas
versions being out of date. You can upgrade them using:
pip install --upgrade numpy pandas
As for the filtering, there are quite a few options. Here I mention each one with some dummy data.
Setup
df
name colour a b c d e f
0 john red 1 2 3 4 5 6
1 james red 2 3 4 5 6 7
2 jane blue 1 2 3 5 7 8
df2
0 1
0 0.122 0.222
1 0.343 0.345
2 0.345 0.563
Option 1
boolean indexing
df2[df.colour == 'red']
Out[726]:
0 1
0 0.122 0.222
1 0.343 0.345
Option 2
df.eval
df2[df.eval('colour == "red"')]
Out[732]:
0 1
0 0.122 0.222
1 0.343 0.345
Note that both these options work even if df2
is a numpy
array of the form:
array([[ 0.122, 0.222],
[ 0.343, 0.345],
[ 0.345, 0.563]])
For your actual data, you'll need to do something along the same lines:
df2
array([[-6.01 , 4.466],
[-7.091, 1.894],
[-5.91 , -1.979],
[-4.857, -1.099],
[-0.402, -2.572],
[-2.974, -3.159],
[-4.28 , 2.827],
[ 3.951, 1.083],
[-2.941, -1.955],
[-4.838, 2.324],
[-5.005, -0.338],
[-4.889, -1.555],
[-3.382, -1.044],
[-2.143, -0.531],
[ 0.301, -2.11 ],
[-2.678, -1.833],
[-1.645, -2.481],
[-2.926, -3.024],
[-4.011, 2.904],
[-1.046, 0.758],
[ 0.234, -2.34 ],
[ 3.156, 1.094],
[-3.838, 0.114],
[-0.734, -3.702],
[ 0.822, -0.478],
[-3.293, -1.619],
[-4.243, 2.272],
[ 1.457, -3.56 ],
[ 1.799, -0.372],
[ 0.368, -3.53 ],
[ 3.776, -0.302],
[-4.217, -1.309]])
df
Position
0 FW
1 FW
2 FW
3 FW
4 FB
5 AM
6 FW
7 CB
8 AM
9 FW
10 AM
11 FW
12 AM
13 CM
14 FB
15 AM
16 CM
17 CM
18 FW
19 CM
20 CDM
21 CB
22 AM
23 FB
24 CDM
25 FW
26 FW
27 CDM
28 FB
29 CDM
30 CB
31 AM
df2[df.Position == 'CDM']
array([[ 0.234, -2.34 ],
[ 0.822, -0.478],
[ 1.457, -3.56 ],
[ 0.368, -3.53 ]])