pythonarrayspandasnumpyfuture-warning

FutureWarning when applying a condition on a pandas dataframe to filter an array


I have applied PCA to an array of around 1000 observations but only want to keep the observation in the new array IF one of the features from the original array = something.

I have a numpy array df2 and a dataframe df. I want to find all rows in df2 where df.Position is CDM.

My actual data:

df2

[[ -6.00987823e+00   4.46585005e+00]
 [ -7.09055159e+00   1.89437600e+00]
 [ -5.91044431e+00  -1.97888707e+00]
 [ -4.85698965e+00  -1.09936724e+00]
 [ -4.01780368e-01  -2.57178392e+00]
 [ -2.97351215e+00  -3.15940358e+00]
 [ -4.27973589e+00   2.82707326e+00]
 [  3.95086576e+00   1.08281922e+00]
 [ -2.94075361e+00  -1.95544661e+00]
 [ -4.83788056e+00   2.32369496e+00]
 [ -5.00473716e+00  -3.37680552e-01]
 [ -4.88905829e+00  -1.55527476e+00]
 [ -3.38202709e+00  -1.04402867e+00]
 [ -2.14261510e+00  -5.30757477e-01]
 [  3.00813803e-01  -2.11010985e+00]
 [ -2.67824986e+00  -1.83303905e+00]
 [ -1.64547049e+00  -2.48056250e+00]
 [ -2.92550543e+00  -3.02363170e+00]
 [ -4.01116933e+00   2.90363840e+00]
 [ -1.04571206e+00   7.58064433e-01]
 [  2.34068739e-01  -2.33981296e+00]
 [  3.15597517e+00   1.09429188e+00]
 [ -3.83828970e+00   1.14195305e-01]
 [ -7.33794066e-01  -3.70152816e+00]
 [  8.21789967e-01  -4.77818413e-01]
 [ -3.29257688e+00  -1.61887349e+00]
 [ -4.24297171e+00   2.27187714e+00]
 [  1.45714199e+00  -3.56024788e+00]
 [  1.79855738e+00  -3.71818328e-01]
 [  3.68171085e-01  -3.52961707e+00]
 [  3.77585412e+00  -3.01627595e-01]
 [ -4.21740128e+00  -1.30913719e+00]
 [ -3.85041585e+00  -1.05515969e+00]
 [ -5.01752378e+00   4.67348167e-01]
 [  3.65943448e+00   9.21016483e-01]
 [  3.12159896e+00  -1.25707872e-01]
 [ -4.50219722e+00  -4.06752784e+00]
 [ -3.92172250e+00  -2.88567430e+00]
 [ -2.68908475e-01  -2.17506629e+00]
 [ -1.13728112e+00  -2.66843007e+00]
 [ -8.73467957e-01  -1.24389494e+00]
 [  3.21966300e+00  -1.35271239e-01]
 [ -4.31060796e+00  -1.90505910e+00]
 [  3.73904981e+00   7.70228802e-01]
 [  1.02646986e+00  -5.91828676e-01]
 [  8.43840480e-01  -1.49636218e+00]
 [  1.54065978e+00  -1.65086030e+00]
 [  2.96602068e+00  -7.41024474e-01]
 [  6.53636345e-01   3.04647288e-01]
 [  2.59236989e+00  -6.70435261e-02]
 [  2.00184665e-01  -1.55230314e+00]
 [ -7.29533092e-01  -2.73390749e+00]
 [ -2.93578745e+00  -2.18118257e+00]
 [ -4.37481195e+00   1.02701222e+00]
 [  1.00713302e+00  -1.39943282e+00]
...]


df

(simply playing position in football/soccer - FB, CB, CDM, CM, AM, FW)

Position
FW
FW
FW
FW
FB
AM
FW
CB
AM
FW
AM
FW
AM
CM
FB
AM
CM
CM
FW
CM
CDM
CB
AM
FB
CDM
FW
FW
CDM
FB
CDM
CB
AM
...
AM

When filtering, I get this output (along with a FutureWarning):

enter image description here

Where am I going wrong and how can I filter the data appropriately?


Solution

  • The FutureWarning is probably a result of your numpy and pandas versions being out of date. You can upgrade them using:

    pip install --upgrade numpy pandas 
    

    As for the filtering, there are quite a few options. Here I mention each one with some dummy data.


    Setup

    df
        name colour  a  b  c  d  e  f
    0   john    red  1  2  3  4  5  6
    1  james    red  2  3  4  5  6  7
    2   jane   blue  1  2  3  5  7  8
    
    df2
           0      1
    0  0.122  0.222
    1  0.343  0.345
    2  0.345  0.563
    

    Option 1
    boolean indexing

    df2[df.colour == 'red']
    Out[726]: 
           0       1
    0  0.122   0.222
    1  0.343   0.345
    

    Option 2
    df.eval

    df2[df.eval('colour == "red"')]
    Out[732]: 
           0       1
    0  0.122   0.222
    1  0.343   0.345
    

    Note that both these options work even if df2 is a numpy array of the form:

    array([[ 0.122,  0.222],
           [ 0.343,  0.345],
           [ 0.345,  0.563]])
    

    For your actual data, you'll need to do something along the same lines:

    df2
    
    array([[-6.01 ,  4.466],
           [-7.091,  1.894],
           [-5.91 , -1.979],
           [-4.857, -1.099],
           [-0.402, -2.572],
           [-2.974, -3.159],
           [-4.28 ,  2.827],
           [ 3.951,  1.083],
           [-2.941, -1.955],
           [-4.838,  2.324],
           [-5.005, -0.338],
           [-4.889, -1.555],
           [-3.382, -1.044],
           [-2.143, -0.531],
           [ 0.301, -2.11 ],
           [-2.678, -1.833],
           [-1.645, -2.481],
           [-2.926, -3.024],
           [-4.011,  2.904],
           [-1.046,  0.758],
           [ 0.234, -2.34 ],
           [ 3.156,  1.094],
           [-3.838,  0.114],
           [-0.734, -3.702],
           [ 0.822, -0.478],
           [-3.293, -1.619],
           [-4.243,  2.272],
           [ 1.457, -3.56 ],
           [ 1.799, -0.372],
           [ 0.368, -3.53 ],
           [ 3.776, -0.302],
           [-4.217, -1.309]])
    
    df
    
       Position
    0        FW
    1        FW
    2        FW
    3        FW
    4        FB
    5        AM
    6        FW
    7        CB
    8        AM
    9        FW
    10       AM
    11       FW
    12       AM
    13       CM
    14       FB
    15       AM
    16       CM
    17       CM
    18       FW
    19       CM
    20      CDM
    21       CB
    22       AM
    23       FB
    24      CDM
    25       FW
    26       FW
    27      CDM
    28       FB
    29      CDM
    30       CB
    31       AM
    
    df2[df.Position == 'CDM']
    
    array([[ 0.234, -2.34 ],
           [ 0.822, -0.478],
           [ 1.457, -3.56 ],
           [ 0.368, -3.53 ]])