pythonpatsy

Stop patsy dmatrix from dropping NaN rows


I would like use patsy's dmatrix function to generate a design matrix in which rows with NaN values are preserved. For example, the following code would return a design matrix with four rows, which is what we would normally want. However, in this case I would like dmatrix to return a matrix with five rows, where the first row will have an NaN value in it.

import numpy as np
import pandas as pd
from patsy import dmatrix

df = pd.DataFrame({'x1': np.arange(5), 'x2': np.arange(5)})
dmatrix("~x1+x2.diff()", df)

Alternatively, I would settle for an answer that allows me to retrieve the row numbers that were dropped / retained. In the example above row 1 is the row that was dropped, while rows 2-5 were retained.


Solution

  • Try:

    dmatrix("~x1+x2.diff()", df, NA_action=patsy.NAAction(NA_types=[]))
    

    This tells patsy not to consider NaN as indicating a missing value, so it will be passed through instead. Docs are here: https://patsy.readthedocs.io/en/latest/API-reference.html#missing-values

    Alternatively, I would settle for an answer that allows me to retrieve the row numbers that were dropped / retained.

    If you use return_type="dataframe", then patsy will return a pandas DataFrame containing your design matrix, and the index on that DataFrame will correspond to the rows in your original input, so you can see which rows were kept or dropped.