pandasmultivariate-time-seriesu8darts

Darts Time Series thinks my static covariate needs to be a float


I'm learning how to use machine learning algorithms for forecasting with darts following a video from Kish Manani (https://www.youtube.com/watch?v=9QtL7m3YS9I)

I'm trying to use TimeSeries.from_group_dataframe() to create a couple of different graphs for a linear regression model. My data looks like this

date country volume
2020-01-01 UK 2121
2020-01-01 DE 300
2020-01-02 UK 2150
2020-01-02 DE 243

The issue is that for some reason I am getting a value error that I cannot understand the cause of:

Traceback (most recent call last):
  File "C:\Users\[redacted]\Desktop\Scripts\git\scikitlearn-models\More advanced models.py", line 41, in <module>
    model.fit(y)
  File "C:\Users\[redacted]\AppData\Local\Programs\Python\Python39\lib\site-packages\darts\models\forecasting\regression_model.py", line 722, in fit
    self._fit_model(
  File "C:\Users\[redacted]\AppData\Local\Programs\Python\Python39\lib\site-packages\darts\models\forecasting\regression_model.py", line 544, in _fit_model
    self.model.fit(training_samples, training_labels, **kwargs)
  File "C:\Users\[redacted]\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "C:\Users\[redacted]\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model\_base.py", line 678, in fit
    X, y = self._validate_data(
  File "C:\Users\[redacted]\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py", line 621, in _validate_data
    X, y = check_X_y(X, y, **check_params)
    X = check_array(
  File "C:\Users\[redacted]\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\validation.py", line 917, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
  File "C:\Users\[redacted]\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\_array_api.py", line 380, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
ValueError: could not convert string to float: 'DE'

Code attached:

datasource = pd.read_csv('data/multivariate_test.csv',index_col=False)
df = pd.DataFrame(datasource)


# Convert the 'date_column' to timestamps with english formatting (day comes first)
df['date'] = pd.to_datetime(df['date'],dayfirst=True)
print("Data Converted to: ")
print(df.dtypes)
df.sort_values(by='date',inplace=True)
df.reset_index(drop=True,inplace=True)
print(df)

# Create a TimeSeries, specifying the time and value columns
y = TimeSeries.from_group_dataframe(df,
                                    group_cols= 'country',
                                    static_cols= 'country',
                                    time_col= 'date',
                                    value_cols=['y'],
                                    fill_missing_dates=False, freq='MS') # stands for Month Start

    ### REGRESSION MODEL ###
model = RegressionModel(lags=[-1,-2,-12],model = LinearRegression())
model.fit(y)
y_pred = model.predict(n=12,series=y)ype here

I'm not expecting that the darts library has a need to convert my grouping covariates into floats, especially when the purpose of this attribute is to be able to split time series based off of category, or type.

Anyone who knows the library well, or can see an obvious mistake please let me know.


Solution

  • After playing around with your code and data, I ended up at this error:

    "RegressionModel can only interpret numeric static covariate data. Consider encoding/transforming categorical static covariates with darts.dataprocessing.transformers.static_covariates_transformer.StaticCovariatesTransformer or set use_static_covariates=False at model creation to ignore static covariates."

    This means that if you are going to use RegressionModel, your "country" column should be converted to a numerical type before being passed as a static covariate. In this case, you would be assigning each country a numerical value (i.e. UK = 1, DE = 2) and replacing their country letter codes with the assigned numerical value.

    To do this, you can use scikit-learn's ordinal encoder (sklearn.preprocessing.OrdinalEncoder), or even easier, darts' built in StaticCovariatesTransformer:

    y = TimeSeries.from_group_dataframe(df,
                                        group_cols= 'country',
                                        time_col= 'date',
                                        value_cols=['y'],
                                        fill_missing_dates=False, freq='MS') # stands for Month Start
    
    transformer = StaticCovariatesTransformer()
    y_transformed = transformer.fit_transform(y)
    
    model = RegressionModel(lags=[-1,-2,-12],model = LinearRegression())
    model.fit(y_transformed)
    y_pred = model.predict(n=12,series= y_transformed)
    

    *note that I also removed your static_cols parameter, as group_cols automatically gets converted into static covariates, and therefore you do not need both.