I'm doing a project using the MIMIC-IV dataset as a source. I found a preprocessing pipeline which is widely used in many projects. When I try to run through said pipeline all is well until I try to generate the time series data representation module (I haven't modified the data nor the pipeline code in any way myself). The following error occurs:
TypeError Traceback (most recent call last)
.../Downloads/MIMIC-IV-Data-Pipeline-main/mainPipeline.ipynb Cell 27 in <cell line: 20>()
18 impute=False
20 if data_icu:
---> 21 gen=data_generation_icu.Generator(cohort_output,data_mort,data_admn,data_los,diag_flag,proc_flag,out_flag,chart_flag,med_flag,impute,include,bucket,predW)
22 #gen=data_generation_icu.Generator(cohort_output,data_mort,diag_flag,False,False,chart_flag,False,impute,include,bucket,predW)
23 #if chart_flag:
24 # gen=data_generation_icu.Generator(cohort_output,data_mort,False,False,False,chart_flag,False,impute,include,bucket,predW)
25 else:
26 gen=data_generation.Generator(cohort_output,data_mort,data_admn,data_los,diag_flag,lab_flag,proc_flag,med_flag,impute,include,bucket,predW)
File ~/Downloads/MIMIC-IV-Data-Pipeline-main/model/data_generation_icu.py:22, in Generator.__init__(self, cohort_output, if_mort, if_admn, if_los, feat_cond, feat_proc, feat_out, feat_chart, feat_med, impute, include_time, bucket, predW)
20 self.cohort_output=cohort_output
21 self.impute=impute
---> 22 self.data = self.generate_adm()
23 print("[ READ COHORT ]")
25 self.generate_feat()
File ~/Downloads/MIMIC-IV-Data-Pipeline-main/model/data_generation_icu.py:64, in Generator.generate_adm(self)
62 data['los']=pd.to_timedelta(data['outtime']-data['intime'],unit='h')
63 data['los']=data['los'].astype(str)
---> 64 data[['days', 'dummy','hours']] = data['los'].str.split(' ', -1, expand=True)
65 data[['hours','min','sec']] = data['hours'].str.split(':', -1, expand=True)
66 data['los']=pd.to_numeric(data['days'])*24+pd.to_numeric(data['hours'])
...
127 )
128 raise TypeError(msg)
--> 129 return func(self, *args, **kwargs)
TypeError: split() takes from 1 to 2 positional arguments but 3 positional arguments (and 1 keyword-only argument) were given.
I'm assuming the problem lies in the use of the pandas.str.split() function (I'm using pandas version 2.0.3) but when I check the documentation it should accept 3 keyword arguments as far as I can tell.
Since it isn't my code I'm having a hard time debugging what is going wrong here but maybe I'm missing something. Does anyone know or did anyone run into the same problem when trying to use this pipeline and have any clue how to fix this?
In recent pandas versions, many functions switched to keyword only, you can actually see this in str.split
documentation.
# positional # keyword-only
Series.str.split(pat=None, *, n=-1, expand=False, regex=None)
The *
means that onlt pat
can be used as a positional parameter, n
/expand
/regex
must be provided as keywords.
You need to use the named parameter:
data[['days', 'dummy','hours']] = data['los'].str.split(' ', n=-1, expand=True)
There actually used to be a FutureWarning about this in previous versions:
In a future version of pandas all arguments of StringMethods.split except for the argument 'pat' will be keyword-only.