I am attempting to build a naive Bayes model for text classification.
Here is a sample of the data I'm working with:
df_some_observations = filtered_training.sample(frac=0.0001)
df_some_observations.to_dict()
The output looks like this:
{'Intitulé (Ce champ doit respecter la nomenclature suivante : Code action – Libellé)_x': {40219: 'aegua00268 format oper scad htbhta fonction avance',
16820: 'aeedf50490 sort conflit facon construct',
24771: '4022mps192 prepar a lhabilit electr boho indic v personnel non elec',
34482: '3095mceg73 affirmezvous relat professionnel bas ref 7114'},
'Nœud parent au niveau N y compris moi-même.1': {40219: 'distribu electricit rel reseau electricit ecr exploit conduit reseau electricit',
16820: 'ct competent transvers rhu ressourc humain for pilotag gestion format',
24771: 'ss sant securit prevent prf prevent risqu professionnel hcp habilit certif perm prevent risqu meti',
34482: 'nan'},
'Thème de formation (Chemin complet)': {40219: 'distribu electricit rel reseau electricit ecr exploit conduit reseau electricit',
16820: 'ct competent transvers rhu ressourc humain for pilotag gestion format',
24771: 'ss sant securit prevent prf prevent risqu professionnel hcp habilit certif perm prevent risqu meti',
34482: 'in ingenier esp equip sous pression'},
'Description du champ supplémentaire : Objectifs de la formation': {40219: 'nan',
16820: 'nan',
24771: 'prepar a lhabilit electr boho indic v autoris special lissu cet format stagiair doit connaitr risqu electr savoir sen proteg doit etre capabl deffectu oper simpl dexploit suiv certain methodolog',
34482: 'nan'},
'Objectifs': {40219: 'nan', 16820: 'nan', 24771: 'nan', 34482: 'nan'},
'Programme de formation': {40219: 'nan',
16820: 'nan',
24771: 'notion elementair delectricit sensibilis risqu electr prevent risqu electr publiqu utec 18 510 definit oper lenviron intervent tbt b appareillag electr bt materiel protect individuel collect manoeuvr mesurag essais verif outillag electr portat a main mis situat coffret didact',
34482: 'nan'},
'Populations concernées': {40219: 'nan',
16820: 'nan',
24771: 'personnel electricien effectu oper dordr electr',
34482: 'nan'},
'Prérequis': {40219: 'nan',
16820: 'nan',
24771: 'personnel non electricien effectu oper simpl remplac fusibl rearm disjoncteur rel thermiqu',
34482: 'nan'},
"Description du champ supplémentaire : Commanditaire de l'action": {40219: 'nan',
16820: 'nan',
24771: 'nan',
34482: 'nan'},
"Organisme dispensant l'action": {40219: 'local sei',
16820: 'intern edf',
24771: 'intern edf',
34482: 'intern edf'},
'Durée théorique (h)': {40219: 14.0, 24771: 11.0, 34482: 14.0},
'Coût de la catégorie Coût pédagogique': {40219: 0.0,
16820: 0.0,
24771: 0.0,
34482: 0.0},
'Coût de la catégorie Coût logistique': {40219: 0.0,
16820: 0.0,
24771: 0.0,
34482: 0.0},
I started by splitting the data after removing some unnecessary columns:
(my target variable is in column 15)
df_training = filtered_training.sample(frac=0.8, random_state=42)
df_test = filtered_training.drop(df_training.index)
X_train = df_training.iloc[:,:14]
y_train = df_training.iloc[:,15]
X_test = df_test.iloc[:,:14]
y_test = df_test.iloc[:,15]
When building the model with:
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X_train, y_train)
predicted_categories = model.predict(X_test)
I receive the following error when executing model.fit(X_train, y_train)
:
ValueError: Found input variables with inconsistent numbers of samples: [14, 35478]
Additional information that may be helpful:
np.shape(X_train) #(35478, 14)
np.shape(y_train) #(35478,)
np.shape(X_test) #(8870, 14)
np.shape(y_test) #(8870,)
I think that the main problem that TfidfVectorizer is able to work with one-dimensional text data only (as I see it from here). That's why when it tries to convert several columns with text data it tries to do it for column names for some reason.
In your case I see 2 ways how to solve this problem:
column_transformer = ColumnTransformer([(x, TfidfVectorizer(), x) for x in X_train.columns]) # make sure that all columns contains text data
model = make_pipeline(column_transformer, MultinomialNB())
model.fit(X_train, y_train)
predicted_categories = model.predict(X_test)
nex_X_train = X_train.iloc[:,0]
for x in X_train.columns[1:]:
nex_X_train = nex_X_train + ' ' + X_train[x]
nex_X_test = X_test.iloc[:,0]
for x in X_test.columns[1:]:
nex_X_test = nex_X_test + ' ' + X_test[x]
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(nex_X_train, y_train)
predicted_categories = model.predict(nex_X_test)