python machine-learning nlp valueerror naivebayes

ValueError when using model.fit even with the vectors being aligned

I am attempting to build a naive Bayes model for text classification.

Here is a sample of the data I'm working with:

df_some_observations = filtered_training.sample(frac=0.0001)
df_some_observations.to_dict()

The output looks like this:

{'Intitulé (Ce champ doit respecter la nomenclature suivante : Code action – Libellé)_x': {40219: 'aegua00268 format oper scad htbhta fonction avance',
  16820: 'aeedf50490 sort conflit facon construct',
  24771: '4022mps192 prepar a lhabilit electr boho indic v personnel non elec',
  34482: '3095mceg73 affirmezvous relat professionnel bas ref 7114'},
 'Nœud parent au niveau N y compris moi-même.1': {40219: 'distribu electricit rel reseau electricit ecr exploit conduit reseau electricit',
  16820: 'ct competent transvers rhu ressourc humain for pilotag gestion format',
  24771: 'ss sant securit prevent prf prevent risqu professionnel hcp habilit certif perm prevent risqu meti',
  34482: 'nan'},
 'Thème de formation (Chemin complet)': {40219: 'distribu electricit rel reseau electricit ecr exploit conduit reseau electricit',
  16820: 'ct competent transvers rhu ressourc humain for pilotag gestion format',
  24771: 'ss sant securit prevent prf prevent risqu professionnel hcp habilit certif perm prevent risqu meti',
  34482: 'in ingenier esp equip sous pression'},
 'Description du champ supplémentaire : Objectifs de la formation': {40219: 'nan',
  16820: 'nan',
  24771: 'prepar a lhabilit electr boho indic v autoris special lissu cet format stagiair doit connaitr risqu electr savoir sen proteg doit etre capabl deffectu oper simpl dexploit suiv certain methodolog',
  34482: 'nan'},
 'Objectifs': {40219: 'nan', 16820: 'nan', 24771: 'nan', 34482: 'nan'},
 'Programme de formation': {40219: 'nan',
  16820: 'nan',
  24771: 'notion elementair delectricit sensibilis risqu electr prevent risqu electr publiqu utec 18 510 definit oper lenviron intervent tbt b appareillag electr bt materiel protect individuel collect manoeuvr mesurag essais verif outillag electr portat a main mis situat coffret didact',
  34482: 'nan'},
 'Populations concernées': {40219: 'nan',
  16820: 'nan',
  24771: 'personnel electricien effectu oper dordr electr',
  34482: 'nan'},
 'Prérequis': {40219: 'nan',
  16820: 'nan',
  24771: 'personnel non electricien effectu oper simpl remplac fusibl rearm disjoncteur rel thermiqu',
  34482: 'nan'},
 "Description du champ supplémentaire : Commanditaire de l'action": {40219: 'nan',
  16820: 'nan',
  24771: 'nan',
  34482: 'nan'},
 "Organisme dispensant l'action": {40219: 'local sei',
  16820: 'intern edf',
  24771: 'intern edf',
  34482: 'intern edf'},
 'Durée théorique (h)': {40219: 14.0, 24771: 11.0, 34482: 14.0},
 'Coût de la catégorie Coût pédagogique': {40219: 0.0,
  16820: 0.0,
  24771: 0.0,
  34482: 0.0},
 'Coût de la catégorie Coût logistique': {40219: 0.0,
  16820: 0.0,
  24771: 0.0,
  34482: 0.0},

I started by splitting the data after removing some unnecessary columns:

(my target variable is in column 15)

df_training = filtered_training.sample(frac=0.8, random_state=42) 
df_test = filtered_training.drop(df_training.index)
X_train = df_training.iloc[:,:14]
y_train = df_training.iloc[:,15]
X_test = df_test.iloc[:,:14]
y_test = df_test.iloc[:,15]

When building the model with:

model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X_train, y_train)
predicted_categories = model.predict(X_test)

I receive the following error when executing model.fit(X_train, y_train):

ValueError: Found input variables with inconsistent numbers of samples: [14, 35478]

Additional information that may be helpful:

np.shape(X_train) #(35478, 14)
np.shape(y_train) #(35478,)
np.shape(X_test) #(8870, 14)
np.shape(y_test) #(8870,)

Solution

I think that the main problem that TfidfVectorizer is able to work with one-dimensional text data only (as I see it from here). That's why when it tries to convert several columns with text data it tries to do it for column names for some reason.

In your case I see 2 ways how to solve this problem:

If you want to apply TfidfVectorizer for each column individually, it would be better to do it like this for example:

column_transformer = ColumnTransformer([(x, TfidfVectorizer(), x) for x in X_train.columns]) # make sure that all columns contains text data
model = make_pipeline(column_transformer, MultinomialNB())
model.fit(X_train, y_train)
predicted_categories = model.predict(X_test)

But if you want to apply one vocabulary for your columns, then I would recomment to do it like this:

nex_X_train = X_train.iloc[:,0]
for x in X_train.columns[1:]:
    nex_X_train = nex_X_train + ' ' + X_train[x]

nex_X_test = X_test.iloc[:,0]
for x in X_test.columns[1:]:
    nex_X_test = nex_X_test + ' ' + X_test[x]
    
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(nex_X_train, y_train)
predicted_categories = model.predict(nex_X_test)