I’m encountering a ValueError while attempting to combine numerical, categorical, and image features into a single feature set for a machine learning model. I have followed the steps for feature extraction and preprocessing but am still facing issues.
Here’s a summary of what I’m trying to do:
Load and preprocess numerical and categorical features. Extract and preprocess image features using a pre-trained CNN model. Combine these features into a single dataset.
Code and Error:
# Features and target variable
X = data[["ID",'Thinckness', 'Weight', 'Surface', 'Color', 'Transparence']]
y = data['Material']
# Image loading function
def load_image(image_id, base_path='Camera2/front'):
# Replace with the path to your images directory
image_path = f"{base_path}/{image_id}.jpeg"
try:
with Image.open(image_path) as img:
img = img.resize((128, 128)) # Resize image
return np.array(img)
except FileNotFoundError:
# Return NaN or a placeholder image (e.g., all zeros)
print(f"Image file not found: {image_path}")
return np.full((128, 128, 3), np.nan) # Return a placeholder image with NaN values
# Load images
images = np.array([load_image(image_id) for image_id in data['ID']])
from tensorflow.keras.applications import VGG16
from tensorflow.keras.applications.vgg16 import preprocess_input
from tensorflow.keras.models import Model
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
# Load a pre-trained CNN model for feature extraction
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(128, 128, 3))
model = Model(inputs=base_model.input, outputs=base_model.output)
def extract_features_from_images(images):
features = []
for img in images:
if np.isnan(img).any(): # Check if the image contains NaN values
features.append(np.zeros((4 * 4 * 512,))) # Return a zero-filled vector as placeholder
else:
img = preprocess_input(img)
img = np.expand_dims(img, axis=0)
feature = model.predict(img)
features.append(feature.flatten())
return np.array(features)
# Extract image features
image_features = extract_features_from_images(images)
if image_features.ndim == 3:
# Flatten the image features to 2D: [n_samples, height * width * channels]
image_features = image_features.reshape(image_features.shape[0], -1)
# Define numerical and categorical features
numeric_features = ['Thinckness', 'Weight', 'Surface']
categorical_features = ['Color', 'Transparence']
# Preprocessing for numerical data
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')), # Handle missing values
('scaler', StandardScaler()) # Normalize numerical data
])
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')), # Handle missing values
('onehot', OneHotEncoder(handle_unknown='ignore')) # One-hot encode categorical data
])
# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
numeric_categorical_features = preprocessor.fit_transform(data[numeric_features + categorical_features])
# Combine numerical, categorical, and image features
combined_features = np.hstack([
numeric_categorical_features,
image_features
])
Shape:
Numeric/Categorical Features Shape: (1099, 19)
Image Features Shape: (1099, 8192)
Error:
ValueError Traceback (most recent call last)
Cell In[103], line 2
1 # Combine numerical, categorical, and image features
----> 2 combined_features = np.hstack([
3 numeric_categorical_features,
4 image_features
5 ])
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 2 dimension(s)
I ran your pipeline with a debugger and found that the OneHotEncoder
produces a scipy.sparse.csr_matrix
by default. The ColumnTransformer
has a parameter sparse_threshold
(default: 0.3), which let's it also output sparse matrices if the overall density is lower than the set value.
This led numeric_categorical_features
to be a sparse matrix. Apparently, numpy can't stack scipy's sparse matrices and numpy matrices. To fix this, you have at least 2 options.
OneHotEncoder
output directly to non-sparse (Before v1.2 the parameter is called sparse
, not sparse_output
):OneHotEncoder(handle_unknown='ignore', sparse_output=False)
ColumnTransformer
to always be dense (aka numpy arrays), regardless if the input is sparse or not:ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
], sparse_threshold=0.0)