I split the data using train_test_split
after preprocessing:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.2,random_state=42)
Then did robust scaling separately for the numerical columns in test and train:
from sklearn.preprocessing import RobustScaler
robust = RobustScaler()
X_train_ = robust.fit_transform(X_train[numeric_columns])
X_test_ = robust.transform(X_test[numeric_columns])
X_train_sc_num=pd.DataFrame(X_train_,columns=[numeric_columns])
X_test_sc_num=pd.DataFrame(X_test_,columns=[numeric_columns])
Then did concatenation:
X_train_scaled=pd.concat([X_train_sc_num,X_train[categoric_columns]],axis=1)
X_test_scaled=pd.concat([X_test_sc_num,X_test[categoric_columns]],axis=1)
but the shape got broken and so many 'nan' values added in the categorical columns of the output data. The sahpe was (466,17)+(466,11), it should be (466,28), but it became (560,28).
How can I solve this issue?
I want to do Robust scale my data after train_test_split
, but without touching my OHE columns.
Your issue might be arising from a couple of things.
columns=[numeric_columns]
, which would treat the list as a single column name. Instead, it should be just columns=numeric_columns
.index=X_train.index
(or X_test
, depending on the case) to the pd.DataFrame()
initialization.Here is a reproducible example using this example data illustrating the steps you'd need to follow with your data:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
# Create example data
np.random.seed(42)
n_samples = 466
numerical_data = {"temperature": np.random.normal(14, 3, n_samples), "moisture": np.random.normal(96, 2, n_samples)}
categorical_data = {"color": np.random.choice(["green", "yellow", "purple"], size=n_samples, p=[0.8, 0.1, 0.1])}
# Create DataFrame
df = pd.DataFrame({**numerical_data, **categorical_data})
# Define numeric and categorical columns
numerical_columns = numerical_data.keys()
categorical_columns = categorical_data.keys()
# One-hot encode categorical columns
df_encoded = pd.get_dummies(df, columns=categorical_columns)
# Split features and target (creating dummy target for example)
y = np.random.randint(0, 2, n_samples)
X = df_encoded
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Get the one-hot encoded column names
# In general, these will be different than categorical_columns, since you're doing OHE
categorical_columns_encoded = [col for col in X_train.columns if col not in numerical_columns]
# Initialize RobustScaler
robust = RobustScaler()
# Scale only numeric columns
X_train_scaled_numeric = robust.fit_transform(X_train[numerical_columns])
X_test_scaled_numeric = robust.transform(X_test[numerical_columns])
# Create DataFrames with correct column names for scaled numeric data
X_train_scaled_numeric_df = pd.DataFrame(
X_train_scaled_numeric,
columns=numerical_columns,
index=X_train.index, # Preserve the index
)
X_test_scaled_numeric_df = pd.DataFrame(
X_test_scaled_numeric,
columns=numerical_columns,
index=X_test.index, # Preserve the index
)
# Concatenate with categorical columns
X_train_scaled = pd.concat([X_train_scaled_numeric_df, X_train[categorical_columns_encoded]], axis=1)
X_test_scaled = pd.concat([X_test_scaled_numeric_df, X_test[categorical_columns_encoded]], axis=1)
# Verify the shapes
print("Original shapes:")
print(f"X_train: {X_train.shape}")
print(f"X_test: {X_test.shape}")
print("\nScaled shapes:")
print("X_train_scaled: {X_train_scaled.shape} = {X_train_scaled_numeric.shape} + {X_train[categorical_columns_encoded].shape}")
print(f"X_test_scaled: {X_test_scaled.shape} = {X_test_scaled_numeric.shape} + {X_test[categorical_columns_encoded].shape}")
# Verify no NaN values
print("\nNaN check:")
print("NaN in X_train_scaled:", X_train_scaled.isna().sum().sum())
print("NaN in X_test_scaled:", X_test_scaled.isna().sum().sum())
That would print:
Original shapes:
X_train: (372, 5)
X_test: (94, 5)
Scaled shapes:
X_train_scaled: (372, 5) = (372, 2) + (372, 3)
X_test_scaled: (94, 5) = (94, 2) + (94, 3)