I made a dataframe of a csv file and passed it into train_test_split and then used MinMaxScaler to scale the whole X and Y dataframes but now I want to know the basic number of rows and columns but can't.
df=pd.read_csv("cancer_classification.csv")
from sklearn.model_selection import train_test_split
X = df.drop("benign_0__mal_1",axis=1).values
y = df["benign_0__mal_1"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit(X_train)
X_test = scaler.fit(X_test)
X_train.shape
this is throwing the following error
AttributeError Traceback (most recent call last) in () ----> 1 X_train.shape
AttributeError: 'MinMaxScaler' object has no attribute 'shape'
I read the documentation and was able to find the number of rows using scale_ but not to find the columns. this is how the answer should look like but I was not able to find an attribute that can help
MinMaxScaler is an object that can fit
itself to certain data and also transform
that data. There are
fit
method fits the scaler's parameters to that data. It then returns the MinMaxScaler objecttransforms
method transforms data based on the scaler's fitted parameters. It then returns the transformed data.fit_transform
method first fits the scaler to that data, then transforms it and returns the transformed version of the data.In your example, you are treating the MinMaxScaler object itself as the data! (see 1st bullet point)
The same MinMaxScaler shouldn't be fitted twice on different dataset since its internal values will be changed. You should never fit a minmaxscaler on the test dataset since that's a way of leaking test data into your model. What you should be doing is fit_transform()
on the training data and transform()
on the test data.
The answer here may also help this explanation: fit-transform on training data and transform on test data
When you call StandardScaler.fit(X_train), what it does is calculate the mean and variance from the values in X_train. Then calling .transform() will transform all of the features by subtracting the mean and dividing by the variance. For convenience, these two function calls can be done in one step using fit_transform().
The reason you want to fit the scaler using only the training data is because you don't want to bias your model with information from the test data.
If you fit() to your test data, you'd compute a new mean and variance for each feature. In theory these values may be very similar if your test and train sets have the same distribution, but in practice this is typically not the case.
Instead, you want to only transform the test data by using the parameters computed on the training data.