#ResidueNoInEachProtein,Residue,TrueLabel,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9
0 GLN C 0.000 0.000 0.000 1 1 1 1 1 0
1 THR E 7.057 10.394 0.000 1 1 1 1 1 0
2 VAL E 6.710 9.449 13.140 0 0 0 0 1 0
3 PRO E 6.552 9.752 12.974 0 0 0 0 0 0
4 SER C 6.544 7.584 11.239 0 0 0 0 0 0
5 SER C 5.407 5.140 5.159 0 0 0 0 0 0
6 ASP C 5.485 7.378 5.152 0 0 0 0 0 0
7 GLY C 5.723 9.048 9.571 0 0 0 1 1 0
8 THR C 6.347 9.102 10.812 0 0 0 2 2 0
9 PRO E 6.219 9.620 12.486 0 1 1 3 4 0
10 ILE E 6.412 9.721 12.781 0 0 0 3 4 0
11 ALA E 6.603 10.294 13.140 0 1 1 2 3 0
12 PHE E 7.219 10.586 13.126 0 0 0 2 2 0
13 GLU E 6.939 10.295 13.972 0 0 0 0 1 0
14 ARG E 6.814 10.472 13.764 0 0 0 0 0 0
15 SER E 7.061 9.189 12.947 0 0 0 0 0 0
16 GLY E 6.872 9.856 11.521 0 0 0 0 0 0
17 SER C 6.988 9.388 11.337 0 0 0 0 0 0
18 GLY C 6.903 7.889 9.055 0 0 0 0 0 0
import pandas as pd
from sklearn.model_selection import train_test_split
from lazypredict.Supervised import LazyClassifier
# Load the data from full.regular.txt
data = pd.read_csv('full.regular.txt', delim_whitespace=True)
# Verify the data structure
print("Data preview:")
print(data.head())
print("\nData columns:")
print(data.columns)
# Assuming columns are correctly read, let’s inspect their index positions
print("\nColumn positions and data types:")
print(data.dtypes)
# Extract the true label and features based on given specifications
# Column 3 is index 2 and columns 4 to 12 are index 3 to 11
try:
y = data.iloc[:, 2]
X = data.iloc[:, 3:11]
except IndexError as e:
print(f"IndexError: {e}")
print("The dataset does not have the expected number of columns.")
print("Please check the dataset and ensure it matches the expected format.")
# Verify the shapes of X and y if the extraction was successful
if 'X' in locals() and 'y' in locals():
print("\nFeatures shape:", X.shape)
print("Labels shape:", y.shape)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and run LazyClassifier
clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)
train, test = clf.fit(X_train, X_test, y_train, y_test)
# Display the results
print("\nTraining set evaluation:")
print(train)
print("\nTest set evaluation:")
print(test)
Output:
Data preview:
#ResidueNoInEachProtein,Residue,TrueLabel,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9
0 GLN C 0.00 0.00 0.00 1 1 1 1 1 0
1 THR E 7.06 10.39 0.00 1 1 1 1 1 0
2 VAL E 6.71 9.45 13.14 0 0 0 0 1 0
3 PRO E 6.55 9.75 12.97 0 0 0 0 0 0
4 SER C 6.54 7.58 11.24 0 0 0 0 0 0
Data columns:
Index(['#ResidueNoInEachProtein,Residue,TrueLabel,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9'], dtype='object')
Column positions and data types:
#ResidueNoInEachProtein,Residue,TrueLabel,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9 int64
dtype: object
IndexError: single positional indexer is out-of-bounds
The dataset does not have the expected number of columns.
Please check the dataset and ensure it matches the expected format.
Features shape: (1079134, 8)
Labels shape: (1079134,)
100%|██████████| 29/29 [00:00<00:00, 3085.30it/s]
Training set evaluation:
Empty DataFrame
Columns: [Accuracy, Balanced Accuracy, ROC AUC, F1 Score, Time Taken]
Index: []
Test set evaluation:
Empty DataFrame
Columns: [Accuracy, Balanced Accuracy, ROC AUC, F1 Score, Time Taken]
Index: []
I am not understanding what I am doing wrong.
Why am I seeing Index error in this Python script?
Your delimiter does not recognize columns as expected, so you may explicitly define column names as:
column_names = [
'ResidueNoInEachProtein', 'Residue', 'TrueLabel', 'Feature1', 'Feature2',
'Feature3', 'Feature4', 'Feature5', 'Feature6', 'Feature7', 'Feature8', 'Feature9'
]
# Load the data from full.regular.txt with the specified delimiter and column names
data = pd.read_csv('full.regular.txt', delim_whitespace=True, names=column_names, skiprows=1)
This should help..