pythonclassificationlazypredict

Why am I seeing Index error in this Python script?


#ResidueNoInEachProtein,Residue,TrueLabel,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9
   0 GLN C   0.000   0.000   0.000 1 1 1 1  1 0
   1 THR E   7.057  10.394   0.000 1 1 1 1  1 0
   2 VAL E   6.710   9.449  13.140 0 0 0 0  1 0
   3 PRO E   6.552   9.752  12.974 0 0 0 0  0 0
   4 SER C   6.544   7.584  11.239 0 0 0 0  0 0
   5 SER C   5.407   5.140   5.159 0 0 0 0  0 0
   6 ASP C   5.485   7.378   5.152 0 0 0 0  0 0
   7 GLY C   5.723   9.048   9.571 0 0 0 1  1 0
   8 THR C   6.347   9.102  10.812 0 0 0 2  2 0
   9 PRO E   6.219   9.620  12.486 0 1 1 3  4 0
  10 ILE E   6.412   9.721  12.781 0 0 0 3  4 0
  11 ALA E   6.603  10.294  13.140 0 1 1 2  3 0
  12 PHE E   7.219  10.586  13.126 0 0 0 2  2 0
  13 GLU E   6.939  10.295  13.972 0 0 0 0  1 0
  14 ARG E   6.814  10.472  13.764 0 0 0 0  0 0
  15 SER E   7.061   9.189  12.947 0 0 0 0  0 0
  16 GLY E   6.872   9.856  11.521 0 0 0 0  0 0
  17 SER C   6.988   9.388  11.337 0 0 0 0  0 0
  18 GLY C   6.903   7.889   9.055 0 0 0 0  0 0

import pandas as pd
from sklearn.model_selection import train_test_split
from lazypredict.Supervised import LazyClassifier

# Load the data from full.regular.txt
data = pd.read_csv('full.regular.txt', delim_whitespace=True)

# Verify the data structure
print("Data preview:")
print(data.head())
print("\nData columns:")
print(data.columns)

# Assuming columns are correctly read, let’s inspect their index positions
print("\nColumn positions and data types:")
print(data.dtypes)

# Extract the true label and features based on given specifications
# Column 3 is index 2 and columns 4 to 12 are index 3 to 11
try:
    y = data.iloc[:, 2]
    X = data.iloc[:, 3:11]
except IndexError as e:
    print(f"IndexError: {e}")
    print("The dataset does not have the expected number of columns.")
    print("Please check the dataset and ensure it matches the expected format.")

# Verify the shapes of X and y if the extraction was successful
if 'X' in locals() and 'y' in locals():
    print("\nFeatures shape:", X.shape)
    print("Labels shape:", y.shape)

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Initialize and run LazyClassifier
    clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)
    train, test = clf.fit(X_train, X_test, y_train, y_test)

    # Display the results
    print("\nTraining set evaluation:")
    print(train)
    print("\nTest set evaluation:")
    print(test)

Output:

Data preview:
                                    #ResidueNoInEachProtein,Residue,TrueLabel,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9
0 GLN C 0.00 0.00  0.00  1 1 1 1 1                                                  0                                                                         
1 THR E 7.06 10.39 0.00  1 1 1 1 1                                                  0                                                                         
2 VAL E 6.71 9.45  13.14 0 0 0 0 1                                                  0                                                                         
3 PRO E 6.55 9.75  12.97 0 0 0 0 0                                                  0                                                                         
4 SER C 6.54 7.58  11.24 0 0 0 0 0                                                  0                                                                         

Data columns:
Index(['#ResidueNoInEachProtein,Residue,TrueLabel,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9'], dtype='object')

Column positions and data types:
#ResidueNoInEachProtein,Residue,TrueLabel,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9    int64
dtype: object
IndexError: single positional indexer is out-of-bounds
The dataset does not have the expected number of columns.
Please check the dataset and ensure it matches the expected format.

Features shape: (1079134, 8)
Labels shape: (1079134,)
100%|██████████| 29/29 [00:00<00:00, 3085.30it/s]

Training set evaluation:
Empty DataFrame
Columns: [Accuracy, Balanced Accuracy, ROC AUC, F1 Score, Time Taken]
Index: []

Test set evaluation:
Empty DataFrame
Columns: [Accuracy, Balanced Accuracy, ROC AUC, F1 Score, Time Taken]
Index: []

I am not understanding what I am doing wrong.

Why am I seeing Index error in this Python script?


Solution

  • Your delimiter does not recognize columns as expected, so you may explicitly define column names as:

    column_names = [
        'ResidueNoInEachProtein', 'Residue', 'TrueLabel', 'Feature1', 'Feature2',
        'Feature3', 'Feature4', 'Feature5', 'Feature6', 'Feature7', 'Feature8', 'Feature9'
    ]
    
    # Load the data from full.regular.txt with the specified delimiter and column names
    data = pd.read_csv('full.regular.txt', delim_whitespace=True, names=column_names, skiprows=1)
    
    

    This should help..