pythonpandas

Pandas - only select rows containing a substring in a column


I'm using something similar to this as input.txt

header
 040525 $$$$$   9999         12345
   random stuff
 040525 $$$$$   8888         12345
 040525 $$$$$   7777         12345
           random stuff
 040525 $$$$$   6666         12345
footer

Due to the way this input is being pre-processed, I cannot correctly use pd.read_csv. I must first create a list from the input; Then, create a DataFrame from the list.

data_list = []
with open('input.txt', 'r') as data:
    for line in data:
        data_list.append(line.strip().split()) 
df = pd.DataFrame(data_list)

I only want to append lines that contain '$$$' in the second column. Desired output would be:

       0      1     2      3
0  40525  $$$$$  9999  12345
1  40525  $$$$$  8888  12345
2  40525  $$$$$  7777  12345
3  40525  $$$$$  6666  12345

Solution

  • data_list = []
    with open('input.txt', 'r') as data:
        for line in data:
             #create split_row to check
             split_row = line.strip().split()
             #check if the second substring in split_row starts with "*****"
             if split_row[1].startswith("*****"): 
                 data_list.append(split_row) 
    df = pd.DataFrame(data_list)