pythonarraysperformancematrixprocessing-efficiency

Multiple keywords return multiple seperate index arrays


I have a very large matrix (70k x 700k) with numeric values, is the column name for the matrix.

I want to be able to calculate the row average fo each keyword.

For example, below will calculate only for keyword 'Heart'.

names = ["heart", "braininjured", "heartarmleg"]
matrix = [[1, 2, 3], 
          [4, 5, 6], 
          [7, 8, 9]]

indices = [idx for idx, string in enumerate(names) if ('Heart'.lower()) in string.lower()]
print(indices)

avgs_list = []
for row in range(len(matrix)):
   row_sum  = 0
   avg = 0
   row_sum = sum([matrix[row][col] for col in indices])         
   avg = row_sum /len(indices)
   avgs_list.append(avg)
print(row_sum )
print(avgs_list)

Result with keyword: 'Heart' is [2.0, 5.0, 8.0]

End result desired is [[[2.0, 5.0, 8.0], [brainAvgRow1, brainAvgRow2, brainAvgRow3], [armAvgRow1, armAvgRow2, armAvgRow3]]

Now, to do with multiple keywords, both the indices and the current matrix looping will have another added loop,

ie.

keywords = ["Heart", "Brain", "Arm"]
key_idx_lists = []
for keyword in keywords: 
   indices = [idx for idx, string in enumerate(names) if (keyword.lower()) in string.lower()]
   key_idx_lists.append(indices)

The concern is:

  1. Looping the names array for each keyword

  2. Looping through the matrix to get the average sum for each keyword (after we gotten the indexes of matching cols). Runtime becomes quite long.

I was thinking of some way to avoid looping the matrix again and again. ie. That for every element in the matrix, it would check if it appeared in the key_idx_list and keep a running sum (to eventually do an average)?

I wasn't able to draw it out, though, so I would appreciate it if you could point me in the right direction. I have tried searching on StackOverflow but might not have gotten the right search terms as it usually comes up with "multiple substrings return a single index array," which isn't quite what I want.

Thank you for your help in advance.

Update: Edited with Swifty's suggestions


Solution

  • Here's my code:

    names = ["heart", "braininjured", "heartarmleg"]
    names_dict = {name: i for i, name in enumerate(names)}
    
    print(names_dict) # for testing
    
    matrix = [[1, 2, 3], 
              [4, 5, 6], 
              [7, 8, 9]]
    
    keywords = ["Heart", "Brain", "Arm"]
    
    indices = { kw:{names_dict[name] for name in names_dict  if kw.lower() in name} for kw in keywords}
    
    print(indices)  # for testing
    
    averages = {kw:[] for kw in keywords}
    
    for row in range(len(matrix)):
       for kw in keywords:
          averages[kw].append(sum(matrix[row][col] for col in indices[kw])/len(indices[kw])) 
    
             
    print(averages)