pythonnumpynumpy-ufunc

Creating a matrix mapping of Excel cells in Python


i am trying to create a matrix mapping of excel cells in Python starting from A1 to ZZ1500000. I tried using numpy matmul but it does not work for char vectors. I had to resort using python lists which are super slow. Is there a more elegant solution to this problem? Here is my code:

row_nums = np.arange(1,200000,1)
row_nums = list(map(str, row_nums))
col_nums = []
set_char_l = []
alphabets = list(string.ascii_uppercase)
#set_char = 'A'
for i in range(1, 703, 26):
    if i <= 26:
        for item in alphabets:
            col_nums.append(item)
    elif (i // 26) < 27:
        set_char = round(i // 26) - 1
        new_char = alphabets[set_char]
        set_char_l.append(new_char)
        for item in alphabets:
            col_nums.append(new_char + item)

cell_nums = []
for col in col_nums:
    for row in row_nums:
        cell_no = col + row
        cell_nums.append(cell_no)
cell_nums = np.array(cell_nums)
cell_nums = cell_nums.reshape(len(row_nums), len(col_nums))
cell_nums = np.matmul(row_nums.reshape(-1,1), col_nums)

UFuncTypeError: ufunc 'matmul' did not contain a loop with signature matching types (dtype('<U6'), dtype('<U2')) -> None


Solution

  • With a reduced row range, your lists look like:

    In [118]: row_nums[:10]
    Out[118]: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
    
    In [119]: col_nums[:10]
    Out[119]: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
    

    And cell_nums produces an array of strings, joining elements from the rows and columns:

    In [120]: cell_nums
    Out[120]: 
    array([['A1', 'A2', 'A3', ..., 'AK16', 'AK17', 'AK18'],
           ['AK19', 'AL1', 'AL2', ..., 'BV15', 'BV16', 'BV17'],
           ['BV18', 'BV19', 'BW1', ..., 'DG14', 'DG15', 'DG16'],
           ...,
           ['VT4', 'VT5', 'VT6', ..., 'XD19', 'XE1', 'XE2'],
           ['XE3', 'XE4', 'XE5', ..., 'YO18', 'YO19', 'YP1'],
           ['YP2', 'YP3', 'YP4', ..., 'ZZ17', 'ZZ18', 'ZZ19']], dtype='<U4')
    

    And the evaluation time:

    In [121]: %%timeit
         ...: cell_nums = []
         ...: for col in col_nums:
         ...:     for row in row_nums:
         ...:         cell_no = col + row
         ...:         cell_nums.append(cell_no)
         ...: cell_nums = np.array(cell_nums)
         ...: cell_nums = cell_nums.reshape(len(row_nums), len(col_nums))
         ...: 
         ...: 
    8.79 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [122]: cell_nums.shape
    Out[122]: (19, 702)
    

    matmul is a numeric matrix multiplication, not simply an "outer product". It's elementwise multiplication followed by a sum reduction. 'multliplication' isn't defined for strings. Strings can sum, by concatenatenation. np.array(col_nums) makes a U2 dtype array, which doesn't have that kind of sum.

    But if we make object dtype arrays, the elements are python strings, and they do have this sum/join:

    In [123]: x=np.array(col_nums,object)[:,None]+np.array(row_nums,object); x.shape
    Out[123]: (702, 19)
    
    In [124]: x
    Out[124]: 
    array([['A1', 'A2', 'A3', ..., 'A17', 'A18', 'A19'],
           ['B1', 'B2', 'B3', ..., 'B17', 'B18', 'B19'],
           ['C1', 'C2', 'C3', ..., 'C17', 'C18', 'C19'],
           ...,
           ['ZX1', 'ZX2', 'ZX3', ..., 'ZX17', 'ZX18', 'ZX19'],
           ['ZY1', 'ZY2', 'ZY3', ..., 'ZY17', 'ZY18', 'ZY19'],
           ['ZZ1', 'ZZ2', 'ZZ3', ..., 'ZZ17', 'ZZ18', 'ZZ19']], dtype=object)
    

    While I can easily match the cell_nums shape, I haven't taken the time to duplicate the ordering.

    edit

    They do match if I correct your cell_nums reshape.

    But times are modestly faster than your list version (5x):

    In [125]: timeit x=np.array(col_nums,object)[:,None]+np.array(row_nums,object)
    1.69 ms ± 4.74 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    

    This is nothing like the speed up of a numeric matrix multiply. It still has to iterate and apply the string 'add' to each pair.