i am trying to create a matrix mapping of excel cells in Python starting from A1 to ZZ1500000. I tried using numpy matmul but it does not work for char vectors. I had to resort using python lists which are super slow. Is there a more elegant solution to this problem? Here is my code:
row_nums = np.arange(1,200000,1)
row_nums = list(map(str, row_nums))
col_nums = []
set_char_l = []
alphabets = list(string.ascii_uppercase)
#set_char = 'A'
for i in range(1, 703, 26):
if i <= 26:
for item in alphabets:
col_nums.append(item)
elif (i // 26) < 27:
set_char = round(i // 26) - 1
new_char = alphabets[set_char]
set_char_l.append(new_char)
for item in alphabets:
col_nums.append(new_char + item)
cell_nums = []
for col in col_nums:
for row in row_nums:
cell_no = col + row
cell_nums.append(cell_no)
cell_nums = np.array(cell_nums)
cell_nums = cell_nums.reshape(len(row_nums), len(col_nums))
cell_nums = np.matmul(row_nums.reshape(-1,1), col_nums)
UFuncTypeError: ufunc 'matmul' did not contain a loop with signature matching types (dtype('<U6'), dtype('<U2')) -> None
With a reduced row
range, your lists look like:
In [118]: row_nums[:10]
Out[118]: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
In [119]: col_nums[:10]
Out[119]: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
And cell_nums
produces an array of strings, joining elements from the rows and columns:
In [120]: cell_nums
Out[120]:
array([['A1', 'A2', 'A3', ..., 'AK16', 'AK17', 'AK18'],
['AK19', 'AL1', 'AL2', ..., 'BV15', 'BV16', 'BV17'],
['BV18', 'BV19', 'BW1', ..., 'DG14', 'DG15', 'DG16'],
...,
['VT4', 'VT5', 'VT6', ..., 'XD19', 'XE1', 'XE2'],
['XE3', 'XE4', 'XE5', ..., 'YO18', 'YO19', 'YP1'],
['YP2', 'YP3', 'YP4', ..., 'ZZ17', 'ZZ18', 'ZZ19']], dtype='<U4')
And the evaluation time:
In [121]: %%timeit
...: cell_nums = []
...: for col in col_nums:
...: for row in row_nums:
...: cell_no = col + row
...: cell_nums.append(cell_no)
...: cell_nums = np.array(cell_nums)
...: cell_nums = cell_nums.reshape(len(row_nums), len(col_nums))
...:
...:
8.79 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [122]: cell_nums.shape
Out[122]: (19, 702)
matmul
is a numeric matrix multiplication, not simply an "outer product". It's elementwise multiplication followed by a sum reduction. 'multliplication' isn't defined for strings. Strings can sum
, by concatenatenation. np.array(col_nums)
makes a U2
dtype array, which doesn't have that kind of sum.
But if we make object
dtype arrays, the elements are python strings, and they do have this sum/join:
In [123]: x=np.array(col_nums,object)[:,None]+np.array(row_nums,object); x.shape
Out[123]: (702, 19)
In [124]: x
Out[124]:
array([['A1', 'A2', 'A3', ..., 'A17', 'A18', 'A19'],
['B1', 'B2', 'B3', ..., 'B17', 'B18', 'B19'],
['C1', 'C2', 'C3', ..., 'C17', 'C18', 'C19'],
...,
['ZX1', 'ZX2', 'ZX3', ..., 'ZX17', 'ZX18', 'ZX19'],
['ZY1', 'ZY2', 'ZY3', ..., 'ZY17', 'ZY18', 'ZY19'],
['ZZ1', 'ZZ2', 'ZZ3', ..., 'ZZ17', 'ZZ18', 'ZZ19']], dtype=object)
While I can easily match the cell_nums
shape, I haven't taken the time to duplicate the ordering.
They do match if I correct your cell_nums
reshape.
But times are modestly faster than your list version (5x):
In [125]: timeit x=np.array(col_nums,object)[:,None]+np.array(row_nums,object)
1.69 ms ± 4.74 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
This is nothing like the speed up of a numeric matrix multiply. It still has to iterate and apply the string 'add' to each pair.