Hi I am trying to take a transpose of a DenseMatrix and multiply it to a RowMatrix.
I have a DenseMatrix = V and RowMatrix = U.
I have tried to implement the below function to create a transpose
def dense_T(dense_mat):
if str(type(dense_mat)) == "<class 'pyspark.mllib.linalg.DenseMatrix'>":
t_mat = DenseMatrix(dense_mat.numRows, dense_mat.numCols, dense_mat.values, isTransposed=True)
else:
print("input is not a dense matrix")
return t_mat
But when I do
V_trans = dense_T(V)
U.multiply(V_trans)
I still get dimension issues. and V and V_trans have the same dimensions (from documentation, isTransposed = True seems like it should not change dimensions anyways, but should calculated like it is transposed, which the multiply() is not doing...) It seems like there is a way of converting the matrix into numpy or using a loop to create a new value list and then indexing it back to a transposed matrix like below.
transposed_values = [values[j*num_rows + i] for i in range(num_rows) for j in range(num_cols)]
But due to scalability, I would like to not use numpy(numpy diminishes the reason for distributed computing form what I have read) nor loop through each value. What are my options? Also how come Spark does not have such a common method to be implemented easily? What is the reason?
This is one option to perform a DenseMatrix()
transpose:
from pyspark.mllib.linalg import DenseMatrix
from pyspark.mllib.linalg.distributed import RowMatrix
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
def _transpose_a(dmat):
res = [dmat.values[j * dmat.numRows + i]
for i in range(dmat.numRows) for j in range(dmat.numCols)]
return DenseMatrix(dmat.numCols, dmat.numRows, res)
V = DenseMatrix(3, 2, [1, 2, 3, 4, 5, 6])
U = RowMatrix(spark.sparkContext.parallelize([[1, 2], [3, 4], [5, 6]]))
V_T1 = _transpose_a(V)
print(U.multiply(V_T1).rows.collect())
[DenseVector([9.0, 12.0, 15.0]), DenseVector([19.0, 26.0, 33.0]), DenseVector([29.0, 40.0, 51.0])]