pysparktransposematrix-multiplication

PySpark DenseMatrix (from mllin.linalg) transpose


Hi I am trying to take a transpose of a DenseMatrix and multiply it to a RowMatrix.

I have a DenseMatrix = V and RowMatrix = U.

I have tried to implement the below function to create a transpose

def dense_T(dense_mat):
    if str(type(dense_mat)) == "<class 'pyspark.mllib.linalg.DenseMatrix'>":
        t_mat = DenseMatrix(dense_mat.numRows, dense_mat.numCols, dense_mat.values, isTransposed=True)
    else:
        print("input is not a dense matrix")
    return t_mat

But when I do

V_trans = dense_T(V)
U.multiply(V_trans)

I still get dimension issues. and V and V_trans have the same dimensions (from documentation, isTransposed = True seems like it should not change dimensions anyways, but should calculated like it is transposed, which the multiply() is not doing...) It seems like there is a way of converting the matrix into numpy or using a loop to create a new value list and then indexing it back to a transposed matrix like below.

transposed_values = [values[j*num_rows + i] for i in range(num_rows) for j in range(num_cols)]

But due to scalability, I would like to not use numpy(numpy diminishes the reason for distributed computing form what I have read) nor loop through each value. What are my options? Also how come Spark does not have such a common method to be implemented easily? What is the reason?


Solution

  • This is one option to perform a DenseMatrix() transpose:

    from pyspark.mllib.linalg import DenseMatrix
    from pyspark.mllib.linalg.distributed import RowMatrix
    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.getOrCreate()
    
    def _transpose_a(dmat):
        res = [dmat.values[j * dmat.numRows + i]
               for i in range(dmat.numRows) for j in range(dmat.numCols)]
        return DenseMatrix(dmat.numCols, dmat.numRows, res)
    
    
    V = DenseMatrix(3, 2, [1, 2, 3, 4, 5, 6])
    U = RowMatrix(spark.sparkContext.parallelize([[1, 2], [3, 4], [5, 6]]))
    
    
    V_T1 = _transpose_a(V)
    print(U.multiply(V_T1).rows.collect())
    
    

    Prints

    [DenseVector([9.0, 12.0, 15.0]), DenseVector([19.0, 26.0, 33.0]), DenseVector([29.0, 40.0, 51.0])]