pysparkstatisticsp-valuechi-squared

pyspark p values and chisquaretest correlations


+----------+---------------+--------------------+--------------+-------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------+
|      date|  serial_number|               model|capacity_bytes|failure|smart_1_raw|smart_3_raw|smart_4_raw|smart_5_raw|smart_7_raw|smart_9_raw|smart_10_raw|s
+----------+---------------+--------------------+--------------+-------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------+
|2018-09-23|       ZJV01VV0|       ST12000NM0007|12000138625024|      0|   32985096|          0|  
|2018-09-23|       ZJV01VV5|       ST12000NM0007|12000138625024|      0|   77197496|          0|  
|2018-09-23| PL2331LAH3XLZJ|HGST HMS5C4040BLE640| 4000787030016|      0|          0|          0| 
|2018-09-23|       ZCH0ATJY|       ST12000NM0007|12000138625024|      0|   51954552|          0|  
|2018-09-23|       ZA1816EB|        ST8000NM0055| 8001563222016|      0|  129696704|          0| 
|2018-09-23|       ZA13ZKX8|         ST8000DM002| 8001563222016|      0|   89446512|          0| 
|2018-09-23| PL2331LAHDB5PJ|HGST HMS5C4040BLE640| 4000787030016|      0|          0|        442| 
|2018-09-23|       ZA1816E1|        ST8000NM0055| 8001563222016|      0|    8437320|          0| 
|2018-09-23| PL2331LAH3WM1J|HGST HMS5C4040BLE640| 4000787030016|      0|          0|          0| 
|2018-09-23|       S30108NT|         ST4000DM000| 4000787030016|      0|   11197576|          0| 
|2018-09-23|       ZJV01VVG|       ST12000NM0007|12000138625024|      0|  172268856|          0|  
|2018-09-23|       ZJV01VVM|       ST12000NM0007|12000138625024|      0|  101040904|          0|  
|2018-09-23|       ZA174KPY|        ST8000NM0055| 8001563222016|      0|   50287344|          0| 
|2018-09-23| PL2331LAH3W4XJ|HGST HMS5C4040BLE640| 4000787030016|      0|          0|        530| 
|2018-09-23|       Z4D068HF|         ST6000DX000| 6001175126016|      0|  23293443

supposed to calculate pvalue of correlation between smart_194_raw and "failure" column. im not sure how to go about creating the LabeldPoint and Vectors etc.


Solution

  • Here's a small step by step guide on how to get Chi Square Test and basic stats for your question.

    >>> from pyspark.sql import SparkSession
    >>> from pyspark.ml.feature import VectorAssembler
    >>> from pyspark.ml.stat import ChiSquareTest
    
    >>> df = spark._sc.parallelize([
        [0, 1.0, 0.71, 0.143],
        [1, 0.0, 0.97, 0.943],
        [0, 0.123, 0.27, 0.443],
        [1, 0.67, 0.3457, 0.243],
        [1, 0.39, 0.7777, 0.143]
    ]).toDF(['label', 'col2', 'col3', 'col4'])
    
    >>> df.show()
    +-----+-----+------+-----+
    |label| col2|  col3| col4|
    +-----+-----+------+-----+
    |    0|  1.0|  0.71|0.143|
    |    1|  0.0|  0.97|0.943|
    |    0|0.123|  0.27|0.443|
    |    1| 0.67|0.3457|0.243|
    |    1| 0.39|0.7777|0.143|
    +-----+-----+------+-----+
    
    
    >>> assembler = VectorAssembler(
        inputCols=['col2', 'col3', 'col4'],
        outputCol="vector_features")
    
    >>> vectorized_df = assembler.transform(df).select('label', 'vector_features')
    
    >>> vectorized_df.show()
    +-----+-------------------+
    |label|    vector_features|
    +-----+-------------------+
    |    0|   [1.0,0.71,0.143]|
    |    1|   [0.0,0.97,0.943]|
    |    0| [0.123,0.27,0.443]|
    |    1|[0.67,0.3457,0.243]|
    |    1|[0.39,0.7777,0.143]|
    +-----+-------------------+
    
    
    >>> r = ChiSquareTest.test(vectorized_df, "vector_features", "label").head()
    >>> print("pValues: " + str(r.pValues))
    >>> print("degreesOfFreedom: " + str(r.degreesOfFreedom))
    >>> print("statistics: " + str(r.statistics))
    
    pValues: [0.2872974951836462,0.2872974951836462,0.40465279495160544]
    degreesOfFreedom: [4, 4, 3]
    statistics: [5.0,5.0,2.916666666666667]