pythonapache-sparkpyspark

Row object in RDD


I am trying For each RDD, remove the header row and parse each comma-delimited line into a Row object with each column following the data type given in the jupyter notebook cell. Please convert some columns into the preferred format. Columns that should be converted into integer : 'YEAR', 'MONTH', 'DAY','DAY_OF_WEEK', 'FLIGHT_NUMBER'. Column that should be converted into float data type : 'DEPARTURE_DELAY', 'ARRIVAL_DELAY', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE', 'TAXI_IN', and 'TAXI_OUT'. While the rest are kept as string format.

Currently this is my code

def import_csv_rdd(data):
    rdd = sc.textFile(data)
    rdd_header = 
    # 1. Split each line separated by comma into a list 
    bank_rdd1 = bank_rdd.map(lambda line: line.split(','))
    # 2. Remove the header
    header = bank_rdd1.first()
    bank_rdd1 = bank_rdd1.filter(lambda row: row != header)   #filter out header

^ the above code is not finished and needs some adaptation however, i would like to get some clarification in how to "parse each comma-delimited line into a Row object"


Solution

  • Let's do this with simple file with less columns.

    col1,col2,col3
    1,1.2,value1
    2,2.2,value2
    

    Reading this file as RDD and change that to Row object after converting col1 type to int, col2 type to float.

    >>> rdd = spark.sparkContext.textFile('/file/path/')
    
    >>> header = rdd.first()
    >>> rdd = rdd.filter(lambda row: row != header)
    >>> split_rdd = rdd.map(lambda line: line.split(','))
    
    >>> row_rdd = split_rdd.map(lambda line: Row(col1 = int(line[0]), col2=float(line[1]), col3=line[2]))
    
    >>> row_rdd.collect()
    [Row(col1=1, col2=1.2, col3='value1'), Row(col1=2, col2=2.2, col3='value2')]