scalaapache-sparkpysparkdatabricks

Remove first and last row from the text file in pyspark


I have file like below

H201908127477474
123|sample|customer|3433
786|ex|data|7474737
987|solve|data|6364
T3 637373

I want to remove the first row and last row from the file. Please give me some solution in pyspark

I am using this to load the file

df=spark.read.format('csv').load('sample.txt')

Solution

  • There is no easy way to drop rows by line number because Spark DataFrames do not by default have a concept of order1. There is no "first" or "last" row - each row is treated as an independent block of structured data. This is fundamental to spark and is what what allows it to distribute/parallelize computing - each executor can pick up an arbitrary chunk of the data and process is.

    Although your question asks how to drop the first and last rows, I assume what you actually want is to keep the data which follow the correct schema.

    If you know the correct schema ahead of time, you can pass it into spark.read.csv and use mode="DROPMALFORMED":

    from pyspark.sql.types import IntegerType, StringType, StructField, StructType
    
    schema = StructType(
        [
            StructField('a', IntegerType()),
            StructField('b', StringType()),
            StructField('c', StringType()),
            StructField('d', IntegerType())
        ]
    )
    df = spark.read.csv('sample.txt', sep="|", mode="DROPMALFORMED", schema=schema)
    #+---+------+--------+-------+
    #|  a|     b|       c|      d|
    #+---+------+--------+-------+
    #|123|sample|customer|   3433|
    #|786|    ex|    data|7474737|
    #|987| solve|    data|   6364|
    #+---+------+--------+-------+
    

    Notes:

    1. You can introduce order via a sort or with a Window function. See: Pyspark add sequential and deterministic index to dataframe (and check out the posts linked in the question).

    2. If you truly wanted to drop the first and last rows, you can add line numbers to rdd with zipWithIndex(), and use this to filter out the smallest and largest line numbers.