dataframepysparkjupyter-notebook

Why is my PySpark DataFrame not displaying properly in a table format?


I'm trying to read a CSV file using PySpark in jupyter notebook , but when I display the DataFrame using df.show(), the data appears scattered and not properly formatted in a table. Here's an example of how the output looks:

+---------+-------------+--------------------+--------------------+--------------+-----------------+------------------+-------+----+-----+--------------------+-----------+--------+----------+--------------+-------------+---------+------+--------------------------+------------------------+-------------------------------------------------------------------------------------------+--------------------+-------+--------+----------+-----------+------------+ |SALE TYPE| SOLD DATE| PROPERTY TYPE| ADDRESS| CITY|STATE OR PROVINCE|ZIP OR POSTAL CODE| PRICE|BEDS|BATHS| LOCATION|SQUARE FEET|LOT SIZE|YEAR BUILT|DAYS ON MARKET|$/SQUARE FEET|HOA/MONTH|STATUS|NEXT OPEN HOUSE START TIME|NEXT OPEN HOUSE END TIME|URL (SEE https://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)| SOURCE| MLS#|FAVORITE|INTERESTED| LATITUDE| LONGITUDE| +---------+-------------+--------------------+--------------------+--------------+-----------------+------------------+-------+----+-----+--------------------+-----------+--------+----------+--------------+-------------+---------+------+--------------------------+------------------------+-------------------------------------------------------------------------------------------+--------------------+-------+--------+----------+-----------+------------+ |PAST SALE|April-10-2024|Single Family Res...|1016 Wyndham Hill Ln| Franklin| TN| 37069| 950000| 5| 3.0| Fieldstone Farms| 3500| 21780| 1993| NULL| 271| 75| Sold| NULL| NULL| https://www.redfi...|REALTRACS as Dist...|2641189| N| Y| 35.9697949| -86.8849545|

Here's the code I'm using to load the CSV:
df = spark.read.csv(file_path, header=True, inferSchema=True)
df.show()

Solution

  • That is the expected output. I would suggest using the arguments truncate and vertical to show() to better allign the output to your expectation. Here's the documentation for the show() method. I personally am more comfortable with vertical = True when examining specific rows in my pyspark.sql.DataFrame.

    df = spark.read.csv(file_path, header=True, inferSchema=True)
    df.show(n = 5, truncate = False, vertical = True)