dataframepysparkjupyter-notebook

Why is my PySpark DataFrame not displaying properly in a table format?


I'm trying to read a CSV file using PySpark in Jupyter Notebook, but when I display the DataFrame using df.show(), the data appears scattered and not properly formatted in a table. Here's an example of how the output looks:

+---------+-------------+--------------------+--------------------+--------------+-----------------+------------------+-------+----+-----+--------------------+-----------+--------+----------+--------------+-------------+---------+------+--------------------------+------------------------+-------------------------------------------------------------------------------------------+--------------------+-------+--------+----------+-----------+------------+
|SALE TYPE|    SOLD DATE|       PROPERTY TYPE|             ADDRESS|          CITY|STATE OR PROVINCE|ZIP OR POSTAL CODE|  PRICE|BEDS|BATHS|            LOCATION|SQUARE FEET|LOT SIZE|YEAR BUILT|DAYS ON MARKET|$/SQUARE FEET|HOA/MONTH|STATUS|NEXT OPEN HOUSE START TIME|NEXT OPEN HOUSE END TIME|URL (SEE https://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)|              SOURCE|   MLS#|FAVORITE|INTERESTED|   LATITUDE|   LONGITUDE|
+---------+-------------+--------------------+--------------------+--------------+-----------------+------------------+-------+----+-----+--------------------+-----------+--------+----------+--------------+-------------+---------+------+--------------------------+------------------------+-------------------------------------------------------------------------------------------+--------------------+-------+--------+----------+-----------+------------+
|PAST SALE|April-10-2024|Single Family Res...|1016 Wyndham Hill Ln|      Franklin|               TN|             37069| 950000|   5|  3.0|    Fieldstone Farms|       3500|   21780|      1993|          NULL|          271|       75|  Sold|                      NULL|                    NULL|                                                                       https://www.redfi...|REALTRACS as Dist...|2641189|       N|         Y| 35.9697949| -86.8849545|
+---------+-------------+--------------------+--------------------+--------------+-----------------+------------------+-------+----+-----+--------------------+-----------+--------+----------+--------------+-------------+---------+------+--------------------------+------------------------+-------------------------------------------------------------------------------------------+--------------------+-------+--------+----------+-----------+------------+
Here's the code I'm using to load the CSV:
df = spark.read.csv(file_path, header=True, inferSchema=True)
df.show()

Solution

  • That is the expected output. I would suggest using the arguments truncate and vertical to show() to better allign the output to your expectation. Here's the documentation for the show() method. I personally am more comfortable with vertical = True when examining specific rows in my pyspark.sql.DataFrame.

    df = spark.read.csv(file_path, header=True, inferSchema=True)
    df.show(n = 5, truncate = False, vertical = True)