I'm trying to read a CSV file using PySpark in Jupyter Notebook, but when I display the DataFrame using df.show()
, the data appears scattered and not properly formatted in a table. Here's an example of how the output looks:
+---------+-------------+--------------------+--------------------+--------------+-----------------+------------------+-------+----+-----+--------------------+-----------+--------+----------+--------------+-------------+---------+------+--------------------------+------------------------+-------------------------------------------------------------------------------------------+--------------------+-------+--------+----------+-----------+------------+
|SALE TYPE| SOLD DATE| PROPERTY TYPE| ADDRESS| CITY|STATE OR PROVINCE|ZIP OR POSTAL CODE| PRICE|BEDS|BATHS| LOCATION|SQUARE FEET|LOT SIZE|YEAR BUILT|DAYS ON MARKET|$/SQUARE FEET|HOA/MONTH|STATUS|NEXT OPEN HOUSE START TIME|NEXT OPEN HOUSE END TIME|URL (SEE https://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)| SOURCE| MLS#|FAVORITE|INTERESTED| LATITUDE| LONGITUDE|
+---------+-------------+--------------------+--------------------+--------------+-----------------+------------------+-------+----+-----+--------------------+-----------+--------+----------+--------------+-------------+---------+------+--------------------------+------------------------+-------------------------------------------------------------------------------------------+--------------------+-------+--------+----------+-----------+------------+
|PAST SALE|April-10-2024|Single Family Res...|1016 Wyndham Hill Ln| Franklin| TN| 37069| 950000| 5| 3.0| Fieldstone Farms| 3500| 21780| 1993| NULL| 271| 75| Sold| NULL| NULL| https://www.redfi...|REALTRACS as Dist...|2641189| N| Y| 35.9697949| -86.8849545|
+---------+-------------+--------------------+--------------------+--------------+-----------------+------------------+-------+----+-----+--------------------+-----------+--------+----------+--------------+-------------+---------+------+--------------------------+------------------------+-------------------------------------------------------------------------------------------+--------------------+-------+--------+----------+-----------+------------+
df = spark.read.csv(file_path, header=True, inferSchema=True)
df.show()
That is the expected output. I would suggest using the arguments truncate
and vertical
to show()
to better allign the output to your expectation.
Here's the documentation for the show()
method. I personally am more comfortable with vertical = True
when examining specific rows in my pyspark.sql.DataFrame
.
df = spark.read.csv(file_path, header=True, inferSchema=True)
df.show(n = 5, truncate = False, vertical = True)