pythonpysparkpyspark-schema

commas within a field in a file using pyspark


my data file contains column values that include commas

example.com', 'example Technologies is a leading provider of sophisticated electronic components, instruments & communications products, including defense electronics, data acquisition & communications equipment for airlines and business aircraft, monitoring and control instruments for industrial and environmental applications and components, and subsystems for wireless and satellite communications. The example Solution No matter what challenge you face, example has a solution. The diverse segments of example Technologies Incorporated bring decades of experience to bear on every project, working in cooperation to develop leading edge technologies. Our Markets We serve niche market segments where performance, precision and reliability are critical. Our customers include major industrial and communications companies, government agencies, aerospace prime contractors and general aviation companies.', 'http://www.example.com

I am trying to escape the commas of each column but not for last column I want them the same and get the output using spark-shell. I tried using the below code but it has given me the different output.

df = spark.read.format("csv").option("sep", ',').option("quote", '"').option("escape", '"').option("inferSchema", "true").option("header", "true").load('1k.csv').rdd.toDF()

The output it has given me is

df.show()                                                                   
+--------------------+------------------------------------------------------------------------------------+--------------------------------------+------------------------------+-------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+--------------------+-------------------------------------------------------------+-------------------------+
|       example.com'| 'example Technologies is a leading provider of sophisticated electronic components| instruments & communications products| including defense electronics| data acquisition & communications equipment for airlines and business aircraft| monitoring and control instruments for industrial and environmental applications and components| and subsystems for wireless and satellite communications. The example Solution No matter what challenge you face| example has a solution. The diverse segments of example Technologies Incorporated bring decades of experience to bear on every project| working in cooperation to develop leading edge technologies. Our Markets We serve niche market segments where performance| precision and reliability are critical. Our customers include major industrial and communications companies| government agencies| aerospace prime contractors and general aviation companies.'| 'http://www.example.com|
+--------------------+------------------------------------------------------------------------------------+--------------------------------------+------------------------------+-------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+--------------------+-------------------------------------------------------------+-------------------------+
df.printSchema()

root
 |-- example.com': string (nullable = true)
 |--  'example Technologies is a leading provider of sophisticated electronic components: string (nullable = true)
 |--  instruments & communications products: string (nullable = true)
 |--  including defense electronics: string (nullable = true)
 |--  data acquisition & communications equipment for airlines and business aircraft: string (nullable = true)
 |--  monitoring and control instruments for industrial and environmental applications and components: string (nullable = true)
 |--  and subsystems for wireless and satellite communications. The example Solution No matter what challenge you face: string (nullable = true)
 |--  example has a solution. The diverse segments of example Technologies Incorporated bring decades of experience to bear on every project: string (nullable = true)
 |--  working in cooperation to develop leading edge technologies. Our Markets We serve niche market segments where performance: string (nullable = true)
 |--  precision and reliability are critical. Our customers include major industrial and communications companies: string (nullable = true)
 |--  government agencies: string (nullable = true)
 |--  aerospace prime contractors and general aviation companies.': string (nullable = true)
 |--  'http://www.example.com: string (nullable = true)

But I am expecting output to be like below What I am missing here can anyone help me?

domain, discription, url


Solution

  • Why not just set the separator ', '?

    df = spark.read.csv('test.csv', sep='\', \'')
    
    df.printSchema()
    df.show()
    
    root
     |-- _c0: string (nullable = true)
     |-- _c1: string (nullable = true)
     |-- _c2: string (nullable = true)
    
    +------------+--------------------+--------------------+
    |         _c0|                 _c1|                 _c2|
    +------------+--------------------+--------------------+
    |teledyne.com|Teledyne Technolo...|http://www.teledy...|
    +------------+--------------------+--------------------+