apache-sparkhivedatabricksexternal-tables

Why does Spark not create a new file after inserting data into an external table?


I have a .csv file data.csv stored at location: dbfs:/raw/data/externalTables/emp_data_folder/emp_data.csv

Here is a sample of the data in the file:

Alice,25,50000,North
Bob,30,60000,South
Charlie,35,70000,East
David,40,80000,West
Eve,29,58000,North
Frank,50,90000,South
Grace,28,54000,East
Hannah,32,62000,West
Ian,45,72000,North
Jack,27,56000,South

Using this .csv file, I created an external table in Spark using the following SQL command:

%sql
CREATE TABLE IF NOT EXISTS tablesDbDef.emp_data_f (
    Name STRING,
    Age INTEGER,
    Salary INT,
    Region STRING
)
USING CSV
LOCATION '/raw/data/externalTables/emp_data_folder/'

The table is created successfully, and I can query it without any issues.

Next, I inserted a new record into the table using the following command:

%sql

INSERT INTO tablesDbDef.emp_data_f VALUES ('Mark', 20, 50000, 'South')

The record is inserted successfully and I can see this in sql query. My understanding is that if we insert new data, spark will create new files (.csv files in this case) for the newly inserted data. However, when I check the emp_data_folder directory, I don't see any new files created for this newly inserted record. The only files present are the original emp_data.csv and a newly generated _SUCCESS file.

My question is where is this newly inserted data stored if not in files? Because I can see the newly inserted data in the sql queries but there is no file created for this?


Solution

  • When you create an external table using USING CSV LOCATION '/path', Spark reads data from the file but doesn’t manage the files or modify them when new data is inserted.

    When you use INSERT INTO on an external table, Spark stores the new data in its internal metadata (e.g., Hive Metastore), not in the original CSV file.

    Spark treats CSV as read-only and doesn’t append records to it. Instead, the new data is stored in Spark's managed storage, allowing it to be queried but not reflected in the CSV.

    To write new data back to files, you’ll need to either convert the table to a managed table or write the updated data to a new location.