apache-sparkpysparkdelta-lakedremiodata-lakehouse

Is there a way to access a delta table old column name?


I have a table with customer IDs, creditcard_number and creditcard_type (amex, mastercard, etc.) I am trying to find a way to:

  1. Modify the column name creditcard_type to type and creditcard_number to number by using a delta lake function
  2. Have the old name of the column logged somewhere so that i can access it and use it to modify a .ttl-file (for a virtual knowledge graph mapping) where i search for the old name and replace it with the new column name.

I tried with change data feed (but it only tracks row-level changes) I tried with column mapping (but column mapping doesn't modify the data files too). And if it doesn't modify the files too, i can't see them in Dremio (which i'm using as a federator to load the data further to Ontopic - the virtual knowledge graph tool).

How can i change the name of a column? Then access it's old name. And everything has to be done with Delta. Is Delta event capable of this? Or am I missing something?


Solution

  • Yes, you can rename columns starting with Delta Lake 1.2.0, at least in Apache Spark (not sure about Dremio) - you need to enable column mapping by name (my table is called test):

    ALTER TABLE test SET TBLPROPERTIES (
       'delta.columnMapping.mode' = 'name',
       'delta.minReaderVersion' = '2',
       'delta.minWriterVersion' = '5')
    

    Then rename columns:

    ALTER TABLE test rename column creditcard_type to type;
    ALTER TABLE test rename column creditcard_number to number
    

    And you can pull the old and new names from Delta table history with something like this (pull history, leave only RENAME COLUMN operations and select old and new names from the operationParameters column):

    # fetch history for the table
    tdf = spark.sql("describe history test") \
      .filter("operation='RENAME COLUMN'") \
      .selectExpr("operationParameters['oldColumnPath'] as old_name", 
                  "operationParameters['newColumnPath'] as new_name")
    # convert dataframe into a dict old_name -> new_name
    names_mapping = dict([[r[0], r[1]] for r in tdf.collect()])
    

    will give you:

    >>> names_mapping
    {'creditcard_number': 'number', 'creditcard_type': 'type'}