pythonjsonreplaceazure-synapsespark-notebook

Assistance Refactoring Pyspark Replace Code for Synapse Notebook (working with JSON)


Could anyone help with advice on how to refactor this code to make it faster? We are ingesting JSON and transforming in preparation for a later step but this particular section of the Notebook is taking a lot longer than expected (and we're new to python). Code is below, thanks!

    import pandas as pd

    df_main = pd.DataFrame(table_data)
    df_parameters = pd.DataFrame(parameter_data)

    # Convert df_parameters to dictionary for faster lookups
    parameter_dict = dict(zip(df_parameters['abc'], df_parameters['def']))

    def replace_parameters(row):
        if isinstance(row['expression'], str) and row['xyz'] == row['xyz']:
            for parameter_name, value in parameter_dict.items():
                if row['expression']:
                    row['expression'] = row['expression'].replace(f'|{parameter_name}|', str(value))
                if row['sp']:
                    row['sp'] = row['sp'].replace(f'|{parameter_name}|', str(value))
        return row

    # Apply the replacement function only when xyz is equal to xyz
    df_main = df_main.apply(replace_parameters, axis=1)

Solution

  • Instead of using a for loop, use pattern matching. I believe this kind of expression you are using; if not, adjust the pattern accordingly.

    Expressions I used:

    1. |param1| + |param2|
    2. |param2| - |param3|
    3. |param1| * |param3|
    4. |param2| / |param3|

    You can use the code below.

    Data I used:

    import pandas as pd
    import re
    
    table_data = {
        'expression': ['|param1| + |param2|', '|param2| - |param3|'],
        'sp': ['|param1| * |param3|', '|param2| / |param3|'],
        'xyz': [1, 2]  # Sample values for 'xyz'
    }
    
    parameter_data = {
        'abc': ['param1', 'param2', 'param3'],
        'def': [10, 20, 30]  # Sample values for parameters
    }
    
    df_main = pd.DataFrame(table_data)
    df_parameters = pd.DataFrame(parameter_data)
    

    Output:

    enter image description here

    Modify your function like below:

    import re
    parameter_dict = dict(zip(parameter_data['abc'], parameter_data['def']))
    
    tmp = [ "\|" +i+ "\|" for i in parameter_data['abc']]
    param_pattern = '|'.join(tmp)
    
    # Define a function to perform replacements
    def replace_parameters(row):
        if isinstance(row['expression'], str) and row['xyz'] == row['xyz']:
            row['expression'] = re.sub(param_pattern,lambda match: str(parameter_dict[match.group(0).strip('|')]), row['expression'])
            row['sp'] = re.sub(param_pattern, lambda match: str(parameter_dict[match.group(0).strip('|')]), row['sp'])
        return row
    
    df_main = df_main.apply(replace_parameters, axis=1)
    
    df_main
    

    Output:

    enter image description here

    You can see the parameters are replaced. Once again, I am saying, according to your expressions, you create the pattern. For the above kind of expressions, below is the pattern I created: '\\|param1\\||\\|param2\\||\\|param3\\|'.