pythonsnowflake-cloud-data-platformjinja2dbt

How to replace strings in a column if exist in a list with dbt jinja?


I'm new to writing sql with jinja in dbt. I have created this variable for example:

{% set html_tags = [
    '<td>',
    '</td>',
    '&nbsp;',
    '<td style="background-color: #d8d8d8;">',
    '<span style="text-decoration: underline; font-family: arial; font-size: 16px;">',
    '<span style="font-family: arial;">', 
    '</span>',  
    '<span style="font-family: arial; font-size: 16px;">', 
    '<span class="Apple-tab-span" style="white-space: pre;">'
%}

I have a column called question and the string contain these html tags. I'd like to iterate my list and replace these tags in question column for each record with '' to clean up the text. Is this possible?

Example string from question column: <tdstyle="background-color:#d8d8d8;">Hemoglobin</td><tdstyle="background-color:#d8d8d8;">&nbsp;14.6g/dL&nbsp;</td>

I'd like to clean it up to Hemoglobin 14.6g/dL

I think I need a for loop or macro? I understand this might be not a good approach to clean up these html tags, but this is the only tool available to me at the moment.


Solution

  • Probably the most straightforward is to create a macro where you define the strings (HTML tags) that you want to get rid of, and then apply it on a column basis. If these tags are not prone to change, you can just hard code them in the macro:

    -- macros/cleanse_html_tags.sql
    
    {% macro cleanse_html_tags(field_with_html_tags) %}
    
    {% set html_tags = [
        '<td>',
        '</td>',
        '&nbsp;',
        '<td style="background-color: #d8d8d8;">',
        '<span style="text-decoration: underline; font-family: arial; font-size: 16px;">',
        '<span style="font-family: arial;">',
        '</span>',
        '<span style="font-family: arial; font-size: 16px;">',
        '<span class="Apple-tab-span" style="white-space: pre;">'
      ]
    %}
    
    regexp_replace({{ field_with_html_tags }}, '{{ html_tags|join('|') }}')
    
    {% endmacro %}
    

    Then, you would just need to call the macro from your model in the following way, and that's it:

    -- your_model.sql
    
    select {{ cleanse_html_tags('my_html_string') }} as my_string_without_html_tags
    from {{ ref('your_other_model') }}