pythonpython-3.xairflowairflow-2.x

Unexpected String Concatenation Issue in Airflow 2.7.2 DAG


I am encountering an unexpected issue with string concatenation in Airflow 2.7.2 using Python 3.11.5. This problem occurs only in Airflow; the same string concatenation works correctly in local unit tests.

Here is the code snippet in DAG demonstrating the issue:

values = ['1234', '5678', 'ABC_123', 'xyz-calc', '2024-01-01',
 'NULL', '9876', 'NULL', 'example', 42, '2024-07-28T01:23:45.678',
'2024-07-28T02:34:56.789', '2024-07-28T03:45:67.890', 'user_test',
 'complete', '2024-07-28T04:56:78.901',
'2024-07-28T05:67:89.012', 'NULL',
'spark-calc-1234-driver', 'NULL', 'NULL', 'XYZ']

values_str_list = []
for value in values:
    if isinstance(value, int):
        values_str_list.append(str(value))
    elif value == 'NULL':
        values_str_list.append('NULL')
    else:
        values_str_list.append(f"'{value}'")

values_str_3 = ',\n    '.join(values_str_list)  # This concatenation does NOT work correctly
print("Concatenated string values_str_3:")
print(values_str_3)

The logs show:

[2024-07-29, 22:12:39 UTC] {logging_mixin.py:151} INFO - Concatenated string values_str_3:
[2024-07-29, 22:12:39 UTC] {logging_mixin.py:151} INFO - 1234,
    5678,
    'ABC_123',
    'xyz-calc',
    '2024-01-01',
    NULL,
    '9876',
    NULL,
    'example',
    42,
    '2024-07-28T01:23:45.678',
    '2024-07-28T02:34:56.789',
    '2024-07-28T03:45:67.890',
    'user_test',
    'complete',
    '2024-07-28T04:56:78.901',
    '2024-07-28T05:67:89.012',
    NULL,
    'spark-calc-1234-driver',
    NULL,
    'XYZ'
    

The problem is that the values_str_3 string is missing one element.

Compare this

'spark-calc-1234-driver', NULL, NULL, 'XYZ'

and this

'spark-calc-1234-driver',
NULL,
'XYZ'

It is unclear why values_str_3 produces this result.

Interestingly, when I split the concatenated string, the result is correct:

print("Original string values_str_3:", values_str_3.split(',\n    '))

The logs show:

[2024-07-29, 22:12:39 UTC] {logging_mixin.py:151} INFO - Original string values_str_3: ['1234', '5678',
 "'ABC_123'", "'xyz-calc'", "'2024-01-01'",
'NULL', "'9876'", 'NULL', "'example'", '42', "'2024-07-28T01:23:45.678'",
"'2024-07-28T02:34:56.789'", "'2024-07-28T03:45:67.890'", "'user_test'",
"'complete'", "'2024-07-28T04:56:78.901'", "'2024-07-28T05:67:89.012'", 'NULL',
"'spark-calc-1234-driver'", 'NULL', 'NULL', "'XYZ'"]

Everything appears normal.

Thank you for comments. It does seem like some kind of deduplication of strings in the logs. Here is minimal reproducible example.

values_str_3 = ',\n    '.join(['values_str_3'] * 10)
print(f"values_str_3:\n    {values_str_3}")

[2024-08-04, 08:06:08 UTC] {logging_mixin.py:151} INFO - values_str_3:
    values_str_3,
    values_str_3

But I can't find a setting in Airflow that controls this.


Solution

  • In the end, I realized it was a bug and created a PR. I hope it gets approved because I spent a lot of time trying to figure out what was wrong with my code before I realized the issue was in Airflow. Maybe if there are any contributors here, they could help with the approval.