I am encountering an unexpected issue with string concatenation in Airflow 2.7.2 using Python 3.11.5. This problem occurs only in Airflow; the same string concatenation works correctly in local unit tests.
Here is the code snippet in DAG demonstrating the issue:
values = ['1234', '5678', 'ABC_123', 'xyz-calc', '2024-01-01',
'NULL', '9876', 'NULL', 'example', 42, '2024-07-28T01:23:45.678',
'2024-07-28T02:34:56.789', '2024-07-28T03:45:67.890', 'user_test',
'complete', '2024-07-28T04:56:78.901',
'2024-07-28T05:67:89.012', 'NULL',
'spark-calc-1234-driver', 'NULL', 'NULL', 'XYZ']
values_str_list = []
for value in values:
if isinstance(value, int):
values_str_list.append(str(value))
elif value == 'NULL':
values_str_list.append('NULL')
else:
values_str_list.append(f"'{value}'")
values_str_3 = ',\n '.join(values_str_list) # This concatenation does NOT work correctly
print("Concatenated string values_str_3:")
print(values_str_3)
The logs show:
[2024-07-29, 22:12:39 UTC] {logging_mixin.py:151} INFO - Concatenated string values_str_3:
[2024-07-29, 22:12:39 UTC] {logging_mixin.py:151} INFO - 1234,
5678,
'ABC_123',
'xyz-calc',
'2024-01-01',
NULL,
'9876',
NULL,
'example',
42,
'2024-07-28T01:23:45.678',
'2024-07-28T02:34:56.789',
'2024-07-28T03:45:67.890',
'user_test',
'complete',
'2024-07-28T04:56:78.901',
'2024-07-28T05:67:89.012',
NULL,
'spark-calc-1234-driver',
NULL,
'XYZ'
The problem is that the values_str_3 string is missing one element.
Compare this
'spark-calc-1234-driver', NULL, NULL, 'XYZ'
and this
'spark-calc-1234-driver',
NULL,
'XYZ'
It is unclear why values_str_3 produces this result.
Interestingly, when I split the concatenated string, the result is correct:
print("Original string values_str_3:", values_str_3.split(',\n '))
The logs show:
[2024-07-29, 22:12:39 UTC] {logging_mixin.py:151} INFO - Original string values_str_3: ['1234', '5678',
"'ABC_123'", "'xyz-calc'", "'2024-01-01'",
'NULL', "'9876'", 'NULL', "'example'", '42', "'2024-07-28T01:23:45.678'",
"'2024-07-28T02:34:56.789'", "'2024-07-28T03:45:67.890'", "'user_test'",
"'complete'", "'2024-07-28T04:56:78.901'", "'2024-07-28T05:67:89.012'", 'NULL',
"'spark-calc-1234-driver'", 'NULL', 'NULL', "'XYZ'"]
Everything appears normal.
Thank you for comments. It does seem like some kind of deduplication of strings in the logs. Here is minimal reproducible example.
values_str_3 = ',\n '.join(['values_str_3'] * 10)
print(f"values_str_3:\n {values_str_3}")
[2024-08-04, 08:06:08 UTC] {logging_mixin.py:151} INFO - values_str_3:
values_str_3,
values_str_3
But I can't find a setting in Airflow that controls this.
In the end, I realized it was a bug and created a PR. I hope it gets approved because I spent a lot of time trying to figure out what was wrong with my code before I realized the issue was in Airflow. Maybe if there are any contributors here, they could help with the approval.