We're using Python’s difflib.SequenceMatcher to compare strings in a production system. Here's the simplified relevant code:
from difflib import SequenceMatcher
similarity = SequenceMatcher(
None,
normalized_transcript,
normalized_expected
).ratio()
Until 4:10 PM UTC today, the above code was returning a similarity ratio above our internal threshold for a specific string comparison.
After that time, without any change in our code, server configuration, or environment, the same comparison started returning a lower similarity score, failing the threshold check.
Some key facts:
The behavior changed consistently across all environments: both development and production.
Servers run a mix of Windows (dev) and Unix (dev + prod), so this is not likely to be an OS-specific issue.
There were no code deployments, no dependency changes, and no environment variable alterations.
We are aware that SequenceMatcher works entirely locally—there are no third-party requests or models involved.
We’ve validated that the inputs to SequenceMatcher are identical to previous values (confirmed via logs).
Important detail: the string normalized_transcript comes from OpenAI API completions. That’s the only potentially "variable" external component in the system. However, the strings in question are very short, and we’ve historically seen consistent outputs from OpenAI for this prompt setup.
This behavior is baffling. Is there any known edge case, maybe time-sensitive internal optimization, or anything else that could explain this sudden change in behavior from SequenceMatcher?
As I mentioned in a comment, I wrote the difflib
code in question and "it's entirely self-contained and purely functional (the results depend solely on the sequences passed to it)."
It knows nothing about time, which platform it's running on, or anything in its environment, how or when the sequences passed to it were obtained ...
So more information is needed. Not about your environment, but about the symptom itself: what result did you get? what result did you expect? which inputs were passed? We haven't yet been told anything relevant.
Important detail: the string normalized_transcript comes from OpenAI API completions. That’s the only potentially "variable" external component in the system
Then that's the only guess I have.
However, the strings in question are very short,
Why would their lengths be relevant?
and we’ve historically seen consistent outputs from OpenAI for this prompt setup.
Past performance is no guarantee of future results ;-)
At a bare minimum, show us the precise strings that were passed, and what .ratio()
returned on your box(es). Then we can at least see whether people can reproduce your results. And as the algorithm's creator, I may be able to guess non-obvious (to others) things from the precise floating-point result .ratio()
returned.
But, as is, we're all flying blind here.