I've been playing around with cryptocat, which is an interesting online chat service that allows you to encrypt your messages with a key, so that only people with the same key can read your message. An interesting aspect of the service (in my opinion) is the fact that text encrypted using a key other than the one that you're using is displayed simply as "[encrypted]", rather than a bunch of garbage cipher text. My question is, in Python, is there a good way to determine whether or not a given piece of text is cipher text? I'm using RC4 for this example, because it was the fastest thing I could implement (based on the pseudo-code on Wikipedia. Thanks.
there is no guaranteed way to tell, but in practice you can do two things:
check for many non-ascii characters (if you're expecting people to be sending english text).
check the distribution of values. in normal text, some letters are much more common than others. but in encrypted text, all characters are about equally likely.
a simple way of doing the latter is to see if any character occurs more than (N/256) + 5 * sqrt(N/256) times (where you have a total of N characters), in which case it's likely a natural language (unencrypted).
in python (reversing the logic above, to give "true" when encrypted):
def encrypted(text):
scores = defaultdict(lambda: 0)
for letter in text: scores[letter] += 1
largest = max(scores.values())
average = len(text) / 256.0
return largest < average + 5 * sqrt(average)
the maths comes from the average number being a gaussian distribution around the average, with a variance equal to the average - it's not perfect, but it's probably close enough. by default (with small amounts of text, when it is unreliable) this will return false (sorry; earlier i had an incorrect version with "max()" which had the logic for small numbers the wrong way round).