Python: Is there a good way to check if text is encrypted?
Asked Answered
C

3

6

I've been playing around with cryptocat, which is an interesting online chat service that allows you to encrypt your messages with a key, so that only people with the same key can read your message. An interesting aspect of the service (in my opinion) is the fact that text encrypted using a key other than the one that you're using is displayed simply as "[encrypted]", rather than a bunch of garbage cipher text. My question is, in Python, is there a good way to determine whether or not a given piece of text is cipher text? I'm using RC4 for this example, because it was the fastest thing I could implement (based on the pseudo-code on Wikipedia. Thanks.

Connell answered 9/8, 2011 at 18:19 Comment(0)
G
17

there is no guaranteed way to tell, but in practice you can do two things:

  1. check for many non-ascii characters (if you're expecting people to be sending english text).

  2. check the distribution of values. in normal text, some letters are much more common than others. but in encrypted text, all characters are about equally likely.

a simple way of doing the latter is to see if any character occurs more than (N/256) + 5 * sqrt(N/256) times (where you have a total of N characters), in which case it's likely a natural language (unencrypted).

in python (reversing the logic above, to give "true" when encrypted):

def encrypted(text):
    scores = defaultdict(lambda: 0)
    for letter in text: scores[letter] += 1
    largest = max(scores.values())
    average = len(text) / 256.0
    return largest < average + 5 * sqrt(average)

the maths comes from the average number being a gaussian distribution around the average, with a variance equal to the average - it's not perfect, but it's probably close enough. by default (with small amounts of text, when it is unreliable) this will return false (sorry; earlier i had an incorrect version with "max()" which had the logic for small numbers the wrong way round).

Galloot answered 9/8, 2011 at 18:26 Comment(1)
FYI - some on SO have suggested pre-pending encrypted strings with a known 'magic' str to identify encrypted vs. not encrypted strings: #25027381Stent
S
5

Every cipher worth its name will produce output that appears to be completely random. You can exploit this fact for a quick test whether you are dealing with encrypted text or rather data that follows some unknown protocol. If the data is encrypted, then you could check the distribution of byte values in a byte stream you can eavesdrop on - if all values are uniformly distributed then there's a good chance you're dealing with encrypted text.

To gain more and more confidence in the decision you could widen the tests to something more sophisticated such as analyzing the distribution of pairs or triplets of bytes etc.

On the other hand you could also compare the statistical data on digrams and trigrams of your particular language of interest with the occurrences in the data you observe (see also here). If your data behaves similar then it's more likely that you are observing plain text.

Sightless answered 9/8, 2011 at 18:26 Comment(0)
G
0

One way to tell is padding. Add standard padding to the end of the message. If the decrypted message does not end with the standard padding then it was decrypted with the wrong key. The converse is not guaranteed, but is often true.

Giveaway answered 9/8, 2011 at 21:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.