Homoglyph attack detection in email phishing
Asked Answered
I

2

8

Main Question

I am working on an API in Java that needs to detect the use of brands (e.g. PayPal, Mastercard etc.) in phishing emails.

Obviously there are different strategies that the attackers use to target these brands so that they are harder to detect. For instance "rnastercard" looks very similar to "mastercard" and can fool an unsuspecting user.

At this time I can easily detect the misspellings of these brands using a form of fuzzy string search. However the problem I am facing is when the attacker uses homoglyps to change the name of a particular brand but maintains the same visual interpretation.

A homoglyph attack substitutes a character from the [a-zA-Z] pattern with a character that looks similar but is outside this range. For example, an attacker using a particular character set can use the Greek Letter RHO that looks like P to target PayPal. The PayPal brand name in this sort of attack would become :

[Greek character RHO][a][y][Greek character RHO][a][l]

Since I have little to no experience with different standards like Unicode or ISO standards and their encodings I am calling upon your advice. Is there a way to programmatically determine the visual equivalent of a character outside the [a-zA-Z] set so that the result would be a character inside the [a-zA-Z] set?

Some of your answers might be based on a particular character set, I am looking for a solution that would help me determine such representations for every character set usable inside an email.

I have not read the RFC standards for mail exchange but they are on my list, I am asking this question now to save time.

Possible but unworkable solutions

I have thought of some solutions but they are not workable for my particular case since they are very CPU intensive and of a hack-like nature (read "may be easily broken").

The first solution would be to write the character that is outside [a-zA-Z] in it's form into an image and feed that image to an OCR API to get it's closest [a-zA-Z] representation.

The second solution would be to create a map for each character set, the key of the map would be the character itself and the value would be it's [a-zA-Z] equivalent. This map would either have to be done by hand or by using the first solution described above.

Additional details

I have already asked this question here. However the question remained closed despite my editing efforts. Probably because I didn't express myself well and I have not tagged the question properly.

In that particular question I also addressed some concerns I had regarding the character sets used by Java which clouded the actual question. However if you feel the need to include such information in your answer I would be grateful since it would save me some time from researching such questions. The question of homoglyph attacks and the question of character sets in Java or *javax.mail.** are separate but linked.

As a particular example of a homoglyph attack as described in the main question is this email. BEWARE! That is the actual content of a phishing email using this particular attack method so do not follow any link contained in that email.

I've tagged this question with what I thought would be the appropriate tags, if you disagree please provide an edit to this question rather than vote it closed.

Ilocano answered 17/3, 2014 at 6:46 Comment(9)
I'd go with the second solution. But first, I recommend you check out what plagiarism detecting software uses.Decry
Also see https://mcmap.net/q/1469634/-efficient-algorithm-to-find-all-quot-character-equal-quot-strings/632951Meliorate
@Meliorate That questions only refers to ASCII homoglyphs which is easier to do than Unicode homoglyph attack detection. I was asking for an efficient way to convert Unicode characters like Greek or Russian letters to their visual equivalent in this type of detection.Parabolize
For ASCII homoglyphs fuzzy string matching seems to be enough. For example I can detect "mastercard" in "rnastercard" with fuzzy string search (Levenshtein's distance). I won't detect it if the letter 'M' in "Mastercard" would be replaced with a Greek Capital Letter Mu (U+039C).Parabolize
Rather then looking for an attack on a specific brand name you could use the appearance characters that are outside of [a-z] and atypical for the locale as evidence of a homoglyph attack.Bine
Why is the OCR approach unfeasible? Once you've created a mapping of characters to their [a-z] equivalent you can put the results in a lookup table and be done with it. It is a one time process.Bine
See Unicode Technical Standard #39 - Unicode Security MechanismsColitis
Thanks @Colitis I'll read it ASAP.Parabolize
@mpkorstanje The look-up table might work once I build it. My concern is building a solution that is fairly complete. The look-up speed might not be a problem since the operation can be O(1) if the symbol to be replaced is the key, however depending on the size of the table it might be problematic since it will possibly run on client machines. Now I can only hope third party libraries like ICU does a good job of detecting the locale of the character. I've tried OCR APIs in the meantime but they don't give the best results.Parabolize
B
4

As part of TR-39 the Unicode consortium maintains a list of confusables that you can use to help your mapping. I can't testify to its completeness.

TR-39 also describes a skeleton algorithm to compare confusable strings that uses the list of confusables. Thee is A GoLang implementation of the algorithm and I've written a quick java port.

Aside from this removing diacritics and upper case will also help. These are not normalized by the skeleton algorithm. So the full process should be something like skeleton --> remove diacritics ---> to lower case.

/*
 * Special regular expression character ranges relevant for simplification
 * -> see http://docstore.mik.ua/orelly/perl/prog3/ch05_04.htm
 * InCombiningDiacriticalMarks: special marks that are part of "normal" ä,
 * ö, î etc.. IsSk: Symbol, Modifier see
 * http://www.fileformat.info/info/unicode/category/Sk/list.htm IsLm:
 * Letter, Modifier see
 * http://www.fileformat.info/info/unicode/category/Lm/list.htm
 */
private static final Pattern DIACRITICS_AND_FRIENDS = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");

private static String stripDiacritics(String str) {
    str = Normalizer.normalize(str, Normalizer.Form.NFD);
    str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
    return str;
}
Bine answered 3/2, 2015 at 16:50 Comment(8)
This is something I didn't know about. I'll try it and let you know how it goes in a couple of days. Thanks for the help.Parabolize
I've done some testing. This is a very good answer for getting the Latin equivalent of characters like Â,Ș,Ț,Î,... but it doesn't work for getting the Latin equivalent of the Greek Letter Rho for example. The encoding of the character Rho versus the encoding of P have a significant numerical difference between them and a fuzzy string search will yield no result because of it. However this is an useful answer because it will significantly reduce the number of comparisons for the possible implementation of the solution described in the question, but it is not the answer I am looking for.Parabolize
However I will look into this in combination with the list of confusable you mentioned which seems to have these equivalences mapped. It will take me some time though but I hope I will arrive at a workable solution. If it works for all my tests within a reasonable time for each phishing email I will accept your answer. Thank you for your help.Parabolize
I have just ran into a library that implements a solution. It parses the lists of confusables and substitutes to a normal form in which it compares the strings. It isn't java but it also looks portable to java github.com/FiloSottile/tr39-confusables unicode.org/reports/tr39Bine
This is actually great. Thank you! Please include this in your answer if you can :)Parabolize
Cheers! Going to write a Java implementation of of the skeleton algorithm. It's ridiculous that it doesn't exist yet.Bine
@Sebastian-LaurenţiuPlesciuc I've written a quick java implementation of the skeleton algorithm. It works but I'll probably be adding the other confusables tables too. github.com/mpkorstanje/tr39-confusablesBine
Great job! This is going to help a lot.Parabolize
G
3

Here is a GitHub repo with a large list of homoglyphs and some Java and JavaScript to help detect words that have been disguised by using them (disclaimer - I wrote it).

The list is based on the Unicode list of confusables mentioned by @mpkorstanje, but has some additional homoglyphs not on that list, the search code also accounts for variation in case (eg it will find the word 'mastercard' when disguised as 'ᗰas⟙eᖇcᴀrd')

Gath answered 13/11, 2015 at 18:27 Comment(1)
Would you consider mailing the unicode folks the missing characters?Bine

© 2022 - 2024 — McMap. All rights reserved.