Main Question
I am working on an API in Java that needs to detect the use of brands (e.g. PayPal, Mastercard etc.) in phishing emails.
Obviously there are different strategies that the attackers use to target these brands so that they are harder to detect. For instance "rnastercard" looks very similar to "mastercard" and can fool an unsuspecting user.
At this time I can easily detect the misspellings of these brands using a form of fuzzy string search. However the problem I am facing is when the attacker uses homoglyps to change the name of a particular brand but maintains the same visual interpretation.
A homoglyph attack substitutes a character from the [a-zA-Z] pattern with a character that looks similar but is outside this range. For example, an attacker using a particular character set can use the Greek Letter RHO that looks like P to target PayPal. The PayPal brand name in this sort of attack would become :
[Greek character RHO][a][y][Greek character RHO][a][l]
Since I have little to no experience with different standards like Unicode or ISO standards and their encodings I am calling upon your advice. Is there a way to programmatically determine the visual equivalent of a character outside the [a-zA-Z] set so that the result would be a character inside the [a-zA-Z] set?
Some of your answers might be based on a particular character set, I am looking for a solution that would help me determine such representations for every character set usable inside an email.
I have not read the RFC standards for mail exchange but they are on my list, I am asking this question now to save time.
Possible but unworkable solutions
I have thought of some solutions but they are not workable for my particular case since they are very CPU intensive and of a hack-like nature (read "may be easily broken").
The first solution would be to write the character that is outside [a-zA-Z] in it's form into an image and feed that image to an OCR API to get it's closest [a-zA-Z] representation.
The second solution would be to create a map for each character set, the key of the map would be the character itself and the value would be it's [a-zA-Z] equivalent. This map would either have to be done by hand or by using the first solution described above.
Additional details
I have already asked this question here. However the question remained closed despite my editing efforts. Probably because I didn't express myself well and I have not tagged the question properly.
In that particular question I also addressed some concerns I had regarding the character sets used by Java which clouded the actual question. However if you feel the need to include such information in your answer I would be grateful since it would save me some time from researching such questions. The question of homoglyph attacks and the question of character sets in Java or *javax.mail.** are separate but linked.
As a particular example of a homoglyph attack as described in the main question is this email. BEWARE! That is the actual content of a phishing email using this particular attack method so do not follow any link contained in that email.
I've tagged this question with what I thought would be the appropriate tags, if you disagree please provide an edit to this question rather than vote it closed.