how to check if a string looks randomized, or human generated and pronouncable?

Asked 22/7, 2009 at 9:48 Answered 2/12, 2022 at 5:43

For the purpose of identifying [possible] bot-generated usernames.

Suppose you have a username like "bilbomoothof" .. it may be nonsense, but it still contains pronouncable sounds and so appears human-generated.

I accept that it could have been randomly generated from a dictionary of syllables, or word parts, but let's assume for a moment that the bot in question is a bit rubbish.

Suppose you have a username like "sdfgbhm342r3f", to a human this is clearly a random string. But can this be identified programatically?
Are there any algorithms available (similar to Soundex, etc..) that can identify pronounceable sounds within a string like this?

Solutions applicable in PHP/MySQL most appreciated.

Dioxide answered 22/7, 2009 at 9:48 Comment(4)

Like this question and looking forward for answers. :) – Jeaniejeanine 22/7, 2009 at 9:49

The name of this concept in linguistics appears to be 'Pseudoword' en.wikipedia.org/wiki/Pseudoword which may help in your search for material. – Barnebas 22/7, 2009 at 10:5

I think you will find that this is an amazingly complicated algorithm, and perhaps not best suited for PHP. – Ostrowski 22/7, 2009 at 12:0

Is it not possible to use something like image verification? Where letters are drawn instead? (If you can't solve the problem, change the problem). – Capuchin 24/7, 2009 at 8:8

I guess you could think of something like that if you could restrict yourself to pronounceable sounds in english. For me (I am French), words like szczepan or wawrzyniec are unpronounceable and certainly have a certain randomness.

But they are actually Polish first names (meaning steven and lawrence)...

Churr answered 22/7, 2009 at 9:59 Comment(0)

I agree with Mac. But more than that, people sometimes have user name that aren't pronouncable, like qwerty or rtfmorleave.

Why bother with that ?

< obsolete and false, but i don't delete because of comments >

But more than that, no bots use 'zetztzgsd' as user name, they have dictionnary of realname, possible nick name, etc. so I think this would be a waster of time for you

< / obsolete and false, but i don't delete because of comments>

Millikan answered 22/7, 2009 at 10:3 Comment(7)

@clement not true. a lot of bot usernames on Twitter have very poor auto-generated names, equally as poor as "zetztzgsd" regarding people with unpronouncable usernames. This is fine as the test is only an indicator, it won't be relied upon 100%, other tests on behaviour will be performed – Dioxide 22/7, 2009 at 10:16

It's just another thing that can be added to an overall weighting as to whether a user is genuine - it wouldn't be the only indicator used. – Basin 22/7, 2009 at 10:26

@Tim really ? i though bot designer would be more imaginative. You are both right, i can't be 100% accurate but can help – Millikan 22/7, 2009 at 10:43

I have a page ranked high on Google which collects data from a form that is not CAPTCHA protected. Here are some sample names from bots: asdfsdaff, Rihanna nude (and lots of other artist names), kvsdpeqoqby, ygwyss, tbjoezlonzu. The majority of them are of the "x nude" variety, though. The e-mails are always garbled however, e.g., [email protected], [email protected]. – Untoward 22/7, 2009 at 10:57

Ty all for these precisions, I'll be more carefull for future – Millikan 22/7, 2009 at 11:1

I'd like to add that while the first name is definitely from a bot (based on other content it submitted), the name is also definitely entered by a human. Note how it only makes use of the first four characters on the second row of letters on a QWERTY keyboard. You could make an algorithm for detecting human typed random names that makes the name more likely to belong to a human (although in this case it was a bot after all, so it might work against you as well.) – Untoward 22/7, 2009 at 11:3

The questions states to "assume for a moment that the bot in question is a bit rubbish" – Mangosteen 22/7, 2009 at 20:45

Look up n-gram analysis. It is successfully used to automatically detect text language and works surprisingly well even on very short texts.

The online demo (no longer online) recognized 'bilbomoothof' as English and 'sdfgbhm342r3f' as Nepali. It probably always returns the best match, even if it's a very poor one. I think you could train it to discern between 'pronounceable' and 'random'.

Hoagy answered 22/7, 2009 at 11:20 Comment(1)

the demo link is dead – Tritheism 7/3, 2017 at 7:24

Just use CAPTCHA as a part of the registration process.

You can never distinguish real uesrnames from bot-created usernames, without severely annoying your users.

You will block users with bizzare, or non-English names, which will irritate them, and the bots will just keep trying until they catch a good username (from dictionary, or other sources - this is a very nice one, by the way!).

EDIT : Looking for prevention rather than after-the-fact analysis?

The solution is letting somebody else manage user's identities for you. For instance, you can use a small list of OpenID providers (like SO), or facebook connect, or both. You'll know for sure that the users are real, and that they have been solving at least one CAPTCHA.

EDIT: Another Idea

Search the string in Google, and check the number of matches found. Shouldn't be your only tool, but it is a good indicator, too. Randomized strings, of course, should have little or no matches.

Carib answered 22/7, 2009 at 10:51 Comment(2)

Thanks for the response, but this is after-the-fact analysis, not prevention. – Dioxide 22/7, 2009 at 10:58

Neither Google nor any other search engine I know of lets you programmatically look up "hits" unfortunately. You may get away with scraping for a while before being automatically blocked. – Heins 15/5, 2013 at 2:19

Reply for question #1:

Unfortunately this cannot be done, since Kolmogorov complexity function is not computable, therefore you cannot generate such algorithm unless you will apply some rules to domain of possible user names, then you will be able to perform heuristic analysis and decide, but even then it's really hard to do.

PS: After posted this answer, I bumped into some service which gave an idea of example for user name domain restriction, let to the users use the mail box of well known public domain as they user names.

Maclay answered 22/7, 2009 at 9:55 Comment(1)

its not just "hard" - its impossible. see also: turing test – Darsie 11/3, 2023 at 7:17

Off the top of my head, you could look for syllables, making use of soundex. That's the direction I would explore, based on the assumption that a pronounceable word has at least one syllable.

EDIT: Here's a function for counting syllables:

function count_syllables($word) {
 
$subsyl = Array(
'cial'
,'tia'
 ,'cius'
 ,'cious'
 ,'giu'
 ,'ion'
 ,'iou'
 ,'sia$'
 ,'.ely$'
 );
  
 $addsyl = Array(
 'ia'
 ,'riet'
 ,'dien'
 ,'iu'
 ,'io'
 ,'ii'
 ,'[aeiouym]bl$'
 ,'[aeiou]{3}'
 ,'^mc'
 ,'ism$'
 ,'([^aeiouy])\1l$'
 ,'[^l]lien'
 ,'^coa[dglx].'
 ,'[^gq]ua[^auieo]'
 ,'dnt$'
 );
  
 // Based on Greg Fast's Perl module Lingua::EN::Syllables
 $word = preg_replace('/[^a-z]/is', '', strtolower($word));
 $word_parts = preg_split('/[^aeiouy]+/', $word);
 foreach ($word_parts as $key => $value) {
 if ($value <> '') {
 $valid_word_parts[] = $value;
 }
 }
  
 $syllables = 0;
 // Thanks to Joe Kovar for correcting a bug in the following lines
 foreach ($subsyl as $syl) {
 $syllables -= preg_match('~'.$syl.'~', $word);
 }
 foreach ($addsyl as $syl) {
 $syllables += preg_match('~'.$syl.'~', $word);
 }
 if (strlen($word) == 1) {
 $syllables++;
 }
 $syllables += count($valid_word_parts);
 $syllables = ($syllables == 0) ? 1 : $syllables;
 return $syllables;
 }

From this very interesting link:

http://www.addedbytes.com/php/flesch-kincaid-function/

Turgent answered 22/7, 2009 at 9:56 Comment(4)

Nice, but then you need to produce a dictionary in order to be able to use it. And even after that you still can miss some cases. – Maclay 22/7, 2009 at 9:59

@Artem - Nothing is going to be an 100% effective solution for this problem – Turgent 22/7, 2009 at 10:2

@artem @karim a 100% solution is not expected. This test would be just one indicator of spam, other behaviour analysis will be performed. – Dioxide 22/7, 2009 at 10:18

@Tim, your question is whenever it's possible to determine that given string is generated or not programmatically. So the clear answer is no. In case you are looking for approximation and heuristics you need to specify it in your question. – Maclay 22/7, 2009 at 10:32

You could use a neural network to evaluate whether the nickname looks like a natural-language nickname.

Assemble two data-sets: one of valid nicknames, and one of bogus-generated ones. Train a simple back-progating single hidden layer neural network with the character values as inputs. The neural network will learn to discriminate between strings like "zrgssgbt" and "zargbyt", since the latter has consonants and vowels intermingled .

It is important to use real-world examples to get a good discriminator.

Bibliogony answered 22/7, 2009 at 11:2 Comment(1)

What do you mean by "with the character values as inputs"? The codepoint of each letter?? – Heins 15/5, 2013 at 2:22

-1

I dont know of existing algorithms for this problem, but I think it can be attacked in any one of the following ways:

your bot may be rubbish, but you can keep a list of syllables, or more specifically, phonemes, that you can try finding in your given string. But this sounds a bit difficult becasuse you would need to segment the string in different places etc.
there are 5 vowels in the english alphabet, and 21 others. You could assume that if they were randomly generated, then approximately you would expect 5/26*W, (where W is word length) letters that are vowels, and significant deviations from this could be suspicious. (If letter are included then 5/31 and so on..) You can try building on this idea by searching for doubletons, and trying to make sure that each doubleton occurs with same probability etc.
further, you can try to segment your input string around vowels, example three lettters before a vowel and three letters after a vowel, and try to find out if it make a recognizable sound by comparing with phonemes.

Front answered 22/7, 2009 at 10:0 Comment(2)

This is true for words, but not user name, that can mean nothing, or be acronyms, etc. – Millikan 22/7, 2009 at 10:3

Re bullet #1. This is similar to my thinking, except that some letters are more common. ( "e" vs "x" ) So a more sophisticated formula would be required. It is true that usernames could mean nothing, but this is a somewhat academic exercise – Dioxide 22/7, 2009 at 10:10

-1

In Russian, we have forbidden syllables, like ГЙ, а Ъ or Ь after a vowel and so on.

However, spam bots just use the names database, that's why my spam inbox is full of strange names you can only meet in history books.

I expect English to have syllable distribution histograms too (like ETAOIN SHRDLU, but for two-letter or even three-letter syllables), and having critical density of low frequency syllables in one name is certainly a sign.

Mucin answered 22/7, 2009 at 10:1 Comment(2)

There are several hundred common trigrams in the english language. The length of the average nickname is just a few letters. There is not enough data there to get a reliable measure of normality using this model. – Bibliogony 22/7, 2009 at 11:26

@Markus: if we have name like gfwx, we have two trigrams: gfw and fwx, which I think are never met in English corpus. That is, we have 2 zero-probability trigrams in one name, which certainly rings a bell. – Mucin 22/7, 2009 at 15:18

-1

Note that many large sites suggest usernames like [first init][middle init][last name][number]. The users then carry these usernames over to other sites, and the first three letters are definitely not pronounceable.

Scampi answered 28/7, 2009 at 1:52 Comment(0)

-1

I've seen bot registrations where both the username and full name are strings of random Upper- and lowercase letters. They tend to be at least 10 letters long, so in this case, it's not possible to be 100% accurate, but you can get pretty close by first passing any that have a non [a-zA-Z] character (e.g., space, number, or special character).

Then, for the few that haven't passed the test above, if there are both upper-and lowercase letters, failing those with too many uppercase letters in the full name, which normally wouldn't have more than three or four. You'll make an error with names like JoHnDoE for both the username and full name, or JohnSmithIII, but those are a pretty rare cases.

You can refine the algorithm by running it against a group of known valid registrations.

Breakaway answered 2/12, 2022 at 5:43 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Reply for question #1:

Recommended topics

Hot tags