I'll give a bayesian hierarchical model solution. It has a few parameters that must be set by hand, but it is quite robust regarding these parameters, as the simulation below shows. And it can handle not only the scoring system for the word list, but also a probable classification of the user who entered the words. The treatment may be a little technical, but in the end we'll have a routine to calculate the scores as a function of 3 numbers: the number of words in the list, the number of those with an exact match in the database, and the number of those with a partial matching (as in yyyy
). The routine is implemented in R, ,but if you never used it, just download the interpreter, copy and paste the code in it's console, and you'll see the results shown here.
BTW english is not my first language, so bear with me... :-)
1. Model Specification:
There are 3 classes of users, named I, II, III. We assume that each word list is generated by a single user, and that the user is drawn randomly from a universe of users. We say that this universe is 70% class I, 25% class II and 5% class III. These numbers can be changed, of course. We have so far
Prob[User=I] = 70%
Prob[User=II] = 25%
Prob[User=III] = 5%
Given the user, we assume conditional independence, i.e., the user will not look to previous words to decide if he'll type in a valid or invalid word.
User I tends to give only valid words, User II only invalid words, and user III is mixed. So we set
Prob[Word=OK | User=I] = 99%
Prob[Word=OK | User=II] = 0.001%
Prob[Word=OK | User=III] = 50%
The probabilities of the word being invalid, given the class of the user, are complimentary. Note that we give a very small, but non-zero probability of a class-II user entering valid words, since even a monkey in front of a typewriter will, eventually type a valid word.
The final step of the model specification regards the database. We assume that, for each word, the query may have 3 outcomes: a total match, a partial match (as in yyyy
) or no match. In probability terms, we assume that
Prob[match | valid] = 98% (not all valid words will be found)
Prob[partial | valid] = 0.2% (a rare event)
Prob[match | INvalid] = 0 (the database may be incomplete, but it has no invalid words)
Prob[partial | INvalid] = 0.1% (a rare event)
The probabilities of not finding the word don't have to be set, as they are complimentary. That's it, our model is set.
2. Notation and Objective
We have a discrete random variable U, taking values in {1, 2, 3} and two discrete random vectors W and F, each of size n (= the number of words), where W_i is 1 if the word is valid and 2 if the word is invalid, and F_i is 1 if the word is found in the database, 2 if it's a partial match and 3 if it's not found.
Only vector F is observable, the others are latent. Using Bayes theorem and the distributions we set up in the model specification, we can calculate
(a) Prob[User=I | F],
i. e., the posterior probability of the user being in class I, given the observed matchings; and
(b) Prob[W=all valid | F],
i. e., the posterior probability that all words are valid, given the observed matchings.
Depending on your objective, you can use one or another as a scoring solution. If you are interested in distinguishing a real user from a computer program, for instance, you can use (a). If you only care about the word list being valid, you should use (b).
I'll try to explain shortly the theory in the next section, but this is the usual setup in the context of bayesian hierarchical models. The reference is Gelman (2004), "Bayesian Data Analysis".
If you want, you can jump to section 4, with the code.
3. The Math
I'll use a slight abuse of notation, as usual in this context, writing
p(x|y) for Prob[X=x|Y=y] and p(x,y) for Prob[X=x,Y=y].
The goal (a) is to calculate p(u|f), for u=1. Using Bayes theorem:
p(u|f) = p(u,f)/p(f) = p(f|u)p(u)/p(f).
p(u) is given. p(f|u) is obtained from:
p(f|u) = \prod_{i=1}^{n} \sum_{w_i=1}^{2} (p(f_i|w_i)p(w_i|u))
p(f|u) = \prod_{i=1}^{n} p(f_i|u)
= p(f_i=1|u)^(m) p(f_i=2|u)^(p) p(f_i=3)^(n-m-p)
where m = number of matchings and p = number of partial matchings.
p(f) is calculated as:
\sum_{u=1}^{3} p(f|u)p(u)
All these can be calculated directly.
Goal (b) is given by
p(w|f) = p(f|w)*p(w)/p(f)
where
p(f|w) = \prod_{i=1}^{n} p(f_i|w_i)
and p(f_i|w_i) is given in the model specification.
p(f) was calculated above, so we need only
p(w) = \sum_{u=1}^{3} p(w|u)p(u)
where
p(w|u) = \prod_{i=1}^{n} p(w_i|u)
So everything is set for implementation.
4. The Code
The code is written as a R script, the constants are set at the beginning, in accordance to what was discussed above, and the output is given by the functions
(a) p.u_f(u, n, m, p)
and
(b) p.wOK_f(n, m, p)
that calculate the probabilities for options (a) and (b), given inputs:
u = desired user class (set to u=1)
n = number of words
m = number of matchings
p = number of partial matchings
The code itself:
### Constants:
# User:
# Prob[U=1], Prob[U=2], Prob[U=3]
Prob_user = c(0.70, 0.25, 0.05)
# Words:
# Prob[Wi=OK|U=1,2,3]
Prob_OK = c(0.99, 0.001, 0.5)
Prob_NotOK = 1 - Prob_OK
# Database:
# Prob[Fi=match|Wi=OK], Prob[Fi=match|Wi=NotOK]:
Prob_match = c(0.98, 0)
# Prob[Fi=partial|Wi=OK], Prob[Fi=partial|Wi=NotOK]:
Prob_partial = c(0.002, 0.001)
# Prob[Fi=NOmatch|Wi=OK], Prob[Fi=NOmatch|Wi=NotOK]:
Prob_NOmatch = 1 - Prob_match - Prob_partial
###### First Goal: Probability of being a user type I, given the numbers of matchings (m) and partial matchings (p).
# Prob[Fi=fi|U=u]
#
p.fi_u <- function(fi, u)
{
unname(rbind(Prob_match, Prob_partial, Prob_NOmatch) %*% rbind(Prob_OK, Prob_NotOK))[fi,u]
}
# Prob[F=f|U=u]
#
p.f_u <- function(n, m, p, u)
{
exp( log(p.fi_u(1, u))*m + log(p.fi_u(2, u))*p + log(p.fi_u(3, u))*(n-m-p) )
}
# Prob[F=f]
#
p.f <- function(n, m, p)
{
p.f_u(n, m, p, 1)*Prob_user[1] + p.f_u(n, m, p, 2)*Prob_user[2] + p.f_u(n, m, p, 3)*Prob_user[3]
}
# Prob[U=u|F=f]
#
p.u_f <- function(u, n, m, p)
{
p.f_u(n, m, p, u) * Prob_user[u] / p.f(n, m, p)
}
# Probability user type I for n=1,...,5:
for(n in 1:5) for(m in 0:n) for(p in 0:(n-m))
{
cat("n =", n, "| m =", m, "| p =", p, "| Prob type I =", p.u_f(1, n, m, p), "\n")
}
##################################################################################################
# Second Goal: Probability all words OK given matchings/partial matchings.
p.f_wOK <- function(n, m, p)
{
exp( log(Prob_match[1])*m + log(Prob_partial[1])*p + log(Prob_NOmatch[1])*(n-m-p) )
}
p.wOK <- function(n)
{
sum(exp( log(Prob_OK)*n + log(Prob_user) ))
}
p.wOK_f <- function(n, m, p)
{
p.f_wOK(n, m, p)*p.wOK(n)/p.f(n, m, p)
}
# Probability all words ok for n=1,...,5:
for(n in 1:5) for(m in 0:n) for(p in 0:(n-m))
{
cat("n =", n, "| m =", m, "| p =", p, "| Prob all OK =", p.wOK_f(n, m, p), "\n")
}
5. Results
This are the results for n=1,...,5, and all possibilities for m and p. For instance, if you have 3 words, one match, one partial match, and one not found, you can be 66,5% sure it's a class-I user. In the same situation, you can attribute a score of 42,8% that all words are valid.
Note that option (a) does not give 100% score to the case of all matches, but option (b) does. This is expected, since we assumed that the database has no invalid words, hence if they are all found, then they are all valid. OTOH, there is a small chance that a user in class II or III can enter all valid words, but this chance decreases rapidly as n increases.
(a)
n = 1 | m = 0 | p = 0 | Prob type I = 0.06612505
n = 1 | m = 0 | p = 1 | Prob type I = 0.8107086
n = 1 | m = 1 | p = 0 | Prob type I = 0.9648451
n = 2 | m = 0 | p = 0 | Prob type I = 0.002062543
n = 2 | m = 0 | p = 1 | Prob type I = 0.1186027
n = 2 | m = 0 | p = 2 | Prob type I = 0.884213
n = 2 | m = 1 | p = 0 | Prob type I = 0.597882
n = 2 | m = 1 | p = 1 | Prob type I = 0.9733557
n = 2 | m = 2 | p = 0 | Prob type I = 0.982106
n = 3 | m = 0 | p = 0 | Prob type I = 5.901733e-05
n = 3 | m = 0 | p = 1 | Prob type I = 0.003994149
n = 3 | m = 0 | p = 2 | Prob type I = 0.200601
n = 3 | m = 0 | p = 3 | Prob type I = 0.9293284
n = 3 | m = 1 | p = 0 | Prob type I = 0.07393334
n = 3 | m = 1 | p = 1 | Prob type I = 0.665019
n = 3 | m = 1 | p = 2 | Prob type I = 0.9798274
n = 3 | m = 2 | p = 0 | Prob type I = 0.7500993
n = 3 | m = 2 | p = 1 | Prob type I = 0.9864524
n = 3 | m = 3 | p = 0 | Prob type I = 0.990882
n = 4 | m = 0 | p = 0 | Prob type I = 1.66568e-06
n = 4 | m = 0 | p = 1 | Prob type I = 0.0001158324
n = 4 | m = 0 | p = 2 | Prob type I = 0.007636577
n = 4 | m = 0 | p = 3 | Prob type I = 0.3134207
n = 4 | m = 0 | p = 4 | Prob type I = 0.9560934
n = 4 | m = 1 | p = 0 | Prob type I = 0.004198015
n = 4 | m = 1 | p = 1 | Prob type I = 0.09685249
n = 4 | m = 1 | p = 2 | Prob type I = 0.7256616
n = 4 | m = 1 | p = 3 | Prob type I = 0.9847408
n = 4 | m = 2 | p = 0 | Prob type I = 0.1410053
n = 4 | m = 2 | p = 1 | Prob type I = 0.7992839
n = 4 | m = 2 | p = 2 | Prob type I = 0.9897541
n = 4 | m = 3 | p = 0 | Prob type I = 0.855978
n = 4 | m = 3 | p = 1 | Prob type I = 0.9931117
n = 4 | m = 4 | p = 0 | Prob type I = 0.9953741
n = 5 | m = 0 | p = 0 | Prob type I = 4.671933e-08
n = 5 | m = 0 | p = 1 | Prob type I = 3.289577e-06
n = 5 | m = 0 | p = 2 | Prob type I = 0.0002259559
n = 5 | m = 0 | p = 3 | Prob type I = 0.01433312
n = 5 | m = 0 | p = 4 | Prob type I = 0.4459982
n = 5 | m = 0 | p = 5 | Prob type I = 0.9719289
n = 5 | m = 1 | p = 0 | Prob type I = 0.0002158996
n = 5 | m = 1 | p = 1 | Prob type I = 0.005694145
n = 5 | m = 1 | p = 2 | Prob type I = 0.1254661
n = 5 | m = 1 | p = 3 | Prob type I = 0.7787294
n = 5 | m = 1 | p = 4 | Prob type I = 0.988466
n = 5 | m = 2 | p = 0 | Prob type I = 0.00889696
n = 5 | m = 2 | p = 1 | Prob type I = 0.1788336
n = 5 | m = 2 | p = 2 | Prob type I = 0.8408416
n = 5 | m = 2 | p = 3 | Prob type I = 0.9922575
n = 5 | m = 3 | p = 0 | Prob type I = 0.2453087
n = 5 | m = 3 | p = 1 | Prob type I = 0.8874493
n = 5 | m = 3 | p = 2 | Prob type I = 0.994799
n = 5 | m = 4 | p = 0 | Prob type I = 0.9216786
n = 5 | m = 4 | p = 1 | Prob type I = 0.9965092
n = 5 | m = 5 | p = 0 | Prob type I = 0.9976583
(b)
n = 1 | m = 0 | p = 0 | Prob all OK = 0.04391523
n = 1 | m = 0 | p = 1 | Prob all OK = 0.836025
n = 1 | m = 1 | p = 0 | Prob all OK = 1
n = 2 | m = 0 | p = 0 | Prob all OK = 0.0008622994
n = 2 | m = 0 | p = 1 | Prob all OK = 0.07699368
n = 2 | m = 0 | p = 2 | Prob all OK = 0.8912977
n = 2 | m = 1 | p = 0 | Prob all OK = 0.3900892
n = 2 | m = 1 | p = 1 | Prob all OK = 0.9861099
n = 2 | m = 2 | p = 0 | Prob all OK = 1
n = 3 | m = 0 | p = 0 | Prob all OK = 1.567032e-05
n = 3 | m = 0 | p = 1 | Prob all OK = 0.001646751
n = 3 | m = 0 | p = 2 | Prob all OK = 0.1284228
n = 3 | m = 0 | p = 3 | Prob all OK = 0.923812
n = 3 | m = 1 | p = 0 | Prob all OK = 0.03063598
n = 3 | m = 1 | p = 1 | Prob all OK = 0.4278888
n = 3 | m = 1 | p = 2 | Prob all OK = 0.9789305
n = 3 | m = 2 | p = 0 | Prob all OK = 0.485069
n = 3 | m = 2 | p = 1 | Prob all OK = 0.990527
n = 3 | m = 3 | p = 0 | Prob all OK = 1
n = 4 | m = 0 | p = 0 | Prob all OK = 2.821188e-07
n = 4 | m = 0 | p = 1 | Prob all OK = 3.046322e-05
n = 4 | m = 0 | p = 2 | Prob all OK = 0.003118531
n = 4 | m = 0 | p = 3 | Prob all OK = 0.1987396
n = 4 | m = 0 | p = 4 | Prob all OK = 0.9413746
n = 4 | m = 1 | p = 0 | Prob all OK = 0.001109629
n = 4 | m = 1 | p = 1 | Prob all OK = 0.03975118
n = 4 | m = 1 | p = 2 | Prob all OK = 0.4624648
n = 4 | m = 1 | p = 3 | Prob all OK = 0.9744778
n = 4 | m = 2 | p = 0 | Prob all OK = 0.05816511
n = 4 | m = 2 | p = 1 | Prob all OK = 0.5119571
n = 4 | m = 2 | p = 2 | Prob all OK = 0.9843855
n = 4 | m = 3 | p = 0 | Prob all OK = 0.5510398
n = 4 | m = 3 | p = 1 | Prob all OK = 0.9927134
n = 4 | m = 4 | p = 0 | Prob all OK = 1
n = 5 | m = 0 | p = 0 | Prob all OK = 5.05881e-09
n = 5 | m = 0 | p = 1 | Prob all OK = 5.530918e-07
n = 5 | m = 0 | p = 2 | Prob all OK = 5.899106e-05
n = 5 | m = 0 | p = 3 | Prob all OK = 0.005810434
n = 5 | m = 0 | p = 4 | Prob all OK = 0.2807414
n = 5 | m = 0 | p = 5 | Prob all OK = 0.9499773
n = 5 | m = 1 | p = 0 | Prob all OK = 3.648353e-05
n = 5 | m = 1 | p = 1 | Prob all OK = 0.001494098
n = 5 | m = 1 | p = 2 | Prob all OK = 0.051119
n = 5 | m = 1 | p = 3 | Prob all OK = 0.4926606
n = 5 | m = 1 | p = 4 | Prob all OK = 0.9710204
n = 5 | m = 2 | p = 0 | Prob all OK = 0.002346281
n = 5 | m = 2 | p = 1 | Prob all OK = 0.07323064
n = 5 | m = 2 | p = 2 | Prob all OK = 0.5346423
n = 5 | m = 2 | p = 3 | Prob all OK = 0.9796679
n = 5 | m = 3 | p = 0 | Prob all OK = 0.1009589
n = 5 | m = 3 | p = 1 | Prob all OK = 0.5671273
n = 5 | m = 3 | p = 2 | Prob all OK = 0.9871377
n = 5 | m = 4 | p = 0 | Prob all OK = 0.5919764
n = 5 | m = 4 | p = 1 | Prob all OK = 0.9938288
n = 5 | m = 5 | p = 0 | Prob all OK = 1
xxx yyy
when one is in the DB and the other is definitely invalid? Would the confidence score be different if the words were in a different order? Other more general questions: What are real life use cases for this system? (Some sort of CAPTCHA?) Could there potentially be adversarial users of this system? Etc.. – Caseinxxx yyy zzz
were all in the database, I thin the total confidence score should be 100%. If none of the words are in the database, then possibly 0%. In your example, ifxxx yyy
was the input and only one of them were in the database and the other wasn't, then MAYBE it's 50%? There is no such thing as "definitely invalid"; there is however a "definitely valid." The idea is to figure out the likelihood that the groups of words are valid. CAPTCHA would be an application of such a scoring system. – Upheaval