Set of unambiguous looking letters & numbers for user input
Asked Answered
D

10

41

Is there an existing subset of the alphanumerics that is easier to read? In particular, is there a subset that has fewer characters that are visually ambiguous, and by removing (or equating) certain characters we reduce human error?

I know "visually ambiguous" is somewhat waffly of an expression, but it is fairly evident that D, O and 0 are all similar, and 1 and I are also similar. I would like to maximize the size of the set of alpha-numerics, but minimize the number of characters that are likely to be misinterpreted.

The only precedent I am aware of for such a set is the Canada Postal code system that removes the letters D, F, I, O, Q, and U, and that subset was created to aid the postal system's OCR process.

My initial thought is to use only capital letters and numbers as follows:

A
B = 8
C = G
D = 0 = O = Q
E = F
H
I = J = L = T = 1 = 7
K = X
M
N
P
R
S = 5
U = V = Y
W
Z = 2
3
4
6
9

This problem may be difficult to separate from the given type face. The distinctiveness of the characters in the chosen typeface could significantly affect the potential visual ambiguity of any two characters, but I expect that in most modern typefaces the above characters that are equated will have a similar enough appearance to warrant equating them.

I would be grateful for thoughts on the above – are the above equations suitable, or perhaps are there more characters that should be equated? Would lowercase characters be more suitable?

Draggletailed answered 12/8, 2012 at 4:20 Comment(9)
Note: "Visually ambiguous" is meant in context of humans, not the OCR system. The solution required is to aid manual input.Culbert
See ux.stackexchange.com/questions/21076/…Massif
@rwb: if you make this into an answer, it will probably pick up the bounty. Discussion in UX is exactly what OP was looking for.Saragossa
Is the bounty closed - I have a 'better' solution..Culbert
@UjjwalSingh: The bounty is closed, but a better solution would still be much appreciated!Draggletailed
Posting on GitHub.. ETA 6 HrsCulbert
I had some time constraints and stackoverflow automatically changed my answer to a comment.Massif
@UjjwalSingh, where on github?Synonymous
@Prof.Falken I have not yet published the code. You may want to check this out: patents.stackexchange.com/q/13629/3127Culbert
S
17

Mainly drawing inspiration from this ux thread, mentioned by @rwb,

  • Several programs use similar things. The list in your post seems to be very similar to those used in these programs, and I think it should be enough for most purposes. You can add always add redundancy (error-correction) to "forgive" minor mistakes; this will require you to space-out your codes (see Hamming distance), though.
  • No references as to particular method used in deriving the lists, except trial and error with humans (which is great for non-ocr: your users are humans)
  • It may make sense to use character grouping (say, groups of 5) to increase context ("first character in the second of 5 groups")
  • Ambiguity can be eliminated by using complete nouns (from a dictionary with few look-alikes; word-edit-distance may be useful here) instead of characters. People may confuse "1" with "i", but few will confuse "one" with "ice".
  • Another option is to make your code into a (fake) word that can be read out loud. A markov model may help you there.
Saragossa answered 23/9, 2012 at 12:3 Comment(2)
+1 for using complete nouns; cloudflare uses something similar for their name serversKalat
Error correction is probably underrated in UX. One valuable bit here may be a a visual distance metric - for example O/D/0 are closer to e.g. C/Q but further from e.g. I/H/R. As mentioned elsewhere though, this may depends heavily on the font. An symbol based error correction (e.g. Reed Solomon) that does not depend on visuals may be simpler and more effective. It's a really great insight, tucuxi, thanks!Draggletailed
J
25

My set of 23 unambiguous characters is:

c,d,e,f,h,j,k,m,n,p,r,t,v,w,x,y,2,3,4,5,6,8,9

I needed a set of unambiguous characters for user input, and I couldn't find anywhere that others have already produced a character set and set of rules that fit my criteria.

My requirements:

  1. No capitals: this supposed to be used in URIs, and typed by people who might not have a lot of typing experience, for whom even the shift key can slow them down and cause uncertainty. I also want someone to be able to say "all lowercase" so as to reduce uncertainty, so I want to avoid capital letters.

  2. Few or no vowels: an easy way to avoid creating foul language or surprising words is to simply omit most vowels. I think keeping "e" and "y" is ok.

  3. Resolve ambiguity consistently: I'm open to using some ambiguous characters, so long as I only use one character from each group (e.g., out of lowercase s, uppercase S, and five, I might only use five); that way, on the backend, I can just replace any of these ambiguous characters with the one correct character from their group. So, the input string "3Sh" would be replaced with "35h" before I look up its match in my database.

  4. Only needed to create tokens: I don't need to encode information like base64 or base32 do, so the exact number of characters in my set doesn't really matter, besides my wanting to to be as large as possible. It only needs to be useful for producing random UUID-type id tokens.

  5. Strongly prefer non-ambiguity: I think it's much more costly for someone to enter a token and have something go wrong than it is for someone to have to type out a longer token. There's a tradeoff, of course, but I want to strongly prefer non-ambiguity over brevity.

The confusable groups of characters I identified:

  • A/4
  • b/6/G
  • 8/B
  • c/C
  • f/F
  • 9/g/q
  • i/I/1/l/7 - just too ambiguous to use; note that european "1" can look a lot like many people's "7"
  • k/K
  • o/O/0 - just too ambiguous to use
  • p/P
  • s/S/5
  • v/V
  • w/W
  • x/X
  • y/Y
  • z/Z/2

Unambiguous characters:

I think this leaves only 9 totally unambiguous lowercase/numeric chars, with no vowels:

d,e,h,j,m,n,r,t,3

Adding back in one character from each of those ambiguous groups (and trying to prefer the character that looks most distinct, while avoiding uppercase), there are 23 characters:

c,d,e,f,h,j,k,m,n,p,r,t,v,w,x,y,2,3,4,5,6,8,9

Analysis:

Using the rule of thumb that a UUID with a numerical equivalent range of N possibilities is sufficient to avoid collisions for sqrt(N) instances:

  • an 8-digit UUID using this character set should be sufficient to avoid collisions for about 300,000 instances
  • a 16-digit UUID using this character set should be sufficient to avoid collisions for about 80 billion instances.
Jewelfish answered 25/9, 2019 at 12:26 Comment(1)
My favorite list of unambiguous characters here. Thank you!Hanson
M
21

I needed a replacement for hexadecimal (base 16) for similar reasons (e.g. for encoding a key, etc.), the best I could come up with is the following set of 16 characters, which can be used as a replacement for hexadecimal:

0 1 2 3 4 5 6 7 8 9 A B C D E F     Hexadecimal
H M N 3 4 P 6 7 R 9 T W C X Y F     Replacement

In the replacement set, we consider the following:

All characters used have major distinguishing features that would only be omitted in a truly awful font.

Vowels A E I O U omitted to avoid accidentally spelling words.

Sets of characters that could potentially be very similar or identical in some fonts are avoided completely (none of the characters in any set are used at all):

0 O D Q 
1 I L J
8 B 
5 S
2 Z

By avoiding these characters completely, the hope is that the user will enter the correct characters, rather than trying to correct mis-entered characters.

For sets of less similar but potentially confusing characters, we only use one character in each set, hopefully the most distinctive:

Y U V 

Here Y is used, since it always has the lower vertical section, and a serif in serif fonts

C G         

Here C is used, since it seems less likely that a C would be entered as G, than vice versa

X K         

Here X is used, since it is more consistent in most fonts

F E         

Here F is used, since it is not a vowel

In the case of these similar sets, entry of any character in the set could be automatically converted to the one that is actually used (the first one listed in each set). Note that E must not be automatically converted to F if hexadecimal input might be used (see below).

Note that there are still similar-sounding letters in the replacement set, this is pretty much unavoidable. When reading aloud, a phonetic alphabet should be used.

Where characters that are also present in standard hexadecimal are used in the replacement set, they are used for the same base-16 value. In theory mixed input of hexadecimal and replacement characters could be supported, provided E is not automatically converted to F.

Since this is just a character replacement, it should be easy to convert to/from hexadecimal.

Upper case seems best for the "canonical" form for output, although lower case also looks reasonable, except for "h" and "n", which should still be relatively clear in most fonts:

h m n 3 4 p 6 7 r 9 t w c x y f

Input can of course be case-insensitive.

There are several similar systems for base 32, see http://en.wikipedia.org/wiki/Base32 However these obviously need to introduce more similar-looking characters, in return for an additional 25% more information per character.

Apparently the following set was also used for Windows product keys in base 24, but again has more similar-looking characters:

B C D F G H J K M P Q R T V W X Y 2 3 4 6 7 8 9
Milieu answered 13/12, 2014 at 13:16 Comment(2)
Very well thought out, thank you for contributing this answer.Draggletailed
If I have it right, here is a trivial Python gist implementing this.Draggletailed
S
17

Mainly drawing inspiration from this ux thread, mentioned by @rwb,

  • Several programs use similar things. The list in your post seems to be very similar to those used in these programs, and I think it should be enough for most purposes. You can add always add redundancy (error-correction) to "forgive" minor mistakes; this will require you to space-out your codes (see Hamming distance), though.
  • No references as to particular method used in deriving the lists, except trial and error with humans (which is great for non-ocr: your users are humans)
  • It may make sense to use character grouping (say, groups of 5) to increase context ("first character in the second of 5 groups")
  • Ambiguity can be eliminated by using complete nouns (from a dictionary with few look-alikes; word-edit-distance may be useful here) instead of characters. People may confuse "1" with "i", but few will confuse "one" with "ice".
  • Another option is to make your code into a (fake) word that can be read out loud. A markov model may help you there.
Saragossa answered 23/9, 2012 at 12:3 Comment(2)
+1 for using complete nouns; cloudflare uses something similar for their name serversKalat
Error correction is probably underrated in UX. One valuable bit here may be a a visual distance metric - for example O/D/0 are closer to e.g. C/Q but further from e.g. I/H/R. As mentioned elsewhere though, this may depends heavily on the font. An symbol based error correction (e.g. Reed Solomon) that does not depend on visuals may be simpler and more effective. It's a really great insight, tucuxi, thanks!Draggletailed
F
7

If you have the option to use only capitals, I created this set based on characters which users commonly mistyped, however this wholly depends on the font they read the text in.

Characters to use: A C D E F G H J K L M N P Q R T U V W X Y 3 4 6 7 9

Characters to avoid:

B similar to 8
I similar to 1
O similar to 0
S similar to 5
Z similar to 2
Fellers answered 5/4, 2020 at 14:30 Comment(0)
C
4

What you seek is an unambiguous, efficient Human-Computer code. What I recommend is to encode the entire data with literal(meaningful) words, nouns in particular.

I have been developing a software to do just that - and most efficiently. I call it WCode.
Technically its just Base-1024 Encoding - wherein you use words instead of symbols.

Here are the links:
Presentation: https://docs.google.com/presentation/d/1sYiXCWIYAWpKAahrGFZ2p5zJX8uMxPccu-oaGOajrGA/edit
Documentation: https://docs.google.com/folder/d/0B0pxLafSqCjKOWhYSFFGOHd1a2c/edit
Project: https://github.com/San13/WCode (Please wait while I get around uploading...)

Culbert answered 24/9, 2012 at 20:56 Comment(2)
@BrianM.Hunt Check out the website: WCodes.org. I also made a video and have posted the project on crowdfunding site: IndieGoGo igg.me/at/wcode/x/2245741Culbert
Nowadays probably better to go with e.g. BIP39, which encodes bitcoin secret keys in words. github.com/bitcoin/bips/blob/master/bip-0039.mediawikiMalisamalison
Z
3

Unambiguous looking letters for humans are also unambiguous for optical character recognition (OCR). By removing all pairs of letters that are confusing for OCR, one obtains:

 !+2345679:BCDEGHKLQSUZadehiopqstu

See https://www.monperrus.net/martin/store-data-paper

Zeeba answered 16/5, 2020 at 16:40 Comment(0)
C
2

This would be a general problem in OCR. Thus for end to end solution where in OCR encoding is controlled - specialised fonts have been developed to solve the "visual ambiguity" issue you mention of. See: http://en.wikipedia.org/wiki/OCR-A_font

as additional information : you may want to know about Base32 Encoding - wherein symbol for digit '1' is not used as it may 'confuse' the users with the symbol for alphabet 'l'.

Culbert answered 12/8, 2012 at 9:16 Comment(2)
Thanks - Base32 is a good tip. Strictly speaking, the question only relates to OCR by way of the Canada Post precedent of removing characters that are ambiguous to machine readers. I am interested in a character (or glyph, really) set that is less ambiguous to humans.Draggletailed
You may use your custom set of symbols in base32-encoding with implementation part remaining the same.Culbert
O
1

It depends how large you want your set to be. For example, just the set {0, 1} will probably work well. Similarly the set of digits only. But probably you want a set that's roughly half the size of the original set of characters.

I have not done this, but here's a suggestion. Pick a font, pick an initial set of characters, and write some code to do the following. Draw each character to fit into an n-by-n square of black and white pixels, for n = 1 through (say) 10. Cut away any all-white rows and columns from the edge, since we're only interested in the black area. That gives you a list of 10 codes for each character. Measure the distance between any two characters by how many of these codes differ. Estimate what distance is acceptable for your application. Then do a brute-force search for a set of characters which are that far apart.

Basically, use a script to simulate squinting at the characters and see which ones you can still tell apart.

Obligation answered 16/9, 2012 at 16:59 Comment(1)
This depends heavily on the font, and even the font-size. It could also require some brute-force alignment: L and I share few pixels until you place the vertical strokes so that they overlap.Saragossa
T
1

Here's some python I wrote to encode and decode integers using the system of characters described above.

def base20encode(i):
    """Convert integer into base20 string of unambiguous characters."""
    if not isinstance(i, int):
        raise TypeError('This function must be called on an integer.')     
    chars, s = '012345689ACEHKMNPRUW', ''
    while i > 0:
        i, remainder = divmod(i, 20)
        s = chars[remainder] + s
    return s


def base20decode(s):
    """Convert string to unambiguous chars and then return integer from resultant base20"""
    if not isinstance(s, str):
        raise TypeError('This function must be called on a string.')
    s = s.translate(bytes.maketrans(b'BGDOQFIJLT7KSVYZ', b'8C000E11111X5UU2'))
    chars, i, exponent = '012345689ACEHKMNPRUW', 0, 1
    for number in s[::-1]:
        i += chars.index(number) * exponent
        exponent *= 20
    return i


base20decode(base20encode(10))
Tussle answered 15/9, 2017 at 20:10 Comment(0)
D
-2

base58:123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz

Delft answered 23/4, 2020 at 13:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.