Reimplementing ToUpper()

Asked 2/12, 2008 at 13:44 Answered 2/12, 2008 at 14:41

Solved language-agnostic unicode localization internationalization toupper

How would you write ToUpper() if it didn't exist? Bonus points for i18n and L10n

Curiosity sparked by this: http://thedailywtf.com/Articles/The-Long-Way-toUpper.aspx

Corneous answered 2/12, 2008 at 13:44 Comment(0)

I download the Unicode tables
I import the tables into a database
I write a method upper().

Here is a sample implementation ;)

public static String upper(String s) {
    if (s == null) {
        return null;
    }

    final int N = s.length(); // Mind the optimization!
    PreparedStatement stmtName = null;
    PreparedStatement stmtSmall = null;
    ResultSet rsName = null;
    ResultSet rsSmall = null;
    StringBuilder buffer = new StringBuilder (N); // Much faster than StringBuffer!
    try {
        conn = DBFactory.getConnection();
        stmtName = conn.prepareStatement("select name from unicode.chart where codepoint = ?");
        // TODO Optimization: Maybe move this in the if() so we don't create this
        // unless there are uppercase characters in the string.
        stmtSmall = conn.prepareStatement("select codepoint from unicode.chart where name = ?");
        for (int i=0; i<N; i++) {
            int c = s.charAt(i);
            stmtName.setInt(1, c);
            rsName = stmtName.execute();
            if (rsName.next()) {
                String name = rsName.getString(1);
                if (name.contains(" SMALL ")) {
                    name = name.replaceAll(" SMALL ", " CAPITAL ");

                    stmtSmall.setString(1, name);
                    rsSmall = stmtSmall.execute();
                    if (rsSmall.next()) {
                        c = rsSmall.getInt(1);
                    }

                    rsSmall = DBUtil.close(rsSmall);
                }
            }
            rsName = DBUtil.close(rsName);
        }
    }
    finally {
        // Always clean up
        rsSmall = DBUtil.close(rsSmall);
        rsName = DBUtil.close(rsName);
        stmtSmall = DBUtil.close(stmtSmall);
        stmtName = DBUtil.close(stmtName);
    }

    // TODO Optimization: Maybe read the table once into RAM at the start
    // Would waste a lot of memory, though :/
    return buffer.toString();
}

;)

Note: The unicode charts which you can find on unicode.org contain the name of the character/code point. This string will contain " SMALL " for characters which are uppercase (mind the blanks or it might match "SMALLER" and the like). Now, you can search for a similar name with "SMALL" replaced with "CAPITAL". If you find it, you've found the captial version.

Isherwood answered 2/12, 2008 at 14:41 Comment(0)

I dont think SO can handle the size of the unicode tables in a single posting :)

Unfortunately, it is not so easy as just char.ToUpper() every character.

Example:

(string-upcase "Straße")    ⇒ "STRASSE"
(string-downcase "Straße")  ⇒ "straße"
(string-upcase "ΧΑΟΣ")      ⇒ "ΧΑΟΣ"
(string-downcase "ΧΑΟΣ")    ⇒ "χαος"
(string-downcase "ΧΑΟΣΣ")   ⇒ "χαοσς"
(string-downcase "ΧΑΟΣ Σ")  ⇒ "χαος σ"
(string-upcase "χαος")      ⇒ "ΧΑΟΣ"
(string-upcase "χαοσ")      ⇒ "ΧΑΟΣ"

Lester answered 2/12, 2008 at 13:49 Comment(6)

(string-upcase "Straße") ⇒ "STRAẞE" – Millar 2/12, 2008 at 15:17

Hangy, sorry, that does not render. Also my conversions are local-independent (guess I should have mentioned that ;p). – Lester 2/12, 2008 at 16:4

And I simply pasted from the R6RS Scheme spec, it could be a typo, will check the tests. – Lester 2/12, 2008 at 16:5

Seems to be correct. These Scheme guys are really pedantic, I will take their word for it :) – Lester 2/12, 2008 at 16:7

The upper case ß was just added to the Unicode standard by updating some ISO standard back in April, so font support is really rare. :) Also, the Duden has not accepted it into the standard language, yet, so yours is correct. :) Just wanted to point another future possibility. – Millar 3/12, 2008 at 8:15

Thanks for the clarification, will reference your post :) – Lester 3/12, 2008 at 18:44

No static table is going to be sufficient because you need to know the language before you know the correct transforms.

e.g. In Turkish i needs to go to İ (U+0130) whereas in any other language is needs to go to I (U+0049) . And the i is the same character U+0069.

Sayer answered 2/12, 2008 at 14:25 Comment(1)

Uff. I guess that's why a proper i18n library takes up >10MB. Crazy people. Why couldn't our ancestors just settle for a nice simple SINGLE writing system? :P – Winnie 2/12, 2008 at 14:31

I won't win the bonus points, but here it is for 7-bit ASCII:

char toupper(char c)
{
    if ((c < 'a') || (c > 'z')) { return c; }
    else { return c & 0xdf; }
}

Alcine answered 2/12, 2008 at 13:50 Comment(7)

That's pretty much exactly the macro as it used to be in strings.h. – Duplicature 2/12, 2008 at 13:52

@Paul Tomblin: Nice! I was hoping to come close :) – Alcine 2/12, 2008 at 13:54

What about the upper 128 chars? Did you mean 7-bit? – Lester 2/12, 2008 at 13:57

Come to think of it, if I remember correctly, I think the macro actually added ('A'-'a'). And yes, @leppie, it only worked for ASCII, which by definition is 7 bit. – Duplicature 2/12, 2008 at 14:1

the check for (c < 'a') || ( c > 'z') takes care of 128..255 (or 0..-127 if a signed char is provided). Bottom line is that only the 26 characters from 'a' to 'z' are modified – Alcine 2/12, 2008 at 14:26

eJames: the nitpick was that ASCII is only 7 bit. The eight bit is always 0 or you're not really using ASCII. – Roustabout 2/12, 2008 at 14:48

Ah, OK. Fair enough! I shall modify my answer. – Alcine 2/12, 2008 at 15:5

in python ..

touppe_map = { massive dictionary to handle all cases in all languages }
def to_upper( c ):
   return toupper_map.get( c, c )

or, if you want to do it the "wrong way"

def to_upper( c ):
  for k,v in toupper_map.items():
     if k == c: return v
  return c

Carcajou answered 2/12, 2008 at 14:2 Comment(0)

Let me suggest even more bonus points for languages such as Hebrew, Arabic, Georgian and others that just do not have capital (upper case) letters. :-)

Bidentate answered 2/12, 2008 at 14:16 Comment(1)

for those languages it would be extremely simple ... anyway Arabic and Hebrew have their own set of string manipulation functionality they require. – Carcajou 2/12, 2008 at 14:21

Recommended topics

Hot tags