How to detect UTF-8 in plain C?
Asked Answered
E

11

42

I am looking for a code snippet in plain old C that detects that the given string is in UTF-8 encoding. I know the solution with regex, but for various reasons it would be better to avoid using anything but plain C in this particular case.

Solution with regex looks like this (warning: various checks omitted):

#define UTF8_DETECT_REGEXP  "^([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$"

const char *error;
int         error_off;
int         rc;
int         vect[100];

utf8_re = pcre_compile(UTF8_DETECT_REGEXP, PCRE_CASELESS, &error, &error_off, NULL);
utf8_pe = pcre_study(utf8_re, 0, &error);

rc = pcre_exec(utf8_re, utf8_pe, str, len, 0, 0, vect, sizeof(vect)/sizeof(vect[0]));

if (rc > 0) {
    printf("string is in UTF8\n");
} else {
    printf("string is not in UTF8\n")
}
Emancipator answered 23/6, 2009 at 9:57 Comment(5)
Can you post the solution with the regex?Ogilvy
@Konstantin: The above is not a comment, please edit the question directly and include these details.Byrle
@Ludwig: Yes, but that's all I need.Emancipator
@Konstantin: Thank you for the regex. If the regex does not match a string it means the string is certainly not valid UTF-8. The reverse is not true however. If it matches the string it can be any garbage that accidentally happens not to contain any illegal UTF-8 sequences.Ogilvy
@Konstantin: OK, it should be possible to translate the regex into plain C. What is a little nasty are the {2}s and the {3}. Look at Christoph's solution, that is the way to go.Ogilvy
E
55

Here's a (hopefully bug-free) implementation of this expression in plain C:

_Bool is_utf8(const char * string)
{
    if(!string)
        return 0;

    const unsigned char * bytes = (const unsigned char *)string;
    while(*bytes)
    {
        if( (// ASCII
             // use bytes[0] <= 0x7F to allow ASCII control characters
                bytes[0] == 0x09 ||
                bytes[0] == 0x0A ||
                bytes[0] == 0x0D ||
                (0x20 <= bytes[0] && bytes[0] <= 0x7E)
            )
        ) {
            bytes += 1;
            continue;
        }

        if( (// non-overlong 2-byte
                (0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF)
            )
        ) {
            bytes += 2;
            continue;
        }

        if( (// excluding overlongs
                bytes[0] == 0xE0 &&
                (0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// straight 3-byte
                ((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
                    bytes[0] == 0xEE ||
                    bytes[0] == 0xEF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// excluding surrogates
                bytes[0] == 0xED &&
                (0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            )
        ) {
            bytes += 3;
            continue;
        }

        if( (// planes 1-3
                bytes[0] == 0xF0 &&
                (0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// planes 4-15
                (0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// plane 16
                bytes[0] == 0xF4 &&
                (0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            )
        ) {
            bytes += 4;
            continue;
        }

        return 0;
    }

    return 1;
}

Please note that this is a faithful translation of the regular expression recommended by W3C for form validation, which does indeed reject some valid UTF-8 sequences (in particular those containing ASCII control characters).

Also, even after fixing this by making the change mentioned in the comment, it still assumes zero-termination, which prevents embedding NUL characters, although it should technically be legal.

When I dabbled in creating my own string library, I went with modified UTF-8 (ie encoding NUL as an overlong two-byte sequence) - feel free to use this header as a template for providing a validation routine which doesn't suffer from the above shortcomings.

Election answered 23/6, 2009 at 10:34 Comment(11)
Very nice. I was just hacking my nested ifs, but you were faster. I have not tested your solution but it looks good to me.Ogilvy
Since you're reading byte +1, +2, +3 and only check that byte != 0, this code can read past the end of the string. Even if it's zero terminated.Traditor
@Lucas: no, it can't: the && will short-circuit this case because 0 isn't in range of any valid multi-byte sequenceElection
This was amazing, thank you very much for providing this code.Halvaard
This code will reject a string containing an ASCII ESC (0x1b). Is this right? In my readings on UTF-8 I can't find anything that says this character isn't allowed as a 1-byte sequence.Casual
@AndrewR: I added a comment and some paragraphs for clarificationElection
How does your answer compare to Danny's IsUTF8Bytes() function in his answer? His seems shorter... https://mcmap.net/q/12198/-how-to-find-out-the-encoding-of-a-file-cPinkster
@DanW; Danny's IsUTF8Bytes() only verifies that the byte string conforms to the UTF-8 encoding scheme (see Stefan's answer), whereas mine does some additional validation by excluding non-characters and ASCII control sequences; the latter are actually valid UTF-8 - see comments aboveElection
Has anyone tested this thoroughly? Would like to know that it is bug-free instead of just hoping :). 31 upvotes and no comments to the contrary in 7 years is a good sign, though.Propylene
This code saved us a nice amount of time, thank you!Phylis
Thank you for the code, but there is a mistake bytes[0] <= 0x7E should be bytes[0] <= 0x7FMetathesis
M
42

This decoder by Bjoern Hoermann is the simplest I've found. It also works by feeding it a single byte, as well as keeping a state. The state is very useful for parsing UTF8 coming in in chunks over the network.

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

// Copyright (c) 2008-2009 Bjoern Hoehrmann <[email protected]>
// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.

#define UTF8_ACCEPT 0
#define UTF8_REJECT 1

static const uint8_t utf8d[] = {
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f
  7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf
  8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df
  0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef
  0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff
  0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2
  1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4
  1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6
  1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};

uint32_t inline
decode(uint32_t* state, uint32_t* codep, uint32_t byte) {
  uint32_t type = utf8d[byte];

  *codep = (*state != UTF8_ACCEPT) ?
    (byte & 0x3fu) | (*codep << 6) :
    (0xff >> type) & (byte);

  *state = utf8d[256 + *state*16 + type];
  return *state;
}

A simple validator/detector doesn't need the code point, so it could be written like this (Initial state is set to UTF8_ACCEPT):

uint32_t validate_utf8(uint32_t *state, char *str, size_t len) {
   size_t i;
   uint32_t type;

    for (i = 0; i < len; i++) {
        // We don't care about the codepoint, so this is
        // a simplified version of the decode function.
        type = utf8d[(uint8_t)str[i]];
        *state = utf8d[256 + (*state) * 16 + type];

        if (*state == UTF8_REJECT)
            break;
    }

    return *state;
}

If the text is valid utf8 UTF8_ACCEPT is returned. If it's invalid UTF8_REJECT. If more data is needed, some other integer is returned.

Usage example with feeding data in chunks (e.g. from the network):

char buf[128];
size_t bytes_read;
uint32_t state = UTF8_ACCEPT;

// Validate the UTF8 data in chunks.
while ((bytes_read = get_new_data(buf, sizeof(buf))) {
    if (validate_utf8(&state, buf, bytes_read) == UTF8_REJECT)) {
        fprintf(stderr, "Invalid UTF8 data!\n");
        return -1;
    }
}

// If everything went well we should have proper UTF8,
// the data might instead have ended in the middle of a UTF8
// codepoint.
if (state != UTF8_ACCEPT) {
    fprintf(stderr, "Invalid UTF8, incomplete codepoint\n");
}
Mountebank answered 2/3, 2014 at 23:15 Comment(0)
R
11

You cannot detect if a given string (or byte sequence) is a UTF-8 encoded text as for example each and every series of UTF-8 octets is also a valid (if nonsensical) series of Latin-1 (or some other encoding) octets. However not every series of valid Latin-1 octets are valid UTF-8 series. So you can rule out strings that do not conform to the UTF-8 encoding schema:

U+0000-U+007F    0xxxxxxx
U+0080-U+07FF    110yyyxx    10xxxxxx
U+0800-U+FFFF    1110yyyy    10yyyyxx    10xxxxxx
U+10000-U+10FFFF 11110zzz    10zzyyyy    10yyyyxx    10xxxxxx   
Radionuclide answered 23/6, 2009 at 10:10 Comment(0)
D
6

You'd have to parse the string as UTF-8, see http://www.rfc-editor.org/rfc/rfc3629.txt It's very simple. If the parsing fails it's not UTF-8. There's several simple UTF-8 libraries around that can do this.

It could perhaps be simplified if you know the string is either plain old ASCII or it contains characters outside ASCII which are UTF-8 encoded . In which case you often don't need to care for the difference, the design of UTF-8 was that existing programs that could handle ASCII, could in most cases transparently handle UTF-8.

Keep in mind that ASCII is encoded in UTF-8 as itself, so ASCII is valid UTF-8.

A C string can be anything, is the problem you need to solve that you don't know if the content is ASCII,GB 2312,CP437,UTF-16, or any of the other dozen character encodings that makes a programmes life hard.. ?

Disposition answered 23/6, 2009 at 10:6 Comment(0)
W
3

It is impossible to detect that a given array of bytes is a UTF-8 string. You can reliably determine that it can't be valid UTF-8 (which doesn't mean it's not invalid UTF-8); and you can determine that it might be a valid UTF-8 sequence but this is subject to false positives.

For a simple example, use a random number generator to generate an array of 3 random bytes and use it to test your code. These are random bytes and therefore not UTF-8, so every string that your code thinks is "possibly UTF-8" is a false positive. My guess is that (under these conditions) your code will be wrong over 12% of the time.

Once you recognise that it's impossible, you can start thinking about returning a confidence level (in addition to your prediction). For example, your function might return something like "I'm 88% sure that this is UTF-8".

Now do this for all other types of data. For example, you might have a function that checks if the data is UTF-16 that might return "I'm 95% confident that this is UTF-16", and then decide that (because 95% is higher than 88%) it's more likely that the data was UTF-16 and not UTF-8.

The next step is to add tricks to increase the confidence levels. For example, if the string seems to mostly contain groups valid syllables separated by white space, then you can be a lot more confident that it actually is UTF-8. In the same way, if the data might be HTML then you could check for things that might be valid HTML markup and use that to increase your confidence.

Of course the same applies to other types of data. For example, if the data has a valid PE32 or ELF header, or a correct BMP or JPG or MP3 header, then you can be a lot more confident that it's not UTF-8 at all.

A far better approach is to fix the actual cause of the problem. For example, it may be possible to add some sort of "document type" identifier to the start of all files that you care about, or perhaps say "this software assumes UTF-8 and doesn't support anything else"; so that you don't need to make dodgy guesses in first place.

Weil answered 4/3, 2014 at 8:40 Comment(4)
It IS possible to check for UTF-8. Check 1st byte: if it is in 0x00-0x7F, it is valid; if it is 1st byte of 2-byte UTF-8 character (0xC2-0xDF), check whether next byte is a valid trailing byte (0x80-0xBF), in which case those 2 bytes together are a valid 2-byte UTF-8 character; if it is 1st byte of a 3-byte UTF-8 character (0xE0-EF), check whether next 2 bytes are TBs, in which case those 3 bytes are a valid 3-byte UTF-8 character; if it is 1st byte of a 4-byte UTF-8 character (0xF0-0xF4), check whether next 3 bytes are TBs, in which case those 4 bytes are a valid 4-byte UTF-8 character.Embodiment
@ThomasHedden: Assume you have the 4 bytes 0x41, 0x24, 0x2E, 0x7F. How do you tell the difference between "intended to be UTF-8" and "merely complies with the rules of the encoding scheme but not intended to be UTF-8 at all"? Maybe it's just a floating point number, or a pair of 16-bit signed integers, or... You can't know. It's impossible to know.Weil
That's always possible. However, if you select a minimum length of 10 or 20 you can detect a lot and rule out most false positives. I wrote a test program and ran it on a JPG file. I deliberately added strings to the JPG file using hexedit, and my test program did find that string. What I got is too long to post here.Embodiment
@ThomasHedden: No. As the number of bytes increases the total permutations increases, the number of false positives increases, and the amount of proof that you are wrong increases. Essentially; you're hiding more needles in a larger haystack and then attempting to pretend "more needles" is "zero needles". You can never get rid of all false positives.Weil
L
2

You can use the UTF-8 detector integrated into Firefox. It is found in the universal charset detector and its pretty much a stand along C++ library. It should be extremely easy to find the class the recognizes UTF-8 and take only that.
What this class basically does is detect character sequences that are unique to UTF-8.

  • get the latest firefox trunk
  • go to \mozilla\extensions\universalchardet\
  • find the UTF-8 detector class (I don't quite remember what is it's exact name)
Lurette answered 23/6, 2009 at 10:7 Comment(0)
A
2

Basically I check if the given key (a string of maximum 4 characters) matches the format from this link: http://www.fileformat.info/info/unicode/utf8.htm

/*
** Checks if the given string has all bytes like: 10xxxxxx
** where x is either 0 or 1
*/

static int      chars_are_folow_uni(const unsigned char *chars)
{
    while (*chars)
    {
        if ((*chars >> 6) != 0x2)
            return (0);
        chars++;
    }
    return (1);
}

int             char_is_utf8(const unsigned char *key)
{
    int         required_len;

    if (key[0] >> 7 == 0)
        required_len = 1;
    else if (key[0] >> 5 == 0x6)
        required_len = 2;
    else if (key[0] >> 4 == 0xE)
        required_len = 3;
    else if (key[0] >> 5 == 0x1E)
        required_len = 4;
    else
        return (0);
    return (strlen(key) == required_len && chars_are_folow_uni(key + 1));
}

Works fine for me:

unsigned char   buf[5];

ft_to_utf8(L'歓', buf);
printf("%d\n", char_is_utf8(buf)); // => 1
Approver answered 11/4, 2017 at 6:21 Comment(0)
G
1

3 random bytes seem to have a 15.8% chance of being valid UTF-8 according to my calculation:

128^3 possible ASCII-only sequences = 2097152

2^16-2^11 possible 3-byte UTF-8 characters (this is assuming surrogate pairs and noncharacters are allowed) = 63488

1920 2-byte UTF-8 characters either before or after an ASCII character = 1920*128*2 = 524288

Divide by number of 3-byte sequences = (2097152+63488+491520)/16777216.0 = 0.1580810546875

IMHO this is vastly over-estimating the number of incorrect matches, because the file is only 3 bytes long. The intersection goes way down as the number of bytes increases. Also actual text in non-UTF-8 is not random, there is a large number of lone bytes with the high bit set, which is not valid UTF-8.

A more useful metric for guessing the odds of failure is how likely a sequence of bytes with the high bit set are valid UTF-8. I get these values:

1 byte = 0% # the really important number that is often ignored
2 byte = 11.7%
3 byte = 3.03% (assumes surrogate halves are valid)
4 byte = 1.76% (includes two 2-byte characters)

It is also useful to try to find an actual readable string (in any language and any encoding) that is also a valid UTF-8 string. This is very very difficult, indicating that this is not a problem with real data.

Grazynagreabe answered 2/6, 2014 at 18:48 Comment(0)
D
0

I know it's an old thread, but I thought I'd post my solution here since I think it's an improvement over @Christoph 's wonderful solution (which I upvoted).

I'm no expert, so I may have read the RFC wrong, but it seems to me a 32 byte map can be used instead of a 256 byte map, saving both memory and time.

This led me to a simple macro that advances a string pointer by one UTF-8 character, storing the UTF8 code-point in a 32bit signed integer and storing the value -1 in case of error.

Here's the code with a some comments.

#include <stdint.h>
/**
 * Maps the last 5 bits in a byte (0b11111xxx) to a UTF-8 codepoint length.
 *
 * Codepoint length 0 == error.
 *
 * The first valid length can be any value between 1 to 4 (5== error).
 *
 * An intermidiate (second, third or forth) valid length must be 5.
 *
 * To map was populated using the following Ruby script:
 *
 *      map = []; 32.times { map << 0 }; (0..0b1111).each {|i| map[i] = 1} ;
 *      (0b10000..0b10111).each {|i| map[i] = 5} ;
 *      (0b11000..0b11011).each {|i| map[i] = 2} ;
 *      (0b11100..0b11101).each {|i| map[i] = 3} ;
 *      map[0b11110] = 4; map;
 */
static uint8_t fio_str_utf8_map[] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                                     1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5,
                                     5, 5, 2, 2, 2, 2, 3, 3, 4, 0};

/**
 * Advances the `ptr` by one utf-8 character, placing the value of the UTF-8
 * character into the i32 variable (which must be a signed integer with 32bits
 * or more). On error, `i32` will be equal to `-1` and `ptr` will not step
 * forwards.
 *
 * The `end` value is only used for overflow protection.
 */
#define FIO_STR_UTF8_CODE_POINT(ptr, end, i32)                                 \
  switch (fio_str_utf8_map[((uint8_t *)(ptr))[0] >> 3]) {                      \
  case 1:                                                                      \
    (i32) = ((uint8_t *)(ptr))[0];                                             \
    ++(ptr);                                                                   \
    break;                                                                     \
  case 2:                                                                      \
    if (((ptr) + 2 > (end)) ||                                                 \
        fio_str_utf8_map[((uint8_t *)(ptr))[1] >> 3] != 5) {                   \
      (i32) = -1;                                                              \
      break;                                                                   \
    }                                                                          \
    (i32) =                                                                    \
        ((((uint8_t *)(ptr))[0] & 31) << 6) | (((uint8_t *)(ptr))[1] & 63);    \
    (ptr) += 2;                                                                \
    break;                                                                     \
  case 3:                                                                      \
    if (((ptr) + 3 > (end)) ||                                                 \
        fio_str_utf8_map[((uint8_t *)(ptr))[1] >> 3] != 5 ||                   \
        fio_str_utf8_map[((uint8_t *)(ptr))[2] >> 3] != 5) {                   \
      (i32) = -1;                                                              \
      break;                                                                   \
    }                                                                          \
    (i32) = ((((uint8_t *)(ptr))[0] & 15) << 12) |                             \
            ((((uint8_t *)(ptr))[1] & 63) << 6) |                              \
            (((uint8_t *)(ptr))[2] & 63);                                      \
    (ptr) += 3;                                                                \
    break;                                                                     \
  case 4:                                                                      \
    if (((ptr) + 4 > (end)) ||                                                 \
        fio_str_utf8_map[((uint8_t *)(ptr))[1] >> 3] != 5 ||                   \
        fio_str_utf8_map[((uint8_t *)(ptr))[2] >> 3] != 5 ||                   \
        fio_str_utf8_map[((uint8_t *)(ptr))[3] >> 3] != 5) {                   \
      (i32) = -1;                                                              \
      break;                                                                   \
    }                                                                          \
    (i32) = ((((uint8_t *)(ptr))[0] & 7) << 18) |                              \
            ((((uint8_t *)(ptr))[1] & 63) << 12) |                             \
            ((((uint8_t *)(ptr))[2] & 63) << 6) |                              \
            (((uint8_t *)(ptr))[3] & 63);                                      \
    (ptr) += 4;                                                                \
    break;                                                                     \
  default:                                                                     \
    (i32) = -1;                                                                \
    break;                                                                     \
  }

/** Returns 1 if the String is UTF-8 valid and 0 if not. */
inline static size_t fio_str_utf8_valid2(char const *str, size_t length) {
  if (!str)
    return 0;
  if (!length)
    return 1;
  const char *const end = str + length;
  int32_t c = 0;
  do {
    FIO_STR_UTF8_CODE_POINT(str, end, c);
  } while (c > 0 && str < end);
  return str == end && c >= 0;
}
Demy answered 31/7, 2018 at 20:44 Comment(0)
A
0

Additional Suggestion With Expressive Code

I am suggesting the following code. I hope, it is a bit more expressive for better understanding (while maybe not so fast as other suggestions in this post).

Below test cases are demonstrating, how this answers the initial question.

#define __FALSE (0)
#define __TRUE  (!__FALSE)

#define MS1BITCNT_0_IS_0xxxxxxx_NO_SUCCESSOR        (0)
#define MS1BITCNT_1_IS_10xxxxxx_IS_SUCCESSOR        (1)
#define MS1BITCNT_2_IS_110xxxxx_HAS_1_SUCCESSOR     (2)
#define MS1BITCNT_3_IS_1110xxxx_HAS_2_SUCCESSORS    (3)
#define MS1BITCNT_4_IS_11110xxx_HAS_3_SUCCESSORS    (4)

typedef int __BOOL;

int CountMS1BitSequenceAndForward(const char **p) {
    int     Mask;
    int     Result = 0;
    char    c = **p;
    ++(*p);
    for (Mask=0x80;c&(Mask&0xFF);Mask>>=1,++Result);
    return Result;
}


int MS1BitSequenceCount2SuccessorByteCount(int MS1BitSeqCount) {
    switch (MS1BitSeqCount) {
    case MS1BITCNT_2_IS_110xxxxx_HAS_1_SUCCESSOR: return 1;
    case MS1BITCNT_3_IS_1110xxxx_HAS_2_SUCCESSORS: return 2;
    case MS1BITCNT_4_IS_11110xxx_HAS_3_SUCCESSORS: return 3;
    }
    return 0;
}

__BOOL ExpectUTF8SuccessorCharsOrReturnFalse(const char **Str, int NumberOfCharsToExpect) {
    while (NumberOfCharsToExpect--) {
        if (CountMS1BitSequenceAndForward(Str) != MS1BITCNT_1_IS_10xxxxxx_IS_SUCCESSOR) {
            return __FALSE;
        }
    }
    return __TRUE;
}

__BOOL IsMS1BitSequenceCountAValidUTF8Starter(int Number) {
    switch (Number) {
    case MS1BITCNT_0_IS_0xxxxxxx_NO_SUCCESSOR:
    case MS1BITCNT_2_IS_110xxxxx_HAS_1_SUCCESSOR:
    case MS1BITCNT_3_IS_1110xxxx_HAS_2_SUCCESSORS:
    case MS1BITCNT_4_IS_11110xxx_HAS_3_SUCCESSORS:
        return __TRUE;
    }
    return __FALSE;
}

#define NO_FURTHER_CHECKS_REQUIRED_IT_IS_NOT_UTF8       (-1)
#define NOT_ALL_EXPECTED_SUCCESSORS_ARE_10xxxxxx        (-1)

int CountValidUTF8CharactersOrNegativeOnBadUTF8(const char *Str) {
    int NumberOfValidUTF8Sequences = 0;
    if (!Str || !Str[0]) { return 0; }
    while (*Str) {
        int MS1BitSeqCount = CountMS1BitSequenceAndForward(&Str);
        if (!IsMS1BitSequenceCountAValidUTF8Starter(MS1BitSeqCount)) {
            return NO_FURTHER_CHECKS_REQUIRED_IT_IS_NOT_UTF8;
        }
        if (!ExpectUTF8SuccessorCharsOrReturnFalse(&Str, MS1BitSequenceCount2SuccessorByteCount(MS1BitSeqCount))) {
            return NOT_ALL_EXPECTED_SUCCESSORS_ARE_10xxxxxx;
        }
        if (MS1BitSeqCount) { ++NumberOfValidUTF8Sequences; }
    }
    return NumberOfValidUTF8Sequences;
}

I also wrote a few test cases:

static void TestUTF8CheckOrDie(const char *Str, int ExpectedResult) {
    int Result = CountValidUTF8CharactersOrNegativeOnBadUTF8(Str);
    if (Result != ExpectedResult) {
        printf("TEST FAILED: %s:%i: check on '%s' returned %i, but expected was %i\n", __FILE__, __LINE__, Str, Result, ExpectedResult);
        exit(1);
    }
}

void SimpleUTF8TestCases(void) {
    TestUTF8CheckOrDie("abcd89234", 0);  // neither valid nor invalid UTF8 sequences
    TestUTF8CheckOrDie("", 0);           // neither valid nor invalid UTF8 sequences
    TestUTF8CheckOrDie(NULL, 0);
    TestUTF8CheckOrDie("asdföadkg", 1);  // contains one valid UTF8 character sequence
    TestUTF8CheckOrDie("asdföadäkg", 2); // contains two valid UTF8 character sequences
    TestUTF8CheckOrDie("asdf\xF8" "adäkg", -1); // contains at least one invalid UTF8 sequence
}
Ansate answered 6/11, 2020 at 17:34 Comment(0)
E
-1

The below programme reads utf-8 strings(ascii, non ascii chars like euro etc...) from stdin. Each line is passed to func_find_utf8. As utf-8 chars are multi byte chars,the function func_find_utf8 checks char bits to find whetehr character is ascii or non-ascii. If the charcter is non-ascii, know the width of bytes. Pass the width of bytes and position it found to function print_non_ascii.

#include<stdio.h>

#include<string.h>

/* UTF-8 : BYTE_BITS*/

/* B0_BYTE : 0XXXXXXX */

/* B1_BYTE : 10XXXXXX */

/* B2_BYTE : 110XXXXX */

/* B3_BYTE : 1110XXXX */

/* B4_BYTE : 11110XXX */

/* B5_BYTE : 111110XX */

/* B6_BYTE : 1111110X */

#define B0_BYTE 0x00

#define B1_BYTE 0x80

#define B2_BYTE 0xC0

#define B3_BYTE 0xE0

#define B4_BYTE 0xF0

#define B5_BYTE 0xF8

#define B6_BYTE 0xFC

#define B7_BYTE 0xFE

/* Please tune this as per number of lines input */

#define MAX_UTF8_STR 10

/* 600 is used because 6byteX100chars */

#define MAX_UTF8_CHR 600

void func_find_utf8 (char *ptr_to_str);

void print_non_ascii (int bytes, char *pbyte);

char strbuf[MAX_UTF8_STR][MAX_UTF8_CHR];

int
main (int ac, char *av[])
{

  int i = 0;

  char no_newln_str[MAX_UTF8_CHR];

  i = 0;

  printf ("\n\nYou can enter utf-8 string or Q/q to QUIT\n\n");

  while (i < MAX_UTF8_STR)
    {

      fgets (strbuf[i], MAX_UTF8_CHR, stdin);

      if (!strlen (strbuf[i]))
    break;

      if ((strbuf[i][0] == 'Q') || (strbuf[i][0] == 'q'))
    break;

      strcpy (no_newln_str, strbuf[i]);

      no_newln_str[strlen (no_newln_str) - 1] = 0;

      func_find_utf8 (no_newln_str);

      ++i;

    }

  return 1;

}

void
func_find_utf8 (char *ptr_to_str)
{

  int found_non_ascii;

  char *pbyte;

  pbyte = ptr_to_str;

  found_non_ascii = 0;

  while (*pbyte)
    {

      if ((*pbyte & B1_BYTE) == B0_BYTE)
    {

      pbyte++;

      continue;

    }

      else
    {

      found_non_ascii = 1;

      if ((*pbyte & B7_BYTE) == B6_BYTE)
        {

          print_non_ascii (6, pbyte);

          pbyte += 6;

          continue;

        }

      if ((*pbyte & B6_BYTE) == B5_BYTE)
        {

          print_non_ascii (5, pbyte);

          pbyte += 5;

          continue;

        }

      if ((*pbyte & B5_BYTE) == B4_BYTE)
        {

          print_non_ascii (4, pbyte);

          pbyte += 4;

          continue;

        }

      if ((*pbyte & B4_BYTE) == B3_BYTE)
        {

          print_non_ascii (3, pbyte);

          pbyte += 3;

          continue;

        }

      if ((*pbyte & B3_BYTE) == B2_BYTE)
        {

          print_non_ascii (2, pbyte);

          pbyte += 2;

          continue;

        }

    }

    }

  if (found_non_ascii)
    printf (" These are Non Ascci chars\n");

}

void
print_non_ascii (int bytes, char *pbyte)
{

  char store[6];

  int i;

  memset (store, 0, 6);

  memcpy (store, pbyte, bytes);

  i = 0;

  while (i < bytes)
    printf ("%c", store[i++]);

  printf ("%c", ' ');

  fflush (stdout);

}
Excerpta answered 10/5, 2015 at 10:31 Comment(1)
Please don't post code only answers, also include an explanation how this answers the question.Skylight

© 2022 - 2024 — McMap. All rights reserved.