Reading Emojis through a pipe in C
Asked Answered
C

5

7

I have a pipe with an endless amount of strings being written to it. These strings are a mix of ASCII and Emojis. The problem I am having is I am reading them like this

char msg[100];
int length = read(fd,&msg,99);
msg[length] =0;

But sometimes the emoji I'm guessing is multibyte and it is getting cut in half and then when I print to the screen I get the diamond question mark unknown UTF-8 symbol.

If anyone knows how to prevent this please fill me in; I've been searching for a while now.

Credulity answered 2/1, 2020 at 23:58 Comment(7)
When the buffer you read ends with an incomplete utf-8 code point, keep it around until you read more data and finish it.Winze
“These strings are a mix of ASCII and Emojis” — That’s unlikely: ASCII is a distinct text encoding that cannot encode Emojis. I guess technically you could encode these fragments with headers that specify different encodings but I’m guessing you’re not actually doing that, and the whole input stream is actually some Unicode encoding such as UTF-8, correct?Stieglitz
The input is actually strings from Java going through jni and encoded as utf-8 into char*Credulity
ASCII is a subset of UTF-8. So the pipe is delivering just UTF-8 data, it just happens that ASCII characters are single-byte in UTF-8.Virgilvirgilia
On an unrelated note, using &msg is semantically incorrect, as that gives a pointer to the array itself, not to its first element as expected. The type of &msg is char (*)[100], which is very different from the common char * expected (which you get from either &msg[0] or just plain msg (as it decays to a pointer to its first element, i.e. &msg[0])).Apothecary
I think it would be helpful for others to have an example sequence of char values that's causing you problems, and the output you expect. Could you post one?Farrahfarrand
Also, when you say "cut in half"... what is your code doing to cut them? Adding newlines or other separators?Farrahfarrand
N
9

If you're reading chunks of bytes, and want to output chunks of UTF-8, you'll have to do at least some minimal UTF-8 decoding yourself. The simplest condition to check for is look at each byte (let's call it b) and see if it is a continuation byte:

bool is_cont = (0x80 == (0xC0 & b));

Any byte that is not a continuation starts a sequence, which continues until the next non-continuation byte. You'll need a 4-byte buffer to hold the chunks.

Nottingham answered 3/1, 2020 at 0:16 Comment(6)
So check the last byte and if the bool is true read another byte until it is falseCredulity
No, you can't just check the last byte, because there's no way to detect "last byte". You have to detect "first byte" (which is equivalent to "not a continuation"), and then grab bytes until the next first byte.Nottingham
so, I just tried for the past hour to do what you said. so for a char[] the first byte is not a continuation and the last byte is the continuation just before the next first byte. haha im trying to wrap my head around thisCredulity
im trying to use like at least a char[100] because im trying to process fastCredulity
I have started a bounty to try to get an example because I cannot figure this out. I think that your answer is correct but I need further detailCredulity
I did it using wchar_tCredulity
C
2

The hint provided by lee-daniel-crocker is good to check whether a particular byte is part of utf-8/utf-16 or not.

Along with this, you need to add some more logic. When you find partial sequence of utf-8 at the end of your stream, you need to look back in your stream (here it is buffer) to locate the start position of this partial sequence.

Once you find start position of this partial utf-8 code sequence, store this partial code, remove it from your buffer and process the buffer. Prepend this partial code sequence to the buffer of next read cycle. This will allow you to combine partial utf-8 code sequence split by to read() operation.

Below is sample code for testing and validation.

App.c

// gcc -Wall app.c

#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

volatile sig_atomic_t g_process_run = 1;

void signal_handler(int signal) { g_process_run = 0; }

int child_process(int *pipe) {
  close(pipe[0]); // close read pipe
  srand(1234);
  int chars_to_send[] = {95, 97, 99, 100, 101, 103, 104, 105,
                         95, 97, 99, 100, 101, 103, 104, 105};
  // int chars_to_send[] = {6, 7, 8, 9,12,14,15,16};
  int fd = open("a.txt", O_RDONLY);
  if (fd == -1) {
    printf("Child: can't open file\n");
    return -1;
  }
  struct stat sb;
  if (fstat(fd, &sb) == -1) {
    printf("Child: can't get file stat\n");
    return -1;
  }
  off_t file_size = sb.st_size;
  char *addr = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
  if (addr == MAP_FAILED) {
    printf("Child:mmap failed");
    return -1;
  }
  int start_address = 0;
  while (g_process_run != 0) {
    long index = rand();
    index = (index * 16) / RAND_MAX;
    int len = chars_to_send[index];
    if (start_address + len > file_size) {
      start_address = 0;
    }
    len = write(pipe[1], &addr[start_address], len);
    start_address = start_address + len;
    sleep(1);
  }
  munmap(addr, file_size);
  close(fd);
  close(pipe[1]);
  printf("child process exiting\n");
  return 0;
}
int parent_process(int *pipe) {
  close(pipe[1]); // close write pipe
  const int BUFF_SIZE = 99;
  char buff[BUFF_SIZE + 1];
  char buff_temp[10];
  int continueCount = 0;
  while (g_process_run != 0) {
    int len = read(pipe[0], &buff[continueCount],
                   BUFF_SIZE - continueCount) +
              continueCount; // addjust buffer position and size based
                             // on previous partial utf-8 sequence
    continueCount = 0;
    for (int i = len - 1; i > -1;
         --i) { // find and save if last sequence are partial utf-8
      if (0 != (0x80 & buff[i])) {
        buff_temp[continueCount] = buff[i];
        buff[i] = '\0';
        continueCount++;
      } else {
        break;
      }
    }
    buff[len] = '\0';
    printf("Parent:%s\n", buff);
    if (continueCount > 0) { // put partial utf-8 sequence to start of buffer,
                             // so it will prepend in next read cycle.
      printf("will resume with %d partial bytes\n", continueCount);
      for (int i = 0; i < continueCount; ++i) {
        buff[i] = buff_temp[continueCount - i - 1];
      }
    }
  }
  close(pipe[0]);
  wait(NULL);
  printf("parent process exiting\n");
  return 0;
}
int init_signal() {
  if (signal(SIGINT, signal_handler) == SIG_ERR) {
    return -1;
  }
  return 0;
}

int main(int argc, char **argv) {
  if (init_signal() != 0)
    return -1;
  int pipefd[2];
  if (pipe(pipefd) == -1) {
    printf("can't create pipe\n");
    return -1;
  }
  pid_t pid = fork();
  if (pid == -1) {
    printf("Can't fork process\n");
    return -1;
  } else if (pid == 0) { // child process
    return child_process(pipefd);
  }
  return parent_process(pipefd);
}

a.txt

12abc😫️a23😀️s345🤑️24ee🙃️dai😕️iodqs😥️dqk😓️pdoo9😛️93wd🤑️qd3👤️2om🍕️de9🤐️312abc😫️a23😀️s345🤑️24ee🙃️dai😕️iodqs😥️dqk😓️pdoo9😛️93wd🤑️qd3👤️2om🍕️de9🤐️312abc😫️a23😀️s345🤑️24ee🙃️dai😕️iodqs😥️dqk😓️pdoo9😛️93wd🤑️qd3👤️2om🍕️de9🤐️312abc😫️a23😀️s345🤑️24ee🙃️dai😕️iodqs😥️dqk😓️pdoo9😛️93wd🤑️qd3👤️2om🍕️de9🤐️312abc😫️a23😀️s345🤑️24ee🙃️dai😕️iodqs😥️dqk😓️pdoo9😛️93wd🤑️qd3👤️2om🍕️de9🤐️312abc😫️a23😀️s345🤑️24ee🙃️dai😕️iodqs😥️dqk😓️pdoo9😛️93wd🤑️qd3👤️2om🍕️de9🤐️312abc😫️a23😀️s345🤑️24ee🙃️dai😕️iodqs😥️dqk😓️pdoo9😛️93wd🤑️qd3👤️2om🍕️de9🤐️312abc😫️a23😀️s345🤑️24ee🙃️dai😕️iodqs😥️dqk😓️pdoo9😛️93wd🤑️qd3👤️2om🍕️de9🤐️3

You can find this code and test file here.

Corina answered 8/1, 2020 at 9:55 Comment(1)
I guess the check "if (0 != (0x80 & buff[i]))" is not quite right because "0 != (0x80 & buff[i])" is true even for last byte in multi-byte sequence. So, for the stream that contains only multi-bytes, this program will get into infinite loop. Example input: perl -e 'print("\xf0\x9f\x98\xab"x1000);' > a.txtBruns
I
1

The example code below uses stdin, but you can uncomment fdopen(fd, "r"); to use fd pipes instead.

Here's a super simple example of how to do this. It might be a bit slower, but I would try it first and see if it meets your needs. You can also read in larger chunks using fgetws().

The program below will read UTF8 characters and print them back out properly.

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>

int main(void)
{
  FILE *input_stream = stdin; //or fdopen(fd, "r");
  FILE *output_stream = stdout;

  setlocale(LC_ALL, "en_US.utf8");
  fputws(L"Program started\n", output_stream); //note the wide string `L` prefix

  wint_t wc;
  while ((wc = fgetwc(input_stream)) != WEOF) {
    //use CTRL+D to send WEOF to stdin
    fputwc(wc, output_stream);
  }

  fputws(L"Program ended\n", output_stream); //note the wide string `L` prefix

  //note that this example omits error handling for writing output and setlocale()
  return EXIT_SUCCESS;
}

Can be used with pipes as well:

$ echo "hello. кошкâ" | ./a.out
Program started
hello. кошкâ
Program ended
Incrocci answered 3/1, 2020 at 0:17 Comment(5)
This is a bit difficult to adapt to the scenario in the question where the data is being read from a pipe (via a file descriptor, not a file stream) and the intent is to read and process characters on the fly (some of which require more than one byte to be a complete character), rather than read everything and then analyze what was received.Assailant
@JonathanLeffler I updated the example of it working with a simple pipe. It also works with stdin just fine. Am I missing something?Incrocci
Your code works with file streams, not file descriptors. Pipes at the C level are created via the function int pipe(int fd[2]); which takes to file descriptors. The code in the question — minimal though it is — shows this: char msg[100]; int length = read(fd, &msg, 99); msg[length] = 0; — it is using the read() system call which takes a file descriptor, often denoted by fd. Your code does not fit into that scenario easily. Sure, at the shell level, it can read from a pipe connected to standard input — that is not in question. The code in the question is not reading from that.Assailant
@JonathanLeffler thanks for clarifying. I thought OP meant a shell pipe. But in their case, couldn't they just use fdopen() as detailed here: #1517266Incrocci
Perhaps — fdopen() is one of the main reasons for only saying 'a bit difficult to adapt' rather than anything stronger.Assailant
B
1

I would go with something like that:

#include <stdio.h>
#include <unistd.h>

#define BUFFER_LENGTH   53

void print_function(char* message) {
        // \r or 0x0d - UTF-8 carriage return
        printf("%s\r", message);
}

void read_pipe(int pipe, void (*print_func)(char*))
{
        char message[BUFFER_LENGTH];
        char to_print[1 + BUFFER_LENGTH];
        char* pointer = message;
        do
        {
                int bytes_read = read(pipe, pointer, BUFFER_LENGTH - (pointer - message));
                if (0 == bytes_read)
                {
                        // print remaining bytes
                        *pointer = '\0';
                        print_func(message);
                        break;
                }

                // add bytes remained from previous run
                bytes_read += (pointer - message);

                // copy complete characters to buffer to_print
                int char_p = 0;
                char* to_print_p = to_print;
                for (int i = 0; i != bytes_read; ++i)
                {
                        if (0x80 != (0xc0 & *(message + i)))
                        {
                                for (; char_p != i; ++char_p)
                                {
                                        *(to_print_p++) = *(message + char_p);
                                }
                        }
                }

                // finish buffer with complete characters and print it
                *to_print_p = '\0';
                print_func(to_print);


                // move tail to the beginning of the input buffer,
                // pointer will point to the first free element in message buffer
                pointer = message;
                for (; char_p != bytes_read; ++char_p)
                {
                        *(pointer++) = *(message + char_p);
                }
        } while (1);
}

int main()
{
        read_pipe(STDIN_FILENO, print_function);

        return 0;
}

Here read_pipe infinitely reads from passed pipe descriptor, and prints data using passed print_func function.

Idea is to read buffer from pipe and then copy to print buffer only complete characters (condition courtesy by Lee Daniel Crocker), with assumption that there is a valid UTF-8 sequence. If buffer has tail of some incomplete UTF-8 character it will be used as a beginning of the next portion of data. So we loop until the end of the pipe.

For simplicity I use stdin as a pipe descriptor. To run and test:

gcc -Wall main.c -o run && perl -e 'print "\xf0\x9f\x98\xab"x1000;' > test.txt && ./run < test.txt > output.txt

P.S. Another approach would be to get character length as described here: UTF-8 Continuation bytes:

#include <stdio.h>
#include <unistd.h>

#define BUFFER_LENGTH   53

void print_function(char* message) {
        // \r or 0x0d - UTF-8 carriage return
        printf("%s\n", message);
}

void read_pipe(int pipe, void (*print_func)(char*))
{
        char message[BUFFER_LENGTH];
        char to_print[1 + BUFFER_LENGTH];
        char* pointer = message;
        do
        {
                int bytes_read = read(pipe, pointer, BUFFER_LENGTH - (pointer - message));
                if (0 == bytes_read)
                {
                        *pointer = '\0';
                        print_func(message);
                        break;
                }

                // add bytes remained from previous run
                bytes_read += (pointer - message);

                // copy complete characters to buffer to_print
                int char_p = 0;
                char* to_print_p = to_print;

                int length;
                do
                {
                        unsigned char c = *(message + char_p);
                        if (0xc0 == (0xc0 & c))
                        {
                                length = 0;
                                while (0 != (0x80 & c))
                                {
                                        c <<= 1;
                                        ++length;
                                }

                                if (char_p + length > bytes_read)
                                {
                                        break;
                                }
                        }
                        else
                        {
                                length = 1;
                        }

                        for (int i = 0; i != length; ++i)
                        {
                                *(to_print_p++) = *(message + char_p++);
                        }
                } while (char_p != bytes_read);

                // finish buffer with complete characters and print it
                *to_print_p = '\0';
                print_func(to_print);


                // move tail to the beginning of the input buffer,
                // pointer will point to the first free element in message buffer
                pointer = message;
                for (; char_p != bytes_read; ++char_p)
                {
                        *(pointer++) = *(message + char_p);
                }
        } while (1);
}

int main()
{
        read_pipe(STDIN_FILENO, print_function);

        return 0;
}
Bruns answered 13/1, 2020 at 8:51 Comment(0)
H
0
int strlen_utf8(const char* s)
{
    //h ttp://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
    int i = 0, j = 0;
    while (s[i])
    {
        if ((s[i] & 0xc0) != 0x80) j++;
        i++;
    }
    return j;
}

void utf8_to_wchar_t(wchar_t * ws,const char* s)
{
    //utf8--------------------------------------
    //0xxxxxxx                              1 byte
    //110xxxxx 10xxxxxx                     2 byte
    //1110xxxx 10xxxxxx 10xxxxxx            3 byte
    //11110xxx 10xxxxxx 10xxxxxx 10xxxxxx   4 byte

    int total = strlen_utf8(s);

    unsigned char c =NULL;          
    wchar_t wc=NULL ;   //unsigned int wc;  !!!! we use 16 bit  

    int i = 0;  //s[i]
    int j = 0;  //ws[j]
    for(j=0;j<total;j++)
    { 
        c = s[i++]; //read 1 byte first

        if (c >> 7 == 0b0)      //1 byte    0xxxxxxx
        {
            wc = (c & 0b01111111);
        }
        if (c >> 5 == 0b110)    //2 byte    110xxxxx    10xxxxxx
        {
            wc  = (c & 0b00011111) << 6;
            wc += (s[i++]& 0b00111111);
        }
        if (c >> 4 == 0b1110)   //3 byte    1110xxxx    10xxxxxx    10xxxxxx
        {

            wc  = (c & 0b00001111) << 12;
            wc += (s[i++] & 0b00111111) << 6;
            wc += (s[i++] & 0b00111111);
        }

        if (c >> 3 == 0b11110)  //4 byte    11110xxx    10xxxxxx    10xxxxxx    10xxxxxx
        {
            wc  = (c & 0b00000111) << 18;
            wc += (s[i++] & 0b00111111) << 12;
            wc += (s[i++] & 0b00111111) << 6;
            wc += (s[i++] & 0b00111111);
        }
        ws[j] = wc;
    }
    ws[total] = NULL;


}
void test()
{

    char s[] = { 0xc5,0x9f,0xe2,0x98,0xba,0x00 };//test utf8
    wchar_t ws[100];

    utf8_to_wchar_t(ws, s);


    //write 8bit
    FILE* fp = fopen("a.txt", "wb");
    fwrite(s, 1, 5, fp);
    fclose(fp);

    //write 16bit
    FILE* fp2 = fopen("a2.txt", "wb");
    fwrite("\xff\xfe", 1, 2, fp2);  //little endian
    fwrite(ws, 1, 4, fp2); fclose(fp2);


}
Hower answered 14/1, 2020 at 14:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.