I am in the process of making a small program that reads a file, that contains UTF-8 elements, char by char. After reading a char it compares it with a few other characters and if there is a match it replaces the character in the file with an underscore '_'.
(Well, it actually makes a duplicate of that file with specific letters replaced by underscores.)
I'm not sure where exactly I'm messing up here but it's most likely everywhere.
Here is my code:
FILE *fpi;
FILE *fpo;
char ifilename[FILENAME_MAX];
char ofilename[FILENAME_MAX];
wint_t sample;
fpi = fopen(ifilename, "rb");
fpo = fopen(ofilename, "wb");
while (!feof(fpi)) {
fread(&sample, sizeof(wchar_t*), 1, fpi);
if ((wcscmp(L"ά", &sample) == 0) || (wcscmp(L"ε", &sample) == 0) ) {
fwrite(L"_", sizeof(wchar_t*), 1, fpo);
} else {
fwrite(&sample, sizeof(wchar_t*), 1, fpo);
}
}
I have omitted the code that has to do with the filename generation because it has nothing to offer to the case. It is just string manipulation.
If I feed this program a file containing the words γειά σου κόσμε.
I would want it to return this:
γει_ σου κόσμ_.
Searching the internet didn't help much as most results were very general or talking about completely different things regarding UTF-8. It's like nobody needs to manipulate single characters for some reason.
Anything pointing me the right way is most welcome. I am not, necessarily, looking for a straightforward fixed version of the code I submitted, I would be grateful for any insightful comments helping me understand how exactly the wchar mechanism works. The whole wbyte, wchar, L, no-L, thing is a mess to me.
Thank you in advance for your help.
wchar_t
is not UTF-8; I don't see how you could expect this code to work since you're reading a fixed number of bytes (and the wrong number;sizeof(wchar_t*)
is not the same as the size of the pointed-to object) and UTF-8 is a variable-length encoding. – Lovely