Removing punctuation and capitalizing in C
Asked Answered
S

3

7

I'm writing a program for school that asks to read text from a file, capitalizes everything, and removes the punctuation and spaces. The file "Congress.txt" contains

(Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the government for a redress of grievances.)

It reads in correctly but what I have so far to remove the punctuation, spaces, and capitalize causes some major problems with junk characters. My code so far is:

void processFile(char line[]) {
    FILE *fp;
    int i = 0;
    char c;

    if (!(fp = fopen("congress.txt", "r"))) {
        printf("File could not be opened for input.\n");
        exit(1);
    }

    line[i] = '\0';
    fseek(fp, 0, SEEK_END);
    fseek(fp, 0, SEEK_SET);
    for (i = 0; i < MAX; ++i) {
        fscanf(fp, "%c", &line[i]);
        if (line[i] == ' ')
            i++;
        else if (ispunct((unsigned char)line[i]))
            i++;
        else if (islower((unsigned char)line[i])) {
            line[i] = toupper((unsigned char)line[i]);
            i++;
        }
        printf("%c", line[i]);
        fprintf(csis, "%c", line[i]);
    }

    fclose(fp);
}

I don't know if it's an issue but I have MAX defined as 272 because that's what the text file is including punctuation and spaces.

My output I am getting is:

    C╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠Press any key to continue . . .
Squirt answered 19/4, 2015 at 21:44 Comment(7)
Use fgetc() instead of fscanf(). Please post the caller function.Ecdysis
One problem is that you are double-incrementing i when you read a punctuation character. You only want to increment i when you copy something into the array. You should also null terminate the string when you exit the loop (before returning, at any rate).Isooctane
@JonathanLeffler, I think that OP should just let the for loop increment i, no need to touch it anywhere else.Quarterback
@Zoltán: actually, no. The loop probably needs to be rewritten as a while ((c = getc(fp)) != EOF) loop, and the increment to i only occurs when an assignment is made.Isooctane
leaving everything as is except changing my for loop to the above mentioned while loop now prints out the exact same output except one more odd character in place of the "C".Squirt
get the size of a file via: 'fseek(fp, 0L, SEEK_END); max = ftell(fp); fseek(fp, 0L, SEEK_SET);' also, there is no guaranteed that the 'line' passed in by the caller, is long enough to contain the whole file.Floodgate
where did the csis file pointer originate?Floodgate
I
6

The fundamental algorithm needs to be along the lines of:

while next character is not EOF
    if it is alphabetic
        save the upper case version of it in the string
null terminate the string

which translates into C as:

int c;
int i = 0;

while ((c = getc(fp)) != EOF)
{
    if (isalpha(c))
        line[i++] = toupper(c);
}
line[i] = '\0';

This code doesn't need the (unsigned char) cast with the functions from <ctype.h> because c is guaranteed to contain either EOF (in which case it doesn't get into the body of the loop) or the value of a character converted to unsigned char anyway. You only have to worry about the cast when you use char c (as in the code in the question) and try to write toupper(c) or isalpha(c). The problem is that plain char can be a signed type, so some characters, notoriously ÿ (y-umlaut, U+00FF, LATIN SMALL LETTER Y WITH DIAERESIS), will appear as a negative value, and that breaks the requirements on the inputs to the <ctype.h> functions. This code will attempt to case-convert characters that are already upper-case, but that's probably cheaper than a second test.

What else you do in the way of printing, etc is up to you. The csis file stream is a global scope variable; that's a bit (tr)icky. You should probably terminate the output printing with a newline.

The code shown is vulnerable to buffer overflow. If the length of line is MAX, then you can modify the loop condition to:

while (i < MAX - 1 && (c = getc(fp)) != EOF)

If, as would be a better design, you change the function signature to:

void processFile(int size, char line[]) {

and assert that the size is strictly positive:

    assert(size > 0);

and then the loop condition changes to:

while (i < size - 1 && (c = getc(fp)) != EOF)

Obviously, you change the call too:

char line[4096];

processFile(sizeof(line), line);
Isooctane answered 19/4, 2015 at 21:59 Comment(1)
Nice explanation about when (unsigned char) needed.Fathom
F
1

in the posted code, there is no intermediate processing, so the following code ignores the 'line[]' input parameter

void processFile()
{
    FILE *fp = NULL;

    if (!(fp = fopen("congress.txt", "r")))
    {
        printf("File could not be opened for input.\n");
        exit(1);
    }

    // implied else, fopen successful

    unsigned int c; // must be integer so EOF (-1) can be recognized
    while( EOF != (c =(unsigned)fgetc(fp) ) )
    {
        if( (isalpha(c) || isblank(c) ) && !ispunct(c) ) // a...z or A...Z or space
        {
            // note toupper has no effect on upper case characters
            // note toupper has no effect on a space
            printf("%c", toupper(c));
            fprintf(csis, "%c", toupper(c));
        }
    }
    printf( "\n" );

    fclose(fp);
} // end function: processFile
Floodgate answered 20/4, 2015 at 3:13 Comment(0)
S
0

Okay so what I did was created a second character array. My first array read in the entire file. I created a second array which would only take in alphabetical characters from the first array then make them uppercase. My correct and completed function for that part of my homework is as follows:

void processFile(char line[], char newline[]) {
    FILE *fp;
    int i = 0;
    int j = 0;

    if (!(fp = fopen("congress.txt", "r"))) {                 //checks file open
        printf("File could not be opened for input.\n");
        exit(1);
    }
    line[i] = '\0';
    fseek(fp, 0, SEEK_END);               //idk what they do but they make it not crash
    fseek(fp, 0, SEEK_SET);

    for (i = 0; i < MAX; ++i) {           //reads the file into the first array
        fscanf(fp, "%c", &line[i]);
    }

    for (i = 0; i < MAX; ++i) {    
        if (isalpha(line[i])){                     //if it's an alphabetical character
            newline[j] = line[i];                  //read into new array
            newline[j] = toupper(newline[j]);      //makes that letter capitalized
            j++;
        }
    }

    fclose(fp);
}

Just make sure that after creating the new array, it will be smaller than your defined MAX. To make it easy I just counted the now missing punctuation and spaces (which was 50) so for future "for" loops it was:

for (i = 0; i < MAX - 50; ++i)
Squirt answered 21/4, 2015 at 7:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.