cunicodeutf-8fgetcutf-32

UTF-32 to UTF-8 converter in C, buffer is full of nulls / zeroes


I've been trying forever to get this working. The program is supposed to take two arguments, on for the buffer size and another for a file name and convert that file form UTF-32 to UTF-8. I've been using the fgetc() function to fill an int array with the Unicode codepoint. I've tested printint out the contents of my buffer, and it has all these null characters instead of each codepoint.

For example, for a file consisting of only the character 'A': buffer [0] is 0 buffer [1] is 0 buffer [2] is 0 buffer [3] is 41

The codepoints for anything above U+7F end up getting split apart.

Here is the code for initializing my buffer:

int main(int argc, char** argv) {
  if (argc != 3) {
    printf("Must input a buffer size and a file name :D");
    return 0;
  }

  FILE* input = fopen(argv[2], "r");
  if (!input) {
    printf("The file %s does not exist.", argv[1]);
    return 0;
  } else {
    int bufferLimit = atoi(argv[1]);
    int buffer[bufferLimit];
    int charReplaced = 0;
    int fileEndReached = 0;
    int i = 0;
    int j = 0;

    while(1) {
      // fill the buffer with the characters from the file.
      for(i = 0; i < bufferLimit; i++){
        buffer[i] = fgetc(input);
        // if EOF reached, move onto next step and mark that
        // it has finished.
        if (buffer[i] == EOF) {
          fileEndReached = 1;
          break;
        }
      }
      // output buffer of chars until EOF or end of buffer
      for(j = 0; j <= i; j++) {
        if(buffer[j] == EOF) {
          break;
        }
        // check for Character Replacements
        charReplaced += !convert(buffer[j]);
      }
      if(fileEndReached != 0) {
        break;
      } 
    }  
    //return a 1 if any Character Replacements were used
    if(charReplaced != 0) {
      return 1;
    }
  }
}

Solution

  • fgetc() returns a byte, not a unicode code point.

    From there on based on that false assumption the whole thing falls down.