How to convert Roman numerals to int while rejecting invalid numbers using standard C?

Asked 10/5, 2017 at 5:14 Answered 10/5, 2017 at 21:47

What it means to be proper Roman numerals may vary. For simplicity (no Unicode, no multiplicative principle, no double subtractives, no overbars, no large numbers, etc) for the sake of this question, valid Roman numerals are defined by the regex:

^(M{0,3})(D?C{0,3}|CM|CD)(L?X{0,3}|XC|XL)(V?I{0,3}|IX|IV)$

Code example with POSIX regexec(). The regex matches Roman numerals in 1..3999 range represented using "strict" rules.

There are many solutions that can convert Roman numerals if we don't need to reject invalid numerals, for example:

int roman_numeral_value(unsigned char c)
{
  switch(toupper(c)) {
  case 'I': return 1;
  case 'V': return 5;
  case 'X': return 10;
  case 'L': return 50;
  case 'C': return 100;
  case 'D': return 500;
  case 'M': return 1000;
  default: return 0; // error
  }
}

int roman_numeral_to_int(const char *s, int size)
{
  int total = 0, prev = 0;
  for (int i = size-1; i >= 0; --i) { // in reverse order
    int value = roman_numeral_value(s[i]);
    total += value < prev ? -value : value; // subtract if necessary
    prev = value;
  }
  return total;
}

It works for valid Roman numerals. But roman_numeral_to_int() accepts numerals such as IIIII that are rejected by the regex. Is there a similarly simple cross-platform solution that doesn't require pcre_exec() or other external dependencies that works for valid Roman numerals and only for them?

Humble answered 10/5, 2017 at 5:14 Comment(7)

Perhaps a separate count, e.g. int total ... n = 0; for (...) { int value ...; if (n < 3 && (value == 1000 || ...) { total...; n++;} else n = 0; ...}. A very (rough) but brute force approach? Are there any restrictions on how you want to approach it? – Linguist 10/5, 2017 at 5:29

do you want us to implement DFA in pure c? – Ashford 10/5, 2017 at 5:30

The only other I have seen has 28 conditionals checking, e.g. if (100) .."C".. else if (200) .."CC".. else if (700) .."DCC", etc. – Linguist 10/5, 2017 at 5:33

@DavidC.Rankin: all restrictions are in the question. Basically, I'm hoping that I'm missing some elegant approach to this problem. – Humble 10/5, 2017 at 5:37

The only other elegant approach I could think of would be a short lookup table that could be used in lieu of the conditionals that would provide a more compact way of doing the same thing. Good question, but unless somebody just implemented something similar, it is one that will take a bit of thought and tinkering... I'll tinker, but make no promise :) – Linguist 10/5, 2017 at 5:43

@J.F.Sebastian Please clarify. One critique was that the ruleset may change. If you want a solution that works for only valid Roman numerals and that ruleset may change, what is the form of the driving definition going to be? Some regex like ^(M{0,3})... I{0,3}|IX|IV)$ ? If that is the case, then validation code needs to be a generic regex parser. Otherwise you need to define how the ruleset may change or is encoded. – Wesle 10/5, 2017 at 20:31

@chux: "for the sake of this question, valid Roman numerals are defined by the regex:" i.e., it is ok if your solution doesn't support anything else (that is why I've upvoted it). The whole problem is relatively easy -- if any solution doesn't work -- a new one can be created from scratch (so the fact that your approach is inflexible is a downside but it is not a significant one -- it is worth mentioning it in a comment). The changes in the requirements can be arbitrary (follow the first three links in the answer to see a variety possible of Roman numerals. – Humble 10/5, 2017 at 20:58

To create some level of rule flexibility, the following Roman_string_to_unsigned0() employs a table.

It follows the strtol() style of functionality in that an end-pointer is returned indicating where parsing stopped. De-ref and test against '\0' for success.

The function has a bool subtractive parameter to steer the two major types of Roman Numeral parsing: basic, subtractive.

static const struct Roman_digit {
  char ch[3];
  bool subtractive;
  unsigned char limit;
  unsigned char nextdown;  // with parse success, offset to next element to try
  unsigned value;
} Roman_table[] = {
    { "I", false, 4, 1, 1 }, //
    { "IV", true, 1, 2, 4 }, //
    { "V", false, 1, 2, 5 }, //
    { "IX", true, 1, 4, 9 }, //
    { "X", false, 4, 1, 10 }, //
    { "XL", true, 1, 2, 40 }, //
    { "L", false, 1, 2, 50 }, //
    { "XC", true, 1, 4, 90 }, //
    { "C", false, 4, 1, 100 }, //
    { "CD", true, 1, 2, 400 }, //
    { "D", false, 1, 2, 500 }, //
    { "CM", true, 1, 4, 900 }, //
    { "M", false, 4, 1, 1000 }, //
};
#define Roman_table_N (sizeof Roman_table / sizeof Roman_table[0])

const char *Roman_string_to_unsigned0(unsigned *dest, const char *src, bool subtractive){
  *dest = 0;
  for (unsigned i = Roman_table_N; i > 0;) {
    const struct Roman_digit *digit = &Roman_table[i - 1];
    if (!subtractive && digit->subtractive) {
      i--;
      continue;
    }
    unsigned limit = digit->limit;  // repeat count
    if (limit > 1 && subtractive) limit--;
    size_t ch_length = strlen(digit->ch);
    size_t next_i = i-1;
    for (unsigned j=0; j<limit; j++) {
      if (strncmp(src, digit->ch, ch_length) == 0) {
        *dest += digit->value;
        if (*dest < digit->value) { // Overflow detection
          return (char*) src;
        }
        src += ch_length;
        next_i = i - digit->nextdown;  // With success, maybe skip down the list 
      } else {
        break;
      }
    }
    i = next_i;
  }
  return (char*) src;
}

Notes: Case insensitivity not yet encoded. An empty string returns 0. By this code working most-to-least significant, "XXXMMM" does not pass.

Wesle answered 10/5, 2017 at 21:47 Comment(3)

Per my testing, I am confident this passes ^(M{0,3})(D?C{0,3}|CM|CD)(L?X{0,3}|XC|XL)(V?I{0,3}|IX|IV)$ when it should and fails otherwise. – Wesle 10/5, 2017 at 21:50

@J.F.Sebastian Grumble - cut and paste error - grumble - code fixed. – Wesle 11/5, 2017 at 2:26

it passes my tests now. – Humble 11/5, 2017 at 3:30

By generating C code from a higher-level specification, we can get a readable solution that uses only standard C. For example, the regex:

  ^(?P<thousands>       M{,3})
   (?P<hundreds>CM|CD|D?C{,3})
   (?P<tens>    XC|XL|L?X{,3})
   (?P<units>   IX|IV|V?I{,3})$

can be represented as a FSM using Ragel finite-state machine compiler:

thousands =                       ('M' %{ n += 1000; }){,3};
hundreds = "CM" %{ n += 900; }
         | "CD" %{ n += 400; }
         | ('D' %{ n += 500; } )? ('C' %{ n += 100;  }){,3};
tens     = "XC" %{ n += 90;  }
         | "XL" %{ n += 40;  }
         | ('L' %{ n += 50;  } )? ('X' %{ n += 10;   }){,3};
units    = "IX" %{ n += 9;   }
         | "IV" %{ n += 4;   }
         | ('V' %{ n += 5;   } )? ('I' %{ n += 1;    }){,3};
numeral = thousands hundreds tens units;

main := numeral > { n = 0; } ;

it is a complete specification: it converts all valid Roman numerals. It rejects all that are invalid
it is concise: you can put it on a card
it is straightforward: initialize n with zero and add thousands, hundreds, tens, and units. 100s, 10s, 1s follow a simple pattern: nine | four | (five? ten{0,3}) e.g., tens part is either 90 or 40 or optional 50 plus upto three 10s.
it is declarative and easy to extend without introducing errors e.g., to support four consecutive numerals such as IIII in addition to subtractive IV, it is enough to replace {,3} with {,4}. To support Unicode/lower/upper case letters, the corresponding literals such as 'M' could be replaced with M where M = 'M' | 'm' | "Ⅿ" | "ⅿ";
ragel translates it to a fast table- or goto-driven FSM in pure C.

Complete code example (with Unicode and IIII extensions mentioned above). The generated roman_numerals.c has no third-party dependencies.

Humble answered 10/5, 2017 at 5:46 Comment(5)

Why do you think that ragel is not 3rd party dependency? – Ashford 10/5, 2017 at 5:52

@Lashane: "The generated roman_numerals.c has no third-party dependencies." (the file in the gist should have *.rl extension -- github must have changed it on submit to .c -- I've fixed it) – Humble 10/5, 2017 at 5:52

@Lashane: the linked roman_numerals.rl is the compete code example (the comments show how to use it). – Humble 10/5, 2017 at 5:59

Comments show me that I need to execute some 3rd party tool, which could die next year and this answer will be useless, where is complete code example? – Ashford 10/5, 2017 at 6:0

@Lashane: I see no point to post generated code here. The answer is not a software distribution (I would understand if you wanted the generated source included in a source tarball). The answer demonstrates a specific approach on how to solve this problem ("generate C code from higher-level specification"). It works. (I find these comments amusing. I had a problem. I've solved and posted a solution that works for me. It may work for somebody else too. As a thank you, I get downvotes). – Humble 10/5, 2017 at 6:21

The Roman digits come in two classes, the "ones" (I, X, C, M) and the "fives" (V,L, D). The "fives" can't be repeated and cannot be substracted. The "ones" can be repeated up to three times when they don't come after a smaller number and the can be subtracted from a number that isn't greater than the next "one".

When parsing, a digit can be in three different modes: It is either added normally or it is a number to be subtracted or it is a number from which the preceding number was substracted.

You can enforce these rules as you build your number. In addition to the value of a digit, you need a function that classifies the digit. In the code below, the function repeat does this. It returns the maximum number of repetitions per number, but it also serves as classification: 3 means a "one" and 1 means a "five."

The code below seems to produce the same results as your code with regex validation. It returns a positive number for valid Roman numbers and −1 otherwise. (And it has fewer than 28 conditionals.)

int digit(int c)
{
    if (c == 'I') return 1;
    if (c == 'V') return 5;
    if (c == 'X') return 10;
    if (c == 'L') return 50;
    if (c == 'C') return 100;
    if (c == 'D') return 500;
    if (c == 'M') return 1000;
    return 0;
}

int repeat(int c)
{
    if (c == 'I') return 3;
    if (c == 'V') return 1;
    if (c == 'X') return 3;
    if (c == 'L') return 1;
    if (c == 'C') return 3;
    if (c == 'D') return 1;
    if (c == 'M') return 3;
    return 0;
}

int from_roman(const char *s)
{
    int res = 0;                // running result
    int prev = 10000;           // value of previous digit

    if (s == NULL || *s == '\0') return -1;

    while (*s) {
        int c = *s++;                           // Roman digit
        int count = 1;                          // count of consecutive numbers
        int value = digit(c);                   // digit value
        int max = repeat(c);                    // allowed repetitions

        if (value == 0) return -1;              // illegal Roman digit

        while (*s == c) {
            s++;
            count++;
        }

        if (*s && digit(*s) > value) {
            int next = digit(*s++);

            if (max != 3) return -1;            // can only subtract I, X, C
            if (count > 1) return -1;           // can only subtract once
            if (next > 10 * value) return -1;   // IM,ID, IC, IL etc. invalid
            if (value * 10 > prev) return -1;   // VIV, VIX etc. invalid

            res += next - value;
        } else {
            if (count > max) return -1;         // too many repetitions
            if (value >= prev) return -1;       // must decrease

            res += count * value;
        }

        prev = value;
    }

    return res;
}

Edit: The first two drafts of my code had errors, which are now fixed.

Since the validation of correctness is done via a regular expression, another approach is to implement the regex directly, while at the same time calculating the value of the Roman number. Also, given how complicated it is to get the logic to the Roman numbers right, this might be the better approach.

An implementation of this approach could be:

/*
 *      Returns the length of the digit stretch and advances the pointer
 */
static int stretch(const char **s, int m, int max)
{
    int n = 0;

    while (n < max && **s == m) {
        (*s)++;
        n++;
    }
    return n;
}

/*
 *      Parses (I II III IV V VI VII VIII IX) for ones,
 *      tens and hundreds and advances the pointer.
 */
static int parse(const char **s, int x, int v, int i)
{
    int res = 0;

    if (**s == i && *(*s + 1) == x) {
        res += 9;
        *s += 2;
    } else if (**s == i && *(*s + 1) == v) {
        res += 4;
        *s += 2;
    } else {
        res += stretch(s, v, 1) * 5;
        res += stretch(s, i, 3);
    }

    return res;
}

/*
 *      Parse a Roman numeral according the the regex; -1 means failure
 */
int from_roman_regex(const char *s)
{
    int res = 0;

    if (s == NULL || *s == '\0') return -1;

    res += stretch(&s, 'M', 3) * 1000;
    res += parse(&s, 'M', 'D', 'C') * 100;
    res += parse(&s, 'C', 'L', 'X') * 10;
    res += parse(&s, 'X', 'V', 'I') * 1;

    if (*s) return -1;
    return res;
}

The stretch function emulates regexes such as X{0,3}; the parse function emulates regexes such as (V?I{0,3}|IX|IV), but in addition to singalling matching success or failure, it evaluates it as Roman number.

The first approach tries to implement the rules of Roman numbers. This is a bit complicated, but has the advantage that one cound extend it easily to provide exact error messages if one wanted to. The second approach has the advantage that it matches the question's spec exactly: It does what the regex does.

I've tested all Roman numbers up to 3,999 and all combinations of up to 7 Roman digits. The two approaches above and the OP's approach – simple aritgmetic plus regex validation – yielded the same results for all cases.

Autorotation answered 10/5, 2017 at 7:27 Comment(10)

Your code allows Roman numerals that are rejected by the regex. For example, for CMC your code returns 1000 but it is invalid – Humble 10/5, 2017 at 8:54

Oh. I've tested on a large range of partly erroneous numbers and got the same behaviour as your regex. Apparently that's a case I haven't considered. What's the rule here, that you can't use a digit after you have subtracted it? (Will look into it, but I'm busy right now.) – Autorotation 10/5, 2017 at 9:2

The question says explicitly: "for the sake of this question, valid Roman numerals are defined by the regex". If you want the explicit rule with words then one of the links in the question has the following for the subtractive principle: "Only if any numeral following the minuend is smaller than the subtrahend." i.e., you can write: CML (L follows M minuend is less than C subtrahend) but not CMC (C is not less than C). – Humble 10/5, 2017 at 9:27

That's more or less the same rule I proposed, only more formally worded. Do you have a repository of test cases? – Autorotation 10/5, 2017 at 9:32

I test custom invalid numerals that I've encountered in the links and that do not match the regex (e.g., IIII or iij are invalid according to the regex) but you can always find corresponding rules (the specification is rather common though there are variations as it is said in the question). I check all valid (3999) numerals. And use the exhaustive search up to a maxlen for invalid Roman numerals:

for numeral in map(''.join, product(digits, repeat=maxlen)): if not is_valid_according_to_regex(numeral): test_invalid(numeral)

. – Humble 10/5, 2017 at 9:59

Okay, fixed the error. I've also added a solution that does what the regex does. This code is probably very close to the generated code from the other answer. – Autorotation 10/5, 2017 at 11:25

(I've just noticed that the proposal to use generated code is an answer to your own question, so maybe this wasn't intended to invite further answers. Never mind.) – Autorotation 10/5, 2017 at 11:31

now your code fails on VIX (your code accepts it but the regex doesn't). This problem is not as easy as it looks. – Humble 10/5, 2017 at 11:34

Or I am not as mart as I'd like to be. Anyway, that's strange, because I verified that all nonsense combos of up to 7 roman digits yielded the same result with all three methods, but that turned out to be an error with the test. I'm generally wary of regexes in production code, but here it looks like it is the best solution. – Autorotation 10/5, 2017 at 12:5

I had a bug in my testig code, or I would have found the VIV bug, too. Whether the code is more elegant than an automatically generated solution is debatable, though. – Autorotation 10/5, 2017 at 13:46

To create some level of rule flexibility, the following Roman_string_to_unsigned0() employs a table.

It follows the strtol() style of functionality in that an end-pointer is returned indicating where parsing stopped. De-ref and test against '\0' for success.

The function has a bool subtractive parameter to steer the two major types of Roman Numeral parsing: basic, subtractive.

static const struct Roman_digit {
  char ch[3];
  bool subtractive;
  unsigned char limit;
  unsigned char nextdown;  // with parse success, offset to next element to try
  unsigned value;
} Roman_table[] = {
    { "I", false, 4, 1, 1 }, //
    { "IV", true, 1, 2, 4 }, //
    { "V", false, 1, 2, 5 }, //
    { "IX", true, 1, 4, 9 }, //
    { "X", false, 4, 1, 10 }, //
    { "XL", true, 1, 2, 40 }, //
    { "L", false, 1, 2, 50 }, //
    { "XC", true, 1, 4, 90 }, //
    { "C", false, 4, 1, 100 }, //
    { "CD", true, 1, 2, 400 }, //
    { "D", false, 1, 2, 500 }, //
    { "CM", true, 1, 4, 900 }, //
    { "M", false, 4, 1, 1000 }, //
};
#define Roman_table_N (sizeof Roman_table / sizeof Roman_table[0])

const char *Roman_string_to_unsigned0(unsigned *dest, const char *src, bool subtractive){
  *dest = 0;
  for (unsigned i = Roman_table_N; i > 0;) {
    const struct Roman_digit *digit = &Roman_table[i - 1];
    if (!subtractive && digit->subtractive) {
      i--;
      continue;
    }
    unsigned limit = digit->limit;  // repeat count
    if (limit > 1 && subtractive) limit--;
    size_t ch_length = strlen(digit->ch);
    size_t next_i = i-1;
    for (unsigned j=0; j<limit; j++) {
      if (strncmp(src, digit->ch, ch_length) == 0) {
        *dest += digit->value;
        if (*dest < digit->value) { // Overflow detection
          return (char*) src;
        }
        src += ch_length;
        next_i = i - digit->nextdown;  // With success, maybe skip down the list 
      } else {
        break;
      }
    }
    i = next_i;
  }
  return (char*) src;
}

Notes: Case insensitivity not yet encoded. An empty string returns 0. By this code working most-to-least significant, "XXXMMM" does not pass.

Wesle answered 10/5, 2017 at 21:47 Comment(3)

Per my testing, I am confident this passes ^(M{0,3})(D?C{0,3}|CM|CD)(L?X{0,3}|XC|XL)(V?I{0,3}|IX|IV)$ when it should and fails otherwise. – Wesle 10/5, 2017 at 21:50

@J.F.Sebastian Grumble - cut and paste error - grumble - code fixed. – Wesle 11/5, 2017 at 2:26

it passes my tests now. – Humble 11/5, 2017 at 3:30

Use strcmp() or in other words, round-trip the string.

First consider the reverse problem, number --> string.

There are many ways to efficiently convert an integer to a string of Roman numerals. Let us call it:

// return false on error due to `value` range error or scant `size`
bool roman_int_to_string(char *dest, size_t size, int value);

Aside from letter case concerns, there is a one-to-one relationship between a canonical Roman number string and an int. Simply convert the source string to an int and then the int into another test string. If these strings match, we have a winner.

#define ROMAN_STRING_N 20
int roman_numeral_to_int_with_validate(const char *s, int size, bool *is_canonical) {
  int value = roman_numeral_to_int(s, size);

  char test[ROMAN_STRING_N];
  *is_canonical = roman_int_to_string(test, sizeof test, value);
  if (*is_canonical) {
    if (strcmp(s, test)) {  // Or use a case insensitive compare as desired
      *is_canonical = false;
    }
  }

  return value;
}

Lesson: I did code up a direct validation function. To test it I needed the inverse roman_int_to_string(). A random string generator rapidly showed many surprising errors like "CMC" and "CMCD" as well as OP's "IIII". In the end, using a simplistic string-to-int and int-to-string function pair and then doing a string compare was the most resilient.

Wesle answered 10/5, 2017 at 18:18 Comment(3)

it is a very nice idea: It is simple and easy to get right. The downside is that it assumes that there is a one-to-one relationship roman<->int (for normalized case) that is true for the specific ruleset in the question but it does not hold if you want to extend it. – Humble 10/5, 2017 at 18:50

@J.F.Sebastian Agreed. OTOH, the test set of string cases becomes nearly unmanageable or not known complete with relaxed specs. If one wants to really "extend" the rule set, that may simple devolve to valid = int roman_numeral_to_int(s, size) > 0;. Ancient Romans really were loose on what was valid and generally did not use the subtractive principle. IMO, roman string to int follows 1 of 4 rule sets: non-subtractive/subtractive and canonical/anything goes like "(CCC)MMMMDMViiiiL". Just return an endptr like strtol() to denote scanning end. Thanks for the fun post. – Wesle 10/5, 2017 at 19:7

I've used your idea to implement Roman literals for CPython (an analog of the 1st April's PEP-313). It works. I can type 0rXIV, 0Riii in python now. See How to add support for int literals written as Roman numerals to CPython? What does it take? – Humble 30/6, 2017 at 14:9

Simple and Sweet logic using subtract the value

ng sorry code is in a python but you can follow the logic i have used while i have done this programme**

def checkio(data):
roman=""
while(data!=0):
    if data>=1000:
    data-=1000
        roman+='M'
        elif data>=900:
        data-=900
        roman+='CM'
        elif data>=500:
        data-=500
        roman+='D'
        elif data>=400:
        data-=400
        roman+='CD'
        elif data>=100:
        data-=100
        roman+='C'
        elif data>=90:
        data-=90
        roman+='XC'
        elif data>=50:
        data-=50
        roman+='L'
        elif data>=40:
        data-=40
        roman+='XL'
        elif data>=10:
        data-=10
        roman+='X'
        elif data>=9:
        data-=9
        roman+='IX'
        elif data>=5:
        data-=5
        roman+='V'
        elif data>=4:
        data-=4
        roman+='IV'
        elif data>=1:
        data-=1
        roman+='I'
    return roman

Bonds answered 10/5, 2017 at 8:5 Comment(1)

Your code tries to implement to_roman(n: int) -> str while the question is about the opposite direction (from_roman(numeral: str) -> int). The input in the question Roman numerals the output is the corresponding int or an error if the input numerals are invalid according to the regex. – Humble 10/5, 2017 at 8:57

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags