I'm looking for some simple examples and best practices of how to use regular expressions in ANSI C. man regex.h
does not provide that much help.
Regular expressions actually aren't part of ANSI C. It sounds like you might be talking about the POSIX regular expression library, which comes with most (all?) *nixes. Here's an example of using POSIX regexes in C (based on this):
#include <regex.h>
regex_t regex;
int reti;
char msgbuf[100];
/* Compile regular expression */
reti = regcomp(®ex, "^a[[:alnum:]]", 0);
if (reti) {
fprintf(stderr, "Could not compile regex\n");
exit(1);
}
/* Execute regular expression */
reti = regexec(®ex, "abc", 0, NULL, 0);
if (!reti) {
puts("Match");
}
else if (reti == REG_NOMATCH) {
puts("No match");
}
else {
regerror(reti, ®ex, msgbuf, sizeof(msgbuf));
fprintf(stderr, "Regex match failed: %s\n", msgbuf);
exit(1);
}
/* Free memory allocated to the pattern buffer by regcomp() */
regfree(®ex);
Alternatively, you may want to check out PCRE, a library for Perl-compatible regular expressions in C. The Perl syntax is pretty much that same syntax used in Java, Python, and a number of other languages. The POSIX syntax is the syntax used by grep
, sed
, vi
, etc.
regcomp
, cflags
, is a bitmask. From pubs.opengroup.org/onlinepubs/009695399/functions/regcomp.html : "The cflags argument is the bitwise-inclusive OR of zero or more of the following flags...". If you OR-together zero, you'll get 0. I see that the Linux manpage for regcomp
says "cflags may be the bitwise-or of one or more of the following", which does seem misleading. –
Zumwalt regmatch_t matches[MAX_MATCHES]; if (regexec(&exp, sz, MAX_MATCHES, matches, 0) == 0) { memcpy(buff, sz + matches[1].rm_so, matches[1].rm_eo - matches[1].rm_so); printf("group1: %s\n", buff); }
note that group matches start at 1, group 0 is the entire string. Add error checks for out of bounds, etc. –
Granvillegranvillebarker regex_t
again"; this sounds wrong. The documentation says the function frees several internal fields. In other words, you should call regfree()
regardless of whether you want to reuse the regex_t
. In fact, the way the documentation puts it, you also have to call it on failed regcomp()
s... –
Barrelhouse regcomp
fails? I'm not seeing this in any docs I can find, and every example I can find does not call regfree
if regcomp
fails. eg: pubs.opengroup.org/onlinepubs/009695399/functions/regcomp.html –
Zumwalt man regcomp
says "regerror()
is used to turn the error codes that can be returned by both regcomp()
and regexec()
into error message strings." What's interesting is that regerror
requires as argument the regex_t
that you tried to initialize during regcomp
. This doesn't mean the regex_t
still has allocated memory to release, but it's suspicious anyway. –
Barrelhouse regcomp
once (regardless of success), you can call regfree
as many consecutive times as you want and it won't complain or crash. It really isn't designed like your average variation of the free
function. I feel like the mentality of this API is that All pipelines after a regcomp
can (and should) include a regfree
for symmetric "candy" purposes. Like it wasn't designed by someone used to the malloc-free
pattern (where you don't have to deallocate unless the init was successful). –
Barrelhouse regcomp
and friends probably also terminate gracefully during uses more akin to malloc-free
. It's what I've been led to infer, anyway. But as a very careful user of these functions, I'd rather play by contracts the documentation and the API seem to promise. –
Barrelhouse regfree
before the exit(1)
? –
Barrelhouse regfree
is necessary after a failed regcomp
, though it really is rather under-specified, this suggest that it shouldn't be done: redhat.com/archives/libvir-list/2013-September/msg00276.html –
Graiggrail This is an example of using REG_EXTENDED. This regular expression
"^(-)?([0-9]+)((,|.)([0-9]+))?\n$"
Allows you to catch decimal numbers in Spanish system and international. :)
#include <regex.h>
#include <stdlib.h>
#include <stdio.h>
regex_t regex;
int reti;
char msgbuf[100];
int main(int argc, char const *argv[])
{
while(1){
fgets( msgbuf, 100, stdin );
reti = regcomp(®ex, "^(-)?([0-9]+)((,|.)([0-9]+))?\n$", REG_EXTENDED);
if (reti) {
fprintf(stderr, "Could not compile regex\n");
exit(1);
}
/* Execute regular expression */
printf("%s\n", msgbuf);
reti = regexec(®ex, msgbuf, 0, NULL, 0);
if (!reti) {
puts("Match");
}
else if (reti == REG_NOMATCH) {
puts("No match");
}
else {
regerror(reti, ®ex, msgbuf, sizeof(msgbuf));
fprintf(stderr, "Regex match failed: %s\n", msgbuf);
exit(1);
}
/* Free memory allocated to the pattern buffer by regcomp() */
regfree(®ex);
}
}
regcomp
outside the loop?. unless it should be initialized every time it get used. –
Chum It's probably not what you want, but a tool like re2c can compile POSIX(-ish) regular expressions to ANSI C. It's written as a replacement for lex
, but this approach allows you to sacrifice flexibility and legibility for the last bit of speed, if you really need it.
man regex.h
doesn't show any manual entry for regex.h, but man 3 regex
shows a page explaining the POSIX functions for pattern matching.
The same functions are described in The GNU C Library: Regular Expression Matching, which explains that the GNU C Library supports both the POSIX.2 interface and the interface the GNU C Library has had for many years.
For example, for an hypothetical program that prints which of the strings passed as argument matches the pattern passed as first argument, you could use code similar to the following one.
#include <errno.h>
#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void print_regerror (int errcode, size_t length, regex_t *compiled);
int
main (int argc, char *argv[])
{
regex_t regex;
int result;
if (argc < 3)
{
// The number of passed arguments is lower than the number of
// expected arguments.
fputs ("Missing command line arguments\n", stderr);
return EXIT_FAILURE;
}
result = regcomp (®ex, argv[1], REG_EXTENDED);
if (result)
{
// Any value different from 0 means it was not possible to
// compile the regular expression, either for memory problems
// or problems with the regular expression syntax.
if (result == REG_ESPACE)
fprintf (stderr, "%s\n", strerror(ENOMEM));
else
fputs ("Syntax error in the regular expression passed as first argument\n", stderr);
return EXIT_FAILURE;
}
for (int i = 2; i < argc; i++)
{
result = regexec (®ex, argv[i], 0, NULL, 0);
if (!result)
{
printf ("'%s' matches the regular expression\n", argv[i]);
}
else if (result == REG_NOMATCH)
{
printf ("'%s' doesn't match the regular expression\n", argv[i]);
}
else
{
// The function returned an error; print the string
// describing it.
// Get the size of the buffer required for the error message.
size_t length = regerror (result, ®ex, NULL, 0);
print_regerror (result, length, ®ex);
return EXIT_FAILURE;
}
}
/* Free the memory allocated from regcomp(). */
regfree (®ex);
return EXIT_SUCCESS;
}
void
print_regerror (int errcode, size_t length, regex_t *compiled)
{
char buffer[length];
(void) regerror (errcode, compiled, buffer, length);
fprintf(stderr, "Regex match failed: %s\n", buffer);
}
The last argument of regcomp()
needs to be at least REG_EXTENDED
, or the functions will use basic regular expressions, which means that (for example) you would need to use a\{3\}
instead of a{3}
used from extended regular expressions, which is probably what you expect to use.
POSIX.2 has also another function for wildcard matching: fnmatch()
. It doesn't allow to compile the regular expression, or get the sub-strings matching a sub-expression, but it is very specific for checking when a filename match a wildcard (it uses the FNM_PATHNAME
flag).
While the answer above is good, I recommend using PCRE2. This means you can literally use all the regex examples out there now and not have to translate from some ancient regex.
I made an answer for this already, but I think it can help here too..
Regex In C To Search For Credit Card Numbers
// YOU MUST SPECIFY THE UNIT WIDTH BEFORE THE INCLUDE OF THE pcre.h
#define PCRE2_CODE_UNIT_WIDTH 8
#include <stdio.h>
#include <string.h>
#include <pcre2.h>
#include <stdbool.h>
int main(){
bool Debug = true;
bool Found = false;
pcre2_code *re;
PCRE2_SPTR pattern;
PCRE2_SPTR subject;
int errornumber;
int i;
int rc;
PCRE2_SIZE erroroffset;
PCRE2_SIZE *ovector;
size_t subject_length;
pcre2_match_data *match_data;
char * RegexStr = "(?:\\D|^)(5[1-5][0-9]{2}(?:\\ |\\-|)[0-9]{4}(?:\\ |\\-|)[0-9]{4}(?:\\ |\\-|)[0-9]{4})(?:\\D|$)";
char * source = "5111 2222 3333 4444";
pattern = (PCRE2_SPTR)RegexStr;// <<<<< This is where you pass your REGEX
subject = (PCRE2_SPTR)source;// <<<<< This is where you pass your bufer that will be checked.
subject_length = strlen((char *)subject);
re = pcre2_compile(
pattern, /* the pattern */
PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */
0, /* default options */
&errornumber, /* for error number */
&erroroffset, /* for error offset */
NULL); /* use default compile context */
/* Compilation failed: print the error message and exit. */
if (re == NULL)
{
PCRE2_UCHAR buffer[256];
pcre2_get_error_message(errornumber, buffer, sizeof(buffer));
printf("PCRE2 compilation failed at offset %d: %s\n", (int)erroroffset,buffer);
return 1;
}
match_data = pcre2_match_data_create_from_pattern(re, NULL);
rc = pcre2_match(
re,
subject, /* the subject string */
subject_length, /* the length of the subject */
0, /* start at offset 0 in the subject */
0, /* default options */
match_data, /* block for storing the result */
NULL);
if (rc < 0)
{
switch(rc)
{
case PCRE2_ERROR_NOMATCH: //printf("No match\n"); //
pcre2_match_data_free(match_data);
pcre2_code_free(re);
Found = 0;
return Found;
// break;
/*
Handle other special cases if you like
*/
default: printf("Matching error %d\n", rc); //break;
}
pcre2_match_data_free(match_data); /* Release memory used for the match */
pcre2_code_free(re);
Found = 0; /* data and the compiled pattern. */
return Found;
}
if (Debug){
ovector = pcre2_get_ovector_pointer(match_data);
printf("Match succeeded at offset %d\n", (int)ovector[0]);
if (rc == 0)
printf("ovector was not big enough for all the captured substrings\n");
if (ovector[0] > ovector[1])
{
printf("\\K was used in an assertion to set the match start after its end.\n"
"From end to start the match was: %.*s\n", (int)(ovector[0] - ovector[1]),
(char *)(subject + ovector[1]));
printf("Run abandoned\n");
pcre2_match_data_free(match_data);
pcre2_code_free(re);
return 0;
}
for (i = 0; i < rc; i++)
{
PCRE2_SPTR substring_start = subject + ovector[2*i];
size_t substring_length = ovector[2*i+1] - ovector[2*i];
printf("%2d: %.*s\n", i, (int)substring_length, (char *)substring_start);
}
}
else{
if(rc > 0){
Found = true;
}
}
pcre2_match_data_free(match_data);
pcre2_code_free(re);
return Found;
}
Install PCRE using:
wget https://ftp.pcre.org/pub/pcre/pcre2-10.31.zip
make
sudo make install
sudo ldconfig
Compile using :
gcc foo.c -lpcre2-8 -o foo
Check my answer for more details.
This example follows the manual page for regex.h
as it exists on Ubuntu 22.04.3 LTS.
The regex.h
source exposes basic and extended POSIX regular expressions.
The semantics are somewhat analogous to using re.compile
in Python-- there's a call to compile a regular expression, and a call to match it against a target string. The function available to compile a regular expression is
int regcomp(regex_t *preg, const char *regex, int cflags);
The first parameter, *preg
, is a pointer of type regex_t
. It's specified ultimately by re_pattern_buffer
in the source, and is a struct
whose members will contain details related to the regular expression, such as the number of bytes allocated to the regular expression, flags that indicate if sub-expressions exist in the regular expression, and so on. This is where all the state of the compiled regular expression resides. The documentation calls it the pattern buffer storage area.
The second parameter, *regex
, points to the regular expression string. This is the POSIX regular expression pattern that is used to search against a target string.
The third parameter, cflags
, is 0
or a bitwise-or of flags from the following:
- REG_EXTENDED, i.e. use POSIX extended regular expressions
- REG_ICASE, i.e. use case-insensitive pattern matching
- REG_NOSUB, i.e. do not report position of matches (more below)
- REG_NEWLINE, i.e. do not let wildcards match newlines
Here is a compiled regular expression:
// holds details about the regular expression
static regex_t preg;
// the raw regular expression (static'd ref and val)
static const char *const regex = "^bork.*$";
// pack details needed for matching into preg
// then return 0 if compiled correctly, else error code
int errorcode = regcomp(&preg, regex, REG_ICASE);
The regular expression ^bork.*$
is now represented/specified by members of preg
. The REG_ICASE
flag was passed in too, so that the target string being mixed-case won't be a matter.
Compiling a regular expression can go wrong. In at least 14 different ways based on the manual page. To handle this, regcomp
returns 0
on success or an error code when something goes wrong.
Each error code corresponds to a textual error message, and regex.h
also has a function that accepts the error code and writes that textual message to a desired buffer:
size_t regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size);
The parameters to this function in order are
- our error code,
errorcode
- pointer to the compiled regular expression,
&preg
- pointer to some buffer to dump the message into
- the size of the buffer
One simple pattern that implementes regerror
is this:
// initialize regular expression buffer struct
regex_t preg;
// initialize bad regular expression string
static const char *const regex = "[A-Z";
// initialize buffer for error message
int ebuffsize = 1000;
char ebuff[ebuffsize];
// compile with no flags
int ecode = regcomp(&preg, regex, 0);
// initialize and get textual error length
int etextsize;
if (ecode)
etextsize = regerror(ecode, &preg, ebuff, ebuffsize);
// use etextsize to resize ebuff if its too small ...
To actually perform matching with a compiled regular expression, regex.h
exposes
int regexec(
const regex_t *preg, // pointer to pattern buffer storage area
const char *string, // pointer to target string
size_t nmatch, // number of entries in pmatch
regmatch_t pmatch[], // array of regmatch_t types
int eflags // bitwise-or match options
);
The simplest usage occurs when a regular expression is compiled with the REG_NOSUB
flag, as this will factor out the significance of nmatch
and pmatch
(these two parameters are involved in match addressing, described below). Here's this simple but contrived use case:
// initialize regular expression buffer struct
regex_t preg;
// initialize regular expression string
static const char *const regex = "bork";
// compile regular expression or exit
int ecode = regcomp(&preg, regex, REG_NOSUB);
if (ecode)
exit(-1);
// perform matching and print output
static const char *const target = "spoon-bork-knife";
printf("match value: %i", regexec(&preg, target, 0, NULL, 0));
// prints 0 to indicate success
Usage is a bit more complex when utilizing match addressing. Match addressing makes it possible to extract matching patterns from the target string, as in the manual page example (with added comments):
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>
// macro to determine nmatch from pmatch
#define ARRAY_SIZE(arr) (sizeof((arr)) / sizeof((arr)[0]))
// initialize target string and regular expression string
static const char *const str = "1) John Driverhacker;\n2) John Doe;\n3) John Foo;\n";
static const char *const re = "John.*o";
int main(void)
{
// pointer to target string and regular expression
static const char *s = str;
regex_t regex;
// initialize pmatch containter for caching match offest and length
regmatch_t pmatch[1];
// initialize offset and length types
regoff_t off, len;
// exit if regular expression cant be compiled
if (regcomp(®ex, re, REG_NEWLINE))
exit(EXIT_FAILURE);
printf("String = \"%s\"\n", str);
printf("Matches:\n");
// for each possible match
for (int i = 0; ; i++) {
// exit if no more matches
if (regexec(®ex, s, ARRAY_SIZE(pmatch), pmatch, 0))
break;
// compute offset of match and length of match and print
off = pmatch[0].rm_so + (s - str);
len = pmatch[0].rm_eo - pmatch[0].rm_so;
printf("#%d:\n", i);
printf("offset = %jd; length = %jd\n", (intmax_t) off, (intmax_t) len);
// print the match
printf("substring = \"%.*s\"\n", len, s + pmatch[0].rm_so);
// move the pointer to the next start of the string
s += pmatch[0].rm_eo;
}
exit(EXIT_SUCCESS);
}
Here is the output:
String = "1) John Driverhacker;
2) John Doe;
3) John Foo;
"
Matches:
#0:
offset = 25; length = 7
substring = "John Do"
#1:
offset = 38; length = 8
substring = "John Foo"
The last function in regex.h
that should be mentioned is
void regfree(regex_t *preg);
This function is passed the regular expression buffer and is similar to free
in terms of hygiene.
© 2022 - 2024 — McMap. All rights reserved.