Regular expressions to match C grammar

This page discusses regular expressions for parsing various kinds of C grammar.

The following Perl regular expression matches the traditional-style C comments like /* this */ or

/*
this
*/
our $trad_comment_re = qr!
                            /\*
                            (?:
                                # Match "not an asterisk"
                                [^*]
                            |
                                # Match multiple asterisks followed
                                # by anything except an asterisk or a
                                # slash.
                                \*+[^*/]
                            )*
                            # Match multiple asterisks followed by a
                            # slash.
                            \*+/
                        !x;

(download)

Matching the C++-style comments is easier:

our $cxx_comment_re = qr!//.*\n!;

(download)

The following regular expression matches a C preprocessor instruction:

our $cpp_re = qr/^\h*
                 \#
                 (?:
                    $trad_comment_re
                |
                    [^\\\n]
                |
                    \\[^\n]
                |
                    \\\n
                )+\n
               /mx;

(download)

The following regular expressions match a single C string, like "this", and compound C strings, like "this" "one":

our $single_string_re = qr/
                             (?:
                                 "
                                 (?:[^\\"]+|\\[^"]|\\")*
                                 "
                             )
                         /x;

(download)

our $string_re = qr/$single_string_re(?:\s*$single_string_re)*/;

(download)

The following regular expressions match one-character C operators and all C operators respectively.

our $one_char_op_re = qr/(?:\%|\&|\+|\-|\=|\/|\||\.|\*|\:|>|<|\!|\?|~|\^)/;

(download)

our $operator_re = qr/
                        (?:
                                # Operators with two characters
                                \|\||&&|<<|>>|--|\+\+|->|==
                            |
                                # Operators with one or two characters
                                # followed by an equals sign.
                                (?:<<|>>|\+|-|\*|\/|%|&|\||\^)
                                =
                            |
                                $one_char_op_re
                            )
                    /x;

(download)

All of these regular expressions are supplied in the Perl CPAN module C::Tokenize.


Copyright © Ben Bullock 2009-2017. All rights reserved. For comments, questions, and corrections, please email Ben Bullock (benkasminbullock@gmail.com). / Privacy / Disclaimer