Regular expressions to match C grammar

This page discusses regular expressions for parsing various kinds of C grammar.

The following Perl regular expression matches the traditional-style C comments like /* this */ or

/*
this
*/
our $trad_comment_re = qr!
                            /\*
                            (?:
                                # Match "not an asterisk"
                                [^*]
                            |
                                # Match multiple asterisks followed
                                # by anything except an asterisk or a
                                # slash.
                                \*+[^*/]
                            )*
                            # Match multiple asterisks followed by a
                            # slash.
                            \*+/
                        !x;

Download it here.

Matching the C++-style comments is easier:

our $cxx_comment_re = qr!//.*\n!;

Download it here.

The following regular expression matches a C preprocessor instruction:

our $cpp_re = qr/^\h*\#(?:
                    $trad_comment_re
                |
                    [^\\\n]
                |
                    \\[^\n]
                |
                    \\\n
                )+\n
               /mx;

Download it here.

The following regular expressions match a single C string, like "this", and compound C strings, like "this" "one":

our $single_string_re = qr/
                             (?:
                                 "
                                 (?:[^\\"]+|\\[^"]|\\")*
                                     "
                                     )
                         /x;

Download it here.

our $string_re = qr/$single_string_re(?:\s*$single_string_re)*/;

Download it here.

The following regular expressions match one-character C operators and all C operators respectively.

our $one_char_op_re = qr/(?:\%|\&|\+|\-|\=|\/|\||\.|\*|\:|>|<|\!|\?|~|\^)/;

Download it here.

our $operator_re = qr/
                        (?:
                                # Operators with two characters
                                \|\||&&|<<|>>|--|\+\+|->
                            |
                                # Operators with one or two characters
                                # followed by an equals sign.
                                (?:<<|>>|\+|-|\*|\/|%|&|\||\^)
                                =
                            |
                                $one_char_op_re
                            )
                    /x;

Download it here.

All of these regular expressions are supplied in the CPAN module C::Tokenize.

Web links

Ask and answer questions on C in the new C forum

Copyright © Ben Bullock 2009-2012. All rights reserved. For comments, questions, and corrections, please email Ben Bullock (ben.bullock@lemoda.net) / Privacy / Disclaimer