A C++ compile-time string parser prototype

August 15, 2009 by Tor Brede Vekterli

Update 20 November 2011: In an earlier update, I jumped to conclusions that this wouldn't be possible to do with C++11 constexprs, but newer comments have hopefully proven me entirely wrong :) This bodes well for someone creating a better, non-hacky implementation in the future! Also, sorry for the utterly abysmal site update frequency for quite some time now.

When the compiler-adoption of C++11 becomes more commonplace and extensive, we will eventually see the introduction of a language feature known as user-defined literals. This will allow us to transform a constexpr (i.e. known fully at compile-time) string into a variadic template argument list of characters using a special suffix operator, e.g. "hello"_stuffoperator"" _stuff<'h','e','l','l','o'>. This fills a void left by the earlier C++ standards in that it has been impossible for template code to access the contents of a regular string literal.

So why is this interesting? Since we're given the contents of a string at compile-time, it's evident that this is because we would like to process that string somehow at the meta-level, but this is hardly a straight forward task with immutable template programming. In the run-time world we already have excellent parsing support in the form of Boost.Spirit 2.1, ANTLR, YACC et al, but to the best of my knowledge no such solution exists for C++ metaprogramming. This might perhaps be because no one has been crazy enough to attempt it (C++ template programming being infamous for its hilarious error-verbosity and debugging difficulties), but I decided to give developing such a library a go.

This article outlines a highly experimental/prototypical recursive-descent, attribute-based "meta-parser" that is inspired very closely by the concepts found in Boost.Spirit 2.1. In fact, it's sufficiently closely based on them that I would recommend reading the tutorial documentation for its Qi module before continuing to get a feel for what's involved.

Update: added download link for prototype code. Only tested with Visual C++ 2008, very unlikely that it will compile cleanly in GCC et al.

Files associated with this post

Metaprogramming and recursive-descent parsing

There are certain inherent limitations to template metaprogramming that shape the way parsing is performed. The most important one is that metaprogramming is pure functional and immutable, that is, any function call (i.e. template invocation) must happen without any side effects and you cannot change its contents once it has been instantiated. This means that if we want to process a sequence of characters, we cannot simply use a for-style loop, since that would require the ability to change a mutable index counter variable. Instead, recursion is used to handle all elements of the sequence, each element going one level deeper on the stack. Since recursion is the main enabler of advanced template programming, it's a natural choice to use recursive-descent parsing, both because of its expressive power, non-ambiguity and ability to arbitrarily backtrack as well as the fact that it's used by high-end parsers such as Boost.Spirit and ANTLR et al.

Please note: I'm hardly an expert at parsing theory, so there might be some gross oversights in this article.

Attribute-based parsers

Attribute-based parsing is a very intuitive method of generating results from a parser that has recently found its way to native C++ with Boost.Spirit2. It allows parsers to carry a "synthesized" attribute that may come from the parser itself (for primitive parsers like single integers, characters etc) or a composition of its sub-parsers. For example, an integer parser would have as its attribute the integer representation of the text it parsed, while multiple parser attributes (or with operators such as the Kleene star), we get sequences—or other combinations—of attributes. Consider a pseudo-EBNF/PEG grammar using attributes:

intlist: integer (',' integer)*;

Given the string "10,20,30,40", the attribute of intlist would be the sequence [int: 10, int: 20, int: 30, int: 40]. Or with Spirit2, a std::vector<int> with said values. String literals do not carry an attribute (unless explicitly told to) and will thus not be a part of the resulting attribute.

To give a hopefully gentle introduction to the parser prototype, this is how the grammar is represented, as well as the parser invocation:

struct intlist
  : seq<
        int_
      , kleene<
            ch<','>
          , int_
        >
    >
{};

typedef mpl::string<'10,2','0,30',',40'> str;
typedef parse<
    mpl::begin<str>::type
  , mpl::end<str>::type
  , intlist
> parsed_t;

A bit more verbose, certainly, but keep in mind we're working within the constraints of the C++ template syntax. The most awkward part here is the current lack of string literal support—MPL's string type goes a long way, but still cannot change the language itself. All the various nested types given are primitive parsers, and it's important to note that any combination of such primitive parsers is-a parser, that is, combine several parsers and you get a new, bigger parser type. For this reason, intlist is itself a parser. The primitive parsers used here have the following semantics that should be well known to anyone familiar with EBNF-esque parsing or even regular expressions:

  • seq: a sequence of parsers. Only matches if all its nested parsers also match. Attribute is an mpl::vector of the attributes of the nested parsers.
  • int_: a simple (currently unsigned only) integer parser. Has a synthesized attribute of mpl::int_<parsed_value>.
  • kleene: implements the well-known kleene star, which matches its nested parser zero or more times. As with seq, attribute is an mpl::vector, but where each match of the kleene parser appends a new type to the vector. It may be noted that the parser library always maintains a vector state of attributes for each parser being matched (in this case, seq), and that the kleene parser will append to this parent vector rather than creating a new, nested one.
  • ch: simple single-character literal parser. Does not have an associated attribute.

In turn, given the example string above, we would get the resulting attribute type mpl::vector<mpl::int_<​10>, mpl::int_<​20>, mpl::int_<​30>, mpl::int_<​40>>. As mentioned for the kleene parser, since the attribute state for any parser is a vector, the parsers will append to this vector by default, ensuring that we get a single linear type vector as a result rather than some nested mess.

As with Spirit, parsing is supported on any iterable character range. The 3rd argument is the type of the parser itself, in this case intlist. A 4th parser argument is also partially supported for skip parsing, but this is not fully implemented yet. Parse-success is given by parsed_t::parse_ok (an mpl::bool_). Unlike Spirit2, attributes aren't specified in the grammar/rule declarations themselves, but are a dynamic property of the attributes of the recursive parsers.

Modifying attributes and creating type trees

All this would be pretty unwieldy (and let's face it; useless) if we couldn't do something akin to the Fusion-adaptation of data structures in Spirit, and therefore it's made simple to transform the attributes as they are being generated. As an example for parsing strings of the type "123-456":

template <typename Prefix, typename Postfix>
struct num_range {};

struct num_range_parser
  : attr<
        seq<int_, ch<'-'>, int_>
      , num_range<mpl::_1, mpl::_2>
    >
{};

where _1, _2..._n refer to the current attributes of attr<>'s sub-parsers (essentially, the entries in the nested parser's attribute vector). attr<> is a special parser that can modify the returned attribute type of any parser before it is appended. As one might expect, the resulting type here would be num_range<mpl::int_<123>, mpl::int_<456>>. The same concept of being able to near arbitrarily alter the returned attribute types applies to nested parsers and this is where the power lies to create complex, nested hierarchies of types. When attr is used in conjunction with _1, _2... _n, the attribute vector of the nested parser is unwrapped and ignored in favor of the type that attr specifies.

To take a more realistic example, assume we want to create a type tree representing a regular expression-esque character set string such as [a-z0-9_]:

// Types that make up our tree structure
// A range of characters From-To
template <typename From, typename To>
struct make_range { };

// A single char of the charset
template <typename Char>
struct make_char { };

template <typename Left, typename Right>
struct make_binary_node { };

template <typename Root>
struct make_charset { };

// Parser grammars
struct charset_range
  : rule<
      seq<
          char_
        , ch<'-'>
        , char_
      >
    , make_range<mpl::_1, mpl::_2>
  >
{};

struct charset_char
  : rule<
      diff<char_, ch<']'>> // char_ - ']'
    , make_char<_val>
  >
{};

struct charset_node
  : or_<
      seq<
          charset_range
        , optional<
            re_wrap<
                charset_node
              , make_binary_node<mpl::_1, mpl::_2>
            >
          >
      >
    , seq<
          charset_char
        , optional<
            re_wrap<
                charset_node
              , make_binary_node<mpl::_1, mpl::_2>
            >
          >
      >
  >
{};

struct charset
  : seq<
      ch<'['>
    , attr<charset_node, make_charset<mpl::_1>>
    , ch<']'>
  >
{};

re_wrap<> is similar to attr<>, but rather than appending the modified attribute to the parser state, it replaces the entire attribute with its own upon a successful parse. This allows us to effectively wrap the attributes in a parent type (to hopefully make things clearer, this can be seen somewhat analogous to AST root nodes). rule<> is currently merely a semantic alias of attr<> (will probably be removed unless some other use can be found for it).

Given the string of "[a-z0-9_]", the following type tree is produced:

boost::mpl::vector1<
  make_charset<
    make_binary_node<
        make_range<
            boost::mpl::char_<97>  // 'a'
          , boost::mpl::char_<122> // 'z'
        >
      , make_binary_node<
          make_range<
              boost::mpl::char_<48> // '0'
            , boost::mpl::char_<57> // '9'
          >
          , make_char<
            boost::mpl::char_<95> // '_'
          >
        >
    >
  >
>

It is presumed that these types would implement some run-time logic that would make use of its template arguments to perform a useful task (in a sense, this might be considered a way of constructing expression trees via strings rather than operator overloading a-la Boost.Proto). The parser will currently always wrap top-level attributes in a MPL vector, but improved and extended ways of handling attributes will be included in any non-prototype version.

Examples of application areas

What examples we've seen thus far have been fairly contrived, as most of them they could've been done more easily with plain old code, so as a proof of concept I've whipped up a parser that transforms a regular expression string into an intermediate representation akin to what's shown above. This intermediate representation is then transformed into a logically equivalent Boost.Xpressive Proto expression tree. This means that it's possible to have the benefits that static regular expressions give you, but with the familiar syntax of a dynamic regular expression. It's currently very prototypical and deep-copies absolutely everything, but supports most features such as character ranges, greedy and non-greedy quantifiers, (non-)capture groups, backreferences, most special escape characters etc.

To create a static regex for "<(\\w+)>.*?</\\1>" (a HTML tag matcher with back-referencing and non-greedy matching) one could with Boost.Xpressive write something akin to

namespace xpr = boost::xpressive;

xpr::sregex rex = '<' >> (s1= +_w) >> '>' >> -*_ >> "</" >> s1 >> '>';
xpr::smatch what;

std::string hello("<boost>greetings!</boost>");

xpr::regex_match(hello, what, rex);
// ...

Although elegant and efficient, it adds yet another domain-specific language to learn on top of the regular expressions. With the current compile-time parser prototype, the static regex construction is reduced to

xpr::sregex rex = static_regex<'<(\\w','+)>.','*?</','\\1>'>::create();

With C++11, it should hopefully be possible to do this much, much cleaner using constexpr functions taking in a string literal and evaluating it at compile-time, but the standard is not entirely clear on this. Please read the article comments for insights and discussions about this.

I haven't yet properly benchmarked compiler time/memory usage with regards to grammar complexity or input string length, but it can process a relatively complex regex grammar without the compiler melting down, so I'd claim there's hopefully some potential (it did use over 300 MiB of memory during the regex parsing and Xpressive tree construction, but I have not investigated whether or not this is primarily from the parser or Boost.Proto, which is also a template processing heavy-weight).

Again, this is only a hacked together proof-of-concept, so there is much work that needs to be done, but if there's any interest I'd gladly try to fix things up and get something more viable put together. Feedback, ideas and/or screams of horror are all very gladly accepted.

Potential future usage areas

  • ... Compiler template benchmarking, maybe?
  • Scaring programmers with gargantuan compilation error-messages

Comments

Programming enthusiast on August 16, 2009

Very interesting development! Keep us updated.

Manfred on September 4, 2009

I like what you've done, have an up-coming project that might leverage some of this. Please do keep iterating over your ideas, and post over to the boost-users list to keep the community up to date - it should be met with favor from a few in the community.

Ben Hanson on September 15, 2009

I'm author of lexertl (http://www.benhanson.net/le...) and very interested in this. I don't think anyone has done a compile time dfa based regex engine yet (even in 'D'), but if it's possible this technique would be a great front end!

Tor Brede Vekterli on September 16, 2009

Thanks for the kind words, I'll try to get something posted to the Spirit/Boost list soon :)

@Ben: I'm currently using Spirit.Lex to lex python code, and it's a very impressive piece of work. Thanks for contributing it to Boost! DFA construction at compile time is probably quite a bit more tricky than type tree construction, but I think it should theoretically be possible. Whether or not any compilers will survive it is another question ;)

Sohail on October 16, 2009

Ben, compile-time regex has been done in Common Lisp. See http://weitz.de/cl-ppcre/

Ben Hanson on October 22, 2009

I also just found this: http://www.dsource.org/proj...

Andrzej Krzemienski on June 15, 2010

Hi, I believe there is an invalid expectation of C++0x that is often held. The compile-time processing of user defined literals will only be available for float and integer -like literals. I am not sure about character literals, but string literals will definitely not e available in the compile-time form.

Tor Brede Vekterli on July 3, 2010

Hi Andrzej

After consulting the current draft (http://www.open-std.org/jtc...), it does look like you're quite correct. Looking at Section 2.14.8, §5 only C-style strings are passed to the literal function, making any compile-time parsing of proper string literals just as impossible as before. Sigh. Oh well, at least it'll teach me to study the standard closer before jumping to conclusions the next time :)

From a functionality point of view, I fail to see why providing templated characters for integral and floating point literals are included but not for strings, though. Seems like a rather arbitrary decision. Maybe it's to keep people from doing things like this to their compilers ;) Or simply from seeing no usage area for it (in which case, I hope this article may eventually help to prove otherwise).

Thank you for notifying me of this, I will update the article.

Tom Prince on October 27, 2010

Reading 5.19, suggests that if the function is a constexpr, that it could still process the string at compile time. Although it seems that in many contexts, the compiler isn't forced to do it at compile time.

Thomas Petit on December 13, 2010

constexpr works well now with GCC 4.6. Did someone manage to transform a string literal into a variadic list of char and plug it to the metaparser code ?

As for me, I tried to port the metaparser code under GCC but failed. After fixing some obvious stuff (two or three ::template missing, one superfluous typedef) the compilation still stopped in error deep down into the template maze and I couldn't figure out why.

Andrzej Krzemienski on May 22, 2011

Hi,
I believe it is still possible to parse strings at compile-time in C++11 (or C++0x if you will). You can achieve that using the 'constexpr' feature. However, it requires a new approach to compile-time computations. "pure compile-time" tools like template meta-programming or Boost.MPL will not work. I provided some sketch on how it can be done on my blog:
http://akrzemi1.wordpress.c...

Regards,
Andrzej

niXman on June 17, 2011

compile-time-string-hashing with use c++1x constexpr extension: http://liveworkspace.org/co...

minoo on June 3, 2013

Hi
Can you help me please.
I need program for Level, top-down parser !
Where can I find it?
Thank you very much

BHW on April 14, 2017

hey there and thank you for your information – I've definitely picked up anything new from right here.
I did however expertise several technical points using
this web site, since I experienced to reload the website a lot
of times previous to I could get it to load correctly.
I had been wondering if your web host is OK? Not
that I'm complaining, but sluggish loading instances times will often affect your placement in google
and can damage your quality score if ads and marketing with Adwords.

Well I'm adding this RSS to my e-mail and could look out for a
lot more of your respective interesting content.
Ensure that you update this again soon.

Post a comment