Verbose Logging

software development with some really amazing hair

T + G I F R

Programming Language Style: Let The Compiler Do It

· · Posted in Editorial
Tagged with

I was using treetop to do some parsing the other day and it got me thinking. Treetop is a parsing DSL for ruby based on the idea of a parsing expression grammar. This could get dangerous.

lex and yacc (flex and bison)

If you open up the ruby source code, you'll probably find a file named parse.y. This is probably used by bison to generate a parser. The parser is (probably) used in conjuction with flex to deal with parsing things. A thing could be a source file of a programming language, an HTTP request, or some other interesting file format.

Using this type of system is great because it's (usually) damn fast, and you can provide decent error reporting, as opposed to something like regular expressions. Not to mention, regular expressions won't work on things that aren't regular.

The Process (Simplified)

So why all this? That way of parsing things, with a lexer (generated by flex) and a parser (generated by bison), goes like this:

  1. Lexer gets setup with an input.
  2. Parser is setup with lexer as input.
  3. Parser ask the lexer, "what's the next token?"
  4. The lexer uses it's rules to match a token (typically eating whitespace).
  5. Parser starts matching rules with that token.
  6. Go back to 3 until the input is done (or the parser says it's done).

Basically.

It's a bit more complicated than that, but for our purposes, we can leave it there. The point is that the lexer typically ignores whitespace between tokens. Who cares if you have one space or twenty beween a type declaration and the name of the variable? The system doesn't care.

Python

Now python cares a little bit. Python is partially whitespace-sensitive, which in this case means that it pays attention to your indentation. Python uses indentation to denote blocks. If you make an if statement, and you indent the first line by 4 spaces (let's say), you just indent the next line by 4 spaces as well to specify that line as being a part of that if statement as well. If you don't indent it, it's not part of the if statement. If you indent by some other value (3 spaces), it complains because you aren't being consistent and has no idea what the hell you're talking about. No more curly brace madness with your if statements!

Let's crank it up to 11

Since a PEG system is "different" than the system I previously described, we can do different things with it. You write your PEG to recognize the text exactly as it is, so to recognize an if statement you'd do something like this:

rule if_start: 'if' space lparen if_body rparen

This would not match an if statement with two spaces between the if token and the left paren.

Lack of research

I'm not going to lie, I haven't researched this to the four corners of the earth. I don't know whether or not the traditional system could be made to work to the degree I am thinking. But that doesn't really matter, it's just the idea of it.

Let the compiler do it

My 1+1 was if everybody whines about programming style,1 and there is this parsing system that requires you to specify the text exactly as it should be, why not combine the two and just enforce programming style in the language grammar itself? If you screw up the style, it doesn't compile!

You can turn the dial a bit, to make some parts of the syntax more flexible than others. As an example, you could enforce single spaces between things, but not indentation rules (like python).

Pros

The best part is nobody can come in and start throwing extra spaces around, mixing tabs in there, and just generally muck about. If they don't follow the style that is laid out as a part of the language, it doesn't work. All code looks the same.

Second…well I guess that's about it. It really just helps keep things consistent and under control. That is, however, a pretty big pro in my mind.

Cons

A couple people argued that style is subjective. Well…maybe. That's really not enough of an argument to convince me. Yes, it is, sort of, but I'd much rather see consistent code than be able to write exactly the way I want.

Another argument was that evolving the style would be painful. No more painful than upgrading or changing language features, really. If you change the style of something which results in an upgrade barrier for some (they have to change their code to upgrade to the latest version of the language), how is this any different than ruby 1.9 or python 3? They each had breaking changes in the language spec which required work to upgrade. It's not the end of the world.

Why not?

So why not? Regardless of what parsing system you use, why not enforce the language style in the language itself? It seems like it could be a pretty good (or at least interesting) idea.2


  1. Recently I've been having many conversations with friends about programming style problems we've seen.

  2. At a minimum, it looks good on paper, and would be interesting from an academic perspective if it proved to be a bad idea in the real world.