...making Linux just a little more fun!
By Mike Chirico
Lemon is a compact, thread safe, well-tested parser generator written by Dr. Richard Hipp. Using a parser generator, along with a scanner like flex, can be advantageous because there is less code to write. You just write the grammar for the parser.
Compare this to writing all of the raw code from scratch. For instance, compare the basic C++ desktop calculator to the file below. Below is a syntax file which contains the grammar that the parser uses. "example1.y" is an example syntax file for a very basic calculator.
1 %token_type {int}
2
3 %left PLUS MINUS.
4 %left DIVIDE TIMES.
5
6 %include {
7 #include <iostream>
8 #include "example1.h"
9 }
10
11 %syntax_error {
12 std::cout << "Syntax error!" << std::endl;
13 }
14
15 program ::= expr(A). { std::cout << "Result=" << A << std::endl; }
16
17 expr(A) ::= expr(B) MINUS expr(C). { A = B - C; }
18 expr(A) ::= expr(B) PLUS expr(C). { A = B + C; }
19 expr(A) ::= expr(B) TIMES expr(C). { A = B * C; }
20 expr(A) ::= expr(B) DIVIDE expr(C). {
21 if(C != 0){
22 A = B / C;
23 }else{
24 std::cout << "divide by zero" << std::endl;
25 }
26 } /* end of DIVIDE */
27 expr(A) ::= INTEGER(B). { A = B; }
As you can see, this file is only 27 lines of code, not counting spaces. It is much easer to modify the grammar than it is to rewrite larger sections of raw code.
The parser generator (lemon.c and lempar.c) takes the input from the syntax file "example1.y" and creates the parser file "example1.c", along with two other files "example1.h", which contains definitions for the tokens, and "example1.out", which gives a detailed listing of the shift reduce states for the grammar listed in "example1.y".
Let's run through the steps, starting first with compiling the source code of lemon (available here). The first step is to compile lemon.c:
$ gcc -o lemon lemon.c
Now we have our parser generator, lemon
, so run the syntax
file "example1.y" through it.
$ ./lemon example1.y
This will create example1.c, example1.h, and example1.out. What about lempar.c? Compare "example1.c" with "lempar.c", and you will see that it contains a lot of the same code. "lempar.c" is a template file. You can modify the code if you want, and all modifications will be passed to "example1.c" (including any comments).
But "example1.c" is not complete. We'll append the contents of the file "main_part", which contains a main function and tokens. "main_part" is called a driver.
1 int main()
2 {
3 void* pParser = ParseAlloc (malloc);
4 /* First input:
5 15 / 5
6 */
7 Parse (pParser, INTEGER, 15);
8 Parse (pParser, DIVIDE, 0);
9 Parse (pParser, INTEGER, 5);
10 Parse (pParser, 0, 0);
11 /* Second input:
12 50 + 125
13 */
14 Parse (pParser, INTEGER, 50);
15 Parse (pParser, PLUS, 0);
16 Parse (pParser, INTEGER, 125);
17 Parse (pParser, 0, 0);
18 /* Third input:
19 50 * 125 + 125
20 */
21 Parse (pParser, INTEGER, 50);
22 Parse (pParser, TIMES, 0);
23 Parse (pParser, INTEGER, 125);
24 Parse (pParser, PLUS, 0);
25 Parse (pParser, INTEGER, 125);
26 Parse (pParser, 0, 0);
27 ParseFree(pParser, free );
28 }
So, what is main_part doing? Well, line 3 initializes the parser. You'll note that pParser is passed to each call of the "Parse" function starting at line 7. The expression 15/5, or 15 DIVIDE 5, is performed in lines 7 through 10, sending first the INTEGER 15, then the identifier DIVIDE, which doesn't need a value, so 0 is chosen as the third parameter on line 8. Finally, line 10, with 0 as the second parameter in "Parse(pParser, 0, 0);" signals the end of the input for this expression. (Please note that in "example4.y", the grammar will handle this with a NEWLINE, and "Parse(pParser,0,...);" will only be called at the very end of the syntax file.)
"main_part" is appended to "example1.c". You may want to reference the Makefile with the downloadable example, which has this step:
$ cat main_part >> example1.c
Next, just compile example1.c, and it's good to go.
$ g++ -o ex1 example1.c
Now execute "ex1", and we'll see that the result of 15/5 is, of course, 3. And 50+125 is equal to 175, and 50*125+125 is indeed equal to (50*125)+125= 6375. This last result verifies that TIMES has higher precedence than PLUS.
$ ./ex1
Result=3
Result=175
Result=6375
Now for a closer look at the syntax file (example1.y). Why does TIMES have higher precedence than PLUS? Line 3 and line 4 determine the precedence of the operators PLUS, MINUS, DIVIDE, and TIMES.
3 %left PLUS MINUS.
4 %left DIVIDE TIMES.
Lines at the top have a lower operator precedence. This is very important to note. PLUS and MINUS have less operator precedence than DIVIDE and TIMES because PLUS and MINUS are on line 3, whereas DIVIDE and TIMES are on line 4. If, for example, exponentiation (EXP) were added, since EXP has even higher operator precedence than TIMES and DIVIDE, it would be added below line 4.
What if you wanted real numbers in the input, 15.5,5.2 instead of integers? How would you do that? It's easy. These tokens are currently integers because of the following line in "example1.y":
1 %token_type {int}
So to accommodate a double, line 1 would be changed to:
%token_type {double}
Moving further down the lines of "example1.y", there is an "include" directive on line 6. The include statement in "example1.y" passes along any C statements, which are inserted at the beginning of the parse file "example1.c". Again, the contents are inserted into the beginning of "example1.c", which is necessary for declarations and headers.
...
6 %include {
7 #include <iostream>
8 #include "example1.h"
9 }
...
Note that "example1.h" is generated from "$ ./lemon
example1.y
". It defines the tokens, or, put another way, assigns
integer values to the token names starting at 1. Why start at 1, and
not 0? Because 0 is reserved for the Parse function. Remember, "Parse
(pParser, 0, 0);", with the second parameter set to zero, signals an end
to the input.
example1.h (note, this is a generated file; do not add code to it):
#define PLUS 1
#define MINUS 2
#define DIVIDE 3
#define TIMES 4
#define INTEGER 5
"example2.y" differs from "example1.y" by defining the token type as a structure. Specifically, this token type is defined in the file "ex2def.h". Defining our own structure can give us flexibility in the semantic action, or the piece of code on the right of the production rule. Here is an example:
expr(A) ::= expr(B) TIMES expr(C). { A.value = B.value * C.value;
A.n = B.n+1 + C.n+1;
}
The token_type in "example2.y" is defined as Token in line 6.
6 %token_type {Token}
This structure Token is defined in "ex2def.h" as follows:
struct Token {
const char *z;
int value;
unsigned n;
};
Special note: "const char *z" is not used in these examples, but I've kept it in this structure, since it's the next logical step in a calculator, assigning an expression to a variable. For instance, variable1=2+5, where variable1 would be some value in a symbol table. See this reference.
Again, note the change in the include directive, the addition of "ex2def.h", which defines struct Token.
1 #include {
2 #include <iostream>
3 #include "ex2def.h"
4 #include "example2.h"
5 }
6 %token_type {Token}
7 %default_type {Token}
8 %type expr {Token}
9 %type NUM {Token}
10
11 %left PLUS MINUS.
12 %left DIVIDE TIMES.
13
14
15 %syntax_error {
16 std::cout << "Syntax error!" << std::endl;
17 }
18
19 program ::= expr(A). {
20 std::cout << "Result.value=" << A.value << std::endl;
21 std::cout << "Result.n=" << A.n << std::endl;
22 }
23 expr(A) ::= expr(B) MINUS expr(C). { A.value = B.value - C.value;
24 A.n = B.n+1 + C.n+1;
25 }
26 expr(A) ::= expr(B) PLUS expr(C). { A.value = B.value + C.value;
27 A.n = B.n+1 + C.n+1;
28 }
29 expr(A) ::= expr(B) TIMES expr(C). { A.value = B.value * C.value;
30 A.n = B.n+1 + C.n+1;
31 }
32 expr(A) ::= expr(B) DIVIDE expr(C). {
33 if(C.value != 0){
34 A.value = B.value / C.value;
35 A.n = B.n+1 + C.n+1;
36 }else{
37 std::cout << "divide by zero" << std::endl;
38 }
39 } /* end of DIVIDE */
40 expr(A) ::= NUM(B). { A.value = B.value; A.n = B.n+1; }
As you can see below, taking a close look at lines 23 through 25, the Token structure A now takes on members "A.value" and "A.n", with ".value" taking on the value of the expression and ".n" the number of times an assignment is made:
23 expr(A) ::= expr(B) MINUS expr(C). { A.value = B.value - C.value;
24 A.n = B.n+1 + C.n+1;
25 }
This is a quick way to see the "shift" and "reduce" dynamically. A "shift"
is the number of times a token is pushed on the stack. A "reduce" is the
number of times an expression rule has been matched. Once it's matched, it
can be reduced. As you will recall, when lemon
is run, three
files are normally created: *.c, *.h, and *.out. This ".out" file contains
each step of the grammar, along with the shift and reduce states. If you
want a simple summary, run lemon with the "-s" option:
$ ./lemon -s example2.y
Parser statistics: 6 terminals, 3 nonterminals, 6 rules
11 states, 0 parser table entries, 0 conflicts
Again, as in the previous example, "main_part2", the driver, is appended to "example2.c":
$ cat main_part2 >> example2.c
Now "example2.c" can be compiled and executed:
$ g++ -o ex2 example2.c
$ ./ex2
Result.value=17
Result.n=4
Result.value=-9
Result.n=4
Result.value=78
Result.n=10
One advantage of lemon over bison is the ability to free memory used by
a non-terminal. You can call the function of your choice.
"expr
" is an example of a non-terminal. When the program is
done with the non-terminal, the function defined by
token_destructor
is called.
1 %include {
2 #include <iostream>
3 #include "ex3def.h"
4 #include "example3.h"
5 void token_destructor(Token t)
6 {
7 std::cout << "In token_destructor t.value= " << t.value << std::endl;
8 std::cout << "In token_destructor t.n= " << t.n << std::endl;
9 }
10 }
11 %token_type {Token}
12 %default_type {Token}
13 %token_destructor { token_destructor($$); }
...
In line 13, token_destructor
is the function
"token_destructor($$);
". The function
"token_destructor
" is defined in lines 5 through 9. For
this simple example, no memory is allocated, so there is no need to call
free
. Instead, to see what is happening, output will be
written to std::cout.
After the program is compiled, it can be executed as follows. Note that I have added line numbers to the output of "ex3" for easy reference.
$ ./ex3
1 t0.value=4 PLUS t1.value=13
2 In token_destructor t.value= 4
3 In token_destructor t.n= 0
4 Result.value=17
5 Result.n=4
6 parsing complete!
...
After the expression has been reduced, the destructor is called, but it is only called for the token.value=4. Why? For an answer we will have to take a look at "main_part3".
1 int main()
2 {
3 void* pParser = ParseAlloc (malloc);
4 struct Token t0,t1;
5 struct Token mToken;
6 t0.value=4;
7 t0.n=0;
8 t1.value=13;
9 t1.n=0;
10 std::cout << " t0.value=4 PLUS t1.value=13 " << std::endl;
11 Parse (pParser, NUM, t0);
12 Parse (pParser, PLUS, t0);
13 Parse (pParser, NUM, t1);
14 Parse (pParser, 0, t0);
15 std::cout << " t0.value=4 DIVIDE t1.value=13 " << std::endl;
16 Parse (pParser, NUM, t0);
17 Parse (pParser, DIVIDE, t0);
18 Parse (pParser, NUM, t1);
19 Parse (pParser, 0, t1);
...
Line 14 terminates the grammar with t0
as the third
parameter. That third parameter is passed as "$$
" to the
defined destructor function, "token_destructor(...
". When
calling "Parse
" a second time immediately, it is undefined,
so you should only call the destructor function once after you're done
passing tokens to complete an expression. In other words, you would
never call "Parse (pParser, 0, t0);
", immediately followed
by another "Parse (pParser, 0, t0);
".
In line 19, token_destructor
is called for t1.value=
13
. If you look at "main_part3", line 19, you'll see that
Parse
is called with t1
as the third parameter
and 0
and the second parameter.
Continuation of the output from the program:
7
8
9 t1.value=13 PLUS t0.value=4
10 In token_destructor t.value= 13
11 In token_destructor t.n= 0
12 Result.value=17
13 Result.n=4
14 parsing complete!
So t0
is called at the third parameter position in line 14
and t1
is called in line 19. This shouldn't be a problem.
One variable could hold the value of the tokens. For instance,
main_part3 could have had Token t0
used for both the values
4 and 14 as follows:
...
struct Token t0;
t0.value=4;
t0.n=0;
Parse (pParser, NUM, t0);
Parse (pParser, PLUS, t0);
t0.value=13;
t0.n=0;
Parse (pParser, NUM, t0);
Parse (pParser, 0, t0);
...
Notice that in the last three examples, Parse(pParse,0..
had to be called to signal the end of the input for an expression. This
is awkward. Instead, the grammar should dictate when the expression can
no longer be reduced.
"example4.y" contains the following lines:
1 %include {
2 #include <iostream>
3 #include "ex4def.h"
4 #include "example4.h"
...
23
24 %syntax_error {
25 std::cout << "Syntax error!" << std::endl;
26 }
27
28 /* This is to terminate with a new line */
29 main ::= in.
30 in ::= .
31 in ::= in state NEWLINE.
32 state ::= expr(A). {
33 std::cout << "Result.value=" << A.value << std::end
34 std::cout << "Result.n=" << A.n << std::endl;
35 }
36 expr(A) ::= expr(B) MINUS expr(C). { A.value = B.value - C.value;
37 A.n = B.n+1 + C.n+1;
38 }
...
Note lines 29 through 35. "main
" and "in
"
must be defined (lines 29-31). If you're a Bison user, you could get
away without having to define the non-terminal main, but lemon currently
requires it.
With this change made to the grammar in "example4.y", "main_part4" can now terminate each expression by passing the token NEWLINE.
Here is a section of main_part4:
1 int main()
2 {
3 void* pParser = ParseAlloc (malloc);
4 struct Token t0,t1;
5 struct Token mToken;
6 t0.value=4;
7 t0.n=0;
8 t1.value=13;
9 t1.n=0;
10 std::cout << std::endl <<" t0.value=4 PLUS t1.value=13 " << std::endl << std::endl;
11 Parse (pParser, NUM, t0);
12 Parse (pParser, PLUS, t0);
13 Parse (pParser, NUM, t1);
14 Parse (pParser, NEWLINE, t1);
15 std::cout << std::endl <<" t0.value=4 TIMES t1.value=13 " << std::endl << std::endl;
Note that line 14 is passing the token NEWLINE and checking "example4.h". NEWLINE in this case is defined as an integer, 6.
So, looking at the output of "ex4", with line numbers added for clarification, we get the following:
$ ./ex4
1 t0.value=4 PLUS t1.value=13
2
3 In token_destructor t.value= 4
4 In token_destructor t.n= 0
5 Result.value=17
6 Result.n=4
7
8 t0.value=4 TIMES t1.value=13
9
10 In token_destructor t.value= 4
11 In token_destructor t.n= 0
12 Result.value=52
13 Result.n=4
14 parsing complete!
We get the result on line 5, and there was no need to call Parse
(pParser, 0, t0);
. Instead, Parse( pParse, NEWLINE,
t0)
worked.
The next example takes input directly from the terminal, and flex will create a scanner for finding the appropriate tokens.
First, a quick look at the flex program "lexer.l", again with line numbers added for clarification:
1 %{
2 #include "lexglobal.h"
3 #include "example5.h"
4 #include <string.h>
5 #include <math.h>
6 int line = 1, col = 1;
7 %}
8 %%
9 [0-9]+|[0-9]*\.[0-9]+ { col += (int) strlen(yytext);
10 yylval.dval = atof(yytext);
11 return NUM; }
12 [ \t] { col += (int) strlen(yytext); } /* ignore but count white space */
13 [A-Za-z][A-Za-z0-9]* { /* ignore but needed for variables */
14 return 0;
15 }
16 "+" { return PLUS; }
17 "-" { return MINUS; }
18 "*" { return TIMES; }
19 "/" { return DIVIDE; }
20 \n { col = 0; ++line; return NEWLINE; }
21 . { col += (int) strlen(yytext); return yytext[0]; }
22 %%
23 /**
24 * reset the line and column count
25 *
26 *
27 */
28 void reset_lexer(void)
29 {
30 line = 1;
31 col = 1;
32 }
33 /**
34 * yyerror() is invoked when the lexer or the parser encounter
35 * an error. The error message is passed via *s
36 *
37 *
38 */
39 void yyerror(char *s)
40 {
41 printf("error: %s at line: %d col: %d\n",s,line,col);
42 }
43 int yywrap(void)
44 {
45 return 1;
46 }
The format for flex is basically a rule on the left side followed by C
code to execute on the right side. Take line 9,
"[0-9]+|[0-9]*\.[0-9]+
", which will match any of 3, .3,
0.3, and 23.4 and will return NUM. What's the value of NUM? It's taken
from line 3, which includes the file "example5.h", generated from
the lemon parser. On Line 10, yylval.dval
is assigned the
value of "yytext
" after it's converted to a float. The
structure of yylval is defined in "lexglobal.h" on line 2.
"lexglobal.h" with line numbers added:
1 #ifndef YYSTYPE
2 typedef union {
3 double dval;
4 struct symtab *symp;
5 } yystype;
6 # define YYSTYPE yystype
7 # define YYSTYPE_IS_TRIVIAL 1
8 #endif
9 /* extern YYSTYPE yylval; */
10 YYSTYPE yylval;
yystype
is the union of dval
and
symtab
. Again, symtab
is not used in these
examples, but should you move to a calculator with variables that can be
assigned, you'd use this. See Reference 3 for a full calculator example
with flex and bison.
Again looking at lines 9 through 11 in lexer.l;
...
9 [0-9]+|[0-9]*\.[0-9]+ { col += (int) strlen(yytext);
10 yylval.dval = atof(yytext);
11 return NUM; }
...
Both the type of token, NUM, and its value must be passed along. We need to know it's a number, but we also need to know the value of the number.
Unlike what we need with PLUS, MINUS, TIME, and DIVIDE, we only need to know the particular identifier has been found. Therefore, in lexer.l, lines 16 through 19 only return the token value.
16 "+" { return PLUS; }
17 "-" { return MINUS; }
18 "*" { return TIMES; }
19 "/" { return DIVIDE; }
20 \n { col = 0; ++line; return NEWLINE; }
21 . { col += (int) strlen(yytext); return yytext[0]; }
22 %%
Line 20 will match on a NEWLINE. Although not used, line numbers keep
track of the variable "line
" and col
is used
to track the number of columns. This is a good idea; it is helpful when
debugging.
The driver, main_part5, contains a lot more code. The low level read
statement is used on stdin. This could easily be changed to accept
input coming in on a socket descriptor, so if you had a Web scraping
program that scans input from a TCP socket, the socket descriptor would
replace "fileno(stdin)
" on line 33.
1 #include <stdio.h>
2 #include <unistd.h>
3 #include <sys/types.h>
4 #include <sys/stat.h>
5 #include <fcntl.h>
6 #include <stdlib.h>
7 #define BUFS 1024
8 /**
9 * We have to declare these here - they're not in any header files
10 * we can include. yyparse() is declared with an empty argument list
11 * so that it is compatible with the generated C code from bison.
12 *
13 */
14 extern FILE *yyin;
15 typedef struct yy_buffer_state *YY_BUFFER_STATE;
16 extern "C" {
17 int yylex( void );
18 YY_BUFFER_STATE yy_scan_string( const char * );
19 void yy_delete_buffer( YY_BUFFER_STATE );
20 }
21 int main(int argc,char** argv)
22 {
23 int n;
24 int yv;
25 char buf[BUFS+1];
26 void* pParser = ParseAlloc (malloc);
27 struct Token t0,t1;
28 struct Token mToken;
29 t0.n=0;
30 t0.value=0;
31 std::cout << "Enter an expression like 3+5 <return>" << std::endl;
32 std::cout << " Terminate with ^D" << std::endl;
33 while ( ( n=read(fileno(stdin), buf, BUFS )) > 0)
34 {
35 buf[n]='\0';
36 yy_scan_string(buf);
37 // on EOF yylex will return 0
38 while( (yv=yylex()) != 0)
39 {
40 std::cout << " yylex() " << yv << " yylval.dval " << yylval.dval << std::endl;
41 t0.value=yylval.dval;
42 Parse (pParser, yv, t0);
43 }
44 }
45 Parse (pParser, 0, t0);
46 ParseFree(pParser, free );
47 }
Line 16, 'extern "C"
', is necessary because "lexer.l" was
run through flex to create C code, as opposed to C++ code:
$ flex lexer.l
See the flex manual, Reference 7. Yes, "flex++
" will
output C++ code. However, for complex scanning, C code may be faster.
"main_part5", which is compiled as a C++ program, makes the transition
smoothly.
The parser should always terminate input with 0 in the second parameter
to "Parse(pParser,0,..
". When there is no more input
coming into flex, it will return a zero, so the
while
loop below on line 38 terminates with a zero. Then
the read
statement, line 33, looks for more input. This is
something you would want to do when reading from a socket, since it may
have been delayed.
But if the initial read (line 33 for the first time) isn't successful, flex has no chance of returning a zero. Therefore, line 45 has a zero as the second parameter.
...
33 while ( ( n=read(fileno(stdin), buf, BUFS )) > 0)
...
38 while( (yv=yylex()) != 0)
39 {
40 std::cout << " yylex() " << yv << " yylval.dval " << yylval.dval << std::endl;
41 t0.value=yylval.dval;
42 Parse (pParser, yv, t0);
43 }
...
45 Parse (pParser, 0, t0);
46 ParseFree(pParser, free );
lemon is fast, completely in the public domain, well tested in SQLite, and thread safe. Parser generators can help developers write reusable code for complex tasks in a fraction of the time they would need for writing the complete program from scratch. The syntax file, the file that holds the grammar, can be modified to suit multiple needs.
Although I have had no problems with lemon.c, there are a few compiler warnings regarding signed and unsigned integers when compiling it with the -Wall -W flags:
[chirico@third-fl-71 lemon_examples]$ gcc -Wall -W -O2 -s -pipe lemon.c
lemon.c: In function `resolve_conflict':
lemon.c:973: warning: unused parameter `errsym'
lemon.c: In function `main':
lemon.c:1342: warning: unused parameter `argc'
lemon.c: At top level:
lemon.c:2308: warning: return type defaults to `int'
lemon.c: In function `preprocess_input':
lemon.c:2334: warning: comparison between signed and unsigned
lemon.c:2352: warning: control reaches end of non-void function
lemon.c:2311: warning: `start' might be used uninitialized in this function
lemon.c:2313: warning: `start_lineno' might be used uninitialized in this function
lemon.c: In function `Parse':
lemon.c:2393: warning: comparison between signed and unsigned
lemon.c: In function `tplt_open':
lemon.c:2904: warning: implicit declaration of function `access'
lemon.c: In function `append_str':
lemon.c:3019: warning: comparison between signed and unsigned
lemon.c:3011: warning: unused variable `i'
lemon.c: In function `translate_code':
lemon.c:3109: warning: control reaches end of non-void function
This can be an inconvenience when adding the parse.c file to existing code. A fix is on the way. Since I expect the changes to be cleaned up soon, this version of lemon.c is the same version that you'd get from the author's site, which will make it easier to apply the patch.
There are times when a parser like lemon or bison may be a little too much. These are powerful tools. An interesting alternative, if you're a C++ programmer and you only need to do inline parsing, is the spirit library. See Reference 9.
The complete source for these examples, including the parser itself, can be downloaded here.
Mike Chirico, a father of triplets (all girls) lives outside of
Philadelphia, PA, USA. He has worked with Linux since 1996, has a Masters
in Computer Science and Mathematics from Villanova University, and has
worked in computer-related jobs from Wall Street to the University of
Pennsylvania. His hero is Paul Erdos, a brilliant number theorist who was
known for his open collaboration with others.
Mike's notes page is souptonuts.