Untitled Document

Implementation

This document describes how YooLex is implemented.

Input files are parsed using Flex's grammar (I took scan.l and parse.y from Flex and adapt them to YooLex).
NFA were constructed from RegEx using Thompson-construction.
DFA tables are constructed using the standard NFA to DFA algorithm found in the Dragon book.
DFA states minimization. Instead of minimizing DFA states after it has being constructed which is very difficult and inefficient, another way is to reduce the amount of NFAs obtained from e_closure, which is used to identify different transition states. I removed all the non-accepting states that have epsilon out edges. For accepting states, it would be removed from the e_closure set if it is shadowed by other accepting states AND it has no out edges. It seemed that the method was effective, though I am not very positive that it would definitely create the minimal amount of DFA states. Flex does not seem to do any minimization at all.
Conditions were maintained on a stack <A,B,C> will cause the current condition to be <A,B,C> plus whatever on the stack. { would cause the current condition to be pushed onto the stack and } would cause the current condition to be popped off the stack.
Each condition gets its own base state. YooLex generate the exact base state location for each condition and store the value in #define directives STATE_##state. Then a trick of recursive include is used to include them ahead of user defined actions. Flex uses a slightly different strategy that calculates the locations of each condition.
For each NFA constructed, that NFA is added to all the condition tables specified by the current condition variable. If the current condition is empty, then the NFA is added to all the inclusive conditions.
For trailing contexts, only ones with either fixed texts or fixed trailing lengths were supported. The trailing context is supported via adding additional code in the yyswitch of that particular action. The code retracts the length of the trailing context.
For EOF, the switch case value is stored in _yy_accept[_baseState], which is not used in scanning. Thus, no extra space is necessary.
For BOL, in the creation of the DFA table, first push the e_closure (start), then push e_closure (bol union start) and does the rest as usual. When doing scanning, if at the BOL, set the current state as the base state + 1. I think this is how Flex does it too.
For back track checking, whenever a non-starting state is non-accepting, we report states that lead to that non-accepting state
For shadowed regular expression check, first compute all the cases and its corresponding line #. Then mark all the cases where it is shown in the final DFA table 0. Then scan all the cases for non-zero entries and report them.
DFA Table compression. The algorithm is detailed in P.Dencker, et al,"Optimization of parser tables for portable compilers,", ACM Transaction on Programming Languages and Systems, 6(4):546-572, October 1984. A post Nandakumar Sankaran made on comp.compilers provided some insight. Finally, the output of Flex provided much information on the algorithm, though YooLex's approach to the algorithm is quite different from that of Flex's.
Basically the compression has 4 steps. The last 3-steps deal with cases that can be used in combination.
1. Use equivalence class to group characters. Always save space. It is also independent of other methods. (_yy_ecs in YooLex and yy_ec in Flex). This step is actually the by product of doing NFA->DFA. Using equivalent class in NFA->DFA can significantly reduce amount of time to construct DFA. The algorithm is by monitoring the CCL structures used in NFA construction. Thus (a|b) will create 2 eqvalent classes while [ab] will only create 1.
2. Find the difference of one state to another. This method may not always save space, but it will save space for
```
a)	0 0 0 7 7 7 8 8 8 8 8
b)	0 0 0 7 7 7 8 9 9 8 8
```
  where b can be efficently represented using a as the default state, and just store 9 9 as the difference. (Note, b cannot have 0 where a has something else since their error state would then be different. It is possible handle 0 in this case, but it would generate too many sub-cases. 0 is error state here)
3. Block fitting.
```
	0 0 2 3 4 0 0 0 0 0 0
```
  can be represented using
```
	2 3 4
```
  block size of 3 and 0 as the default transition state. Note in the following case, the default transition cannot be the most common state:
```
	1 1 2 3 4 1 1 1 1 1 1
```
  since setting 1 as the default transition would tell us to goto state 1 and match again in case our symbol is not within 2 3 4 block. The reason is in part 4. Also, we could fit blocks
```
	a)	1 X X X 1
	b)	2 3 4
```
  where X represent holes. we could fit b inside a conveniently:
```
	1 2 3 4 1
```
  Holes are generated in cases like;
```
	0 1 0 0 1 0
=>	1 X X 1
```
  or
```
	9 1 9 9 1 0
=>	1 X X 1
```
4. The error state is used to handle all the non-zero common states. Thus a state with
```
	1 1 1 1 1 1 1 1
```
  would in fact have to go through the error state checking while a state
```
	0 0 0 0 0 0 0 0
```
  do not. Anyway, here are some states and their corresponding block and error states, with X representing non-important states.
  
  Example 1
```
	1 1 2 3 4 1 1 1 1 1 1
block:
	2 3 4
error state:
	1 1 X X X 1 1 1 1 1 1
```
  Example 2
```
	0 0 2 2 2 2 3 4 0 0 0
block:
	3 4
error state:
	0 0 2 2 2 2 X X 0 0 0
```
  In this example, common repeat (other than 0) is 2, thus the block can be reduced to 3 4. Chances are, there will be other states sharing the same common repeat with the same start and end location. Thus we could save additional space.
  
  Example 3
```
	0 0 0 0 0 7 8 0 0 0 0
block:
	7 8
error state:
	0 0 0 0 0 X X 0 0 0 0
```
  note: there is no common repeat (other than 0) in this example. How much repeat is minimum to be recognized as common repeat is tricky. (MINREPEAT is used by YooLex to specifiy the minimum repeat percentage).
  
  Example 4
```
	1 1 1 1 1 1 1 1 1 1 1
block:
	block size of 0
error state:
	1 1 1 1 1 1 1 1 1 1 1
```
  So, does it mean we cannot compress this state since we have to store the error state too? Not necessarily. Usually error state has fewer equivlence class (see below) and there may be other states share the same error state. So, space can be saved.
  
  Example 5
```
	1 2 3 4 6 7 1 8 9 0 1
```
  This example is tricky since 1 is not repeated that often. If we treat 1 as the common repeat, then we obtain
```
block:
		2 3 4 6 7 X 8 9
	error state:
		1 X X X X X 1 X X 0 1
```
  Note, we have to store both. Although another state could share the same error state, the chance is unlikly since it is usually the start state that shows the property above. Thus, in this case we just store the state as is: 1 2 3 4 6 7 1 8 9 0 1 without blocks and error state. Due to how our table is organized, we still have to pay 2 extra pointers (one is the default state, another is the start address of the row). Identify uncompressed state can be tricky and a parameter (MINRATIO in YooLex) is used to identify it.
  
  After converting all the states to block and error states, we can compress the error state. Here is an example of some error states:
```
	2 2 2 2 X X 0 0 0 0
	1 1 1 1 1 0 0 0 0 0
	1 1 1 1 1 0 0 0 0 0
	0 0 0 1 0 0 0 0 0 0
```
  Note that row 2 and row 3 are the same. First 3 columns and last 4 columns are the same, so the above can be reduced to:
```
	2 2 X X 0
	1 1 1 0 0
	0 1 1 0 0
```
  Now, the problem is the non-important column containing X. They could be either 0 or 1, does not matter since they will not be used. It is therefore possible to generate
```
	2 2 0
	1 1 0
	0 1 0
```
  as the final error matrix. However, due to performance reason and how my equivalence class computation works (which only works on clearly defined numbers). I converted all the X's to that row's common repeat after cannot find a pattern that avoid creating a new error state:
```
	2 X X 2 2 X 0 0 0 0
	2 2 2 2 X X 0 0 0 0
	1 1 1 1 1 0 0 0 0 0
	1 1 1 1 1 0 0 0 0 0
	0 0 0 1 0 0 0 0 0 0
=>
	2 2 2 2 2 2 0 0 0 0
	1 1 1 1 1 0 0 0 0 0
	0 0 0 1 0 0 0 0 0 0
=>
	2 2 2 0
	1 1 1 0
	0 1 0 0
```
  Now, we could do block fitting on this matrix. It can be mixed with block state, with the eqvuivalence class of error state stored separately (_yy_meta in YooLex or yy_meta in Flex).

$Id: implementation.html,v 1.5 2003/01/20 09:24:14 coconut Exp $

Implementation

Example 1

Example 2

Example 3

Example 4

Example 5