<A,B,C>
will cause
the current condition to be <A,B,C>
plus whatever on the
stack. {
would cause the current condition to be pushed onto
the stack and }
would cause the current condition to be popped
off the stack.
STATE_##state
.
Then a trick of recursive include is used to include them ahead of user defined
actions. Flex uses a slightly different strategy that calculates the locations
of each condition.
comp.compilers
provided some insight.
Finally, the output of Flex provided much information on the algorithm, though
YooLex's approach to the algorithm is quite different from that of Flex's.
Basically the compression has 4 steps. The last 3-steps deal with cases that can be used in combination.
_yy_ecs
in YooLex and
yy_ec
in Flex). This step is actually the by product of doing
NFA->DFA. Using equivalent class in NFA->DFA can significantly reduce
amount of time to construct DFA. The algorithm is by monitoring the CCL
structures used in NFA construction. Thus (a|b)
will create
2 eqvalent classes while [ab]
will only create 1.
a) 0 0 0 7 7 7 8 8 8 8 8 b) 0 0 0 7 7 7 8 9 9 8 8
where b can be efficently represented using a as the default state, and just store 9 9 as the difference. (Note, b cannot have 0 where a has something else since their error state would then be different. It is possible handle 0 in this case, but it would generate too many sub-cases. 0 is error state here)
0 0 2 3 4 0 0 0 0 0 0can be represented using
2 3 4block size of 3 and 0 as the default transition state. Note in the following case, the default transition cannot be the most common state:
1 1 2 3 4 1 1 1 1 1 1since setting 1 as the default transition would tell us to goto state 1 and match again in case our symbol is not within 2 3 4 block. The reason is in part 4. Also, we could fit blocks
a) 1 X X X 1 b) 2 3 4where X represent holes. we could fit b inside a conveniently:
1 2 3 4 1Holes are generated in cases like;
0 1 0 0 1 0 => 1 X X 1or
9 1 9 9 1 0 => 1 X X 1
1 1 1 1 1 1 1 1would in fact have to go through the error state checking while a state
0 0 0 0 0 0 0 0do not. Anyway, here are some states and their corresponding block and error states, with X representing non-important states.
1 1 2 3 4 1 1 1 1 1 1 block: 2 3 4 error state: 1 1 X X X 1 1 1 1 1 1
0 0 2 2 2 2 3 4 0 0 0 block: 3 4 error state: 0 0 2 2 2 2 X X 0 0 0In this example, common repeat (other than 0) is 2, thus the block can be reduced to
3 4
. Chances are, there will be other
states sharing the same common repeat with the same start and end location.
Thus we could save additional space.
0 0 0 0 0 7 8 0 0 0 0 block: 7 8 error state: 0 0 0 0 0 X X 0 0 0 0note: there is no common repeat (other than 0) in this example. How much repeat is minimum to be recognized as common repeat is tricky. (MINREPEAT is used by YooLex to specifiy the minimum repeat percentage).
1 1 1 1 1 1 1 1 1 1 1 block: block size of 0 error state: 1 1 1 1 1 1 1 1 1 1 1So, does it mean we cannot compress this state since we have to store the error state too? Not necessarily. Usually error state has fewer equivlence class (see below) and there may be other states share the same error state. So, space can be saved.
1 2 3 4 6 7 1 8 9 0 1This example is tricky since 1 is not repeated that often. If we treat 1 as the common repeat, then we obtain
block: 2 3 4 6 7 X 8 9 error state: 1 X X X X X 1 X X 0 1
Note, we have to store both. Although another state could share the same error state, the chance is unlikly since it is usually the start state that shows the property above. Thus, in this case we just store the state as is: 1 2 3 4 6 7 1 8 9 0 1 without blocks and error state. Due to how our table is organized, we still have to pay 2 extra pointers (one is the default state, another is the start address of the row). Identify uncompressed state can be tricky and a parameter (MINRATIO in YooLex) is used to identify it.
After converting all the states to block and error states, we can compress the error state. Here is an example of some error states:
2 2 2 2 X X 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0Note that row 2 and row 3 are the same. First 3 columns and last 4 columns are the same, so the above can be reduced to:
2 2 X X 0 1 1 1 0 0 0 1 1 0 0Now, the problem is the non-important column containing X. They could be either 0 or 1, does not matter since they will not be used. It is therefore possible to generate
2 2 0 1 1 0 0 1 0as the final error matrix. However, due to performance reason and how my equivalence class computation works (which only works on clearly defined numbers). I converted all the X's to that row's common repeat after cannot find a pattern that avoid creating a new error state:
2 X X 2 2 X 0 0 0 0 2 2 2 2 X X 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 => 2 2 2 2 2 2 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 => 2 2 2 0 1 1 1 0 0 1 0 0Now, we could do block fitting on this matrix. It can be mixed with block state, with the eqvuivalence class of error state stored separately (_yy_meta in YooLex or yy_meta in Flex).