public class EarleyParser extends java.lang.Object implements GrammaticalParser
The primary difference between this Earley Parser and the standard version is that input expressions are "pre-parsed" with variable hypotheses replacing the variables in the expression to be parsed. One reason for this change is that Var's and VarHyp's in Metamath are not necessarily in global scope; variable "x" might be a set now but later be a foobar. A drawback of this change is increased difficulty in building ParseTrees after EarleyParse finishes doing its "parse recognition" algorithm. Another drawback is that a restriction is placed on coding a grammatical Type Code, such as "wff" as a constant in a formula (ex. "xyz $a |- ( wff -> ph ) $." is prohibited because the "wff" is picked up as a something where a VarHyp should be...assuming the file that contains that statement uses "wff" as a Type Code.)
Another major change to the algorithm is that, while nulls and type conversions are handled in the mmj Grammar, a process of "combinatorial explosion" is used during the grammar-generation process, and so EarleyParser input "grammar" is solely the Notation Rules and unaltered VarHyp's. A grammar rule match to a subsequence of an expression may point to a Notation Rule that is a composite function of a Notation Syntax Axiom, plus any number of Nulls Permitted and Type Conversion Axioms; so when Earley is done, a fair amount of work still needs to happen to generate the output Parse tree/RPN for that match! It would definitely be feasible to revert the EarleyParse code to handle nulls and type conversions but extra invocations of Predictor and Completor are needed, as described in "Practical Techniques..." (see below) AND that still leaves the problem of how to generate an unambiguous grammar from a database like set.mm that contains "removable" or "non-essential ambiguities" such as overloaded functions.
The output of EarleyParser is an array of ParseTrees, and there is no arbitrary limit on the number of these trees in the event of ambiguity -- except that imposed by the user in establishing the parseTreeArray. The sequence of the output trees *is* arbitrary however (I have no clue), but there are solid reasons to believe that duplicate trees will never be generated, under any circumstances.
The Earley Parse algorithm write-up talks incessantly about Item Sets. These are supposed to be actual *sets*, with no duplications in any one set. To accomplish that efficiently in Java was a coding issue. Instead of deleting duplicates after they are generated, I attempt to prevent generation of duplicate Earley Items in the first place using the pPredictorTyp and pBringForwardTyp arrays. And instead of coding the Item Sets as Java Sets I used arrays, pre-allocated for speed. The concession to quality is that I do do a duplicate check when adding an item to the pCompletedItem array -- no duplicates in that array means no duplicate Parse Trees, end of story (assuming the "build Parse Tree" logic isn't buggy.)
EarleyParser is my 3rd attempt at coding a parser capable of handling Metamath's set.mm database. BottomUpParser was my second attempt. Set.mm's "supeu" was too hairy for Bottom Up, but the algorithm does work -- when it actually gets around to computing an answer. My first attempt handled propositional logic just fine but died deep into set.mm and has since been deleted (not even a backup lives on...)
The nice thing about the Earley Parse algorithm is that it does handle all context free grammars and it does handle ambiguous grammars. If it works for set.mm then it is a helluva parser, because set.mm gives professional logicians gray hair with its intricate syntax!
See Dick Grune and Ceriel J.H. Jacob's "Parsing Techniques -- A Practical Guide", page 149 in Edition 1 -- which is the 139th page in the .pdf book file, available at:
Parsing Techniques -- A Practical Guide
(Note: I went completely bald after working on this project. Fortunately, Dick Grune was incredibly friendly and helped me to understand some gnarly concepts. My hair may return, we shall see! While we're on the subject of heroes, Norm Megill has the patience of a saint. As far as I am concerned he is a prime candidate for the Nobel Prize. One day they will build statues of Norman Megill, and small boys will go to bed with Metamath bubble gum trading cards under their pillows...
Constructor and Description |
---|
EarleyParser(Grammar grammarIn,
int maxFormulaLengthIn)
Construct using reference to Grammar and a parameter signifying the
maximum length of a formula in the database.
|
Modifier and Type | Method and Description |
---|---|
int |
parseExpr(ParseTree[] parseTreeArrayIn,
Cnst formulaTypIn,
ParseNodeHolder[] parseNodeHolderExprIn,
int highestSeqIn)
parseExpr - returns 'n' = the number of ParseTree objects generated for
the input formula and stored in parseTreeArray.
|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
addSettings
public EarleyParser(Grammar grammarIn, int maxFormulaLengthIn)
grammarIn
- Grammar objectmaxFormulaLengthIn
- gives us a hint about what to expect.public int parseExpr(ParseTree[] parseTreeArrayIn, Cnst formulaTypIn, ParseNodeHolder[] parseNodeHolderExprIn, int highestSeqIn) throws VerifyException
The user can control whether or not the first successful parse is returned by passing in an array of length = 1.
parseExpr
in interface GrammaticalParser
parseTreeArrayIn
- holds generated ParseTrees, therefore, length
must be greater than 0. The user can control whether or not
the first successful parse is returned by passing in an array
of length = 1. If the array length is greater than 1 the
function attempts to fill the array with alternate parses
(which if found, would indicate grammatical ambiguity.)
Returned ParseTree objects are stored in array order -- 0, 1,
2, ... -- and the contents of unused array entries is
unspecified. If more than one ParseTree is returned, the
returned ParseTrees are guaranteed to be different (no
duplicate ParseTrees are returned!)formulaTypIn
- Cnst Type Code of the Formula to be parsed.parseNodeHolderExprIn
- Formula's Expression, preloaded into a
ParseNodeHolder[] (see
Formula(dot)getParseNodeHolderExpr(mandVarHypArray).highestSeqIn
- restricts the parse to statements with a sequence
number less than or equal to highestSeq. Set this to
Integer(dot)MAX_VALUE to enable grammatical parsing using all
available rules.VerifyException
- if an error occurred