AsteroidMeta: metamathGrammarFacilities

(Caution: serious ruminating below! This is amateur hour stuff so don't take any wooden nickels!)

Metamath uses "syntax axioms" to recursively define things such as wff's, classes and terms. It is part of the engineered simplicity of Metamath that these syntax constructors are defined as axioms using the same $a keyword used to define logical axioms, such as those of propositional logic. There are shortcomings to this approach, which can be seen in the present coding of the set.mm database.

For example, operator overloading of "=" for sets and classes results in an apparent ambiguity when set.mm's syntax axioms are used directly to create a grammar for parsing the symbol sequences making up the formulas of set.mm. The problem isn't necessarily with set.mm but perhaps with the way the grammar is constructed, and the fact that the Metamath specification does not explicitly cover derivation of formal grammars from .mm files. Here is the situation with "=" in set.mm:

    $c = wff class set $.
    $v x y A B ph ps $.
    wph $f wff ph $.
    wps $f wff ps $.
    vx $f set x $.
    vy $f set y $.
    cA $f class A $.
    cB $f class B $.
    weq $a wff x = y $.
    cv $a class x $.
    wceq $a wff A = B $.

These definitions result in an apparent ambiguity – if the grammar is taken directly from the $a statements:

    cv says that a set variable can be
    converted to a class expression
 
    
    and so, "x = y" can be parsed to either
 
        "vx vy weq" or
 
        "vx cv vy cv wceq".

mmj2 resolves these operator overloading "ambiguities" in the way it builds its grammar. It attempts to divine the attempt of the author by noting that since weq appears first, an occurrence of "x = y" should be associated with weq and the remaining 3 possibilities should be associated with wceq. It does this by using a "combinatorial explosion" of type conversions on the wceq syntax axiom as it builds its grammar rule table: four grammar rules are built involving weq and wceq:

    weq   := set, set 
    wceq1 := class, set 
    wceq2 := set, class
    wceq3 := class, class

This technique works satisfactorily for every syntax axiom involving class variables, though it does result in a grammar rule table of neary 500 rules – a +300% increase over the original syntax axioms in set.mm.

So, operator overloading is something that can, perhaps, be satisfactorily defined away with a properly coded grammar that is based on the syntax axioms but contains extra rules that depend on the sequence in the database of the overloading instances.

There are, however, other desirable syntactic constructions that cannot be so easily defined away. Not without altering expanding the number of grammatical type codes in a Metamath file. Here is a classic example from Schmidt's book on Denotational Semantics (you can google for that):

""" Ambiguous BNF definitions can often be rewritten into an unambiguous form, but the price paid is that the revised definitions contain extra, artificial levels of structure. An unambiguous definition of arithmetic reads:

    <expression> ::= <expression> <lowop> <term> | <term>
 
    <term> ::= <term> <highop> <factor> | <factor>
 
    <factor> ::= <numeral> | ( <expression> )
 
    <lowop> ::= + | -
 
    <highop> ::= × | /
 
    <numeral> ::= <digit> | <digit> <numeral>
 
    <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

(The rules for <numeral> and <digit> remain the same.) This definition solves the ambiguity problem, and now there is only one derivation tree for 4 × 2 + 1, given in Figure 1.4. The tree is more complex than the one in Figure 1.2 (or 1.3) and the intuitive structure of the expression is obscured. Compiler writers further extend BNF definitions so that fast parsers result. Must we use these modified, complex BNF definitions when we study semantics? The answer is no.

We claim that the derivation trees are the real sentences of a language, and strings of symbols are just abbreviations for the trees. Thus, the string 4 × 2 + 1 is an ambiguous abbreviation. The original BNF definition of arithmetic is adequate for specifying the structure of sentences (trees) of arithmetic, but it is not designed for assigning a unique derivation tree to a string purporting to be a sentence. In real life, we use two BNF definitions: one to determine the derivation tree that a string abbreviates, and one to analyze the tree’s structure and determine its semantics. Call these the concrete and abstract syntax definitions, respectively.

"""

The lowop, highop, term and factor types (aka "left hand sides") are synthetic. The user never sees them, just the terminal symbols such as the digits, operators and the resulting expressions. Were we to implement the above BNF definitions in terms of $a statements we would also be compelled to add variables and variable hypotheses for lowop, highop, term and factor even though these are internal to the process of parsing symbol sequences.

Creating these additional rules enables the grammar to give precedence to "*" over "+" so that the correct answer to, say, "2 + 3 * 5" can be given without having to use parentheses.

   2 + 3 * 5 = 17 ? or
   2 + 3 * 5 = 30 ?

Recent spin-offs from Metamath such as Ghilbert and Bourbaki implement expressions as trees.

Even so, with the weq/wceq issue, if an author wishes to retain the distinction between weq and wceq, two function names or symbols will be needed; there can be only one function named "=". Thus, it may be that the function names assigned in languages like Ghilbert and Bourbaki should be considered internal, so that users can be presented with the familiar semiotics of "=", regardless of operand types, whenever equality is intended (likewise for other familiar operators…)

And, the fact that languages like Bourbaki and Ghilbert store their formulas as trees does not solve their notational and typesetting problems. Not unless the users are supposed to input sexp's and read Lisp code instead of math and logic formulas! And that means the problem of parsing symbol sequences does not disappear just because internally formulas are stored in trees. So what is becoming apparent now is that there are sound reasons for defining the grammar of a formal language separately from its axioms – as Schmidt says, "Call these the concrete and abstract syntax definitions, respectively."

: I'm writing up my thoughts on this issue on a separate page entitled Ghilbert syntax plans. --raph

Still, this may not fully resolve all of our problems so easily! Say, for example there is a function that requires as arguments only proper classes – and sets will not do. How then will this be defined in say, Bourbaki whose expressions accept the specified type or any of its sub-types?

For me that is the unresolved question. Let's look again at axiom "cv" in set.mm:

    $( All sets are classes (but not vice-versa!). $)
    cv $a class x $.

Is that a "syntax axiom" or an axiom of logic? Both?

--ocat 23-Oct-2005

Proposal

by marnix 24-Oct-2005

: Haven't got the time to read and digest the above in detail, but it is in the same area as a proposal that I came up with this weekend, independently, but also while trying to parse .mm files for Hmm. So I might as well post it here. NOTE: Norm didn't yet have a chance to see this, probably. NOTE 2: My original proposal had "type axiom" instead of "syntax axiom". I like the latter better.
: This proposal seems to be both minimal and in the spirit of Metamath. Is this (i.e., having a .mm file with verified proofs and verified syntax constructions for all non-syntax axioms) really enough to do unambiguous parsing of all assertions?

For any axiom that is not a 'syntax axiom' (i.e., for which the initial constant is not the initial contant of an active $f statement), optionally allow a 'syntax construction' $= … just before the closing $. For example,

  ax-1 $a |- ( ph -> ( ps -> ph ) ) $= wph wps wph wi wi thwff $.

This behaves just like a proof, and a Metamath verifier should verify that it really results in the stated axiom.

But where does the thwff come from? Well, after processing 'wph wps wph wi wi' the stack contains "wff ( ph → ( ps → ph ) )". Somehow from this, we should be able to conclude that the axiom is syntactically valid, but the axiom has '|-' instead of 'wff'. For this we need a rule saying, "if ps is a wff, then '|- ps' is a syntactically valid (but not necessarily true) assertion". The simplest way to do this, as far as I can see, is to introduce a built-in 'type' $TOP (just choosing a name here which can't be used as a math symbol), and putting in one extra axiom (actually a special syntax axiom):

  thwff $a $TOP |- ps $.

somewhere after $f wff ps $.

So then after processing the syntax construction above, the stack contains "$TOP |- ( ph → ( ps → ph ) )", and this means that the axiom is a syntactically-valid top-level expression.

I'm not sure what this additional mechanism buys you. Keep in mind that all terms in Metamath already have two representations: the string of symbols, and the RPN used to push the term onto the stack within a proof. So, the string "( ph → ( ps → ph ) )" has the RPN representation "wph wps wph wi wi". Using the $a rules, it's always possible to go from the RPN representation to the string. Metamath itself does not offer a mechanism to convert in the other direction.

The issue is quite a bit trickier when the grammar is ambiguous. For example, in set.mm, "x = y" has the two RPN parses "vx vy weq" and "vx cv vy cv wceq". I've chosen to handle this in my MM → GH translation by defining "=" to be the wceq variant and eliminating the weq variant altogether. Another perfectly valid approach would be to define both "=" variants and enforce the distinction between the two literally.

--raph

re: Marnix -->

     Is this (i.e., having a .mm file with verified 
     proofs and verified syntax constructions for 
     all non-syntax axioms) really enough to do 
     unambiguous parsing of all assertions?

I do not see that. For two reasons. 1) Merely having a syntax construction (parse) for non-syntax axioms cannot make an ambiguous grammar unambiguous, and 2) Metamath does not require that provable assertions be parseable (substract new axiom "wxy" from miu.mm and see that the proofs still hold – but add it back and suddendly the grammar is ambiguous!)

--ocat 24-Oct-2005

Over the last days I've been thinking about the above two replies, and most of all, I've been coding today just to see what my thoughts would look like in an actual implementation. What I've been writing (and what I'll probably put in Hmm) is a tool to find all possible parse trees for non-syntax axioms. And what I've done on paper (and should be simple to code up) is to find the parse trees that are implied by the proofs of $p statements.

Let me respond point by point:

ocat: "…cannot make an ambiguous grammar unambiguous"; raph: "I'm not sure what this additional mechanism buys you." My intention here is not to make an ambiguous grammar unambiguous, but to make sure that Metamath-related software can work with parse trees instead of just token sequences. And with as little guessing or database-specific annotations as possible. The explicit 'syntax construction' in RPN format for non-syntax $a statements seems to do the trick. For each assertion, this points the Metamath-related software (Hmm in this case) to exactly one of the possible RPN parses: the one that the user prefers. So this is a matter of guiding Metamath-related software.
ocat: "Metamath does not require that provable assertions be parseable" Agreed, and it's OK to keep things that way. However, in the majority cases (set.mm and ql.mm are of course the best examples) this is the intention of the user. So in my view it is not so strange to have a mechanism to specify and automatically check this intention. And as I said, this mechanism would be optional. And it is only required for (1) $a statements that are (2) non-syntax axioms, and for which (3) multiple RPN parses exist. I don't have numbers yet, but my experiments suggest that this is a small number of statements.
raph: "Metamath itself does not offer a mechanism to convert in the other direction." My proposal is exactly about such a mechanism: let the user provide an RPN representation for every non-syntax $a statement. (If the grammar is ambiguous, the user has a choice, and in this way s/he can indicate, e.g., which version of '=' was intended.) From these RPN representations, it is then possible to use the proofs of the $p statements to find an RPN representation for every $p statement. (Again, if the grammar is ambiguous, there will be multiple RPN parses for the $p statement, but the proof allows exactly one of these parses.)
raph: "…has the two RPN parses "vx vy weq" and "vx cv vy cv wceq"." Yes, the choice is arbitrary because of the ambiguity in the grammar. Hmm will probably choose the shortest one available, and choose an arbitrary one if multiple shortest parses exist.

(I don't have time now to explain why the proof of a $p statement --specifically the RPN parse of the assertion used in the top-level/last proof step– determines the RPN parse of the $p statement.)

Hope this makes things clearer…

Next up in Hmm (but I'm not promising a date): a tool that prints all possible RPN parses for each assertion in a database.

--marnix

Yes. I get it now. Thanks for the clarification. I'm not sure I have a warm, fuzzy feeling about the proposal, but I do see what it's trying to do.

/"all possible RPN parses"/ – doesn't that make the hair stand up on the back of your neck when you think about the exponential blowup?

--raph

One point to consider, based on the example of miu.mm – which is highly ambiguous with syntax axiom wxy's presence – is that the $e statements can effectively introduce syntax. How else could the proofs in miu.mm work in the absence of wxy? A $e is like a syntax axiom, in some respects, if it is not parseable.

Regarding the job of producing all possible RPN parses, I settled on the Earley Parse algorithm in mmj2 because it can handle ambiguous grammars. The default maximum number of parse trees returned in "extended" editing mode is 2, but a constant parameter in mmj.verify.GrammarConstants?.java allows the number to be set to an arbitrary level, like 100. The Earley Parse is fairly speedy even with "extended" ambiguity checking, unlike the Bottom Up parse algorithm which can easily spend an hour or more working on a single formula :)

--ocat

Instead of

  thwff $a $TOP |- ps $.
  ax-1 $a |- ( ph -> ( ps -> ph ) ) $= wph wps wph wi wi thwff $.

could you use the following?

  ax-1 $a |- ( ph -> ( ps -> ph ) ) $.
  ax-1-wff $p wff ( ph -> ( ps -> ph ) ) $= wph wps wph wi wi $.

This is compatible with the current spec and existing proof verifiers. Of course, your own program would be needed to check that ax-1-wff exists and matches ax-1, which current verifiers don't do.

Perhaps a better way - at least until your proposal becomes popular - is to have your axiom syntax proofs in a separate file, automatically generated and regenerated as needed, with a tool derived, say, from mmj2's parser. This way there would be no danger of the syntax proofs getting out of sync with the axioms, as might happen if they are embedded in the source (since current verifiers do not enforce them).

--norm 28-Oct-2005