page created 2003 Aug 12 by Prof. James A. Mason
updated 2003 Aug 13
This page describes recent work that I have been doing to extend the nounphrasesimple.grm example to a grammar that recognizes a much broader set of English noun phrases. The work is incomplete in that there are still refinements to be made to the grammar, and semantic action and value computations are still to be added to it. Nevertheless, this example illustrates some problems and choices in the process of grammar construction, and it can be used with the ASDTester program to provide a test of the ASDParser on a realistically large grammar. It also provides a useful test of the MergeGrammars and DefinedUsed utility programs. (See software.) The grammar is based mostly on my own idiolect of English, which I believe is reasonably typical of the receptive grammar of an older, well-educated native speaker of American English, and secondarily of Canadian English. The design of the grammar is my own, although I have obtained some ideas for it from the best general reference on English grammar that I know: A Comprehensive Grammar of the English Language by R. Quirk, S. Greenbaum, G. Leech and J Svartvik (Longman, 1985, ISBN 0-582-51734-6). In these notes, specific references to that book will be cited not with page numbers but with the section numbers in the book (e.g. [Quirk 5.41]).
The new grammar npX.grm recognizes as phrase type NPS noun phrases that do not include modifiers involving verbs (i.e., relative clauses and participial phrases). The noun phrases which are recognized include ones with prepositional phrase modifiers, appositives both with and without commas (e.g. "We the people", "I, the owner of this car, ...", and "my dog, the one with the crooked tail, ..."), and conjunctive noun phrases (e.g. "the grey horse, the big black dog, and I"). The grammar further recognizes phrase type NPS as being of type NP, for Noun Phrase. However, the full definition of type NP will not be complete until the grammar is extended to cover noun phrases with clause and participial-phrase modifiers. The file npXexamples.txt provides a very small sample of phrases which can be parsed correctly with ASDParser using npX.grm. (For many of them the parser also finds alternative parsings which are not semantically correct. See the next paragraph.)
For me, the biggest difficulty in constructing an ASD grammar like this one is to find a good balance between syntax and semantics. At one extreme, most of the grammar could be defined through elaborate networks, with constraints such as agreement between singular determiners and singular nouns, and between plural determiners and plural nouns, being expressed by having separate phrase types for singular and plural forms, and connecting nodes for them appropriately in the grammar. At the other extreme, one could define very few syntactic types, and rely on semantic feature settings, through semantic action and semantic value computations attached to grammar nodes, to carry the load of guiding the parser. The best approach seems to lie somewhere between these two extremes. This example grammar is an attempt to illustrate a balanced approach between syntactic and semantic structuring, although the semantic computations are still to be added to it.. Until those computations are added, the ASDParser finds many ambiguities when using the grammar. With this version of the grammar, noun phrases which contain conjunctions are particularly ambiguous, as are noun phrases which can be interpreted as involving appositives. When you try parsing noun phrases with this grammar, using ASDTester for example, please keep in mind the grammar's high degree of syntactic ambiguity when you evaluate the results. Right now the important thing to consider is whether the grammar provides at least one appropriate structure for each phrase being tested, not whether it produces inappropriate structures as well, because semantic computations will eliminate most of those.
The grammar currently consists of the following 18 modules :
approx.grm
notes
conjunc.grm
notes
cardord.grm
notes
fraction-p.grm
notes
cardordirp.grm
notes
adjfunction.grm
notes
quant-vague-p.grm
notes
quantity-np.grm
notes
quantity-p.grm
notes
grader.grm
notes
adj-descr-p.grm
notes
determiner-p.grm
notes
noun-p.grm
notes
pronoun-p.grm
notes
prepphrase.grm
notes
postdescrip.grm
notes
npcomponent.grm
notes
nps.grm
notes
The eighteen modules have been merged in the order shown, using the MergeGrammars utility with build list Xbuild.txt, to produce the combined grammar file npX.grm . (The "X" can stand for "eXtended" or "eXperimental".) The merged grammar contains 500 words and phrase types, including the "dummy" symbol $$, 874 nodes (instances of words or phrase types), and 622 edges connecting nodes. Those counts were obtained by using the ASDCheck utility.
The order in which grammar files are merged determines the order in which the ASDParser will try initial nodes for the same word found in two or more of the grammar modules: The initial nodes that occur in a file earlier in the merge will be tried before the initial nodes for the same word which occur in a file later in the merge. For example, both determiner-p.grm and quantity-np.grm have initial nodes for the word "the". So because of the sequence in which the grammars were merged to produce npX.grm, the parser will try the initial node for "the" in quantity-np.grm before it tries the initial node for "the" in determiner-p.grm. Of course that can be changed by changing the order in which the grammar modules are merged. The merge sequence shown is a preliminary one that will probably be changed as the grammar is refined further.
The diagrams for all of the modules can be displayed fully with the ASDEditor on a 17-inch or larger monitor at 1024x768 or higher pixel resolution. However prepphrase.grm, includes a very complicated diagram with many crossing edges to connect two-word and three-word prepositions. That diagram is hard to follow when viewed with the ASDEditor. That grammar also contains some simpler diagrams which can be viewed selectively, as described in its notes. It is helpful to examine that grammar file with a text editor as well. Then one can trace the acceptance of two-word and three-word prepositions through the lexical entries. Under Microsoft's Windows operating systems, the Notepad accessory can be used to display ASD grammar files as text, but I recommend using a better text file editor such as TextPad for that purpose.
Of course, the combined file, npX.grm, also has an extremely complicated display which consists of the diagrams of all of the separate modules overlaid on top of one another. Individual connected component diagrams of npX.grm can be displayed by selecting words and instances in the upper-left panes of an ASDEditor window. Selected components can also be removed from the display by using the View>Hide>selectedComponent(s) menu selection at the top of the window, or the Hide>selectedComponent(s) selection from the pop-up menu in the bottom pane. However the grammar as a whole is best examined with a text editor.
The file Xxrefs.txt, produced with the DefinedUsed utility using Xbuild.txt as input, provides an index and cross reference for the phrase types and modules of the grammar. When Xxrefs.txt examines the lexicon of a grammar, it considers any entry with at least one uppercase letter to be a phrase type name used in the lexicon. When it examines final nodes in the grammar, it considers the label that occurs as the name of the phrase type ending at that node to be a phrase type name defined by the grammar. Those two ways of finding the phrase type names in a grammar are not always consistent. In particular, Xxrefs.txt lists APOSTROPHE, APOSTROPHEs, I, NUMBER, and UNKNOWN as undefined phrase types. The reasons for these are as follows: "I" is the ordinary first person singular pronoun, but because it violates the convention that words in the lexicon containing uppercase letters are names of phrase types, the DefinedUsed utility considers "I" to be a phrase type name not defined in any of the grammar modules. We can safely ignore that erroneous error report. NUMBER and UNKNOWN are special keywords that ASDParser recognizes as representing, respectively, numerals and words not appearing in the grammar that it is using; so they are defined as phrase types by default. APOSTROPHE and APOSTROPHEs will be defined later, in the same way that they are defined in the earlier Nounphrase1.java example, using the method EnglishWord.processApostrophe.
Also, Xxrefs.txt lists COMMA-SUBORD, CONJUC-SUBORD, and "a" as unused phrase types. "a" is not really a phrase type, but it occurs in the module determiner-p.grm as if it were a phrase type that ends a one-word phrase "an". (That is a simple way to make "an" and "a" syntactically identical.) So the DefinedUsed utility reports that "a" is an unused phrase type. We can safely ignore that erroneous error report, too. COMMA-SUBORD and CONJUC-SUBORD really are unused phrase types, defined in the module conjunc.grm, but not yet used in any of the modules. Their definitions have been made in anticipation of further expansion of the grammar to include subordinate clauses.