CONSTRAINT GRAMMAR AS A FRAMEWORK FOR PARSING RUNNING TEXT Fred Karlsson University of Helsinki Department of General Linguistics Hallituskatu 11 SF-00100 Helsinki Finland e-mail: KARLSS©N@FINUH.bitnet 1. Outline Grammars which are used in parsers are often directly imported from autonomous grammar theory and descriptive practice that were not exercised for the explicit purpose of parsing. Parsers have been designed for English based on e.g. Government and Binding Theory, Generalized Phrase Structure Grammar, and LexicaI-Functional Grammar. We present a formalism to be used for parsing where the grammar statements are closer to real text sentences and more directly address some notorious parsing problems, especially ambiguity. The formalism is a linguistic one. It relies on transitional probabilities in an indirect way. The probabilities are not part of the description. The descriptive statements, constraints, do not have the ordinary task of defining the notion 'correct sentence in L'. They are less categorical in nature, more closely tied to morphological features, and more directly geared towards the basic task of parsing. We see this task as one of inferring surface structure from a stream of concrete tokens in a basically bottom-up mode. Constraints are formulated on the basis of extensive corpus studies. They may reflect absolute, ruleqike facts, or probabilistic tendencies where a certain risk is judged to be proper to take. Constraints of the former rule-like type are of course preferable. The ensemble of constraints for language L constitute a Constraint Grammar (CG) for L. A CG is intended to be used by the Constraint Grammar Parser CGP, implemented as a Lisp interpreter. Our input tokens to CGP are morphologically analyzed word-forms. One central idea is to maximize the use of morphological information for parsing purposes. All relevant structure is assigned directly via lexicon, morphology, and simple mappings from morphology to syntax. ]he task of the constraints is basically to discard as many alternatives as possible, the optimum being a fully disambiguated sentence with one syntactic reading only. The second central idea is to treat morphological disambiguation and syntactic labelling by the same mechanism of discarding improper alternatives. 168 A good parsing formalism should satisfy many requirements: the constraints should be declarative rather than procedural, they should be able to cope with any real-world text-sentence (i.e. with running text, not just with linguists' laboratory sentences), they should be clearly separated from the program code by which they are executed, the formalism should be language-independent, it should be rea~ sonably easy to implement (optimally as finite-state automata), and it should also be efficient to run. The CG formalism adheres to these desiderata. 2. Breaking up the problem of parsing The problem of parsing running text may be broken up into six subproblems or 'modules': • preprocessing, • morphological analysis, • local morphological disambiguation, • morphosyntactic mapping, • context-dependent morphological disambigua- tion, , determination of intrasentential clause boun- daries, , disalnbiguation of surface syntactic functions. The first four of these modules are executed sequentially, optimally followed by parallel execution of the last three modules which constitute 'syntax proper'. We have a five-stage parsing-process. In this general setting, CG is the formalism of the fifth stage, syntax proper. The same CG constraint formalism is used to disambiguate morphological and syntactic ambiguities, and to locate clause boundaries in a complex sentence. Parts of the CG forrnalism are used also in morphosyntactic mapping. Real texts are full with idiosyncracies in regard to headings, footnotes, paragraph structure, interpunctuation, use of upper and lower case, etc. Such phenomena must be properly normalized. Furthermore several purely linguistic phenomena must be somehow dealt with prior to single-word morphological analysis, especially idioms and other more or less fixed multi-word expressions. (It would e.g. make no sense to subject the individual words of the express- ion in spite of to plain morphological analysis.) The existence of an adequate preprocessor is here simply taken for granted. We concentrate on morphological analysis, clause boundary determination, morphological disambiguation, and syntactic function assignment. Viewing the problem of parsing in turn from one or another of these angles clarifies many intricacies. The subproblems take more manageable proportions and make possible a novel type of modularity. Morphological analysis is relatively independent. CGP is always supplied with adequate morphological input. The morphological analyzers are designed according to Koskenniemi's (1983) two-level model. Currently our Research Unit has morphological analyzers available for English (41,000 lexicon entries), Finnish (37,000 entries), and Swedish (42,000 entries). Below are two morphologically analyzed English word-forms, a has one reading, move four. The set of readings for a word-form we call a cohort. All readings in a cohort have the base-form initially or+ the line. Upper-case strings are morphological features except for those containing the designated initial character "@" which denotes that the string following it is the name of a syntactic function, here emanating from the lexicon. "@DN>" = determiner as modifier of the next noun to the right, "@+FMAINV" = finite main verb, "@-FMAINV" = non-finite main verb as member of a verb chain, "@" move move" N NOM SG" move "V SUBJUNCTIVE @+FMAINV" move "V IMP @+FMAINV" move" V INF @-FMAINV @" (-1 NOMHEAD) (1 VFIN)) stating that if a word (@w) has a reading with the feature "PREP", this very reading is discarded (=0) iff the preceding word (i.e. the word in position -1) has a reading with the feature "DET". The domain points out some element to be disambiguated, e.g. (the readings of) a particular wordform. The designated domain @w is a variable over any word-form, used when the target reading is picked by feature(s) only. The target defines which reading the constraint is about. The target may refer to one particular reading, such as "V PRES -SG3", or to all members of a declared set, such as VFIN. The operator defines which operation to perform on the reading(s). There are three disambiguation operators, here treated in order of decreasing strength. The operator '=!!' indicates that the target reading is the correct one iff all context conditions are satisfied; all other readings should be discarded. If the context conditions are not satisfied, the target reading itself is discarded. The operator '=!' indicates that the target reading is the correct one iff all context conditions are satisfied, all other readings are discarded. The operator '=0' discards the target reading iff the context conditions are satisfied, it leaves all The first one discards all finite verb readings immediately after the base-form to (itself either a preposition or an infinitive mark). VFIN is a declared set. The constraint is applicable to all strings declared to belong to this set. The second constraint states that the proper reading of the word thatis relative pronoun (i.e. a reading containing the string "", itself an inherent feature emanating from the lexicon) immediately after a nominal head and immediately belore a finite verb. There is also a mechanism available for expressing relativeposition with reference to variable positions established via unbounded dependencies. Let condition ('1 VFIN) be satisfied at absolute position 5, i.e. at the fifth word to the right. Then (L-1 N) would require there to be a feature "N" in absolute position 4, (L* N) would establish a second unbounded dependency somewhere left of position 5 (but right of position 0), i.e. looking for satisfaction at one of positions 4,3,2,1. Often context conditions work on ambiguous cohorts, i.e. one reading satisfiesthe condition, but this reading perhaps is not the correct one in the first place. If so, should a risk be taken? The CG formalism makes this a matter of deliberate choice. All 170 3 constraints so far treated allow the context conditions to be satisfied by ambiguous context cohorts. By appending the character C to the position number, one requires the respective condition to be satisfied only if the cohort being tested is itself unambiguous. This is called careful mode, e.g.: classical repertoire of heads and modifiers. CG syntax maps morphological categories and word order information onto syntactic labels. The designated syntactic subsets of verb chain elements, head labels, and modifier labels should be established. For English, these include e.g.: (@w =0 VFIN (-1C TO)) For many constraints it is necessary to require that they do not apply over clause boundaries. This clause boundary mode is effected by appending either of the atoms **CLB (ordinary mode) or**CLBC (careful mode) after the last context condition. Clause boundary mode is typically used in conjunction with unbounded contexts. A template mechanism is available for expressing partial generalizations. E.g., a template "&NP" could be declared to contain the alternatives ((N)), ((A) (N)) ((DET) .(N)) ((DET) (A) (N)), etc. Then the template &NP could be used in the context parl of any constraint. At run-time all alternative realizations of &NP would be properly considered. Every constraint embodies a true statement. Occasionally the constraints might seem quite down-toearth and even 'trivial', given mainstream conceptions of what constitutes a 'linguistically significant generalization'. But the very essence of CG is that low-level constraints (i) are easily expressible, and (it) prove to be effective in parsing. 6. Constraints for intrasentential clause boundaries Clause boundary constraints establish locations of clause boundaries. They are important especially for the formulation of proper syntactic constraints. E.g., the syntactic constraint "there is only one finite predicate in a simplex non-coordinated clause" presupposes that clause boundary locations are known. Clause boundaries occur i.a. as the inherent feature "<**CLB>" in the input stream. E.g. subjunctions are lexically marked by this feature. But many boundaries must be spotted by specific constraints. Clause boundary constraints have the special operator "=**CLB" stating that there is a clause boundary before the word specified by the target. E.g., given that conjunctions are lexically marked by the inherent feature "", the constraint: (@w =**CLB "" (1 NOMHEAD) (2 VFIN)) states that there is a clause boundary before conjunction instances that precede a NOMHEAD followed by a finite verb (e.g., before the conjunction in a sentence such as John eats and Bill drinks). 7. Syntactic constraints CG syntax is based on dependency and should assign flat, functional, surface labels, optimally one to each word-form. The labels are roughly the verb chain members: @+FAUXV (finite auxiliary V), @-FAUXV (non-finite auxiliary V), @+FMAINV (finite main V), @-FMAINV (nonfinite main V).... • nominal heads: @SUBJ, @OBJ, @I-OBJ, @PCOMPL-S (subj. pred. compl.), @PCOMPLO (obj. pred. compl.), @ADVL (adverbial) .... ° nominal modifiers: AN> (adjective as premodi~ tier to N), DN> (determineras premodifier to N), (premoditier to A),

", left "<") of the respective head which is identified by its part-of-speech label. E.g., the label @

, DN>, NN>, GN> (genitival). In Constraint Grammar, syntactic labels are assigned in three steps. The basic strategy is: Do as much as possible as early as possible. The first step is to provide as many syntactic labels as possible in the lexicon (including morphology). For entries having a reduced set of labels (compared to what that morphological class normally has), those labels will be listed in the lexicon. Thus, output from lexicon and morphology will indicate that he is @SUBJ, that him is either @OBJ, @I-OBJ, or @

. The first element is a feature string occurring in a morphological reading, the second is either NIL (no conditions) or a list of sublists each of which is a legal context condition. Finally the requisite grammatical function label(s) are listed. Here are some mapping statements without context conditions, providing a maximal set of labels: ("PRON GEN" NIL GN> @PCOMPL-S @ PCOM PL-O) ("A" NIL AN> @PCOMPL-S @PCOMPL-O @SUBJ @OBJ @I-OBJ) ("N NOM" NIL @SUBJ @OBJ @I-OBJ @PCOMPL-S @PCOMPL-O @APP @NN> @) ("N NOM" ((-1 PREP)) @) a subsequently following noun (in compounds, cf. computer screen), that a noun in the nominative case after a preposition is @ N NOM SG "@SUBJ saw see" V PAST" @+FMAINV the the" DET" @DN> little little" A ABS" @AN> dog dog" N NOM SG" @OBJ in in" PREP "@ park park" N NOM SG "@