Real bad grammar:

Realistic grammatical description

with grammaticality*


1. Introduction

Sampson (this issue) argues for a concept of “realistic grammatical description” in which the distinction between grammatical and ungrammatical sentences is irrelevant. In this article I also argue for a concept of

“realistic grammatical description” but one in which a binary distinction between grammatical and ungrammatical sentences is maintained. In distinguishing between the grammatical and ungrammatical, this kind of grammar differs from that proposed by Sampson, but it does share the important property that invented sentences have no role to play, either as positive or negative evidence.

Our propensity to make mistakes, and the fact that many people are forced to speak and write in a language that is not their native one means that sentences are produced which contain grammatical errors.

These naturalistic ungrammatical sentences, as opposed to the invented starred examples often used within the linguistics community, have been dismissed as uninteresting. Although I do not wish to give naturalistic ungrammatical sentences the prominence given by Carnie (2002) to invented ungrammatical sentences when he suggests that it is necessary to determine the ungrammatical sentences in a language in order to determine the grammatical ones (see Sampson, this issue), I do, however, think that naturalistic ungrammatical sentences are of interest to linguists studying language production, language loss and language learning, and that the grammatical/ungrammatical distinction cannot therefore be completely dismissed. Also, for grammar development within the field of natural language processing, the grammatical/ungrammatical distinction cannot be ignored or denied because this can lead to the development of grammars which do not accurately analyse ungrammatical sentences. This article focuses particularly on this second argument Corpus Linguistics and Linguistic Theory 3 1 (2007), 73 86 1613-7027/07/0003 0073 DOI 10.1515/CLLT.2007.005 Walter de Gruyter J. Foster in favour of maintaining the grammatical/ungrammatical distinction, and when I speak of grammar development, I am particularly thinking of those large scale natural language grammars which are used to automatically parse natural language.

Linguistic evidence in the form of grammaticality judgements can be used to distinguish grammatical sentences from ungrammatical ones but, crucially, these judgements should be made only on naturalistic data in context. Sampson (this issue) argues that Chomsky’s conception of language as a set of sentences, with the role of a linguist to establish which strings are in this set, is unhelpful because it focuses undue attention on the grammatical/ungrammatical distinction: I believe that an unfortunate consequence of this definition of language is that it places too much emphasis on the sentence as an isolated unit.

2. Grammars for natural language processing A fundamental debate within the linguistics community has concerned what it is a grammar is supposed to model: should a grammar model competence or performance? Should a grammar reflect a psychological reality or a social reality? Lamb (2000), for example, distinguishes between a “theory of the linguistic extension” which is a theory of the utterances produced by a speaker or community, and a “theory of the linguistic system” which is a theory of the human cognitive system which is capable of producing and understanding such utterances. In the practical domain of natural language processing, there is no such debate. The grammar of a computer parser which is to form part of a practical application must be a theory of the linguistic extension and must describe the productions of a speech community. In proposing the competence/ performance distinction, Chomsky remarked that the language produced by a speech community is rife with slips and imperfections (Chomsky 1961: 130 131). Therefore, if a computer parser has to accurately parse actual language, it will have to accurately parse imperfect language, in particular the kind of imperfect language that we routinely produce and are capable of understanding. It will be able to do this if it is equipped with some knowledge of deviant sentence structures.

A precision grammar distinguishes between the grammatical and the ungrammatical and purposely describes only grammatical sentences. An example is the English Resource Grammar (ERG) (Copestake and Flickinger 2000), a broad coverage HPSG grammar of English. Baldwin et al. (2004) make the point that, if a grammar is to form the basis of a natural language processing system which performs not just sentence parsing but also sentence generation, it should not be able to generate ungrammatical sentences. A parser using such a grammar will reject unReal bad grammar 75 grammatical sentences outright. However, a parser which gives the response “no” or “ungrammatical” to a sentence such as (1),1 may be capable of distinguishing between the grammatical and ungrammatical but of what practical use is this ability if it cannot hint at the meaning of an utterance whose ill-formedness is quite commonplace?

(1) Want to saving money?

Of course, one could argue that robust parsing techniques (such as con´ straint relaxation (Fouvry 2003) or parse-fitting (Penstein Rose and Lavie 1997)) could be employed to handle ungrammatical sentences but such techniques will be more effective if they are tailored to specific types of ungrammatical language a natural extension of this is then to actually let the grammar describe the structure of ungrammatical sentences in the same way that it describes the structure of grammatical sentences.

A parser whose grammar is derived automatically from a treebank of naturalistic sentences is unconcerned with whether or not a sentence is grammatical. Typically, grammaticality is assumed, and this assumption will be quite accurate if the treebank sentences come from a high-quality newspaper such as The Wall Street Journal. The fact that such grammars do not purposely set out to exclude ungrammatical sentences together with the fact that such grammars are generally based upon a large body of data means that parsers equipped with such grammars are quite likely to return a parse for an ungrammatical sentence. However, since such a parser does not have a concept of ungrammaticality, it will not be aware

that there is something deviant about the sentence, with the result that the parse it produces for the sentence will not necessarily be the correct one, that is, it will not necessarily reflect what the person who produced the ungrammatical sentence intended to express. For example, Charniak’s most recent parser2 (Charniak 2000) will provide the reasonable parse in Figure 1 for sentence (1) but it is less successful, for example, on the ungrammatical (2), returning the parse in Figure 2.

(2) The closure in computed breadth-first.

3. Grammar requirements The following are the requirements for the type of grammar which I believe should be developed by computational linguists and used by a

parsing system:

1. The grammar should have a component which describes the structure of the grammatical sentences that occur in language.

2. The grammar should have a component which describes the structure of the ungrammatical sentences that occur in language.

Like a treebank grammar, this grammar aims to be a direct reflection of language rather than an indirect inflection via linguistic intuition.

However, unlike a treebank grammar, this grammar does explicitly distinguish between the grammatical and the ungrammatical, and this distinction relies on linguistic intuition. This distinction is binary, but this does not mean that the rules in each component of the grammar cannot be probabilistic. A linguistic structure described in the first grammatical component of the grammar could be assigned a probability based on Real bad grammar 77 how frequently this structure appears in grammatical data. Similarly, a linguistic structure described in the ungrammatical component of the grammar could be assigned a probability based on how frequently this structure shows up in ungrammatical data. The grammatical component of the grammar is quite similar to a precision grammar which has been tested using corpus evidence. An example is the afore-mentioned ERG which has been tested using sentences from the British National Corpus (Baldwin et al. 2004). The ungrammatical component is, of course, not implemented by a precision grammar.

What kind of evidence is needed in order to develop and test the second component of the grammar, the part of the grammar which describes ungrammatical sentences? Since this grammar is to form the basis of a parsing system, its description of ungrammaticality must reflect the kind of ungrammaticality that actually occurs in language. This means that naturalistic ungrammatical sentences will be needed as evidence rather than imagined ones. Baldwin et al. (2004) argue that naturalistic ungrammatical sentences such as (1) or (2) constitute “haphazard noise” and are useful only to test that a grammar does not overgenerate. I am arguing that a grammar that is capable of generating the kind of ungrammatical sentences that people actually produce, is not guilty of overgeneration, provided that the grammar knows that these kinds of sentences are ungrammatical. Therefore, to test the second ungrammatical component of the grammar, which is essentially a theory of real ungrammaticality, it is necessary to collect a corpus of naturalistic sentences which are considered by speakers of the language to be ungrammatical.

How does this definition of grammar relate to the one suggested by Sampson (this issue)? The two are broadly in agreement since they aim to describe language as it is actually used and both reject the need for negative evidence in grammar development. According to Sampson (2001, and footnote 3, this issue), if a grammar is constructed so that it excludes sentences whose structure has not actually been observed, then negative evidence becomes irrelevant. In order to exclude a sentence from the grammar, it is not necessary to verify that it is ungrammatical.

It is enough not to have observed the sentence in practice. Once the sentence is observed, then this observation has the potential to count as a refutation of a grammar which excludes the sentence, and the grammar will need to be modified accordingly. As Sampson notes, this is Popper’s view of the nature of a scientific theory: it should maximize the number of statements it makes which are refutable by observable evidence.3 Where the two notions of grammar differ is in their treatment of situations when an ungrammatical sentence such as (1) (repeated for convenience as (3)) is actually observed.

J. Foster (3) Want to saving money?

(4) Want to save money?

(5) Want to start saving money?

Surely, if such a string is observed, it should serve as a refutation of any grammar which prohibits it and the grammar in question would need to be modified so that it no longer prohibits this sentence? Sampson (2001) argues that the grammar should not be changed to accommodate such an observation since our knowledge that people make mistakes in language (such as omitting a word, using the wrong verb form, etc.) should allow us to relate this sentence to another sentence such as (4) or (5), both of which are accommodated by the grammar, thereby discounting the observation as a genuine refutation. The ungrammatical sentence (3) would, however, be included in the grammar described here, although it would still be recognised as a different kind of observation to a sentence such as (4) or (5) and thus would be included in the second ungrammatical component of the grammar. Recognizing it as a different kind of observation is the same thing as making a grammaticality judgement, and a method to make this kind of judgement as reliably as possible is described in the next section.

The type of grammar suggested by Sampson could actually be used as the grammatical component of the grammar advocated here. It would include rare and odd constructions (Sampson’s Dunster constructions), and if it was a probabilistic grammar it could encode rareness, without linking this rareness in any way to grammaticality. In fact, because of the clear distinction between the two components of the grammar, the concepts of grammaticality and frequency are not conflated. This nonconflation is a positive thing, regardless of where one stands with respect to Sampson’s claim that frequency data cannot be used to predict grammaticality status (see Sampson’s discussion of noun phrase variability in the SUSANNE treebank, this issue).

4. Judging grammaticality The use of grammaticality judgements as linguistic evidence has always been controversial. A large body of literature spanning several decades casts doubt on the validity of grammaticality judgements (see for example Labov 1972; Derwing 1979; Schütze 1996). These critiques cover various problems with the judgement process: defining grammaticality, choice of informant, the measurement scale used to measure judgements and the role of sentential context. After concluding that the grammaticality of a sentence cannot be inferred from its frequency, Sampson (this Real bad grammar 79 issue) dismisses as scientifically suspect the alternative method of using grammaticality judgements. The fact that it is difficult to reliably infer grammaticality is one of the arguments he uses to support his claim that the concept of grammaticality should be more or less ignored in grammar development. I disagree to a certain extent: in building treebanks (which are now a fundamental ingredient for natural language processing), we rely on the linguistic intuition of the treebank annotators to parse sentences, and I think that it would be useful to view the grammaticality judgement process in a similar way as a necessary evil. Grammaticality judgements, although undoubtedly problematic, can be used to effectively carve out a grammatical/ungrammatical distinction (albeit not a particularly exciting one), and I focus for the rest of the article on how this might be achieved, dealing particularly with the problems of sentential context and defining what it means for a sentence to be ungrammatical.

