A universal data model for linguistic annotation tools |
|
|
|
|
||
|
Since its inception, the E-MELD project has advocated the use of best practice in the digitization and markup of language data. Best practice means creating resources that are "...longlasting, accessible, and re-usable by other linguists and speakers" 1. One of the most well articulated calls for best practice was presented by Bird & Simons (2003), and subsequently adopted by the E-MELD community. Focusing directly on the tasks of language documentation and language description, they emphasize seven steps towards the best practice of digital data, seven steps that they refer to as the "portability of language data". One of the steps, what Bird & Simons (2003) term format, concerns the best practice of encoding and markup. Under this general rubric, there have been several data models proposed, in particular, for dictionaries, paradigms, and interlinear text. While these proposals emphasize structural and encoding compatibility, mostly through the recommendation of using XML and Unicode respectively, the resulting data models are not necessarily interoperable with respect to content. One suggestion proposed by Lewis, Farrar & Langendoen (2001) and Bird & Simons (2003) was to go beyond mere structural and encoding interoperability and relate the elements of the various data models to a common markup ontology, such as the General Ontology for Linguistic Description (GOLD) (Farrar & Langendoen 2003). As of yet, however, there have been few suggestions on how to tie together data format with data content, specifically concerning the relationship to an ontology. This paper attempts to fill the gap by describing a data model that has the potential to facilitate migration of XML to richer structures. The data model advocated here has as its impetus the need for sharing the same data among annotation tools with very different purposes, such that different aspects of the same data can be manipulated by each kind of tool. For example, consider a lexicon creation tool such as FIELD (Aristar 2003) whose output is a highly structured lexicon. If the data were structured according to a universally recognized model, then the results could then be loaded into another kind of tool, for instance, one that produces interlinear text based on the lexicon, or one that adds detailed phonetic annotation to each entry. The most important requirement is that the data exchange format accommodate the fundamental linguistic data types, both of the traditional print variety (e.g., dictionary entries and interlinear text) and of a more technical nature, such as those used in natural language processing applications (e.g., treebanks and computational lexicons). In § 2, we examine these data types to create an inventory of basic elements that serve as the basis for a mapping to an ontology. Another important requirement of the model is that it be conversion "friendly", not only to ensure that the data can be displayed in a human-readable format, but mostly importantly, to ensure that the data is compatible with various tools, that is, migratable to a semantically interoperable form. Thus, the main design issues surrounding display- versus content-oriented data structures are discussed in detail in § 4. Third, the role of current markup standards is discussed and how they can be leveraged to create a more structured data exchange format. Therefore in § 5, we discuss the use of the Resource Description Framework (RDF) and the Web Ontology Language (OWL) as a means of adding more structure to vanilla XML. Once these desiderata are established, we present our data model in § 6. |
|
|
Descriptive linguistics is a discipline that is – in no small way – driven by tradition. This can be seen in the data types that linguists generally use to present analyses of language data. A linguistic data type is any structured entity that acts as a container for annotated data and the elements of annotation, often referred to as the analysis. For example, the tradition of using interlinear glossed text (IGT) is particularly salient in print journals. In various descriptive grammars, on the other hand, there is the tradition of using phonological and morphological paradigms, essentially multidimensional tables showing feature systems of a language. Of course there is the lexicography tradition that focuses largely on how to display or organize lexical entries in a format that is maximally beneficial for the human user in a print environment. There are also traditions of using tree diagrams for morphological, syntactic, and phonological structure. Not to be left out is the tradition of using phonological and syntactic rules in, for example, a deeper grammatical analysis. More recently, however, some branches of linguistics (specifically computational linguistics and natural language processing) have begun to place an emphasis on more formal data structures, used in language resources, specifically tailored for machine readability. For example, there are the treebank data structures that provide a means of representing structural descriptions in an efficient format. Furthermore, there are some very successful electronic dictionaries that encode lexical structures and relations to be read by a computer. Whether print-based or electronic, all these entities can be considered as linguistic data types. The following section presents a discussion of the fundamental data types 2 and summarizes the explicit and implicit content most often associated with each type. |
|
|
The first data type to consider is interlinear glossed text (IGT) which is characterized by a tabular presentation of morphemes and their labels, usually aligned vertically on the page. Bow, Hughes & Bird (2003, Sec. 3) cite the association of a morpheme with a label as the most consistent feature of IGT. An instance of IGT starts with a segment of text. Usually, the text is presented in some recognized orthography. But as noted by Peterson (2000), there can be several layers of transcription, including a phonetic or phonemic transcription, but also transcriptions in a second orthography. The text is segmented somehow showing morpheme, word, and possibly phrase boundaries. Then, there are the glosses of the morphemes that compose the text (either in the form of lexical items from the language of description or abbreviations for grammatical or semantic categories (e.g., 3PL, PAST, ANIM). This is followed by a free translation in the language of description or some other language of scholarship. This is essentially the explicit information given in an instance of IGT. There is, on the other hand, implicit information that adds to the basic entities discussed above. For instance, consider the Leipzig Glossing Rules that recommend structures rich in information on morpheme type (Bickel, Comrie & Haspelmath 2004, pp. 2-7). First of all, clitic boundaries (and hence the existence of clitics) may be indicated with an equals sign between morphemes. The characterization of some morpheme as a portmanteau morpheme may also be present, indicated by a period in the gloss line. Forms such as stems are indicated using a backslash to separate them from the inflectional or derivational material. Also noted in the Glossing Rules are the agent-like and patient-like arguments of a verb. There are "bipartite elements" such as infixes and circum- fixes, marked in a number ways, and morpho-phonological information such as reduplication, indicated with a tilde. |
|
|
Next, we turn to paradigms. For our discussion , we simply highlight a few observations already made Penton, Bow, Bird & Hughes (2004). According to this work, paradigms are perhaps the most pervasive linguistic data type found in the literature. The underlying model for any paradigm includes an ordered set of forms that show some contrast or systematic variation. This is summed up in the following working definition from Bird (1999) and extended by (Penton et al. 2004):
Penton et al. (2004, p. 6) also point out that "...linguistic paradigms simply represent an association between linguistic forms and linguistic categories." Important in paradigms, then, is the listing of specific features, construction types, or meanings with which to order the illustrative forms. The generalization that is put forward is as follows:
That is, while paradigms are usually presented in tabular format in print materials, Penton et al. (2004) propose the above underlying structure, meant to describe the information in most paradigms surveyed in the literature. |
|
|
We now consider the most widespread type of linguistic resource: the dictionary. Dictionaries and their accompanying entries are perhaps the most codified of the data types under discussion, considering that there are fields dedicated to their study, namely lexicography and lexicology. Though the contents of dictionary entries vary widely, there are some general consistencies that can be identified. In their survey of various print dictionaries, for example, Bell & Bird (2000) show that a general model for a dictionary entry can be achieved. The body of an entry contains: pronunciation information, usually in the form of a phonetic transcription; morphosyntactic information (syntactic categories, features, etc.); sense information in the form of a definition, semantic realm, or semantic features; mapping information that provide ordering to the set of lexemes; and finally, optional miscellaneous information concerning, for example, "etymology, obsolescence, cross-references, register, informant identity...". Here we simplify the results from Bell & Bird (2000) for the body of dictionary entry:
Body = {Pron,MSI, Sense,Mapping, (Aux)}
It is clear even from this survey of print resources that a dictionary entry can contain informtation of a much more varied and open-ended type as compared to the other data types reviewed thus far. Expanding the discussion now to include a broader collection of resources, we cite Calzolari, Grishman & Palmer (2001) who have conducted a survey of existing electronic lexical resources including, among other things, machine-readable dictionaries and computational lexicons. Machine readable dictionaries are essentially electronic versions of print dictionaries, but "...lack an explicit representation of linguistic information such as inflectional class, obligatory complements, alternations, regular polysemy, etc" (Calzolari et al. 2001, p. 229). Computational lexicons on the other hand contain "...explicit morphosyntactic, syntactic and semantic knowledge, partly through an extensive work of extraction from corpora" and are mostly monolingual, though "founded on well-established theoretical frameworks" (Calzolari et al. 2001, p. 229). What perhaps has the potential to set these electronic resources apart from their print counterparts is (1) the inclusion of rich semantic information, for example, "Reference to an ontology of types which are used to classify word senses..." and "[d]ifferent types of relations (e.g. synonymy, antonymy, meronymy, hypernymy, Qualia Roles, etc.) between word senses, etc." (Calzolari et al. 2001, p. 18). This research confirms the findings of Bell & Bird (2000) but also shows that these electronic resources may go beyond even the most complex print dictionaries. |
|
|
Treebanks are data structures containing rich syntactic information. For instance, the Penn Treebank (Marcus, Santorini & Marcinkiewicz 1994) contains information on tokenization, part of speech, constituency, and syntactic function. Furthermore, other, more subtle syntactic information can be encoded such as trace information produced by movement operations (Cotton & Bird 2002, p. 2). Some treebanks are designed to show dependency relations among syntactic elements, e.g., the Prague Dependency Treebank (Hajic, Böhmová, Hajicová & Vidová-Hladká 2000). Beyond syntactic information, a number of treebanks also include information on morphological categories. For instance, various HPSGbased treebanks show explicit information concerning morphological and syntactic features, e.g., in the BulTreeBank (Simov, Popova & Osenova 2001) for Bulgarian. Also, treebanks may be enriched with semantic information, such as topic and focus in the Prague Dependency Treebank (Hajic et al. 2000, p. 15) or predicate-argument structure and semantic role information in the Susanne corpus (Sampson 1995). Finally, aimed at providing deeper syntactic and semantic annotation of the Penn Treebank, the PropBank project (Kingsbury & Palmer 2002) also contains predicate-argument information, but adds specific semantic markup of verb modifiers, e.g., directional, locative, or manner elements. |
|
|
Now that we have reviewed the fundamental linguistic data types, it should be clear that the data types overlap significantly with one another in terms of their information content. For instance, dictionaries may contain substantial morphological information on the headword, for instance its syntactic category (cf. Treebanks) or its morphological features (cf. IGT). On the other hand, morphological markup, in the form of IGT, contains a significant amount of lexical information – enough, perhaps, to create a dictionary, provided there were an adequate number of lexical item represented in the IGT instances. Then, there are morphosyntactic paradigms that contain morphosyntactic feature names and values, the information content of which overlaps with that of IGT, namely, feature values. Furthermore, throughout all of these data types, the most basic entity that shows up again and again is a transcription of linguistic form. Form comes in the guise of the headword in a dictionary entry, the contents of the cell in a paradigm, and the elements in the first line of IGT. Because of this overlap, it seems quite reasonable to reuse as much material as possible to arrive at an underlying, general model. Discovering the generalities expressed by the data types requires being very specific about the type of linguistic object that is being represented: This is precisely what the developers of GOLD have intended by creating a markup ontology. Thus, identifying the linguistic object in each data type is an ontological issue. But instead of delving into an in-depth ontological discussion, we take a more practical approach in developing the model. Our aim, then, is similar to that of some computer scientists who model linguistic data:
From an ontological standpoint, one of the most basic questions to ask is whether an element of annotation is relational. A phonetic transcription, for example, is not considered relational: it is a first order representation of the segmental aspect of raw data. Consider, though, a headword in a bilingual dictionary entry and the associated translation. There is an implied relation between the headword and the translation. Morphological annotations and treebanks, by definition, contain implied morphological and syntactic constituency relations between explicitly represented grammatical elements. Once the basic distinction between relational and non-relational elements is made clear, it is also important to keep in mind the classification of other entity types. For instance, the parts of speech (noun, verb, adjective, etc.) are not the same kinds of entities as grammatical categories (case, tense, number, etc.). But even more fundamentally, there should be a strict delineation between, for example, semantic concepts and grammatical concepts.
Essentially, we need a way to combine the variety of data objects represented
in the fundamental data types. It is tempting to create an arbitrarily
complex data type whose contents subsume all the elements of the fundamental
types. However, any general model should be compatible with linguistic theory and not be an ad hoc collection of data objects – in as much as
this is possible. A solution is to use the notion of the linguistic sign (de
Saussure 1959/1915, Hjelmslev 1953) as the basis of our data model. Though
direct discussion of the linguistic sign is not usually considered a current topic
in linguistics, the nature of the sign is still somewhat controversial, cf. Hervey
(1979). Therefore, we include here a brief discussion of the basics of our
approach to the sign. A linguistic sign is a 3-tuple hF,M,Gi consisting of a
form component F, a meaning component M, and a grammatical component
G. For each linguistic sign, there must be some language L to which the sign belongs.
We define linguistic form F as any annotation entity that represents the
phonetic, phonological, orthographic, or otherwise physical manifestation of the
sign (e.g., transcription of hand shape for a sign language). As for the meaning
component M, this represents the concept which the signs signifies. By meaning
component, we refer specifically to semantic units or features of semantic
units, e.g., the concept dog or the feature [+Animate]. We do not include in M
annotation entities such as the definitions of lexical items or the translations of
headwords. While definitions and translations do provide additional semantic
annotation, they are essentially shortcuts that rely on form components of other
signs. We consider such information as auxiliary to the sign. If the meaning
component is annotated, as it is sometimes in dictionaries or instances of IGT,
then the units come from an ontology of (possibly language independent) concepts.
Finally, the grammatical component G refers to the morphological or
syntactic characteristics of the sign. Included here are categories such as the
part of speech and morphosyntactic features and values. As an example of a
possible XML serialization of this model, consider the following: Turning to the opposite problem, there is usually more information in annotated data than just the linguistic sign. We have already mentioned translations and etymology, but the list is quite open-ended. Consider that a dictionary entry is one of the most heterogeneous of the data types. It may contain additional information such as semantic realm (e.g., botany), register (e.g., colloquial), and information regarding the speaker (e.g., age=35). We recommend not requiring such information be present with the sign, as with the translation element in the above example. Instead, we suggest creating relations for such auxiliary information which may be linked to the sign. Note, we are focusing on content only, trying to delineate pure linguistic from auxiliary information. In the next section, we present a more specific discussion of content. |
|
|
In the survey of data types presented in § 2, we emphasized how a general model for annotated data must make certain commitments as to content. By content, we simply refer to all elements that can be considered linguistic data or annotation. We contrast elements of content with those of display, or those entities that pertain to how data and annotation is to appear on the page. To illustrate the difference, consider two types of markup elements in HTML. The first type includes tags for unordered lists ul, list items li, and table data td. The second includes tags for italics i, bold b, and for line breaks br. Whereas the tags in the first group act more like containers for structuring data, the tags in the second control how the data is displayed on the page. Of course the first group also determines how the data are to be displayed, but the second is solely for display.
Consider XML markup, our central concern, which provides as a very general
(tree-like) structure for encoding all kinds of data. It provides the ability
to specify type and token information and various relationships among data.
As such, XML is not intended to be a display-centric format; rather, it is a
format that also allows explicit structure. It is tempting to use XML for encoding
display information. However, little is actually gained by encoding display
concepts at the level of abstraction which XML was intended. For instance,
consider a hypothetical markup scheme for IGT.
|
|
|
The E-MELD and OLAC communities have set out to address the larger issues concerning digitization: accounting for authorship, data provenance, language identification, just to name a few. We think these issues have largely been solved, namely by advocating systems of metadata to be embedded within each document instance. In terms of advocating specific markup schemes, the issue is more complex. As has been argued at many E-MELD sponsored events, XML is a useful markup language for linguistic annotation because, among other reasons, it offers a more structured syntax than do other alternatives, for example, HTML or Shoebox code. One reason to have more structure is to facilitate migration which requires the interpretation of markup perhaps orthogonal to, or even at odds with, its original purpose, as summed up here:
In this section we turn to the specific issue of structure and XML and advocate some additions to take advantage of recent developments in markup languages, in particular, the use of RDF (Lassila & Swick 1999) and OWL (McGuinness & van Harmelen 2004). The main advantage of using XML, rather than less structured markup languages, is that the XML may be manipulated, e.g., via XSL transformations (W3C 2001), and thus migrated to other formats suitable for specific tasks like human-oriented display, database applications, manipulation by specific programming languages, an observation summarized by Sperberg-McQueen & Miller (2004). We argued in § 4 for a content-based data model over one that is display-based. It will likely turn out that creating an interoperable data model renderable in a variety of display formats is relatively straightforward, even for content-centric formats such as the one for paradigms described by Penton et al. (2004, p. 1): "The range of presentations possible for the same data set indicate that the underlying structure of the paradigm can be rendered into a variety of visual formats." The idea is that once an adequate data model is established for content interoperability, the difficult work is done, and various stylesheets can be constructed for displaying the data in a variety of ways. We now turn to the more complex issue of designing a model for migration to a semantically interoperable format.
Consider the work of Simons (2003) and Simons (2004), which we consider
an excellent test case for such a migration task. Simons developed the Semantic
Interpretation Language (SIL) to transform semi-structured data in XML
to highly-structured data in RDF serialized as XML. The SIL is a generalized
framework implemented using XML and XSL that formally maps the elements
and attributes of best practice XML resources to a common semantic
schema, vis-à-vis an ontology. The strength of the SIL is that it provides the
means to manipulate the original XML at both the syntactic and the semantic level, once the semantics of the markup is defined according to a metaschema
(Simons 2003). The metaschema is a document consisting of a set of directives
in the SIL language that instructs the processor on how to interpret the original
markup elements according to the concepts of semantic schema. Furthermore,
the metaschema formally interprets the original markup structure by declaring
what the dominance and linking relations in the XML document structure
represent. We have demonstrated in Simons, Fitzsimons, Langendoen, Lewis,
Farrar, Lanham, Basham & Gonzalez (2004) and Simons, Lewis, Farrar, Langendoen,
Fitzsimons & Gonzalez (2004) that the migration process can be successfully
implemented in a scalable, systematic fashion. However, the creation of
a metaschema document is not at all straightforward. A particular challenge is
determining the meaning of relationships within the document tree. For example,
whereas the actual XML document tree consists of constituency relations
specific to the Document Object Model (DOM), authors of XML documents
often give these relations an implicit meaning. This suggests that methods such
as using the SIL language can be made more transparent if such relations are
encoded directly. The first structural design principle, then, is to explicitly encode
the relations in the XML, and encode them as elements. Consider, for
example, the following XML code from Bow et al. (2003) representing a partial
instance of IGT:
But even more basic perhaps is the challenge of interpreting non-relational
markup tags. Bird & Simons (2003) advocate using tags that are compatible
with elements in an ontology, e.g., GOLD. In other words, "[m]ake sure that
every element comes from a specific namespace," and insure that the namespace
is from a recognized ontology, rather than "making up your own URIs"
(DuCharme & Cowan 2002). For instance, to simplify matters, the default
namespace for the XML instance document could be the ontology itself.
|
|
|
To summarize, we have discussed best-practice markup for language resources
not only in terms of format but also in terms of content. We have argued that
by including tighter control over the content of markup, migration to semantically
interoperable formats can be facilitated. Furthermore we have discussed
the need for such a content-based model in the design of annotation tools. To
arrive at recommendations for content, we have surveyed various best-practice
approaches for the fundamental data types, including linguistic paradigms, interlinear
glossed text, dictionaries and lexicons, and treebanks. We then turned
to a discussion of the virtues of content- over display-oriented data models. Finally,
we gave a few recommendations on how to add even more structure to
existing XML models by using constructs from RDF and OWL. The overall
recommendations for the model are summarized here:
|
|
|
1 See http://emeld.org/school/what.html. |
|
|
Aristar, A. (2003), FIELDL, Technical report, presented at the Workshop on
Digitizing and Annotating Texts and Field |