Ega Case Study:
Multiple Formats to the Web

Page Index




The content of this page was developed from the research of
Dafydd Gibbon, Bruce Connell, and Firmin Ahoua.

Introduction

The data presented here form a case study in securing interpretability for Ega, an endangered language of Ivory Coast for which a small amount of legacy data resources are available. The original data was collected in a variety of formats, and with non-Unicode fonts, all of which had to be converted into best practices.

The Ega data is currently being collected by Dafydd Gibbon (Bielefeld), Bruce Connell (Oxford), and Firmin Ahoua (Cocody), who have agreed to make their primary materials (including a lexicon, audio recordings, and interlinear texts) available to the E-MELD Project.

Those charged with the conversion of the Ega data (Dafydd Gibbon of the University of Bielefeld, and Cathy Bow and Baden Hughes of the University of Melbourne) were confronted with multiple problems in order to process the Ega legacy language documentation materials which was to be archived. These materials consisted of a lexicon, interlinear texts, annotated recordings, and linguistic descriptions. There were a number of problems here: there were legacy fonts to be converted, a lexicon structure, as well as phonetic and prosodic annotations to be interpreted, all using terminology from highly specialized linguistic traditions.

The Lexicon

The Ega project uses Shoebox for storing lexical data. This tool is a hybrid text markup editor and database system, with powerful search, display and output functions. Shoebox is popular for its flexibility: users can invent new fields on the fly, and re-order existing fields at will. It is also popular for its support of ad hoc character encodings: users can represent information from different languages in different fields. However, by the time an archived Shoebox file is accessed, these features have in practice become liabilities. There are three significant interpretability problems with such data. First, if character encodings are not documented, it may not be possible to recover the intended character from its encoding, rendering certain fields quite useless. Second, if the interpretations of field names (e.g. "lx" for lexeme) and abbreviated content (e.g. "N" for noun) are not documented, it may not be possible to recover the intended interpretation of the content. Finally, if the structure of an entry (i.e. the permissible sequence of fields) is not documented, it may not be possible to recover the intended relationships between fields (e.g. whether a field consisting of a comment, translation or cross-reference applies to the previous field or to the entry as a whole).

Fortunately, in this case the conversion team was working with those who had produced the files, and were thus able to convert the material appropriately.

More on Lexicons

Character Encodings

Legacy character encodings cannot be interpreted accurately without access to the original font definition along with suitable software and expertise. This problem is not specific to languages which use exotic characters. Indeed, it is relevant even to languages which use just a single font. The case of Ega language documentation was quite typical: a variety of fonts and encodings were used. It is unrealistic to assume that future users of archived Ega resources will have access to appropriate software and expertise in order to be able to correctly interpret the character encodings. It was therefore necessary to convert them to a standard representation.

The conversion team addressed these problems by first creating a table of correspondences between the characters used in the Ega documentation and Unicode. The table was written in XML, and was used by a tool which augmented each field in the source data with additional fields containing the same data in Unicode. The process of applying the table to the data lead to the identification of unmapped characters and refinement of the table. Then the table itself was archived as formal documentation of the legacy encodings. Additionally, the table can be extended to cover a variety of complete legacy fonts, securing interpretability for legacy documentation of other languages.

More on Unicode

Interlinear Glossed Text

Interlinear text is a common presentation format for the expression of linguistic information. Although some specialised tools exist for creating and manipulating interlinear text, most legacy interlinear material depends on visual interpretation for alignment of interlinear text, not on data structures. Alignment is thus usually lost in conversion between formats, between application instances, and between media. In the case of Ega, only very basic instances of interlinear material existed, with phrases and free translations but no detailed morphosyntactic markup.

The team addressed the problem by translating the Ega interlinear sources into the E-MELD model for interlinear text, expressed in a four-level XML-based format. This process has three distinct stages: converting the original interlinear into a tabular format, converting the tabular format into a tree structure, and then expressing this tree as XML. The team developed a translation tool which automates the mapping from a table structure to XML. In order to handle the specific typological properties of Ega, the granularity of the morphological tier of the E-MELD model was increased to include tiers for lexical tone, morphosyntactic tone, morphological category and morphological paradigm.

More on IGT

Annotated Recordings

The available audio and video recordings are of questionnaire based interviews, narratives and other interactions; these primary data are available on DAT, miniDV formats, and as conversions to WAV, MP3 and AVI formats on CD-ROM. In conversion across signal data formats it is possible to lose information on temporal resolution, and consequently precision of alignment, frequency resolution, and spectral faithfulness (especially by converting into lossy formats such as MP3), all of which potentially damage prospects of future linguistic and phonetic analysis and use in computational information retrieval applications. The files had been annotated using a variety of tools (Praat, Transcriber, esps/waves+, TASX), resulting in a variety of annotation formats and potentially different degrees of precision of alignment. In addition, the time-stamps in some annotation tools are point-based, while in others they are interval-based; the former require additional conventions stating how they apply to intervals.

Securing the long-term interpretability of varied and proprietary binary signal formats on magnetic and CD media is a complex and specialised task which is being addressed by engineers and archivists worldwide, and the team did not address this task, except to prefer non-compressed data formats and to preserve the available temporal, frequency and amplitude resolutions. They concentrated on securing annotations. For this purpose the TASX XML format was used, and a suite of Perl scripts for inter-converting between files and into TASX XML format was developed, and results were validated by reverse conversion and file comparison.

More on audio conversion

More on video conversion

Linguistic Descriptions

A major interpretability problem arose with the morphosyntactic descriptions of Ega. First, the nomenclature of different linguistic traditions often has a common core, but details vary greatly; different terminologies associated with different national languages compound the problem. There is currently little flexibility for the individual linguist in deciding which terms to adopt and how to express them in their data, or to relate these terms to higher level cross-linguistic ontologies which allow for consistency of semantic content.

To address these problems, the team used GOLD, the General Ontology for Linguistic Description, as a potentially suitable source of consistent morphosyntactic terminology. The process of assessing the nomenclature of linguistic annotation terminology and creating the relevant mapping to a higher level ontology involved several steps. First, they assessed the terminology used within the Ega description, then they considered terminological idiosyncracies in the English and French descriptions. Finally, they identified correspondences with GOLD categories, and added these terms as annotations of the Ega descriptions.



Follow the path of the Ega data

  1. Get Started: Summary of the Ega conversion
  2. Build a Lexicon: Lexicons page (Classroom)
  3. Encode Characters: Unicode pages (Classroom)
  4. Create an IGT: IGT pages (Classroom)
  5. Convert Audio Data: Audio pages (Classroom)
  6. Convert Video Data: Video pages (Classroom)
  7. Utilize an Ontology: Ontology pages (Classroom)

User Contributed Notes
E-MELD School of Best Practices: Multiple Formats to the Web
+ Add a comment
  + View comments

Back to top Credits | Glossary | Help | Navigation | Site Map | Site Search