The Berkeley Interlinear Text Collector (BITC)
Presented by:         
Jeff Good and Ronald Sprouse, University of California, Berkeley  
Project / Software Title :      
The Berkeley Interlinear Text Collector (BITC)  
Project / Software URL:  
Access / Availability:       This software is available free if you contact the developer at ronald*_AT_* (though the software is not designed or packaged for easy installation).  
Description: The Berkeley Interlinear Text Collector (BITC) is a system for collecting interlinear texts and is especially designed for group collaboration.  

The Berkeley Interlinear Text Collector (BITC) is a system for collecting interlinear texts and is especially designed for group collaboration. BITC is installed on a network server, and users access BITC through a web browser, an ideal arrangement for group work since practically anyone can get involved with a group project without having to install special software. The collaborative design of BITC means that all project participants benefit from the texts collected by others working on the project because all work contributes to a shared word list.

BITC was developed as part of the Ingush Grammar, Dictionary and Texts project and has also been used in several UC Berkeley Field Methods courses.

Collaborative, web-based interface: BITC makes use of a collaborative, web-based model for data input where each researcher controls their own data but has access to all the data in the project corpus. The fact that it runs on a web-browser allows the input system to be platform-independent from the user end.

Hierarchical data model: The word and the sentence are at the core of the BITC data model. In addition to allowing sentence-level and word-level glossing, it is also possible to associate grammatical notes with particular words or the whole sentence. It is also possible to encode higher levels of structure including paragraph, scene (for a play), and text.

Flexible searching and on-the-fly concordance generation: BITC has a number of built-in searching options that allow the user to look up words (or parts of words) in an entire corpus, in either the source language or the gloss language. In addition, the user may search and retrieve records based on the notes field. The results may be presented to the user in one of two formats: 1) a simple list of glosses for each word found; 2) a list of glosses and hyperlinks to the records in which the word and associated gloss are found. The ’hyperlink’ option is essentially an on-the-fly concordance generator that is very useful for keyword-in-context data retrieval.

Semi-automatic glossing As each word is entered in a text, its gloss is stored as part of the project's shared word list. Subsequently, that gloss will appear in that word's suggested gloss list every time it is encountered in a text. This feature helps make data entry faster and easier, as well as encouraging consistent data entry.

Metadata: The latest version of BITC allows metadata to be recorded as part of each file in a free-form metadata field. The user may record copyright information, access restrictions, or other types of data in this field. Currently BITC doesn't make use of such restrictions in the metadata but might in a future version.

Ongoing Issues and Future Direction
BITC has been very successful for the specific purpose for which it was originally designed: as a collaborative text collection tool for the Ingush project. It has also been used in some Field Methods courses at UC Berkeley, with some successes and some partial successes. We present some of the issues we have encountered in adapting BITC for use by projects other than the original. These issues suggest features that should be incorporated into an improved interlinear text collection system.

Extensibility: BITC installations require some custom programming for new projects, and it is not simple to add new field types. In fact, users have no control over which fields are available. They have some control over which fields are actually displayed at any particular time, but they can't add to the inventory of available fields. A better design would allow users to define fields on a per-project or per-text basis, e.g. a field for lexical tone for a language that has tone. Addition of new fields should be accomplished through an interface that does not require any programming knowledge by the user.

Data Model: The BITC data model does not apply perfectly to all types of texts, though it does represent collections of sentences and phrases well, and also some higher-level structures like paragraph and complete text. A more flexible data model would allow for more kinds of texts to be collected comfortably. At the least, a selection of templates would be valuable, for instance, a template for illustrating paradigmatic forms. In addition, while the BITC data model has proven successful for Ingush and Chechen, two languages with fairly rich morphology, the fact that its data model does not easily represent structure below the level of the word may make it of limited value for languages with particularly extensive morphology.

Shared Dictionary: Even though BITC compiles a shared word list as new words are entered, along with glosses, there is no way to coordinate this word list with a larger, full-featured lexical resource, such as a lexical database. It would be much better to integrate the shared word list with a database of this type.

Server Reliance: Currently BITC requires access to a web server, which is a positive feature in classroom situations or in locations where Internet access is easily available. It would be useful if BITC also had a standalone mode that would function in the absence of the Internet, with the possibility of incorporating data collected offline into a larger online database when Internet access is available.

Sample Screen Shot
The screen shot below should give some sense of the nature of the BITC user interface.

Program Papers & Handouts Readings
Instructions for Participants
Local Arrangements
Emeld 2001 Emeld 2002 Emeld Homepage