What are Best Practices?
'Best practices' are practices which are intended to make digital language documentation optimally longlasting, accessible, and re-usable by other linguists and speakers. Recommendations of best practices cover all aspects of digitizing and archiving language documentation, including how to record it, annotate it, catalogue it, store it, and display it in such a way as to respect the intellectual property rights of stakeholders. The recommendations on this site have grown out of some of the larger language digitization and language engineering projects. E-MELD has been charged with disseminating these recommendations and encouraging community feedback.
Please note: the recommendations on the E-MELD site are intended as suggestions, not directives. Of course, you are free to organize your documentation in any way you choose. However, there are numerous advantages to following recommended practices--not only to you, but also to the future generations of speakers and scholars who may wish to consult your work. Particularly if you are documenting a lesser-known or endangered language, you are creating irreplaceable scientific material which will probably increase in value as the years pass. It is important that we all do what we can to ensure that such materials are as impervious as possible to decay of the physical media and to premature obsolescence wrought by technological change.
Read further on this site to find out more about current recommendations of best practice. A full explanation of the need for such practices is to be found in Bird and Simons (2003) Seven Dimensions of Portability for Language Documentation and Description (PDF; 129k).
Bird and Simons (2003) find seven areas in which consistent approaches can make digital language resources more useful to the discipline of linguistics and to speech communities.
One of the most pressing problems is the inconsistent use of terminology across resources. For example, the term "nominative" is ambiguous: it can refer, among other things, to the case of the subject of any clause, the subject of only an intransitive verb, or a noun which is unmodified. A computer thus cannot know what nominative is meant when it encounters the term. Thus it is very difficult to compare resources,
Terminology should be linked to a common ontology, through which varying terminology can be interpreted by a computer.
Resources are frequently uninterpretable to researchers. Much of the time this is because of the use of non-standard fonts, or because files are in proprietary formats.
All characters should be encoded using Unicode. This will ensure that the character-codes always represent the same character, no matter the machine they are being displayed on. Open, or at least published, file formats should be used, so that these can be accessed by a greater range of software. XML-marked up files are a good choice here, documented by a schema or DTD.
Finding data is perhaps one of the most difficult tasks for a linguist. Generalized search engines such as Google and Yahoo work well; but everyone is aware how many irrelevant results are returned in these searches.
You should list your resources with a linguistic search engine, e.g. the OLAC repository at OLAC archive search engine on LINGUIST List.
Many linguistic datasources are not available through the Internet. Indeed, most linguistic data exists only in the form of tapes and note-cards. Other data, while available on the Internet, is restricted as to who can access it.
Material which is in a non-electronic form can still be made available on the Internet. The ORE Repository Editor allows you to make such a resource discoverable to the linguistic search engines, and provides information on how the material can be accessed. Material which is electronically available only to certain groups can also be made accessible, through a graded-access system which E-MELD is developing in conjunction with IMDI and AILLA.
The citation of internet resources is one of the major problems with online resources: URLs may become inaccessible, or resources may move, leaving no record of where the resource went to.
An archival copy of all material should be placed in a stable online linguistic archive, and notation of this fact should be included in the record placed at the linguistic search engine chosen to record this material.
Preservation is a problem for both digital archives, where file formats over time become obsolete, and for physical archives, where media can deteriorate over time, be lost or damaged, or be in so archaic a format that no current hardware will support them. In addition, material can be lost through physical accidents.
If long term preservation is important, files are best stored by digital archives that undertake to migrate formats as they change. If this is not possible, textual material should be archived in XML format. Material should also be stored in multiple copies, in more than one physical location.
Resource creators, researchers and the speech communities who provide the primary data have different priorities over who has access to language resources.
BP in a Nutshell
What are Best Practices?
Why Follow BP?
Community Start Page
Linguist Start Page
Archivist Start Page