Workshop on
The Digitization of Language Data:
The Need for Standards

Working Group on Markup










New: Working Group Responses

Request for the Markup Working Group:

LINGUIST has received funding from the National Science Foundation to digitize data from ten minority languages, as part of the general E-MELD project. An essential component of this work is to mark up the data in a form which allows the maximum amount of interchangeability and interoperability between different linguistic servers, and this entails that we come to a consensus on what best practice is in this area. What we are soliciting from you, the members of the Markup Working Group, is suggestions about the kind of markup our software should be designed to handle (morphological markup, annotation for sound alignment, formatting), and what the nature of that markup should be. Essentially, we would like you to try out one of two variant markups which we hope to have for you, and write a brief (1 page or less) report which we can use as a springboard for discussion at the workshop.

The two sets we have you use are:

  • The Dobes Markup (Dokumentation Bedrohter Sprachen) The goal of this project is to document endangered languages, and is funded by the VolkswagenStiftung.
  • The LACITO Markup (Langues et Civilizations à Tradition Orale): The goal of the LACITO is to archive linguistic documents associating transcription and recorded speech in a format which guarantees their conservation and their availability for research, and disseminate the results.

We hope you will:

  • Try them out: Please take some of your data and simply try to mark it up using one of these schemes. If you have no suitable resource to annotate, just look at one or both of the sets and try to draw some conclusions. We are interested in the answers to questions like:
    • Are the tags and attributes clearly named and described?
    • Do they allow you to target the right information--i.e, the aspects of your data that you consider important and/or that other linguists might want to search for?
    • Do you think the system(s) would be reasonably easy to use, given appropriate software, e.g., a tag editor?
    • Do you have any other suggestions for markup schema? For example, what markup are you currently using on your data, and how does it compare to these?
  • Write a brief (1 page) report of your results: If you will email your report to Helen Aristar-Dry (hdry@linguistlist.org) by June 14, we will put it on the website prior to the workshop. Otherwise, we ask you to bring 12 copies of your report to the workshop. Your conclusions and suggestions will be the springboard for the discussion in the Metadata Working Group sessions.

We have put together a page providing some background on markup, in which we attempt briefly to answer questions like:

  • What is mark up?
  • What other standards exist?
  • Do we all have to use the same markup system?

For additional information, consult the pages on linguistic annotation at the Linguistic Data Consortium.

:


Workshop homepage | Workshop Proposal | Advance Reading | Contact the Organizers