The original report of the Markup Working Group was for some reason not saved along with the other working group reports. This version was prepared years later, just prior to the 2005 E-MELD workshop, based on notes that I saved from the 2001 workshop, and on Doug Whalen's Final Report. I have also added some remarks that indicate the kind of follow-up that has taken place.
The Markup Working Group was charged with the task of recommending a markup language for use in the E-MELD project, and how to establish standards for marking up project materials.
The group agreed that XML should be the markup language used in the project, as it is the world standard for content, as opposed to formatting, markup. HTML is clearly inadequate for the purpose of representing linguistic information. There was some discussion of the adequacy of XML to represent all of the kinds of information that is of importance to linguists, and of the lack of tools, particularly ones that linguists can afford, for preparing XML documents and making them available on the Web.
The group discussed two approaches to the development of linguistic markup standards, both represented in the Text Encoding Initiative that I reported on at the workshop. One is to define one or more markup syntax specifications that are simple to use, and that provide just enough detail to do specific jobs, such as representing the parse structure of a text, aligning a sound recording with its transcription, and the structure of a lexicon. Remark 1. The other is to develop a markup "metalanguage" that enables the encoder to represent any desired linguistic analysis. This approach is best carried out using "standoff markup", in which pointers are used from the unanalyzed data to the analysis markup. Remark 2.
We agreed that we would begin our effort with recommendations for morphosyntactic markup, on two fronts: (1) the structure of morphosyntactic representations, and (2) making lists of commonly (and not so commonly, but still important) terms. The problems with (1) are best described in Doug Whalen's "Trail of Tiers" section of his Final Report. Remark 3. For (2), we will start with additions to and modifications of the list being prepared by the EUROTYP project. Remark 4.
Bonny Sands pointed out the need for 'confidence level' indicators for the various components of an analysis. Remark 5.
A listserv will be implemented to continue discussion of these issues, and one or more advisory boards should be set up. Remark 6.
Building simple markup modules is represented by the various "best practice" markup recommendations for glossed texts, lexicons, etc.
Return 1.The use of standoff markup was not discussed systematically by the group, but soon became the central focus of much of the standardization effort within the project, particularly the creation of GOLD, and tools for relating it to linguistic descriptions.
Return 2.We have only developed best-practice recommendations for simple "three-line" morphosyntactic markup so far, which represents the vast preponderance of material available on the Web, and Will Lewis has been developing tools for migrating that material to best practice.
Return 3.Constructing term lists and their definitions was one of the first tasks we undertook for the development of GOLD. We started not with the EUROTYP list but with a list that Gary Simons provided from the SIL Lingua Links project.
Return 4.To my knowledge, this suggestion has not been followed up on. My own preference would be to treat it as a metadata problem, and refer it to that working group.
Return 5.The original listserv for this working group was not active. Several other listservs have since been set up with varying degrees of activitity. The formation of an advisory board will was implemented for the GOLD Community at the "Fresno summit" in November 2004.
Return 6.