About metadata
What do we mean by 'metadata'?
Metadata is, of course, data about data. Most of us are familiar with a kind of metadata, the information enclosed in <META> tags at the top of web pages. This is, of course, metadata describing the content of the web page; and it aids search engines in finding and classifying the page correctly.
In the current context, we are using the term 'metadata' for a description of a whole digitized language resource, including, for example, the name of the linguist who created the resource, the language it is in, the content, the format (is it an audio file or a text file), and so forth. The metadata can either be included in the resource itself, like the metadata at the top of web pages, or it can reside elsewhere and be associated with the resource via a link. Metadata may be in various codings, e.g., HTML or XML, and conform to various standards, e.g. Dublin Core, MARC--or the OLAC standard we are trying to create.
Most archives of language resources include information that we would call metadata. And the EAGLES/ISLE Metadata Initiative usefully identifies some examples:
Existing corpora such as Childes or ESF Second Learner Corpus have each corpus file include a so-called header with information that we would now describe as the resources meta-data in a proprietary format. Also important initiatives such as TEI and CES/xCES worked out sets of tags that describe a whole transcription file and would be called meta-data within this initiative. Institutions such as Helsinki University started to build web sites with samples of corpora where hyperlinks and commentary text containing typical meta-data allow the user to easily navigate between the corpus samples. . . . Recently the MPI Browsable Corpus project and the ICE project came up with . . . huge distributed sets of linked meta descriptions of resources that can be parsed and navigated by suitable browsers.
What do we intend to do with metadata?
The LINGUIST List is going to become an OLAC "service provider." This means that we will collect and make available to the linguistics community metadata on available language data and documentation--we will become something like the "union catalog" of language- and linguistics-related resources. Ideally, a linguist will be able to query our database of metadata in order to retrieve information about virtually any language-related resource; the metadata will tell him/her who created the resource, where it is, what format it is in, who has access to it, and so forth.
With the exception of the material in our demonstration project (more on this at the Workshop), we will not collect data and documentation ourselves, but only the metadata. We will thus continue to act as a hub, an information center from which you can navigate to other relevant sites.
We will collect metadata in a number of ways. The most important are:
In sum: metadata is a simple concept, but the metadata format agreed upon and the degree of consensus about it will be very important to the linguistics community in the future. It will determine how well we are able to find language resources in the vast, and rapidly expanding, realm of the Internet.