Comments on Language Codes

Randy J. LaPolla

City University of Hong Kong

 

ISO-639 is clearly inadequate.  The Ethnologue codes are already widely used by a number of multilanguage sites (e.g. The Rosetta Project), and can cover a sufficient number of languages.  The system of classification they have is not perfect, but (at least for Sino-Tibetan) can be lived with.  The information on the genetic affiliations and alternate names of languages needs to be updated in many cases, but this would not usually affect the basic code for a particular language.  As suggested in the article by Constable & Simons, a tag such as "i-sil" could be registered under RFC 1766 to identify the codes as Ethnologue codes, so that they could be used in metadata descriptions and be identified as the Ethnologue codes.  Some system for secondary identification of dialects would also be nice to have, though not essential (assuming we are talking about (a small number of) true dialects, not languages called dialects, as in the case of Chinese), possibly a sub-code of two or three letters, e.g. RAW-MW (Rawang-Matwang dialect).  Assuming they are searched for in a uniform way, there should not be interference between the primary and secondary codes. 

 

Protolanguages need to be given tags as languages, so that information about the protolanguage can be searched for in the same way that other (living and dead) languages can be searched for.  In the metadata for a language resource, the code for the individual language (and possibly dialect) and codes for all nodes in the family tree to which it belongs should be given, and in the correct order.  Depending on which is more efficient computationally, these could take the form of individual values in a series of individual language identification attribute-value pairs, or they could be lined up in one attribute-value statement. Searches could then be done on any level in the tree, and turn up all things below that level, if that is what was wanted (searches could also be constrained so that either only a certain level was retrieved, or all the information was retrieved, but laid out in the hierarchical structure).  Assuming the creator of a language resource is the one who is clearest on the affiliations of the language, the positioning of the language in the family should not be problematic.  A program for creating trees out of these codes could be written, but it would not be absolutely necessary.  Information on affiliations is already given in Ethnologue (where there is a link to a higher level grouping in the entry under a language tag), though could be implemented more thoroughly, and resource creators could inform Ethnologue if their own view of the place of a language is different from that already on file, and give reasons for their view.  Ethnologue could keep variant proposals on file in the entries, tagged with the name of the proposer, and possibly even the criteria used for classification.

 

There should be separate classifications for typological similarities and areal groupings, as these are both of interest to linguists and will need to be searched for.  Areal groupings (possibly not just geographic, but also 'sphere of influence', e.g. Indosphere vs. Sinosphere) should not be too difficult, but there will probably be disagreements over typological characteristics (e.g. some of us don't care for simplistic "SVO" vs. "SOV" characterizations, etc.).  If we can come up with an agreed-upon set of tags for markup of data, we might be able to turn these tags into a type of metadata that can help in typological searches.  For example, if our metadata included some of the patterns found in the data, e.g. "PPREF-N" for pronominal prefixes on nouns, and we wanted to search for all languages which had pronominal prefixes on nouns, we could then search the metadata for that particular feature.