Comments on Language Codes
Randy J. LaPolla
City University of Hong Kong
ISO-639 is clearly inadequate. The Ethnologue codes are already widely used
by a number of multilanguage sites (e.g. The Rosetta Project), and can cover a
sufficient number of languages. The
system of classification they have is not perfect, but (at least for
Sino-Tibetan) can be lived with. The
information on the genetic affiliations and alternate names of languages needs
to be updated in many cases, but this would not usually affect the basic code
for a particular language. As suggested
in the article by Constable & Simons, a tag such as "i-sil" could
be registered under RFC 1766 to identify the codes as Ethnologue codes, so that
they could be used in metadata descriptions and be identified as the Ethnologue
codes. Some system for secondary
identification of dialects would also be nice to have, though not essential
(assuming we are talking about (a small number of) true dialects, not languages
called dialects, as in the case of Chinese), possibly a sub-code of two or
three letters, e.g. RAW-MW (Rawang-Matwang dialect). Assuming they are searched for in a uniform way, there should not
be interference between the primary and secondary codes.
Protolanguages need to be given tags as
languages, so that information about the protolanguage can be searched for in
the same way that other (living and dead) languages can be searched for. In the metadata for a language resource, the
code for the individual language (and possibly dialect) and codes for all nodes
in the family tree to which it belongs should be given, and in the correct
order. Depending on which is more
efficient computationally, these could take the form of individual values in a
series of individual language identification attribute-value pairs, or they could
be lined up in one attribute-value statement. Searches could then be done on
any level in the tree, and turn up all things below that level, if that is what
was wanted (searches could also be constrained so that either only a certain
level was retrieved, or all the information was retrieved, but laid out in the
hierarchical structure). Assuming the
creator of a language resource is the one who is clearest on the affiliations
of the language, the positioning of the language in the family should not be
problematic. A program for creating trees
out of these codes could be written, but it would not be absolutely
necessary. Information on affiliations
is already given in Ethnologue (where there is a link to a higher level
grouping in the entry under a language tag), though could be implemented more
thoroughly, and resource creators could inform Ethnologue if their own view of
the place of a language is different from that already on file, and give
reasons for their view. Ethnologue
could keep variant proposals on file in the entries, tagged with the name of
the proposer, and possibly even the criteria used for classification.
There should be separate classifications for
typological similarities and areal groupings, as these are both of interest to
linguists and will need to be searched for.
Areal groupings (possibly not just geographic, but also 'sphere of
influence', e.g. Indosphere vs. Sinosphere) should not be too difficult, but
there will probably be disagreements over typological characteristics (e.g.
some of us don't care for simplistic "SVO" vs. "SOV"
characterizations, etc.). If we can
come up with an agreed-upon set of tags for markup of data, we might be able to
turn these tags into a type of metadata that can help in typological
searches. For example, if our metadata
included some of the patterns found in the data, e.g. "PPREF-N" for
pronominal prefixes on nouns, and we wanted to search for all languages which
had pronominal prefixes on nouns, we could then search the metadata for that
particular feature.