On Romanization (Pt.3)

So we’ve established that cataloging is a bit messy when it’s decided whether or not to include the vernacular in addition to romanized forms. But who decides what this looks like?

In short, it’s the American Library Association and the Library of Congress. They’ve produced tables for 75 different languages and/or scripts, although certain documents combine languages that use a single script (Hebrew and Yiddish, Non-Slavic Languages [in Cyrillic Script]), while other separate out languages that use these same scripts (Judeo-Arabic, Russian, etc.).

In some cases, the romanization scheme has been very thoughtfully constructed by all parties involved. For example, in 2012, the library world collaborated with the Cherokee people to produce a romanization scheme that was amenable to all. In other cases, however, the scheme was clearly assembled by people with little knowledge of the language involved.

If, for example, you look at the non-Slavic languages in Cyrillic script chart, you see that there has been no consideration for how each language behaves. Instead, there was merely a failed attempt to assign every possible Cyrillic letter a Romanized equivalent. If the scheme had been successful, that would be one thing, but it’s horribly inconsistent. Take a look at some of the following:

  • Tatar, Syriac, Kazakh ә is romanized as ă
  • Tatar-Kryashen, Mari, Karelian ӓ is romanized as ă
  • Khanty ӓ is romanized as ä
  • Chuvash ӑ is romanized as ă

Knowing what I know about these languages, the only two romanizations I can agree with are for Khanty and Chuvash; these are the romanizations that most linguists would use. For Tatar, Mari, Kazakh, etc. I would use ä. The romanization scheme is inconsistent – either provide a 1-to-1 romanization for all possible Cyrillic letters or treat each language individually.

As I just noted, scholarly treatment of these languages rarely aligns with ALA-LC. This is because

  1. ALA-LC attempts to create a 1-to-1 system whereby you can easily work out the vernacular form from the romanized form
  2. ALA-LC has attempted to create an internal consistency based on script and not language (see the Cyrillic examples above)

For scholars working on minority languages, especially, it can be frustrating trying to locate materials in these languages when the romanization in the catalog does not align with the rest of the scholarly literature.

It’s bad enough to annoy scholars, but what about actual speakers of a language? What happens when they have their own Romanization schemes? What happens when a language shifts from one script to a Latin-based one? This has happened several times in the former Soviet Union. Azerbaijani, Turkmen, Kazakh, Crimean Tatar and Uzbek have all converted from a Cyrillic script to a Latin-based one. While neither Tatar nor Belorussian have made this shift, both groups have definitive preferences for how their languages should be presented in Latin script. As an example, here are some of the mismatches you’ll encounter when comparing ALA-LC to the official Latin standard for Azerbaijani:

CyrillicALA-LCOfficial Latin
Ҹ ҹ jc
Ч ч chç
Ә ә ăə
Ҝ ҝġg
Ө ө ö

This disconnect is less that ideal because it means that a native speaker of Azerbaijani not only has to know the Latin script that is currently taught in schools and the Cyrillic script that was used up until the early 90s, but also has to learn the ALA-LC Romanization that is used in American and British libraries. And if that same speaker were to go to Germany, they would have to learn the system used there!

I’m not opposed to romanization. While we do have access to input methods for most of the world’s languages, they are still imperfect. I, for example, have no problem reading Cyrillic, yet still have a hard time typing it. Romanization makes things easy.

However, the current system is broken. There is not internal consistency and the wishes of native speakers are overlooked. I have a few suggestions for how to proceed, and I’m describe them in the next post.

On Romanization (Pt. 2)

I mentioned in the previous post that current standards allow for the inclusion of only certain vernacular scripts. Most of these are scripts that do not have an easy 1-to-1 mapping to Latin script.

In the case of Chinese, Japanese, and Korean, this is because these writing systems employ Chinese characters (called kanji in Japanese and hanja in Korean). The Chinese writing system is logographic, which means that a word or a part of a word is represented by a single character. Over time, these characters have lost much of their relationship to pronunciation, and there exist many characters that sound the same but which express different ideas. For example, 烔, 罿, and 硐 are all pronounced tóng, but mean “heated”, “bird net” and “grind”, respectively. This presented a problem for catalogers because there was no way to work backward for a Romanized version of Chinese to the original. Korean has largely dropped the use of hanja, which simplifies matters somewhat as Korean uses an alphabet that is arranged into syllabic blocks. Japanese is even more complicated because they still employ kanji, but also use syllabiaries called katakana and hiragana, meaning that a single sentence or title could potentially have three different alphabets in it. Therefore, there is no way to map between Romanized titles (or names) and the vernacular in these languages.

Other scripts that have presented a problem are Hebrew and the Perso-Arabic scripts. Neither writing system requires the writing of vowels. Although there are optional vowel symbols, there are typically not written except in children’s books and language-learning materials. For example, תל־אביב is Tel Aviv, yet the e in Tel is not written. Likewise المغرب‎ is al-maġhrib “Morocco”, yet the a and i are not written. Again, this prohibits the existence of a 1-to-1 relation between the Romanized and vernacular versions.

To remedy this, in 1979 the Library of Congress worked with a few other library groups to implement the JACKPHY initiative (Japanese, Arabic, Chinese, Korean, Persian, Hebrew, Yiddish). In practice, this expanded to include all languages using any of these scripts (Urdu, Kurdish, various Jewish languages, pre-French Vietnamese). The result of this was that cataloging for materials in these languages would include not only a romanized version, but also a vernacular version.

Later, the Library of Congress added Greek and Cyrillic (Russian, Bulgarian, Kyrgyz, etc.) to its list of approved languages, likely due to the large number of materials available in the West that were published in languages using these scripts.

In practice, this looks like this:

In this image, there are linkages between the romanized and vernacular versions of this title, which is taken from the OCLC cataloging software Connexion. This catalog entry is in MARC format.

While this looks like a good solution, look what happens when we look at how library software actually translates this:

The vernacular has been shifted down to a series of 880 fields called “Alternate Graphic Representation”. This is because pure MARC doesn’t allow for the kind of linking seen in OCLC; instead it uses a complex series of codes to link the fields.

The result of all of this is that romanized versions of data are prioritized, whereas the vernacular is an afterthought. There is, in fact, no mandate for the inclusion of the vernacular.

Here’s where standards come into play. PCC, the Project for Cooperative Cataloging, sets cataloging standards in the US. Under its purview are four sub-groups that specify standards for other materials.

  • BIBCO: monographic records (books and other one-time publications)
  • CONSER: continuing resource records (serials, journals, etc.)
  • NACO: name authority records
  • SACO: subject authority records

Of these four groups, only BIBCO allows for the inclusion of (potentially) any script in a catalog record. The rest require that any record use only characters from MARC-8 character sets. This has a chilling effect on BIBCO, as the vast majority of records do not take advantage of the fact that non-JACKPHY+ scripts may be used.

For most libraries there is no requirement that they follow PCC guidelines; they could potentially enter data in whatever script they want. However, most libraries also do not have catalogers dedicated to foreign languages. Because catalog records are shared, and because PCC libraries are overwhelmingly well-funded and influential, most library records are produced by, or standardized by, libraries following PCC standards.

As a result, tons of languages with rich literary traditions do not see their vernaculars represented in library catalogs. Most of these are from South Asia (Hindi, Bengali, Gujarati, Tamil) and Southeast Asia (Thai, Lao, Khmer). We also lose out on Armenian, Cherokee, Georgian, Tibetan, Mongolian, Berber, Inuktitut, and many other languages with non-supported writing systems.

All of the languages I have mentioned are fully supported by UNICODE, yet the library world’s dependence on MARC-8 means that they can’t be represented. Granted, BIBCO records could (and sometimes do!) have vernacular representation of these languages, yet is far too rare.

On Romanization (Pt. 1)

As I’ve shown in previous posts, cataloging multilingual library materials is not so simple. For the next few posts, I plan to discuss my issues with Romanization (i.e. the converting of foreign scripts into Latin characters). As with MARC-8, there is a long (and very good) historical precedent for Romanization. Also with MARC-8, technology and interconnectedness have advanced to the point where Romanization has become a problem. I propose that we stop Romanizing and instead enter data in the vernacular form, leaving it up to algorithms or other forms of automation to Romanize should this be necessary. A few of the problems I have with Romanization are as follows; each will be discussed in later posts.

  • Current standards allow for the inclusion of only certain vernacular scripts (JACKPHY, Cyrillic, and Greek – more on these later). These standards are hangovers from MARC-8, so they should be discarded immediately. This also prevents native-script users from searching the catalog in their own scripts.
  • For most non-Latin scripts there exists a wide variety of Romanization schemes. Why force library users to learn ALA-LC (the standard in US cataloging)?
  • Quite a few languages have official Latin-based alphabets, either because they switched from a non-Latin alphabet to a Latin-based one (e.g. Azerbaijani, Uzbek) or because a governmental body promotes a specific Romanization (as in South Korea).
  • It is not unheard of for a title (or other tidbit of catalogable material) to exist in multiple scripts. This makes Romanization messy.
  • Romanization goes against the spirit of RDA. If RDA instructs us to enter data as it exists on the piece, then why do we Romanize?

For further details on the “official” Romanization, see the ALA-LC Romanization Tables.