It’s been a while since I’ve updated this blog, in part because I presently only catalog English-language materials. I’ll be leaving everything up for now, but I have turned off commenting as I get hundreds of spam comments every month. If you need to get in touch with me, send me an email or get in touch using one of the methods listed on my main site.
Open Access & Unicode
I’ve recently discovered Language Science Press, a phenomenal publisher that makes all of its publications available under an open access license.
Of particular interest to readers of this blog is the title The Unicode cookbook for linguists: Managing writing systems using orthography profiles. I’m looking forward to browsing through it!
On Romanization (Pt.4)
Time for some solutions.
As described in the previous posts, there are tons of problems with the romanization of names and titles in bibliographic metadata. I’ve found articles and letters going back to the earliest days of the implementation of this romanization decrying how it doesn’t serve our communities and is impractical. I’ve found a 2009 article by Michael Brewer that describes learning ALA-LC romanizations as a key competency for students of Slavic studies. If our users have to learn new and unnecessary skills in order to use our libraries, then we aren’t doing are jobs. To break down my complaints (and the complaints of others) here is a quick breakdown:
- Libraries are reluctant to include any vernacular script that is not one of the JACKPHY + scripts. This makes it hard to search in the languages that use these scripts and disadvantages the Global South.
- ALA-LC romanization is non-intuitive and inconsistent. It does not usually match the romanization schemes preferred by scholars or native speakers of a language. It often presupposes that letters from a single script are employed the same way in every language.
- Libraries have a habit of taking on tasks that they ought not to. They do things like standardize place names, invent language codes, and come up with romanization schemes. Please stop. Others do this better.
- It goes against the spirit of RDA. Previous cataloging rules allowed for all kinds of shorthands and abbreviations, but RDA emphasizes transcribing information directly from the piece. Romanizing feels like a violation of that principle.
So how do we remedy this?
This is a tricky question, as MARC, the current standard for bibliographic metadata, is (supposedly) dying. There has been great excitement in the library world over the introduction of BIBFRAME, which is a new standard that will further separate current practice from the practices that were employed when catalogers had to type out information on catalog cards. Because fixing romanization is futuristic I’ll remain standard-neutral and put out some ideas that could theoretically be applied to any standard.
- Transcribe what is on the piece in the vernacular for all relevant elements. This could be the title, author name, publication information or other information. Make this the main piece of information, rather than a secondary one.
- Create software that can automatically romanize into a variety of romanization schemes. For Cyrillic, this could include ISO 9, Russian passport transliteration, scholarly transliteration, Azeri Cyrillic to Azeri Latin, and, even, ALA-LC. I know this sort of algorithm is difficult, but others have done it and we don’t need to re-invent the wheel.
- Find a way of linking these romanized bits of information to the vernacular. This is already done (see my previous posts about 880 fields). Also include the type of romanization that is being used in any given field so that we can adjust in the future if, say, a Thai user using a Latin script keyboard is having a hard time finding material in their native language because current schemes are insufficient.
As a community, we have a few further decisions to make. Will these transcribed titles be stored in our bibliographic records, or will romanization automatically happen when the user interface communicates with our database? Should we further indicate whether we have employed the vernacular or romanized information in our records? (I say yes.)
As cataloging improves and advances with new technology, we have the opportunity to change how we deal with non-Latin scripts. Let’s enter the 21st Century and use UNICODE, use the input tools that every computer currently offers, use the vernacular that our users should expect.
On Romanization (Pt.3)
So we’ve established that cataloging is a bit messy when it’s decided whether or not to include the vernacular in addition to romanized forms. But who decides what this looks like?
In short, it’s the American Library Association and the Library of Congress. They’ve produced tables for 75 different languages and/or scripts, although certain documents combine languages that use a single script (Hebrew and Yiddish, Non-Slavic Languages [in Cyrillic Script]), while other separate out languages that use these same scripts (Judeo-Arabic, Russian, etc.).
In some cases, the romanization scheme has been very thoughtfully constructed by all parties involved. For example, in 2012, the library world collaborated with the Cherokee people to produce a romanization scheme that was amenable to all. In other cases, however, the scheme was clearly assembled by people with little knowledge of the language involved.
If, for example, you look at the non-Slavic languages in Cyrillic script chart, you see that there has been no consideration for how each language behaves. Instead, there was merely a failed attempt to assign every possible Cyrillic letter a Romanized equivalent. If the scheme had been successful, that would be one thing, but it’s horribly inconsistent. Take a look at some of the following:
- Tatar, Syriac, Kazakh ә is romanized as ă
- Tatar-Kryashen, Mari, Karelian ӓ is romanized as ă
- Khanty ӓ is romanized as ä
- Chuvash ӑ is romanized as ă
Knowing what I know about these languages, the only two romanizations I can agree with are for Khanty and Chuvash; these are the romanizations that most linguists would use. For Tatar, Mari, Kazakh, etc. I would use ä. The romanization scheme is inconsistent – either provide a 1-to-1 romanization for all possible Cyrillic letters or treat each language individually.
As I just noted, scholarly treatment of these languages rarely aligns with ALA-LC. This is because
- ALA-LC attempts to create a 1-to-1 system whereby you can easily work out the vernacular form from the romanized form
- ALA-LC has attempted to create an internal consistency based on script and not language (see the Cyrillic examples above)
For scholars working on minority languages, especially, it can be frustrating trying to locate materials in these languages when the romanization in the catalog does not align with the rest of the scholarly literature.
It’s bad enough to annoy scholars, but what about actual speakers of a language? What happens when they have their own Romanization schemes? What happens when a language shifts from one script to a Latin-based one? This has happened several times in the former Soviet Union. Azerbaijani, Turkmen, Kazakh, Crimean Tatar and Uzbek have all converted from a Cyrillic script to a Latin-based one. While neither Tatar nor Belorussian have made this shift, both groups have definitive preferences for how their languages should be presented in Latin script. As an example, here are some of the mismatches you’ll encounter when comparing ALA-LC to the official Latin standard for Azerbaijani:
Cyrillic | ALA-LC | Official Latin |
---|---|---|
Ҹ ҹ | j | c |
Ч ч | ch | ç |
Ә ә | ă | ə |
Ҝ ҝ | ġ | g |
Ө ө | ȯ | ö |
This disconnect is less that ideal because it means that a native speaker of Azerbaijani not only has to know the Latin script that is currently taught in schools and the Cyrillic script that was used up until the early 90s, but also has to learn the ALA-LC Romanization that is used in American and British libraries. And if that same speaker were to go to Germany, they would have to learn the system used there!
I’m not opposed to romanization. While we do have access to input methods for most of the world’s languages, they are still imperfect. I, for example, have no problem reading Cyrillic, yet still have a hard time typing it. Romanization makes things easy.
However, the current system is broken. There is not internal consistency and the wishes of native speakers are overlooked. I have a few suggestions for how to proceed, and I’m describe them in the next post.
On Romanization (Pt. 2)
I mentioned in the previous post that current standards allow for the inclusion of only certain vernacular scripts. Most of these are scripts that do not have an easy 1-to-1 mapping to Latin script.
In the case of Chinese, Japanese, and Korean, this is because these writing systems employ Chinese characters (called kanji in Japanese and hanja in Korean). The Chinese writing system is logographic, which means that a word or a part of a word is represented by a single character. Over time, these characters have lost much of their relationship to pronunciation, and there exist many characters that sound the same but which express different ideas. For example, 烔, 罿, and 硐 are all pronounced tóng, but mean “heated”, “bird net” and “grind”, respectively. This presented a problem for catalogers because there was no way to work backward for a Romanized version of Chinese to the original. Korean has largely dropped the use of hanja, which simplifies matters somewhat as Korean uses an alphabet that is arranged into syllabic blocks. Japanese is even more complicated because they still employ kanji, but also use syllabiaries called katakana and hiragana, meaning that a single sentence or title could potentially have three different alphabets in it. Therefore, there is no way to map between Romanized titles (or names) and the vernacular in these languages.
Other scripts that have presented a problem are Hebrew and the Perso-Arabic scripts. Neither writing system requires the writing of vowels. Although there are optional vowel symbols, there are typically not written except in children’s books and language-learning materials. For example, תל־אביב is Tel Aviv, yet the e in Tel is not written. Likewise المغرب is al-maġhrib “Morocco”, yet the a and i are not written. Again, this prohibits the existence of a 1-to-1 relation between the Romanized and vernacular versions.
To remedy this, in 1979 the Library of Congress worked with a few other library groups to implement the JACKPHY initiative (Japanese, Arabic, Chinese, Korean, Persian, Hebrew, Yiddish). In practice, this expanded to include all languages using any of these scripts (Urdu, Kurdish, various Jewish languages, pre-French Vietnamese). The result of this was that cataloging for materials in these languages would include not only a romanized version, but also a vernacular version.
Later, the Library of Congress added Greek and Cyrillic (Russian, Bulgarian, Kyrgyz, etc.) to its list of approved languages, likely due to the large number of materials available in the West that were published in languages using these scripts.
In practice, this looks like this:
In this image, there are linkages between the romanized and vernacular versions of this title, which is taken from the OCLC cataloging software Connexion. This catalog entry is in MARC format.
While this looks like a good solution, look what happens when we look at how library software actually translates this:
The vernacular has been shifted down to a series of 880 fields called “Alternate Graphic Representation”. This is because pure MARC doesn’t allow for the kind of linking seen in OCLC; instead it uses a complex series of codes to link the fields.
The result of all of this is that romanized versions of data are prioritized, whereas the vernacular is an afterthought. There is, in fact, no mandate for the inclusion of the vernacular.
Here’s where standards come into play. PCC, the Project for Cooperative Cataloging, sets cataloging standards in the US. Under its purview are four sub-groups that specify standards for other materials.
- BIBCO: monographic records (books and other one-time publications)
- CONSER: continuing resource records (serials, journals, etc.)
- NACO: name authority records
- SACO: subject authority records
Of these four groups, only BIBCO allows for the inclusion of (potentially) any script in a catalog record. The rest require that any record use only characters from MARC-8 character sets. This has a chilling effect on BIBCO, as the vast majority of records do not take advantage of the fact that non-JACKPHY+ scripts may be used.
For most libraries there is no requirement that they follow PCC guidelines; they could potentially enter data in whatever script they want. However, most libraries also do not have catalogers dedicated to foreign languages. Because catalog records are shared, and because PCC libraries are overwhelmingly well-funded and influential, most library records are produced by, or standardized by, libraries following PCC standards.
As a result, tons of languages with rich literary traditions do not see their vernaculars represented in library catalogs. Most of these are from South Asia (Hindi, Bengali, Gujarati, Tamil) and Southeast Asia (Thai, Lao, Khmer). We also lose out on Armenian, Cherokee, Georgian, Tibetan, Mongolian, Berber, Inuktitut, and many other languages with non-supported writing systems.
All of the languages I have mentioned are fully supported by UNICODE, yet the library world’s dependence on MARC-8 means that they can’t be represented. Granted, BIBCO records could (and sometimes do!) have vernacular representation of these languages, yet is far too rare.
On Romanization (Pt. 1)
As I’ve shown in previous posts, cataloging multilingual library materials is not so simple. For the next few posts, I plan to discuss my issues with Romanization (i.e. the converting of foreign scripts into Latin characters). As with MARC-8, there is a long (and very good) historical precedent for Romanization. Also with MARC-8, technology and interconnectedness have advanced to the point where Romanization has become a problem. I propose that we stop Romanizing and instead enter data in the vernacular form, leaving it up to algorithms or other forms of automation to Romanize should this be necessary. A few of the problems I have with Romanization are as follows; each will be discussed in later posts.
- Current standards allow for the inclusion of only certain vernacular scripts (JACKPHY, Cyrillic, and Greek – more on these later). These standards are hangovers from MARC-8, so they should be discarded immediately. This also prevents native-script users from searching the catalog in their own scripts.
- For most non-Latin scripts there exists a wide variety of Romanization schemes. Why force library users to learn ALA-LC (the standard in US cataloging)?
- Quite a few languages have official Latin-based alphabets, either because they switched from a non-Latin alphabet to a Latin-based one (e.g. Azerbaijani, Uzbek) or because a governmental body promotes a specific Romanization (as in South Korea).
- It is not unheard of for a title (or other tidbit of catalogable material) to exist in multiple scripts. This makes Romanization messy.
- Romanization goes against the spirit of RDA. If RDA instructs us to enter data as it exists on the piece, then why do we Romanize?
For further details on the “official” Romanization, see the ALA-LC Romanization Tables.
What is a Character Set? (Part Three): UNICODE: the future is now!
If you thought it would be strange that computers would still rely on 40-year-old technology to deal with the world’s many writing systems, you would be correct. While the library world was content to stick with MARC-8, the computing world evolved constantly.
For most computer users in the early days of MARC-8, having access to many code points was not especially important. Libraries, as multilingual environments, were one of the few institutions where the availability of multiple scripts was important.
By the late 80’s, however, the rapid adoption of the Internet and the World Wide Web meant that computers around the world could talk to each other. At the same time, computing power was growing rapidly. The phones that we carry in our pockets have greater computing power than NASA used to put humans in space, so processing lots and lots of bits is no longer a problem.
Having a huge number of encodings schemes was counter-productive, as communication in one locale would be rendered as gibberish in another. (Does anyone remember this from the early days of the Internet? Running through encoding settings in order to make a website readable?)
To solve this, computer scientists began working toward a universal standard that could unite all existing standards, and all of the scripts and characters they expressed, into one single standard. The result was Unicode, which was incorporated in 1991 after discussions between engineers at Xerox and Apple.
Unicode is a bit confusing from an encoding standpoint, as it is not an encoding, per se, but rather a standard. Unicode can be implemented in several ways, but the most common (and current) are UTF-8, UTF-16, and UTF-32. UTF-32 and-16 are fixed bit schema, which means that each character takes up exactly 32 or 16 bits. UTF-16 is the rarest of these three and is seen as unstable due its lack of use. UTF-32 takes up the most space, and is therefore not very common either. UTF-8 is a variable bit scheme, with characters taking up 8, 16, 24 or 32 bits. This flexibility means that all characters can be expressed easily, yet less space is taken up than UTF-32.
Other encodings have been implemented or proposed, yet none are very common. UTF-8 is the web standard, although any other UTF encoding should present little difficulty to browsers or other document readers.
Now that we’ve covered encodings, let’s cover what Unicode is. Unicode, as I’ve mentioned is a standard. And it is governed by the Unicode Consortium. This consortium is made up of members, mostly tech companies like Apple and Adobe and Oracle, but also many governments, linguistics institutions, universities, and even interested individuals. These members collaborate to ensure that Unicode truly works as a (near) universal standard, to ensure that every computer can produce and interpret Unicode-compliant material, and expand the standard to include the scripts necessary to digitally (re)produce the sum of human knowledge.
As a standard, Unicode assigns hexadecimal code points to characters, which are represented in slightly different ways depending upon the encoding selected. These characters and conceived of as belonging to blocks. The capital letter <Q> belongs to the C0 Controls and Basic Latin block and has the specific code U+0051. (0051 is the code point, “U+” is usually added to specify that we’re talking about Unicode.)
As the need for more scripts or symbols grows, Unicode can add new blocks or assign characters to new code points. After the Rohingya crisis in Myanmar, Unicode rushed to include the Rohingya script in its standard to ensure that agencies could produce Rohingya language materials that could be interpreted by any computer. In the past, a new script would necessitate a new font and a new encoding, and because this script depended on a font with its own encoding, there was no guarantee that a given computer could read a document written in this way. Unicode solves this by putting every script into a single standard that is readable by just about any modern computer.
So given that Unicode seems to be the solution to our outdated MARC-8 system, why do we still stick to MARC-8? To a certain extent, we actually have made the switch. OCLC, a global cataloging cooperative, allows for the creation of records in Unicode, then allows participating libraries to export records in a variety of encodings, including MARC-8. The reason the switch hasn’t happened completely is largely to due to money and tradition.
On the money end, re-encoding the catalog has the potential to be quite costly. Library software is notoriously expensive and difficult to implement, so many libraries use legacy systems that don’t play well with Unicode. This decision, however, is largely at the level of the individual library, and I have no doubt that clever catalogers could tweak their software and their catalogs to use the Unicode standard.
Tradition as well plays a role. In the MARC catalog record, field 066 is used to indicate what scripts are present. This would be unnecessary if all scripts were inherently supported. Have a look at the mess we deal with now to see why keeping MARC-8 is such a bad idea. PCC, the Program for Cooperative Cataloging has also been slow to adopt Unicode. It has 4 divisions: BIBCO (most bibliographic records), NACO (name authorities), SACO (subject authorities), and CONSER (serials). Of these four, only BIBCO allows for the use of Unicode; the rest require MARC-8. And because PCC is the gold standard for cataloging, it means that they control the keys to Unicode.
It’s 2019. My phone can type and read just about any language I want it to. My computer has no trouble rendering Cyrillic or Mongolian or even Egyptian hieroglyphics. A person from Thailand could conceivably search for an author in an American catalog, but if that author is in Thai, they have to resort to transliterating their name according to a prescribed standard because NACO doesn’t allow us to enter names in vernacular scripts that MARC-8 doesn’t support.
The library world has resigned itself to abandoning MARC, not because it was bad, but because we have technology that allows us do so much more. We should be proud of the brilliant librarians who developed MARC and MARC-8, and the best way to honor their legacy is to move on to something even better.
Delays and Applications
I’ve been quite busy as of late, working on other projects, spending time with family, and a boatload of job applications.
I thought it would be nice to share a the slides for a job presentation I recently gave. I’ve removed some identifying details. It’s available here.
I’m still working on part 3 of the character set series, which I hope to have completed soon. I’ve recently had some fruitful discussions about transcription and Unicode and the library giving up control of things like language names and transcriptions schemes and script codes, and I have plans to write more about those things soon.
What is a Character Set? (Part Two): MARC-8
Libraries are necessarily multilingual environments. Library materials arrive at catalogers’ desks in a wide variety of languages and catalog records need to be created for these materials.
In the late 1960s, libraries saw the potential of computers to organize the vast amount of data previously held in paper catalogs. Henriette Avram at the Library of Congress spearheaded this initiative, alongside staff at OCLC, which was founded in 1967.
A major early problem to be tackled was that of encoding. ASCII couldn’t cover anything but English, and certainly couldn’t begin to cover any non-Latin alphabet. As the computing world had not yet come up with a solid multi-script encoding standard, it was clear that libraries would have to develop their own.
Enter MARC-8. MARC-8 was introduced alongside the other MARC standards, and, for the most part, resembled ASCII in its earliest incarnations. Unlike ASCII, it employed 8 bits, rather than 7. To remedy the lack of non-English characters, a standard called ANSEL (American National Standard for Extended Latin Alphabet Coded Character Set for Bibliographic Use) was introduced. ANSEL added a number of characters not found in English, as well as a few symbols: æ, ð, ©, ư, ı, etc. Crucially, ANSEL included a number of diacritics that could be stacked above or below existing letters to create the accented characters: á, ğ, è, ü. These diacritics also took up 8 bits and were placed before the character that they modify, e.g. ¨u.
What is brilliant about this system is that we gain a huge number of new characters without having to expand the number of bits that every single character uses. In MARC-8/ANSEL, we have 256 possible code points. If, however, we had to add new code points for every single combination of diacritic + character, we would quickly run out of space.
Although 8 bits should allow for 256 code points, in reality, we only have 94. The first bit is used to allow for the use of 2 different tables, 32 of the remaining 128 are control characters, and 2 are reserved. By referring back to a 7-bit system, MARC-8 was able to maintain a level of backward compatibility with ASCII.
This system worked well enough for a time. Whenever a non-Latin script was encountered, catalogers could represent the characters in that script with a Latin language equivalents that could easily be converted back to the original script, but could also easily be entered and searched for by anyone who only had access to a computer and keyboard that could only work with Latin characters. Over time, these correspondences were formalized under the ALA-LC Romanization scheme. This meant that anyone who saw Krasnoi︠a︡rsk in a MARC record knew that the Cyrillic equivalent should be Красноярск.
Where this system ran into trouble was with non-alphabetic scripts. Ideally, there should be a direct correspondence between non-Latin symbols and Latin symbols. This works well for many familiar writing systems like Cyrillic and Greek, not to mention Hindi, Cherokee, Mongolian, or Amharic. However, some writing systems (Arabic, Hebrew) do not require vowels to be written. And others (like Chinese hanji, which are also often employed in Japan and the Koreas) use single symbols to represent whole words or parts of words. For these writings systems, any Romanization scheme would either produce something illegible (e.g. alardn to represent Arabic Al-ʾUrdunn “Jordan”) or simply would not work, as with Chinese hanji.
To remedy this, the JACKPHY initiative was founded in 1979. JACKPHY, which stands for Japanese, Arabic, Chinese, Korean, Persian, Hebrew, Yiddish, sought to include the native script for these languages within MARC records. This is the point at which non-Roman characters entered the MARC record.
While allowing for the use of non-Roman scripts in MARC records solved many problems, it created new ones. Namely, how do we encode the enormous number of new characters that we are allowing into our record sets?
For character sets with a relatively small number of characters, the answer was fairly simple: set up separate code charts for these alphabets, then tell the computer program reading the MARC record which character set to use. In the 880 fields, which contain the native script equivalent of romanized fields, a tag is inserted telling the computer how to interpret the MARC-8 codes:
(3 | Arabic |
(B | Latin |
$1 | Chinese, Japanese, Korean |
(N | Cyrillic |
(S | Greek |
(2 | Hebrew |
066 subfield ǂc can also be used to tell the computer what scripts to watch out for.
Where things get weird is with Chinese, Japanese, and Korean (the CJK scripts). Due simply to the huge number of characters employed by Chinese, it would be impossible to encode everything in 8 bits. Instead, the East Asia Coded Character (EACC) was adopted. When the $1 tag is present the computer knows to read three strings of 7-bits as one. In effect, this gives the benefit of having 21 bits available without actually having to have a 21-bit character set.
Looking at the tags above, you’ll notice that Greek and Cyrillic are also included. These two scripts were added later on, likely due to the number of materials in these languages received by American libraries, and due to the cultural significance to American scholarship.
For some 40 years, American libraries have relied on this system to encode and display information in a wide range of languages. For the most part, attempts to update encoding systems have failed, likely due to cost and complication, and because so many records at so many institutions would have be changed if a new encoding standard were adopted.
Next post: UNICODE: the future is already here.
What is a Character Set? (Part One)
One of the core issues in multilingual cataloging is the character set. You may have noticed that many library catalogs allow for titles and other information to appear in Russian, Chinese, Hebrew, or Arabic, but not in Hindi, Thai, Armenian, or Cherokee. This all has to do with the character sets supported by our cataloging systems.
At its core, a character set is a way to tell a computer to display a series of 1s and 0s as the symbols you see on your screen. The way in which these codes are processed is known as a character encoding. Although the world has mostly reached some agreed-upon standards for character encodings (more on this later), sometimes you may see documents displayed using incorrect encoding. This is why you sometimes see webpages or e-mails display as gibberish, rather than displaying in the correct form. For further examples, see the W3 page explaining this.
The reason we have these problems is because space and processing constraints required us to use different character encodings that were capable of displaying different character sets. Computers operate on a series of binary operators: yes/no, true/false, 0/1. Each of these operators is a bit, which takes up space, and which needs to be read by a program, which takes time
Consider, then, that we need to express the alphabet in bits. In the early days of computing, programmers focused on English (in the US, at least). This requires 26 uppercase letter, 26 lowercase letters, 10 numerals, and punctuation and symbols necessary to encode mathematical and accounting concepts. Additionally, we need to express things like spaces and carriage returns. One of the earliest systems for encoding letters as bits is ASCII: American Standard Code for Information Interchange, which was officially implemented in 1963. This system uses 7 bits to express these symbols: 2^7=128, which means we can express 128 characters in total. This image from Wikipedia expresses how this would look:
A word like “cat” takes up 21 bits, “book” takes of 28, etc. etc. Overall, not too bad. This remained the standard in the USA until the 1980s; other countries with their own needs created various 6-7 bit encoding systems.
As computers gained storage and increased in processing speed, it became possible to increase the number of bits in an encoding system. This allowed us to add characters necessary to encode non-English symbols: ñ, ÿ, é, etc. These are usually called extended ASCII, and they employ 8 bits (and a total of 256 characters), which increases the number of possible characters, but also the necessary space and processing power. At the same time, standards were set for a number of different writing systems and character sets. Computers needed to be able to distinguish these encoding standards, otherwise gibberish would result. From the 1980s until the 1990s, this multitude of systems co-existed with varying degrees of success.
In the late 1960s, libraries began to explore digitizing catalogs. Because few standards had been set by this point, and because libraries dealt with materials in a wide range of languages, the library community had to decide how to deal with this problem. More MARC-8 and UNICODE in the next post.