What is a Character Set? (Part Two): MARC-8

Libraries are necessarily multilingual environments. Library materials arrive at catalogers’ desks in a wide variety of languages and catalog records need to be created for these materials.

In the late 1960s, libraries saw the potential of computers to organize the vast amount of data previously held in paper catalogs. Henriette Avram at the Library of Congress spearheaded this initiative, alongside staff at OCLC, which was founded in 1967.

A major early problem to be tackled was that of encoding. ASCII couldn’t cover anything but English, and certainly couldn’t begin to cover any non-Latin alphabet. As the computing world had not yet come up with a solid multi-script encoding standard, it was clear that libraries would have to develop their own.

Enter MARC-8. MARC-8 was introduced alongside the other MARC standards, and, for the most part, resembled ASCII in its earliest incarnations. Unlike ASCII, it employed 8 bits, rather than 7. To remedy the lack of non-English characters, a standard called ANSEL (American National Standard for Extended Latin Alphabet Coded Character Set for Bibliographic Use) was introduced. ANSEL added a number of characters not found in English, as well as a few symbols: æ, ð, ©, ư, ı, etc. Crucially, ANSEL included a number of diacritics that could be stacked above or below existing letters to create the accented characters: á, ğ, è, ü. These diacritics also took up 8 bits and were placed before the character that they modify, e.g. ¨u.

What is brilliant about this system is that we gain a huge number of new characters without having to expand the number of bits that every single character uses. In MARC-8/ANSEL, we have 256 possible code points. If, however, we had to add new code points for every single combination of diacritic + character, we would quickly run out of space.

Although 8 bits should allow for 256 code points, in reality, we only have 94. The first bit is used to allow for the use of 2 different tables, 32 of the remaining 128 are control characters, and 2 are reserved. By referring back to a 7-bit system, MARC-8 was able to maintain a level of backward compatibility with ASCII.

This system worked well enough for a time. Whenever a non-Latin script was encountered, catalogers could represent the characters in that script with a Latin language equivalents that could easily be converted back to the original script, but could also easily be entered and searched for by anyone who only had access to a computer and keyboard that could only work with Latin characters. Over time, these correspondences were formalized under the ALA-LC Romanization scheme. This meant that anyone who saw Krasnoi︠a︡rsk in a MARC record knew that the Cyrillic equivalent should be Красноярск.

Where this system ran into trouble was with non-alphabetic scripts. Ideally, there should be a direct correspondence between non-Latin symbols and Latin symbols. This works well for many familiar writing systems like Cyrillic and Greek, not to mention Hindi, Cherokee, Mongolian, or Amharic. However, some writing systems (Arabic, Hebrew) do not require vowels to be written. And others (like Chinese hanji, which are also often employed in Japan and the Koreas) use single symbols to represent whole words or parts of words. For these writings systems, any Romanization scheme would either produce something illegible (e.g. alardn to represent Arabic Al-ʾUrdunn “Jordan”) or simply would not work, as with Chinese hanji.

To remedy this, the JACKPHY initiative was founded in 1979. JACKPHY, which stands for Japanese, Arabic, Chinese, Korean, Persian, Hebrew, Yiddish, sought to include the native script for these languages within MARC records. This is the point at which non-Roman characters entered the MARC record.

While allowing for the use of non-Roman scripts in MARC records solved many problems, it created new ones. Namely, how do we encode the enormous number of new characters that we are allowing into our record sets?

For character sets with a relatively small number of characters, the answer was fairly simple: set up separate code charts for these alphabets, then tell the computer program reading the MARC record which character set to use. In the 880 fields, which contain the native script equivalent of romanized fields, a tag is inserted telling the computer how to interpret the MARC-8 codes:

(3Arabic
(BLatin
$1Chinese, Japanese, Korean
(NCyrillic
(SGreek
(2Hebrew

066 subfield ǂc can also be used to tell the computer what scripts to watch out for.

Where things get weird is with Chinese, Japanese, and Korean (the CJK scripts). Due simply to the huge number of characters employed by Chinese, it would be impossible to encode everything in 8 bits. Instead, the East Asia Coded Character (EACC) was adopted. When the $1 tag is present the computer knows to read three strings of 7-bits as one. In effect, this gives the benefit of having 21 bits available without actually having to have a 21-bit character set.

Looking at the tags above, you’ll notice that Greek and Cyrillic are also included. These two scripts were added later on, likely due to the number of materials in these languages received by American libraries, and due to the cultural significance to American scholarship.

For some 40 years, American libraries have relied on this system to encode and display information in a wide range of languages. For the most part, attempts to update encoding systems have failed, likely due to cost and complication, and because so many records at so many institutions would have be changed if a new encoding standard were adopted.

Next post: UNICODE: the future is already here.

What is a Character Set? (Part One)

One of the core issues in multilingual cataloging is the character set. You may have noticed that many library catalogs allow for titles and other information to appear in Russian, Chinese, Hebrew, or Arabic, but not in Hindi, Thai, Armenian, or Cherokee. This all has to do with the character sets supported by our cataloging systems.

At its core, a character set is a way to tell a computer to display a series of 1s and 0s as the symbols you see on your screen. The way in which these codes are processed is known as a character encoding. Although the world has mostly reached some agreed-upon standards for character encodings (more on this later), sometimes you may see documents displayed using incorrect encoding. This is why you sometimes see webpages or e-mails display as gibberish, rather than displaying in the correct form. For further examples, see the W3 page explaining this.

The reason we have these problems is because space and processing constraints required us to use different character encodings that were capable of displaying different character sets. Computers operate on a series of binary operators: yes/no, true/false, 0/1. Each of these operators is a bit, which takes up space, and which needs to be read by a program, which takes time

Consider, then, that we need to express the alphabet in bits. In the early days of computing, programmers focused on English (in the US, at least). This requires 26 uppercase letter, 26 lowercase letters, 10 numerals, and punctuation and symbols necessary to encode mathematical and accounting concepts. Additionally, we need to express things like spaces and carriage returns. One of the earliest systems for encoding letters as bits is ASCII: American Standard Code for Information Interchange, which was officially implemented in 1963. This system uses 7 bits to express these symbols: 2^7=128, which means we can express 128 characters in total. This image from Wikipedia expresses how this would look:

https://upload.wikimedia.org/wikipedia/commons/c/cf/USASCII_code_chart.png

A word like “cat” takes up 21 bits, “book” takes of 28, etc. etc. Overall, not too bad. This remained the standard in the USA until the 1980s; other countries with their own needs created various 6-7 bit encoding systems.

As computers gained storage and increased in processing speed, it became possible to increase the number of bits in an encoding system. This allowed us to add characters necessary to encode non-English symbols: ñ, ÿ, é, etc. These are usually called extended ASCII, and they employ 8 bits (and a total of 256 characters), which increases the number of possible characters, but also the necessary space and processing power. At the same time, standards were set for a number of different writing systems and character sets. Computers needed to be able to distinguish these encoding standards, otherwise gibberish would result. From the 1980s until the 1990s, this multitude of systems co-existed with varying degrees of success.

In the late 1960s, libraries began to explore digitizing catalogs. Because few standards had been set by this point, and because libraries dealt with materials in a wide range of languages, the library community had to decide how to deal with this problem. More MARC-8 and UNICODE in the next post.

Welcome!

I’ve got a PhD in linguistics, and an MSLIS. I have a lot to say about languages, character sets, accessibility, and the current way the library catalog treats less common languages.

I’ll be including some explanatory information, and will also periodically argue for changes that could be made to make the library catalog more accessible, up-to-date, and web-friendly.