What is a Character Set? (Part One)

One of the core issues in multilingual cataloging is the character set. You may have noticed that many library catalogs allow for titles and other information to appear in Russian, Chinese, Hebrew, or Arabic, but not in Hindi, Thai, Armenian, or Cherokee. This all has to do with the character sets supported by our cataloging systems.

At its core, a character set is a way to tell a computer to display a series of 1s and 0s as the symbols you see on your screen. The way in which these codes are processed is known as a character encoding. Although the world has mostly reached some agreed-upon standards for character encodings (more on this later), sometimes you may see documents displayed using incorrect encoding. This is why you sometimes see webpages or e-mails display as gibberish, rather than displaying in the correct form. For further examples, see the W3 page explaining this.

The reason we have these problems is because space and processing constraints required us to use different character encodings that were capable of displaying different character sets. Computers operate on a series of binary operators: yes/no, true/false, 0/1. Each of these operators is a bit, which takes up space, and which needs to be read by a program, which takes time

Consider, then, that we need to express the alphabet in bits. In the early days of computing, programmers focused on English (in the US, at least). This requires 26 uppercase letter, 26 lowercase letters, 10 numerals, and punctuation and symbols necessary to encode mathematical and accounting concepts. Additionally, we need to express things like spaces and carriage returns. One of the earliest systems for encoding letters as bits is ASCII: American Standard Code for Information Interchange, which was officially implemented in 1963. This system uses 7 bits to express these symbols: 2^7=128, which means we can express 128 characters in total. This image from Wikipedia expresses how this would look:

https://upload.wikimedia.org/wikipedia/commons/c/cf/USASCII_code_chart.png

A word like “cat” takes up 21 bits, “book” takes of 28, etc. etc. Overall, not too bad. This remained the standard in the USA until the 1980s; other countries with their own needs created various 6-7 bit encoding systems.

As computers gained storage and increased in processing speed, it became possible to increase the number of bits in an encoding system. This allowed us to add characters necessary to encode non-English symbols: ñ, ÿ, é, etc. These are usually called extended ASCII, and they employ 8 bits (and a total of 256 characters), which increases the number of possible characters, but also the necessary space and processing power. At the same time, standards were set for a number of different writing systems and character sets. Computers needed to be able to distinguish these encoding standards, otherwise gibberish would result. From the 1980s until the 1990s, this multitude of systems co-existed with varying degrees of success.

In the late 1960s, libraries began to explore digitizing catalogs. Because few standards had been set by this point, and because libraries dealt with materials in a wide range of languages, the library community had to decide how to deal with this problem. More MARC-8 and UNICODE in the next post.