What is a Character Set? (Part Three): UNICODE: the future is now!

If you thought it would be strange that computers would still rely on 40-year-old technology to deal with the world’s many writing systems, you would be correct. While the library world was content to stick with MARC-8, the computing world evolved constantly.

For most computer users in the early days of MARC-8, having access to many code points was not especially important. Libraries, as multilingual environments, were one of the few institutions where the availability of multiple scripts was important.

By the late 80’s, however, the rapid adoption of the Internet and the World Wide Web meant that computers around the world could talk to each other. At the same time, computing power was growing rapidly. The phones that we carry in our pockets have greater computing power than NASA used to put humans in space, so processing lots and lots of bits is no longer a problem.

Having a huge number of encodings schemes was counter-productive, as communication in one locale would be rendered as gibberish in another. (Does anyone remember this from the early days of the Internet? Running through encoding settings in order to make a website readable?)

To solve this, computer scientists began working toward a universal standard that could unite all existing standards, and all of the scripts and characters they expressed, into one single standard. The result was Unicode, which was incorporated in 1991 after discussions between engineers at Xerox and Apple.

Unicode is a bit confusing from an encoding standpoint, as it is not an encoding, per se, but rather a standard. Unicode can be implemented in several ways, but the most common (and current) are UTF-8, UTF-16, and UTF-32. UTF-32 and-16 are fixed bit schema, which means that each character takes up exactly 32 or 16 bits. UTF-16 is the rarest of these three and is seen as unstable due its lack of use. UTF-32 takes up the most space, and is therefore not very common either. UTF-8 is a variable bit scheme, with characters taking up 8, 16, 24 or 32 bits. This flexibility means that all characters can be expressed easily, yet less space is taken up than UTF-32.

Other encodings have been implemented or proposed, yet none are very common. UTF-8 is the web standard, although any other UTF encoding should present little difficulty to browsers or other document readers.

Now that we’ve covered encodings, let’s cover what Unicode is. Unicode, as I’ve mentioned is a standard. And it is governed by the Unicode Consortium. This consortium is made up of members, mostly tech companies like Apple and Adobe and Oracle, but also many governments, linguistics institutions, universities, and even interested individuals. These members collaborate to ensure that Unicode truly works as a (near) universal standard, to ensure that every computer can produce and interpret Unicode-compliant material, and expand the standard to include the scripts necessary to digitally (re)produce the sum of human knowledge.

As a standard, Unicode assigns hexadecimal code points to characters, which are represented in slightly different ways depending upon the encoding selected. These characters and conceived of as belonging to blocks. The capital letter <Q> belongs to the C0 Controls and Basic Latin block and has the specific code U+0051. (0051 is the code point, “U+” is usually added to specify that we’re talking about Unicode.)

As the need for more scripts or symbols grows, Unicode can add new blocks or assign characters to new code points. After the Rohingya crisis in Myanmar, Unicode rushed to include the Rohingya script in its standard to ensure that agencies could produce Rohingya language materials that could be interpreted by any computer. In the past, a new script would necessitate a new font and a new encoding, and because this script depended on a font with its own encoding, there was no guarantee that a given computer could read a document written in this way. Unicode solves this by putting every script into a single standard that is readable by just about any modern computer.

So given that Unicode seems to be the solution to our outdated MARC-8 system, why do we still stick to MARC-8? To a certain extent, we actually have made the switch. OCLC, a global cataloging cooperative, allows for the creation of records in Unicode, then allows participating libraries to export records in a variety of encodings, including MARC-8. The reason the switch hasn’t happened completely is largely to due to money and tradition.

On the money end, re-encoding the catalog has the potential to be quite costly. Library software is notoriously expensive and difficult to implement, so many libraries use legacy systems that don’t play well with Unicode. This decision, however, is largely at the level of the individual library, and I have no doubt that clever catalogers could tweak their software and their catalogs to use the Unicode standard.

Tradition as well plays a role. In the MARC catalog record, field 066 is used to indicate what scripts are present. This would be unnecessary if all scripts were inherently supported. Have a look at the mess we deal with now to see why keeping MARC-8 is such a bad idea. PCC, the Program for Cooperative Cataloging has also been slow to adopt Unicode. It has 4 divisions: BIBCO (most bibliographic records), NACO (name authorities), SACO (subject authorities), and CONSER (serials). Of these four, only BIBCO allows for the use of Unicode; the rest require MARC-8. And because PCC is the gold standard for cataloging, it means that they control the keys to Unicode.

It’s 2019. My phone can type and read just about any language I want it to. My computer has no trouble rendering Cyrillic or Mongolian or even Egyptian hieroglyphics. A person from Thailand could conceivably search for an author in an American catalog, but if that author is in Thai, they have to resort to transliterating their name according to a prescribed standard because NACO doesn’t allow us to enter names in vernacular scripts that MARC-8 doesn’t support.

The library world has resigned itself to abandoning MARC, not because it was bad, but because we have technology that allows us do so much more. We should be proud of the brilliant librarians who developed MARC and MARC-8, and the best way to honor their legacy is to move on to something even better.

Leave a Reply

Your email address will not be published. Required fields are marked *