Jump to Navigation | Jump to Content

Font Size: Increase Font Increase | Decrease Font Decrease
ABA Law Practice Managment Section
Law Technology Today (EDD, Litigation, and Law Office Technology)

VOL 1 NO 4   In this Issue of Law Technology Today :: June 2007

 

BYTES IN BRIEF

Dealing with Foreign Language Documents: A Primer (Part 1)

In the first article of a two part series, Tredennick simplifies the process of translating foreign language documents and metadata in litigation and uncovers the key to making applications foreign-language friendly.

Each January, I try to attend LegalTech New York and this year was no exception. Like a lot of people, I come to the show wondering what will be the next “Big Thing.” If you go, you probably know what I mean. There always seems to something hot at the show each year. In the 90s it was the latest release of WordPerfect or a Word-WP shootout. Or, different years it was time and billing software or a new version of Summation or Concordance. Most recently, it was web-based repositories and then EDD.

This year, to my surprise, it was foreign languages. Wherever I went I heard people talking about foreign language documents and what to do with them. We had a booth at the show and that was certainly the big topic there. “Can your software handle foreign language documents?” “We have Chinese files we need to search and review.” “What about Japanese, Russian, German. . .” And so on.

I have no idea why that topic was suddenly hot. During 20+ years as a trial lawyer I only recall seeing a handful of foreign-language documents. When I did, I moved them to the bottom of the stack in favor of the English ones that I could read and understand. OK, just kidding on the last part but I don’t recall having to deal with very many foreign language documents and I did a lot of discovery over the years.

But, in this Flat-Earth era of globalization, foreign language documents are quickly becoming a part of the landscape. If you are representing a multi-national, you will probably be required to collect and review documents from a number of countries to determine relevance and privilege. If you are suing a multi-national, you will probably receive foreign-language documents which you will have to master. Many of them will have several languages in them. Emails with combinations of Japanese, Chinese and English, for example. If you haven’t hit this issue yet, my bet is you will, soon.

This is a two-part column about dealing with foreign language documents and foreign language metadata in litigation. I will start with the basics and teach you what you need to know about ASCII and Unicode, which is the key to making applications foreign-language friendly. In the next issue, I will turn to problems inherent to searching foreign language documents and talk about the unique problems of searching the CJK languages: Chinese, Japanese, Korean, and Thai (although that gets left out of the acronym). This has become increasingly important for many as we move to a borderless world that works together seamlessly but speaks in many tongues.

ASCII: The Base for English-Language Programs

To get a handle on foreign language documents you need to first understand ASCII and its limitations. ASCII, pronounced “Ask ee” is an acronym for American Standard Code for Information Interchange. First developed in the 1960s, ASCII was a system to encode the basic characters used by computers to communicate with people.

The task was to create a universal way to represent all of the basic characters one needed to use a computer—from writing programming code to a drafting research memo using a word processing program. And, because computers run off binary code (bits and bytes), ASCII needed to be expressed in bits and bytes as well.

In ASCII, each character you see on this page (and a number you can’t see) is presented to the computer not as letters but as a “byte” of code. A byte consists of 8 individual “bits” that are either a “1” or a “0”.

Text Box: The Printable   ASCII Characters  !”#$%&’()*+,-./  0123456789:;<=>?@  ABCDEFGHIJKLMNO  PQRSTUVWXYZ[\]^_  ‘ABCDEFGHIJKLMNO  PQRSTUVWXYZ{|}~  Thus, the letter “A” would be encoded in 7 bits as

100 0001

The letter “B” is:

100 0010

The letter “C” is

100 0011

And so on through the alphabet (large and small letters). ASCII also includes the 10 possible number values (0-9) along with standard punctuation characters ($ % * & + =, etc.). It also reserves 32 characters to control things like tabs, line feeds and carriage returns.

You probably noticed that these letters shown above only use 7 bits rather than the 8 that make up a byte of code. As a historical anomaly, the drafters of the standard felt that 128 characters would be plenty to represent the letters, numbers and other “control characters” they felt they would need but most computers required 8 bits as a minimum unit. So, they used the last bit for error checking. Remember, this was the early sixties and people weren’t carrying laptops or Blackberries back then.

Anyway, this system worked great in a world that spoke English and it has held up well for almost a half decade. Over the years it became the base for text transcripts (“Could I get an ASCII copy of the transcript please?”) the core of most word processing programs, and at the heart of most of the programming code used in litigation support applications.

There was only one problem with 7 bit ASCII and I have already alluded to it. Having 128 possible combinations of 1s and 0s works fine if your alphabet only has 26 letters. But what if you want to compute in French or German or Russian or Hebrew? Even more fun, what if you are one of the billions of Chinese or Japanese or other Asian speakers who were blessed with a language that has tens of thousands of characters that make up their written language?

Suddenly you have a problem

Extended ASCII

Not willing to change their mother language just to use computers, many countries began developing their own character set encoding and began to extend the ASCII standard. Simply moving from 7 to 8 bits doubled the range of characters to 256, which helped for many languages. Most included the first 128 characters from ASCII but added other characters in the extended range, sometimes called High or extended ASCII. At the same time, English speaking programmers started used the additional characters to represent all kinds of other characters to support line drawings, horizontal and vertical bars so you could make spiffy drawings on your page.

As you can imagine, this got a bit crazy. People were creating proprietary code sets for all kinds of things including word processing programs, drawing programs and, of course to support different languages. Computer operating systems had to be able to recognize and handle each one.

This worked passably well at first. By the 90s, however, people started looking for need for a universal code that could go beyond ASCII and embrace all possible languages. That realization was the impetus for the Unicode movement which provides the modern foundation for handling foreign language documents around the world.

Unicode

Unicode was a big leap forward providing for as many as 1,114,112 (2 20 plus 2 16) possible characters. At the moment only about 100,000 of them are assigned. The first 256 characters remain as they were in ASCII, which is why it has been easy to adopt. The majority of the remaining characters (96,000) are used to express the Chinese, Japanese and Korean language which are pictorial in nature and require a second byte. You can also express made up languages like Klingon in Unicode.

Unicode can handle such a wide range of characters because it can use more than one byte to express characters. In the years since Unicode was introduced it has become a global standard and is the encoding used in modern operating systems like Windows (NT-95-XP-Vista), XML, the .NET framework, JAVA and the Mac OS X, among many others.

UTF-8

There are several approaches to Unicode representations but UTF-8 (Unicode Transformation Format) is the standard today. It can use up to four bytes (often called octets) to represent any possible language character (actually, it has a theoretical limit of six bytes). The ASCII characters are still represented by a single byte using the same digits. The second byte can be used to describe another 65,000 characters which covers most other language needs. A third byte brings the additional characters needed by the CJK languages. The fourth byte is largely reserved.

Because of its versatility and consistency with ASCII, UTF-8 is steadily becoming the preferred encoding for e-mail, web pages, and other places where characters are stored or streamed.

Why Does This Matter

For starters, unless your software application is written in Unicode (UTF-8), it won’t handle most foreign language characters. This is a problem with many off the shelf litigation packages. They were written in the 90s when language compatibility didn’t matter and it will take a lot of effort to rewrite them.

It also matters when you want to process email and other documents that contain foreign languages. Most of the processing software in use today has its roots in the 90s as well and is written in ASCII. As a result, it will fail to decode email and documents with foreign language characters, representing them typically with ?????.

Thus, before you can begin to search and review your foreign-language documents, you need to make sure that the software used to collect and process your documents (email extraction, for example) is Unicode compliant. Then it needs to be loaded into a database or other software that supports the Unicode character set. Otherwise, you will be looking at gibberish and that won’t get you anywhere.

The next step is to talk about searching foreign languages and particularly to address the difficult issues around searching multi-byte languages like the CJK pictorial languages. I will hit that subject head on in the next issue.

About the Author

John C Tredennick Jr

John C Tredennick Jr

 EmailEditor in Chief

John Tredennick spent more than 20 years as a nationally-recognized trial lawyer and litigation partner with Holland & Hart in Denver Colorado. One of the early pioneers in litigation technology, John published the ABA bestselling books Winning with Computers, Volumes 1 and 2 in 1990 and 1991. Since then he has authored two other book on litigation technology along with scores of articles and columns for the leading legal publications. He also regularly speaks at legal technology conferences around the world.

In 2000, John founded Catalyst Repository Systems (formerly CaseShare Systems). Catalyst provides secure, online repository systems to help professional teams manage large volumes of electronic documents and work together on complex legal,financial and insurance matters. A pioneer in the industry, Catalyst is used by many of the largest corporations and law firms in the world.

Technology Calendar

Upcoming Technology Events

SUBMIT EVENT

Conference
ABA TECHSHOW 2009
American Bar Association
Law Practice Management Section
April 2-4, 2009

 

Subscribe to the Law Technology Today RSS Feed

Choose Your RSS Feed Reader RSS Add to Google Add to My AOL Subscribe in NewsGator Online Subscribe in Bloglines Add to Plusmo
Copyright American Bar Association. http://www.abanet.org