Jump to Navigation | Jump to Content
American Bar Association - Defending Liberty, Pursuing Justice ABA Logo

Font Size: Increase Font Increase | Decrease Font Decrease
ABA Law Practice Managment Section
Law Technology Today (EDD, Litigation, and Law Office Technology)

VOL 1 NO 5   In this Issue of Law Technology Today :: July 2007

Fire Wire

Searching Foreign Language: A Primer

Print-Friendly  E-mail this page to a friend

In this second article of a two part series, Tredennick offers additional advice on handling foreign language documents. Learn how your computer searches a document, what tokenization means and most importantly, how you will manage documents in pictorial languages, such as Chinese, and even multiple languages.

Last issue, I began a series on dealing with foreign language documents. We started by looking at foreign language characters and the problems they cause with computers raised on ASCII programming code. You may recall that ASCII is the basic language of most early computers. It supports up to 256 different characters and uses a single byte of programming code (8 bits) to hold all those combinations of ones and zeroes. That was enough to cover our basic alphabet, along with some punctuation, but not much else. If you were born speaking a language with thousands of characters, you were out of luck.

This problem led to the Unicode movement in the early nineties. The goal was to create a universal system to describe the characters in all the world's languages. They did so by using several bytes to describe the additional characters that were needed. This gave us a single encoding system that could cover thousands of language variants—even the 65,000 or so pictorial characters that people use in the Far East, namely, China, Japan and Korea. It is also why many languages are called "double byte" languages.

So, now we have our computer systems ready to support foreign language characters. What's next? Why search, of course. If you can't search all of these foreign language documents, how are you going to find that smoking gun you need to make your case. Search is the next step in handling foreign language documents.

How does search work?

When we open a Word document and invoke Ctrl-F for "Find." we are performing basic search. A text box appears allowing us to enter a search word. Once the search is invoked, the computer plows through the document from beginning to end looking for words or letters that match the combination you entered. If the document is short, search happens pretty quickly. If the document is long, you might find yourself staring at an hourglass which the search works its way to the end of the file.

To speed up the search process, computers were programmed to index all of the documents in advance of the search. Indexing simply means finding all of the words in a document in advance and putting them in a search dictionary, actually a word index.

You can imagine it as a simple word list in alphabetical order. Beside each word you would find a reference to the document or documents containing that word. A sophisticated index will not only tell you what document contains the word but it will also give you a page and line number (even the position in the line) where it could be found.

That way when you entered a search for Alaska, the computer doesn't have to open and search through at all the documents in your repository collection. Rather, it just needs to open up the index, go down the As in the alphabet until it gets to "Alaska" and then read from the index which documents included that word.

This happens very quickly, even when your repository has millions of pages in it. Indexing is how we can search millions of pages of case documents using Lexis or Westlaw and the same principles apply to litigation support systems.

Tokenization

So what does all this have to do with foreign language search? A lot actually. Indexing can be a big problem for some foreign language documents. Here is why.

When a computer creates an index from a document, it has no idea what words mean or even if the document has words in it. Rather, in the computer world words are generally defined as one or more characters separated by a space or certain punctuation.

Thus, if a computer could read this sentence, it would find and index the following words:

Thus

if

a

computer

could

read

And so on. The computer is not actually reading the words. Rather, it identifies a word by the spaces or punctuation around it. Misspellings can be treated as words. Numbers can be treated as words. Anything can be treated as a word.

The process of identifying these words is called "tokenization." Indeed, search purists don't define the letter/number combinations in an index as words but rather call them "tokens." Thus, the trick to creating a search index (which provides fast search) is to tokenize each document into a series of words called tokens. Whether the token is actually a "word" is irrelevant to the computer. Now we get to the foreign language part.

Most Western languages share a common feature that is particularly helpful to search indexing—their words are always separated by spaces or punctuation. That means a search engine can index a French document just as easily as an English one. It may not realize that Bonjour is the French word for Hello. But it will recognize that the letters in Bonjour are surrounded by spaces and thus should be indexed as a separate word (or token to use computer speak). The same would be true for words with different accent marks or even those using radically different characters, like Cyrillic. So long as there are spaces or punctuation between the words, most search programs can index them.

The Problem with CJK

The problem begins when you realize that not all languages use spaces or punctuation. Take Chinese for example. The Chinese language generally doesn't use spaces or punctuation to delineate words. Rather, the characters run together without any clear break points. Japanese is similar in this respect as is Thai. The Korean language uses spaces but contains "compound" words that are really several words without spaces. Linguists refer to these kinds of languages as the "CJK" languages.

With a CJK language a sentence might look something like this:

Thedogatemydinnerbeforeicouldstophimnexttimeiwill
puthimoutbeforeieat

Editor's Note: Line break added.

Without spaces or punctuation, how is the search engine going to define the individual words in that sentence and create its index? The answer is not very well unless they have special tokenizers that can recognize for each language where the "words" begin and end.

We use a search engine called "Fast Instream" that can handle this special tokenization but not many of the basic litigation support packages have this capability. If you expect to get CJK language documents, make sure your search engine has tokenizers for the languages you will want to search.

Dealing with Pictorial Languages

The challenge is compounded by the fact that these CJK languages use symbols called "logograms" rather than characters. For example, there are two forms of written Chinese: simplified and traditional. Both use symbols to represent words, parts of words and even phrases. The characters they use are not like letters in an alphabet and their words are not simply combinations of letters like our alphabet.

So, tokenization becomes even more difficult. Not only must the computer figure out how to group the pictorial characters into words, it must recognize the fact that how the characters are combined can control their meaning—literally what words are being conveyed.

Let's take an example. The Traditional Chinese word for "Chinese" consists of three pictorial characters :

中國人

These three characters translate to three English words:

Middle country person

For historical interest, the word was created at a time when the Chinese viewed themselves as being in the center of the world. Thus, the middle country was China and a Chinese was a person from the middle country.

So, if you wanted to say China rather than Chinese, you use two characters:

中國

Now you have a two character word that loosely translates as "middle country." Or, you could use each of the symbols for their single word meaning.

When the pictures are all lumped together, how does the tokenizer know whether the first two characters are meant to refer to China and the third character is meant to be part of the next word or phrase? The answer is: Some very smart linguists have designed software that can read the characters and understand their context sufficiently to create words in an index.

To take another example, the Korean language is flexible about word order in a sentence. The verb comes at the end but you can put the other words preceding it any order you choose. That can be quite confusing to a person used to searching for word phrases based on the notion that the words (or related symbols) will always be in a set order.

Take this simple sentence "Tom eats food" and you will get an idea of what I mean. Here is the sentence in Korean:

톰가 음식을 먹는다

In Korean the verb will always come last. From there the sentence could be:

Tom food eat

or

Food Tom eat

To eliminate this confusion, Koreans indicate as part of the word form whether the word is a subject or object. So the sentence becomes more like this:

Tom-subject food-object eat-verb.

The problem is the subject/object designations are included as particles of the word forms to which they refer. The tokenizers have to remove these particles as does the search engine when you look for those words.

Here is how it breaks out in Korean:

톰 = Tom (phoneticly transliterated)

가 = subject particle

음식 = food

을 = object particle

먹는다 = eats

The problem isn't just limited to the CJK languages. The Arabic languages use spaces to delineate words but they use different grammatical forms tied into the words that must also be recognized by a special Arabic-language tokenizer.

Dealing with Multiple Languages

To handle foreign language documents, your indexing engine has to first determine what language is being used. While some documents contain a header with this information, most don't and the indexing engine has to determine this by looking at the characters in the document. Unlezss it can figure out what language is being used, it won't know what tokenizer to use and may have trouble determining word boundaries.

What about documents containing multiple languages? This too can be a problem for many search engines. If it starts tokenizing words based on the Japanese language, what happens to the parts in Chinese? Or English for that matter.

The answer is that sophisticated search engines will analyze documents in parts to determine the predominant language in that part of the document. Thus, it might recognize that part of an email body is in Japanese and part in Korean. A good tokenizing engine will apply two different tokenizing schemes—one for Japanese and one for Korean—to the different parts of the document. That way you can search on both languages. It will do the same for English and foreign mixes. We see a lot of email in our repositories that has three or more languages in it. Make sure your search indexer can handle them all in the same document.

In that regard, there is one further problem with foreign language search that I should mention. A number of search engines that offer foreign language support have a serious limit: they can only handle one language for search at a time. If your documents span several languages, this can be a big problem.

Let me explain why. Many of the older search engines (Verity is an example that we know well) offered foreign language support but required a separate index for each language being searched. What that meant was that if you wanted to index Chinese documents you needed the Chinese indexing package. If you wanted Japanese, you needed the Japanese package. And for English, you need the English package.

The problem is that most cases involving foreign language documents encompass multiple languages, e.g. English, Japanese, Korean and more. If you can only search one language at a time, you have to run your searches against each collection of documents—a new search for each language covered in the documents. Then you have to combine your results from the different searches into one set so you can make sense of what you have. And, you can't run a combined search spanning several languages like English and Chinese.

Modern search engines allow you to index multiple languages in a single repository and to search against all the documents in one or more languages at the same time. If your proposed system does not have this capability, you better look at alternatives.

In the Final Analysis

Foreign language search is a lot tougher to pull off than you might think. You need a system that will handle Unicode so the foreign characters can be displayed properly. Then you need an indexer that can tokenize foreign languages properly and, generally, handle all the different ways that people express words and concepts. Lastly, you need a search engine that is language agnostic and can roll with the punches. If it balks at handling some of the foreign languages you might encounter or requires you to run multiple searches to get what you need, take a pass. The modern engines are geared for the new "Flat Earth" we live in and can help you find what you need in discovery.

After all, foreign language documents are no longer an oddity. They are quickly becoming a routine part of modern litigation. You will just have to deal with them.

About the Author

John C Tredennick Jr

John C Tredennick Jr

 EmailEditor in Chief

John Tredennick spent more than 20 years as a nationally-recognized trial lawyer and litigation partner with Holland & Hart in Denver Colorado. One of the early pioneers in litigation technology, John published the ABA bestselling books Winning with Computers, Volumes 1 and 2 in 1990 and 1991. Since then he has authored two other book on litigation technology along with scores of articles and columns for the leading legal publications. He also regularly speaks at legal technology conferences around the world.

In 2000, John founded Catalyst Repository Systems (formerly CaseShare Systems). Catalyst provides secure, online repository systems to help professional teams manage large volumes of electronic documents and work together on complex legal,financial and insurance matters. A pioneer in the industry, Catalyst is used by many of the largest corporations and law firms in the world.

Technology Calendar

Upcoming Technology Events

SUBMIT EVENT

Conference
ABA TECHSHOW 2009
American Bar Association
Law Practice Management Section
April 2-4, 2009

Back to Top

Subscribe to the Law Technology Today RSS Feed

Choose Your RSS Feed Reader RSS Add to Google Add to My AOL Subscribe in NewsGator Online Subscribe in Bloglines Add to Plusmo