Climbing the Tower of Babel: Advances in Unsupervised Multilingual Learning

Regina Barzilay
Computer Science and Artificial Intelligence Laboratory

For a computer to automatically perform language-oriented tasks, e.g., summarizing articles, and translating between languages, it must “understand” the text. This understanding is typically embodied as some structural representation of the document’s underlying meaning and organization. Today, supervised learning techniques provide the most effective inductions of such representations. That is, given training instances of sentences with corresponding linguistic structures (e.g., syntactic trees), these algorithms learn how to map words in a text into a desired representation. The availability and quality of training instances is critical to the accuracy of these methods, but creating such annotated resources is prohibitively expensive — it may take years of human effort to compile a sizable corpus needed for training text processing tools. Reliance on such resources is a major hindrance in the development of automatic text processing tools.

Joint morphological segmentation, part-of-speech tagging and parsing for three languages.

Joint morphological segmentation, part-of-speech tagging and parsing for three languages.

As an alternative to supervised learning, we develop a new approach for language analysis that we call unsupervised multilingual learning. Fundamentally, our work is based on the idea that different languages exhibit different patterns of ambiguity — what is confusing to an automated system in one language may be clarified using another language. For example, a word with several senses in one language may correspond to an unambiguous word in the other language. The word “can” in English may function as an auxiliary verb, a noun, or a regular verb, whereas many other languages, such as Hebrew, are likely to express these different senses with three distinct lexemes. The key idea of multilingual learning is that by
combining natural cues from multiple languages, the structure of each becomes more readily apparent.

To achieve effective learning performance, a desired multilingual model must accurately model a common cross-lingual structure, yet remain flexible to the idiosyncrasies of each language. From a computational standpoint, the main challenge is to ensure that the model will scale well as the number of languages increases, avoiding an exponential increase in the parameter space.

We have developed a class of non-parametric Bayesian models that satisfy these two requirements. As an example, consider a model that predicts part-of-speech tags (nouns, verbs, etc.) for words in a sentence. For each language, the multilingual model contains a Hidden Markov Model-like substructure; these substructures are connected to one another by means of cross-lingual latent variables. These variables, which we refer to as superlingual tags, capture repeated multilingual patterns and thus reduce the overall uncertainty in tagging decisions. The model scales linearly with the number of languages, allowing us to incorporate as many languages as are available. We have applied multilingual learning algorithms to all levels of linguistic analysis, ranging from morphological segmentation to syntactic parsing. The results of our experiments show that the performance of multilingual learning algorithms yields substantial improvements over state-of-the-art approaches across a range of tasks.

We have also found a surprising increase in performance as the number of languages jointly processed by the model increases. Another important benefit of the multilingual learning algorithms is that they are applicable to hundreds of human languages with no annotated resources. Such resource-poor languages are mostly out of reach for existing text processing methods. For instance, we can accurately perform morphological analysis of ancient Aramaic scripts by simultaneously analyzing these texts with Hebrew and Arabic translations. Finally, this new direction in statistical language learning has the potential to unveil fascinating facts about the hidden connections among human languages.

Tags: , , , , , ,

2 Responses to “Climbing the Tower of Babel: Advances in Unsupervised Multilingual Learning”

  1. Sagive says:

    i see hebrew and arabic.. what is the third
    language shown in that “morphological segmentation”
    is it aramic?

  2. i see hebrew and arabic.. what is the third
    language shown in that “morphological segmentation”
    is it aramic?

Leave a Reply

Replies which add to the content of this article in the EECS Newsletter are welcome. The Department reserves the right to moderate all comments. If you would like to provide any updated information please send an email to newsletter@eecs.mit.edu.