Publication Date

2013

Document Type

Dissertation

Committee Members

Keke Chen (Committee Member), Amit Sheth (Committee Member), Krishnaprasad Thirunarayan (Committee Member), Shaojun Wang (Advisor), Xinhui Zhang (Committee Member)

Degree Name

Doctor of Philosophy (PhD)

Abstract

The n-gram model is the most widely used language model (LM) in statistical machine translation system, due to its simplicity and scalability. However, it only encodes the local lexical relation between adjacent words and clearly ignores the rich syntactic and semantic structures of the natural languages. Attempting to increase the order of an n-gram to describe longer range dependencies in natural language immediately runs into the curse of dimensionality. Although previous researches tried to increase the order of n-gram on a large corpus, they did not see obvious improvement beyond 6-gram. Meanwhile, other LMs, such as syntactic language models and topic language models, tried to encode the long range dependencies from different perspectives of natural languages. But it is still an open question how to effectively combine those language models in order to capture multiple linguistic phenomena. This dissertation presents a study at building a large scale distributed composite language model that is formed by seamlessly combining an n-gram model, a structured language model, and probabilistic latent semantic analysis under a directed Markov random field paradigm to simultaneously account for local word lexical information, mid-range sentence syntactic structure, and long-span document semantic content. The composite language model has been trained by performing a convergent N-best list approximate EM algorithm and a follow-up EM algorithm. To improve word prediction power, the composite LM is distributed with client-server paradigm and trained on corpora with up to a billion tokens. Also, the orders of the composite LM are increased up to 5-gram and 4-headword. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and "readability" of translations when applied to the task of re-ranking the N-best list from a state-of-the-art parsing-based machine translation system. Moreover, we propose an A*-search-based lattice rescoring strategy in order to integrate the large scale distributed composite language model into a phrase-based machine translation system. Experiments show that the A*-based lattice re-scoring is more effective to show the predominance of the composite language model over the n-gram model than the traditional N-best list re-scoring.

Page Count

121

Department or Program

Department of Computer Science and Engineering

Year Degree Awarded

2013


Share

COinS