Publication Date


Document Type


Committee Members

Keke Chen (Committee Member), Shaojun Wang (Advisor), Xinhui Zhang (Committee Member)

Degree Name

Master of Science in Computer Engineering (MSCE)


Language model is a crucial component in statistical machine translation system. The basic language model is N-gram which predicts the next word based on previous N-1 words. It has been used in the state-of-the-art commercial machine translation systems over years. However, the N-gram model ignores the rich syntactic and semantic structure in natural languages. We propose a composite semantic N-gram language model which combines probabilistic latent semantic analysis model with N-gram as a generative model. We have implemented the proposed composite language model in a super-computer with thousand processors that is trained by 1.3 billion tokens corpus. Comparing with simple N-gram, the large scale composite language model has achieved significant perplexity reduction and BLEU score improvement in an n-best list re-ranking task for machine translation.

Page Count


Department or Program

Department of Computer Science and Engineering

Year Degree Awarded