Publication Date

2011

Document Type

Thesis

Committee Members

Keke Chen (Committee Member), Shaojun Wang (Advisor), Xinhui Zhang (Committee Member)

Degree Name

Master of Science in Computer Engineering (MSCE)

Abstract

Language model is a crucial component in statistical machine translation system. The basic language model is N-gram which predicts the next word based on previous N-1 words. It has been used in the state-of-the-art commercial machine translation systems over years. However, the N-gram model ignores the rich syntactic and semantic structure in natural languages. We propose a composite semantic N-gram language model which combines probabilistic latent semantic analysis model with N-gram as a generative model. We have implemented the proposed composite language model in a super-computer with thousand processors that is trained by 1.3 billion tokens corpus. Comparing with simple N-gram, the large scale composite language model has achieved significant perplexity reduction and BLEU score improvement in an n-best list re-ranking task for machine translation.

Page Count

38

Department or Program

Department of Computer Science and Engineering

Year Degree Awarded

2011


Share

COinS