Publication Date
2011
Document Type
Thesis
Committee Members
Keke Chen (Committee Member), Shaojun Wang (Advisor), Xinhui Zhang (Committee Member)
Degree Name
Master of Science in Computer Engineering (MSCE)
Abstract
Language model is a crucial component in statistical machine translation system. The basic language model is N-gram which predicts the next word based on previous N-1 words. It has been used in the state-of-the-art commercial machine translation systems over years. However, the N-gram model ignores the rich syntactic and semantic structure in natural languages. We propose a composite semantic N-gram language model which combines probabilistic latent semantic analysis model with N-gram as a generative model. We have implemented the proposed composite language model in a super-computer with thousand processors that is trained by 1.3 billion tokens corpus. Comparing with simple N-gram, the large scale composite language model has achieved significant perplexity reduction and BLEU score improvement in an n-best list re-ranking task for machine translation.
Page Count
38
Department or Program
Department of Computer Science and Engineering
Year Degree Awarded
2011
Copyright
Copyright 2011, all rights reserved. This open access ETD is published by Wright State University and OhioLINK.