Kno.e.sis Publications

Text Classification in Asian Languages Without Word Segmentation

Document Type

Conference Proceeding

Publication Date

7-7-2003

Abstract

We present a simple approach for Asian language text classification without word segmentation, based on statistical n-gram language modeling. In particular, we examine Chinese and Japanese text classification. With character n-gram models, our approach avoids word segmentation. However, unlike traditional ad hoc n-gram models, the statistical language modeling based approach has strong information theoretic basis and avoids explicit feature selection procedure which potentially loses significantly amount of useful information. We systematically study the key factors in language modeling and their influence on classification. Experiments on Chinese TREC and Japanese NTCIR topic detection show that the simple approach can achieve better performance compared to traditional approaches while avoiding word segmentation, which demonstrates its superiority in Asian language text classification.

Comments

Presented at the 6th International Workshop on Information Retrieval with Asian Languages, Sappro, Japan, July 7, 2003.

Repository Citation

Peng, F., Huang, X., Schuurmans, D., & Wang, S. (2003). Text Classification in Asian Languages Without Word Segmentation. Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages, 11, 41-48.
https://corescholar.libraries.wright.edu/knoesis/1018

DOI

10.3115/1118935.1118941

Link to Full Text

COinS

Kno.e.sis Publications

Text Classification in Asian Languages Without Word Segmentation

Document Type

Publication Date

Abstract

Comments

Repository Citation

DOI

Search

Browse

About

Kno.e.sis Publications

Text Classification in Asian Languages Without Word Segmentation

Authors

Document Type

Publication Date

Abstract

Comments

Repository Citation

DOI

Share

Search

Browse

About