Word segmentation for the Myanmar language

Tun Thura Thet, Jin Cheon Na*, Wunna Ko Ko

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

24 Citations (Scopus)

Abstract

This study reports the development of a Myanmar word segmentation method using Unicode standard encoding. Word segmentation is an essential step prior to natural language processing in the Myanmar language, because a Myanmar text is a string of characters without explicit word boundary delimiters. The proposed method has two phases: syllable segmentation and syllable merging. A rule-based heuristic approach was adopted for syllable segmentation, and a dictionary-based statistical approach for syllable merging. Evaluation of test results showed that the method is very effective for the Myanmar language.

Original languageEnglish
Pages (from-to)688-704
Number of pages17
JournalJournal of Information Science
Volume34
Issue number5
DOIs
Publication statusPublished - Oct 2008
Externally publishedYes

ASJC Scopus Subject Areas

  • Information Systems
  • Library and Information Sciences

Keywords

  • Collocation strength
  • Mutual information
  • Myanmar language
  • Natural language processing
  • Syllable merging
  • Syllable segmentation
  • Word segmentation

Fingerprint

Dive into the research topics of 'Word segmentation for the Myanmar language'. Together they form a unique fingerprint.

Cite this