Abstract
This study reports the development of a Myanmar word segmentation method using Unicode standard encoding. Word segmentation is an essential step prior to natural language processing in the Myanmar language, because a Myanmar text is a string of characters without explicit word boundary delimiters. The proposed method has two phases: syllable segmentation and syllable merging. A rule-based heuristic approach was adopted for syllable segmentation, and a dictionary-based statistical approach for syllable merging. Evaluation of test results showed that the method is very effective for the Myanmar language.
Original language | English |
---|---|
Pages (from-to) | 688-704 |
Number of pages | 17 |
Journal | Journal of Information Science |
Volume | 34 |
Issue number | 5 |
DOIs | |
Publication status | Published - Oct 2008 |
Externally published | Yes |
ASJC Scopus Subject Areas
- Information Systems
- Library and Information Sciences
Keywords
- Collocation strength
- Mutual information
- Myanmar language
- Natural language processing
- Syllable merging
- Syllable segmentation
- Word segmentation