TY - GEN
T1 - Automatic classification of web search results
T2 - 10th International Conference on Asian Digital Libraries, ICADL 2007
AU - Thet, Tun Thura
AU - Na, Jin Cheon
AU - Khoo, Christopher S.G.
PY - 2007
Y1 - 2007
N2 - This study seeks to develop an automatic method to identify product review documents on the Web using the snippets (summary information that includes the URL, title, and summary text) returned by the Web search engine. The aim is to allow the user to extend topical search with genre-based filtering or categorization. Firstly we applied a common machine learning technique, SVM (Support Vector Machine), to investigate which features of the snippets are useful for classification. The best results were obtained using just the title and URL (domain and folder names) of the snippets as phrase terms (n-grams). Then we developed a heuristic approach that utilizes domain knowledge constructed semi-automatically, and found that it performs comparatively well, with only a small drop in accuracy rates. A hybrid approach which combines both the machine learning and heuristic approaches performs slightly better than the machine learning approach alone.
AB - This study seeks to develop an automatic method to identify product review documents on the Web using the snippets (summary information that includes the URL, title, and summary text) returned by the Web search engine. The aim is to allow the user to extend topical search with genre-based filtering or categorization. Firstly we applied a common machine learning technique, SVM (Support Vector Machine), to investigate which features of the snippets are useful for classification. The best results were obtained using just the title and URL (domain and folder names) of the snippets as phrase terms (n-grams). Then we developed a heuristic approach that utilizes domain knowledge constructed semi-automatically, and found that it performs comparatively well, with only a small drop in accuracy rates. A hybrid approach which combines both the machine learning and heuristic approaches performs slightly better than the machine learning approach alone.
KW - Genre classification
KW - Product review documents
KW - Snippets
KW - Web search results
UR - http://www.scopus.com/inward/record.url?scp=38149045775&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=38149045775&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-77094-7_13
DO - 10.1007/978-3-540-77094-7_13
M3 - Conference contribution
AN - SCOPUS:38149045775
SN - 9783540770930
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 65
EP - 74
BT - Asian Digital Libraries
PB - Springer Verlag
Y2 - 10 December 2007 through 13 December 2007
ER -