Discovery of concept entities from web sites using web unit mining

Ming Yin Ming; Dion Hoe-lian Goh; Ee Peng Lim; Aixin Sun

doi:10.1108/17440080580000088

Discovery of concept entities from web sites using web unit mining

Ming Yin Ming, Dion Hoe-lian Goh, Ee Peng Lim, Aixin Sun

Research output: Contribution to journal › Article › peer-review

4 Citations (Scopus)

Abstract

A web site usually contains a large number of concept entities, each consisting of one or more web pages connected by hyperlinks. In order to discover these concept entities for more expressive web site queries and other applications, the web unit mining problem has been proposed. Web unit mining aims to determine web pages that constitute a concept entity and classify concept entities into categories. Nevertheless, the performance of an existing web unit mining algorithm, iWUM, suffers as it may create more than one web unit (incomplete web units) from a single concept entity. This paper presents two methods to solve this problem. The first method introduces a more effective web fragment construction method so as reduce later classification errors. The second method incorporates site-specific knowledge to discover and handle incomplete web units. Experiments show that incomplete web units can be removed and overall accuracy has been significantly improved, especially on the precision and F1 measures.

Original language	English
Pages (from-to)	123-136
Number of pages	14
Journal	International Journal of Web Information Systems
Volume	1
Issue number	3
DOIs	https://doi.org/10.1108/17440080580000088
Publication status	Published - Aug 1 2005
Externally published	Yes

ASJC Scopus Subject Areas

Information Systems
Computer Networks and Communications

Keywords

Web classification
Web information organization

Access to Document

10.1108/17440080580000088

Cite this

@article{c5484abe545c47a594a1fde9dacc5861,

title = "Discovery of concept entities from web sites using web unit mining",

abstract = "A web site usually contains a large number of concept entities, each consisting of one or more web pages connected by hyperlinks. In order to discover these concept entities for more expressive web site queries and other applications, the web unit mining problem has been proposed. Web unit mining aims to determine web pages that constitute a concept entity and classify concept entities into categories. Nevertheless, the performance of an existing web unit mining algorithm, iWUM, suffers as it may create more than one web unit (incomplete web units) from a single concept entity. This paper presents two methods to solve this problem. The first method introduces a more effective web fragment construction method so as reduce later classification errors. The second method incorporates site-specific knowledge to discover and handle incomplete web units. Experiments show that incomplete web units can be removed and overall accuracy has been significantly improved, especially on the precision and F1 measures.",

keywords = "Web classification, Web information organization",

author = "\{Yin Ming\}, Ming and \{Hoe-lian Goh\}, Dion and Lim, \{Ee Peng\} and Aixin Sun",

year = "2005",

month = aug,

day = "1",

doi = "10.1108/17440080580000088",

language = "English",

volume = "1",

pages = "123--136",

journal = "International Journal of Web Information Systems",

issn = "1744-0084",

publisher = "Emerald Group Publishing Ltd.",

number = "3",

}

TY - JOUR

T1 - Discovery of concept entities from web sites using web unit mining

AU - Yin Ming, Ming

AU - Hoe-lian Goh, Dion

AU - Lim, Ee Peng

AU - Sun, Aixin

PY - 2005/8/1

Y1 - 2005/8/1

N2 - A web site usually contains a large number of concept entities, each consisting of one or more web pages connected by hyperlinks. In order to discover these concept entities for more expressive web site queries and other applications, the web unit mining problem has been proposed. Web unit mining aims to determine web pages that constitute a concept entity and classify concept entities into categories. Nevertheless, the performance of an existing web unit mining algorithm, iWUM, suffers as it may create more than one web unit (incomplete web units) from a single concept entity. This paper presents two methods to solve this problem. The first method introduces a more effective web fragment construction method so as reduce later classification errors. The second method incorporates site-specific knowledge to discover and handle incomplete web units. Experiments show that incomplete web units can be removed and overall accuracy has been significantly improved, especially on the precision and F1 measures.

AB - A web site usually contains a large number of concept entities, each consisting of one or more web pages connected by hyperlinks. In order to discover these concept entities for more expressive web site queries and other applications, the web unit mining problem has been proposed. Web unit mining aims to determine web pages that constitute a concept entity and classify concept entities into categories. Nevertheless, the performance of an existing web unit mining algorithm, iWUM, suffers as it may create more than one web unit (incomplete web units) from a single concept entity. This paper presents two methods to solve this problem. The first method introduces a more effective web fragment construction method so as reduce later classification errors. The second method incorporates site-specific knowledge to discover and handle incomplete web units. Experiments show that incomplete web units can be removed and overall accuracy has been significantly improved, especially on the precision and F1 measures.

KW - Web classification

KW - Web information organization

UR - http://www.scopus.com/inward/record.url?scp=33745785079&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33745785079&partnerID=8YFLogxK

U2 - 10.1108/17440080580000088

DO - 10.1108/17440080580000088

M3 - Article

AN - SCOPUS:33745785079

SN - 1744-0084

VL - 1

SP - 123

EP - 136

JO - International Journal of Web Information Systems

JF - International Journal of Web Information Systems

IS - 3

ER -

Discovery of concept entities from web sites using web unit mining

Abstract

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Fingerprint

Cite this