Focused Crawling for Building Web Comment Corpora

Authors

M. Neunerdt, B. Trevisan, M. Niermann, R. Mathar,

Abstract

        Web 2.0 provides various types of social media applications, e.g., blogs, forums and news sites that allow users to post Web comments. This kind of communication plays an important role in acceptance research. To extract different opinions from such data, it is necessary to build Web comment corpora. Building such corpora requires focused crawling. Many focused Web crawling algorithms are known to build topic-specific Web collections. However, the type of Web pages is typically not considered. In this paper, we introduce a new type-specific focused crawler, which uses a classifier based on HTML meta information. Its application allows for collecting only Web pages that cover Web comments from various domains.

BibTEX Reference Entry 

@inproceedings{NeTrNiMa13,
	author = {Melanie Neunerdt and Bianka Trevisan and Markus Niermann and Rudolf Mathar},
	title = "Focused Crawling for Building Web Comment Corpora",
	pages = "676-679",
	booktitle = "The  10th {IEEE} Consumer Communications {\&} Networking Conference CCNC 2013",
	address = {Las Vegas, Nevada USA},
	month = Jan,
	year = 2013,
	hsb = hsb999910282939,
	}

Downloads

 Download paper  Download bibtex-file

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights there in are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.