The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
인터넷 컴퓨팅은 저렴한 비용으로 대규모 웹 애플리케이션을 구축하기 위해 인터넷을 통해 개인 컴퓨팅 리소스를 활용하도록 제안되었습니다. 본 논문에서는 인터넷 컴퓨팅 개념을 기반으로 한 DHT 기반 분산 웹 크롤링 모델을 제안한다. 또한 시스템의 처리량과 업데이트 속도를 높이기 위해 다운로드 시간과 웹 크롤링 작업의 대기 시간을 줄이는 두 가지 최적화를 제안합니다. 기여자 친화적인 다운로드 체계를 기반으로 크롤러-크롤리 RTT를 단축하여 다운로드 시간을 개선합니다. RTT를 정확하게 추정하기 위해 네트워크 좌표계가 기본 DHT와 결합됩니다. 대기 시간은 각 크롤러의 대기열 크기를 동일하게 유지하기 위해 들어오는 크롤링 작업을 부하가 낮은 크롤러로 리디렉션함으로써 달성됩니다. 또한 작업 세분화를 줄이기 위해 대규모 웹 사이트를 작은 조각으로 분할하는 간단한 웹 사이트 분할 방법을 제안합니다. 제안된 모든 방법은 실제 인터넷 테스트와 시뮬레이션을 통해 평가되었으며 만족스러운 결과를 보여주었다.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
부
Xiao XU, Weizhe ZHANG, Hongli ZHANG, Binxing FANG, "Efficient Distributed Web Crawling Utilizing Internet Resources" in IEICE TRANSACTIONS on Information,
vol. E93-D, no. 10, pp. 2747-2762, October 2010, doi: 10.1587/transinf.E93.D.2747.
Abstract: Internet computing is proposed to exploit personal computing resources across the Internet in order to build large-scale Web applications at lower cost. In this paper, a DHT-based distributed Web crawling model based on the concept of Internet computing is proposed. Also, we propose two optimizations to reduce the download time and waiting time of the Web crawling tasks in order to increase the system's throughput and update rate. Based on our contributor-friendly download scheme, the improvement on the download time is achieved by shortening the crawler-crawlee RTTs. In order to accurately estimate the RTTs, a network coordinate system is combined with the underlying DHT. The improvement on the waiting time is achieved by redirecting the incoming crawling tasks to light-loaded crawlers in order to keep the queue on each crawler equally sized. We also propose a simple Web site partition method to split a large Web site into smaller pieces in order to reduce the task granularity. All the methods proposed are evaluated through real Internet tests and simulations showing satisfactory results.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E93.D.2747/_p
부
@ARTICLE{e93-d_10_2747,
author={Xiao XU, Weizhe ZHANG, Hongli ZHANG, Binxing FANG, },
journal={IEICE TRANSACTIONS on Information},
title={Efficient Distributed Web Crawling Utilizing Internet Resources},
year={2010},
volume={E93-D},
number={10},
pages={2747-2762},
abstract={Internet computing is proposed to exploit personal computing resources across the Internet in order to build large-scale Web applications at lower cost. In this paper, a DHT-based distributed Web crawling model based on the concept of Internet computing is proposed. Also, we propose two optimizations to reduce the download time and waiting time of the Web crawling tasks in order to increase the system's throughput and update rate. Based on our contributor-friendly download scheme, the improvement on the download time is achieved by shortening the crawler-crawlee RTTs. In order to accurately estimate the RTTs, a network coordinate system is combined with the underlying DHT. The improvement on the waiting time is achieved by redirecting the incoming crawling tasks to light-loaded crawlers in order to keep the queue on each crawler equally sized. We also propose a simple Web site partition method to split a large Web site into smaller pieces in order to reduce the task granularity. All the methods proposed are evaluated through real Internet tests and simulations showing satisfactory results.},
keywords={},
doi={10.1587/transinf.E93.D.2747},
ISSN={1745-1361},
month={October},}
부
TY - JOUR
TI - Efficient Distributed Web Crawling Utilizing Internet Resources
T2 - IEICE TRANSACTIONS on Information
SP - 2747
EP - 2762
AU - Xiao XU
AU - Weizhe ZHANG
AU - Hongli ZHANG
AU - Binxing FANG
PY - 2010
DO - 10.1587/transinf.E93.D.2747
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E93-D
IS - 10
JA - IEICE TRANSACTIONS on Information
Y1 - October 2010
AB - Internet computing is proposed to exploit personal computing resources across the Internet in order to build large-scale Web applications at lower cost. In this paper, a DHT-based distributed Web crawling model based on the concept of Internet computing is proposed. Also, we propose two optimizations to reduce the download time and waiting time of the Web crawling tasks in order to increase the system's throughput and update rate. Based on our contributor-friendly download scheme, the improvement on the download time is achieved by shortening the crawler-crawlee RTTs. In order to accurately estimate the RTTs, a network coordinate system is combined with the underlying DHT. The improvement on the waiting time is achieved by redirecting the incoming crawling tasks to light-loaded crawlers in order to keep the queue on each crawler equally sized. We also propose a simple Web site partition method to split a large Web site into smaller pieces in order to reduce the task granularity. All the methods proposed are evaluated through real Internet tests and simulations showing satisfactory results.
ER -