Discovery of web robot sessions based on their navigational patterns

Pang Ning Tan; Vipin Kumar

doi:10.1023/A:1013228602957

Discovery of web robot sessions based on their navigational patterns

Pang Ning Tan, Vipin Kumar

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

181 Scopus citations

Abstract

Web robots are software programs that automatically traverse the hyperlink structure of the World Wide Web in order to locate and retrieve information. There are many reasons why it is important to identify visits by Web robots and distinguish them from other users. First of all, e-commerce retailers are particularly concerned about the unauthorized deployment of robots for gathering business intelligence at their Web sites. In addition, Web robots tend to consume considerable network bandwidth at the expense of other users. Sessions due to Web robots also make it more difficult to perform clickstream analysis effectively on the Web data. Conventional techniques for detecting Web robots are often based on identifying the IP address and user agent of the Web clients. While these techniques are applicable to many well-known robots, they may not be sufficient to detect camouflaged and previously unknown robots. In this paper, we propose an alternative approach that uses the navigational patterns in the click-stream data to determine if it is due to a robot. Experimental results on our Computer Science department Web server logs show that highly accurate classification models can be built using this approach. We also show that these models are able to discover many camouflaged and previously unidentified robots.

Original language	English (US)
Pages (from-to)	9-35
Number of pages	27
Journal	Data Mining and Knowledge Discovery
Volume	6
Issue number	1
DOIs	https://doi.org/10.1023/A:1013228602957
State	Published - 2002

Bibliographical note

Funding Information:
This work was partially supported by NSF grant # ACI−9982274 and by Army High Performance Computing Research Center contract number DAAH04−95−C−0008. The content of this work does not necessarily reflect the position or policy of the government and no official endorsement should be inferred. Access to computing facilities was provided by AHPCRC and the Minnesota Supercomputing Institute.

Keywords

Classification
Data mining
Web robot detection
Web usage mining

Access

10.1023/A:1013228602957

OpenUrl availability

Full text

Cite this

@article{5c806ee0bfc749a890a2a4de213c3c04,

title = "Discovery of web robot sessions based on their navigational patterns",

abstract = "Web robots are software programs that automatically traverse the hyperlink structure of the World Wide Web in order to locate and retrieve information. There are many reasons why it is important to identify visits by Web robots and distinguish them from other users. First of all, e-commerce retailers are particularly concerned about the unauthorized deployment of robots for gathering business intelligence at their Web sites. In addition, Web robots tend to consume considerable network bandwidth at the expense of other users. Sessions due to Web robots also make it more difficult to perform clickstream analysis effectively on the Web data. Conventional techniques for detecting Web robots are often based on identifying the IP address and user agent of the Web clients. While these techniques are applicable to many well-known robots, they may not be sufficient to detect camouflaged and previously unknown robots. In this paper, we propose an alternative approach that uses the navigational patterns in the click-stream data to determine if it is due to a robot. Experimental results on our Computer Science department Web server logs show that highly accurate classification models can be built using this approach. We also show that these models are able to discover many camouflaged and previously unidentified robots.",

keywords = "Classification, Data mining, Web robot detection, Web usage mining",

author = "Tan, {Pang Ning} and Vipin Kumar",

note = "Funding Information: This work was partially supported by NSF grant # ACI−9982274 and by Army High Performance Computing Research Center contract number DAAH04−95−C−0008. The content of this work does not necessarily reflect the position or policy of the government and no official endorsement should be inferred. Access to computing facilities was provided by AHPCRC and the Minnesota Supercomputing Institute.",

year = "2002",

doi = "10.1023/A:1013228602957",

language = "English (US)",

volume = "6",

pages = "9--35",

journal = "Data Mining and Knowledge Discovery",

issn = "1384-5810",

publisher = "Springer Netherlands",

number = "1",

}

TY - JOUR

T1 - Discovery of web robot sessions based on their navigational patterns

AU - Tan, Pang Ning

AU - Kumar, Vipin

N1 - Funding Information: This work was partially supported by NSF grant # ACI−9982274 and by Army High Performance Computing Research Center contract number DAAH04−95−C−0008. The content of this work does not necessarily reflect the position or policy of the government and no official endorsement should be inferred. Access to computing facilities was provided by AHPCRC and the Minnesota Supercomputing Institute.

PY - 2002

Y1 - 2002

N2 - Web robots are software programs that automatically traverse the hyperlink structure of the World Wide Web in order to locate and retrieve information. There are many reasons why it is important to identify visits by Web robots and distinguish them from other users. First of all, e-commerce retailers are particularly concerned about the unauthorized deployment of robots for gathering business intelligence at their Web sites. In addition, Web robots tend to consume considerable network bandwidth at the expense of other users. Sessions due to Web robots also make it more difficult to perform clickstream analysis effectively on the Web data. Conventional techniques for detecting Web robots are often based on identifying the IP address and user agent of the Web clients. While these techniques are applicable to many well-known robots, they may not be sufficient to detect camouflaged and previously unknown robots. In this paper, we propose an alternative approach that uses the navigational patterns in the click-stream data to determine if it is due to a robot. Experimental results on our Computer Science department Web server logs show that highly accurate classification models can be built using this approach. We also show that these models are able to discover many camouflaged and previously unidentified robots.

AB - Web robots are software programs that automatically traverse the hyperlink structure of the World Wide Web in order to locate and retrieve information. There are many reasons why it is important to identify visits by Web robots and distinguish them from other users. First of all, e-commerce retailers are particularly concerned about the unauthorized deployment of robots for gathering business intelligence at their Web sites. In addition, Web robots tend to consume considerable network bandwidth at the expense of other users. Sessions due to Web robots also make it more difficult to perform clickstream analysis effectively on the Web data. Conventional techniques for detecting Web robots are often based on identifying the IP address and user agent of the Web clients. While these techniques are applicable to many well-known robots, they may not be sufficient to detect camouflaged and previously unknown robots. In this paper, we propose an alternative approach that uses the navigational patterns in the click-stream data to determine if it is due to a robot. Experimental results on our Computer Science department Web server logs show that highly accurate classification models can be built using this approach. We also show that these models are able to discover many camouflaged and previously unidentified robots.

KW - Classification

KW - Data mining

KW - Web robot detection

KW - Web usage mining

UR - http://www.scopus.com/inward/record.url?scp=0036109905&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036109905&partnerID=8YFLogxK

U2 - 10.1023/A:1013228602957

DO - 10.1023/A:1013228602957

M3 - Article

AN - SCOPUS:0036109905

SN - 1384-5810

VL - 6

SP - 9

EP - 35

JO - Data Mining and Knowledge Discovery

JF - Data Mining and Knowledge Discovery

IS - 1

ER -

Discovery of web robot sessions based on their navigational patterns

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this