Thursday, 19 May 2011

Using Statistics to Detect Search Engine Spam

Using Statistics to Detect Search Engine Optimization Spam
An example of an application of statistical methods to detect web spam is presented in the paper "Spam, Damn Spam and Statistics" by Dennis Fetterly, Mark Manasse and Marc Najork from Microsoft. They used two sets of pages downloaded from the Internet. The first set was crawled repeatedly from November 2002 to February 2003 and consisted from 150 million URLs. For each page the researches recorded HTTP status, time of download, document length, number of non-markup words, and a vector indicating the changes in page content between downloads. A sample of this set (751 pages) was inspected manually and 61 spam pages were discovered, or 8.1% of the set with a confidence interval of 1.95% at 95% confidence.
Another set was crawled between July and September 2002 and comprises 429 million pages and 38 million HTTP redirects. For this set the following properties were recorded: URL, URLs of outgoing links; for the HTTP redirects - the source and the target URL. 535 pages were manually inspected and 37 of them were identified as spam (6.9%).

The research concentrates on studying the following properties of web pages:

URL properties, including length and percentage of non-alphabetical characters (dashes, digits, dots etc.). *
Host name resolutions.
Linkage properties.
Content properties.
Content evolution properties.
Clustering properties.