Web scraping hits NonDoc (and probably your site, too)

Thursday, December 3, 2015

771

COMMENTARY

Stolen_PlazaAlleyWalls — (William W. Savage III)

Thanks to some jackass robot, I recently learned about something known as web scraping.

Basically, and in the sense that it’s relevant to NonDoc, people use computer programs to extract content from web pages and then populate their own websites with that content. (For a more comprehensive definition and history, see this article from cloud-security provider Distil Networks.)

The content that third parties scrape can be optimized to boost a fake site’s search-engine results, with the aim of eventually fooling Google and other search indexes into thinking the site is legit and not fully automated.

A human can easily tell the difference because, with regard to a scrape of original content like what appears on NonDoc, the regurgitation comes out as nonsense. Pre-programmed bots, however, can’t tell the difference.

With well over half of all web traffic estimated to come from non-humans, the need to be coherent within one’s e-commerce goals has taken a backseat to the need to be comprehensively optimized for search engines. Nonsense or not, a person running such an automated site could eventually generate revenue from Google AdWords. At the very least, perpetrators of web scraping rob the original site of traffic and potential revenue. Further, the legality of such activity is dubious at best.

Our real-life example

I came across the example above during a moment of vanity — doing a Google image search with my name as the search term.

What I found was confusing, but it taught me about web scraping.

The weird site ThatVilla.com uses a scraper, and it scraped my photo post about the Plaza Alley Walls from back in September. (You may have to hit escape to stop your browser from redirecting right after you click that link; like I said, the site is weird).

RELATED

“Plaza Alley Walls open at District Fest” by Josh McBee

Because I included photo credits under every image from my original post, I spied the pictures in my vain Google search, but some of the results directed to ThatVilla instead of NonDoc.

Not only is it annoying to see one’s original content so blatantly misappropriated, it’s (potentially) a detriment to our site’s traffic because it pulls search-engine indexes, keyword supremacy and other revenue-related metrics from our site. Further, it’s a violation of our terms of service and illegal with regard to copyright infringement.

Alas, enforcing such claims against web scrapers seems to be a sisyphean effort.

To show how blatantly the site in question plagiarizes, here is some of my post’s page copy as it appeared originally on NonDoc:

The 17th annual Plaza District Festival benefited from excellent weather Saturday, Sept. 26. In addition to the usual festival fare of vendors, food trucks and live music, the opening of the Plaza Alley Walls public art project punctuated the day.

Now here’s the text as it appears on the ThatVilla after their programs molested it for their illicit use:

The 17th annual Plaza District Festival benefited from glorious continue Saturday, Sept. 26. In further to a common festival transport of vendors, food trucks and live music, a opening of a Plaza Alley Walls open art plan punctuated a day.

While not a direct copy, it seems obvious that some kind of computer script was used to change words like “the” to “a,” “excellent” to “glorious,” and “public” to “open.” Not only is it plagiarism to reproduce my work, but the web scraping also fails to make sense beyond the goal of populating a dummy site with text for the purpose of stealing site traffic and generating ill-gotten revenue.

Problems for other industries

The detriments of web scraping are not limited to humble operations such as NonDoc. Curt Beardsley, operator of Realtor.com, recently outlined the concerns the real estate industry as a whole has (or should have) regarding the web scraping of listing data.

According to Beardsley, Realtor.com was facing 1.5 million scraping attempts per day. Through the use of sophisticated programming techniques and eventually the hiring of a third-party firm, Beardsley said he was able to block almost 99 percent of all scrape attempts, but now the scrapers are moving on to easier targets.

“My problem has now become an MLS and broker problem,” he said in the article.

What can be done?

We run our site on the WordPress platform here at NonDoc. As such, there are a multitude of plugins available to endlessly customize and add to our site’s functionality, on the back and front ends. We’re currently experimenting with some anti-scraping plugins, but those measures have only recently been implemented. If anyone has any advice or tips on how to limit a site’s vulnerability to scraping attacks, please list them in the comments below.

My initial research leads me to believe NonDoc will never be able to prevent all scraping attempts, but we can limit our site’s vulnerability from the easier attacks. If you run a site, you may want to as well.