[ACCEPTED]-Prevent site data from being crawled and ripped-spam-prevention

Accepted answer
Score: 20

Any site that it visible by human eyes is, in 9 theory, potentially rippable. If you're 8 going to even try to be accessible then 7 this, by definition, must be the case (how else 6 will speaking browsers be able to deliver 5 your content if it isn't machine readable).

Your 4 best bet is to look into watermarking your 3 content, so that at least if it does get 2 ripped you can point to the watermarks and 1 claim ownership.

Score: 13

Between this:

What are the measures I can 11 take to prevent malicious crawlers from 10 ripping

and this:

I wouldn't want to block 9 legitimate crawlers all together.

you're 8 asking for a lot. Fact is, if you're going 7 to try and block malicious scrapers, you're 6 going to end up blocking all the "good" crawlers 5 too.

You have to remember that if people 4 want to scrape your content, they're going 3 to put in a lot more manual effort than 2 a search engine bot will... So get your 1 priorities right. You've two choices:

  1. Let the peasants of the internet steal your content. Keep an eye out for it (searching Google for some of your more unique phrases) and sending take-down requests to ISPs. This choice has barely any impact on your apart from the time.
  2. Use AJAX and rolling encryption to request all your content from the server. You'll need to keep the method changing, or even random so each pageload carries a different encryption scheme. But even this will be cracked if somebody wants to crack it. You'll also drop off the face of the search engines and therefore take a hit in traffic of real users.
Score: 6

Good crawlers will follow the rules you 8 specify in your robots.txt, malicious ones 7 will not. You can set up a "trap" for 6 bad robots, like it is explained here: http://www.fleiner.com/bots/.
But 5 then again, if you put your content on the 4 internet, I think it's better for everyone 3 if it's as painless as possible to find 2 (in fact, you're posting here and not at 1 some lame forum where experts exchange their opinions)

Score: 6

Realistically you can't stop malicious crawlers 14 - and any measures that you put in place 13 to prevent them are likely to harm your 12 legitimate users (aside from perhaps adding 11 entries to robots.txt to allow detection)

So 10 what you have to do is to plan on the content 9 being stolen - it's more than likely to 8 happen in one form or another - and understand 7 how you will deal with unauthorized copying.

Prevention 6 isn't possible - and will be a waste of 5 your time trying to make it so.

The only 4 sure way of making sure that the content 3 on a website isn't vulnerable to copying 2 is to unplug the network cable...

To detect 1 it use something like http://www.copyscape.com/ may help.

Score: 5

Don't even try to erect limits on the web!

It 11 really is as simple as this.

Every potential 10 measure to discourage ripping (aside from 9 a very strict robots.txt) will harm your 8 users. Captchas are more pain than gain. Checking 7 the user agent shuts out unexpected browsers. The 6 same is true for "clever" tricks with javascript.

Please 5 keep the web open. If you don't want anything 4 to be taken from your website, then do not 3 publish it there. Watermarks can help you 2 claim ownership, but that only helps when 1 you want to sue after the harm is done.

Score: 3

The only way to stop a site being machine 15 ripped is to make the user prove that they 14 are human.

You could make users perform a 13 task that is easy for humans and hard for 12 machines, eg: CAPTCHA. When a user first 11 gets to your site present a CAPTCHA and 10 only allow them to proceed once it has completed. If 9 the user starts moving from page to page 8 too quickly re-verify.

This is not 100% effective 7 and hackers are always trying to break them.

Alternatively 6 you could make slow responses. You don't 5 need to make them crawl, but pick a speed 4 that is reasonable for humans (this would 3 be very slow for a machine). This just makes 2 them take longer to scrape your site, but 1 not impossible.

OK. Out of ideas.

Score: 2

In short: you cannot prevent ripping. Malicious 18 bots commonly use IE user agents and are 17 fairly intelligent nowadays. If you want 16 to have your site accessible to the maximum 15 number (ie screenreaders, etc) you cannot 14 use javascript or one of the popular plugins 13 (flash) simply because they can inhibit 12 a legitimate user's access.

Perhaps you could 11 have a cron job that picks a random snippet 10 out of your database and googles it to check 9 for matches. You could then try and get 8 hold of the offending site and demand they 7 take the content down.

You could also monitor 6 the number of requests from a given IP and 5 block it if it passes a threshold, although 4 you may have to whitelist legitimate bots 3 and would be no use against a botnet (but 2 if you are up against a botnet, perhaps 1 ripping is not your biggest problem).

Score: 2

If you're making a public site, then it's 10 very difficult. There are methods that 9 involve server-side scripting to generate 8 content or the use of non-text (Flash, etc) to 7 minimize the likelihood of ripping.

But to 6 be honest, if you consider your content 5 to be so good, just password-protect it 4 and remove it from the public arena.

My opinion 3 is that the whole point of the web is to 2 propagate useful content to as many people 1 as possible.

Score: 1

If the content is public and freely available, even 19 with page view throttling or whatever, there 18 is nothing you can do. If you require registration 17 and/or payment to access the data, you might 16 restrict it a bit, and at least you can 15 see who reads what and identify the users 14 that seem to be scraping your entire database.

However 13 I think you should rather face the fact 12 that this is how the net works, there are 11 not many ways to prevent a machine to read 10 what a human can. Outputting all your content 9 as images would of course discourage most, but 8 then the site is not accessible anymore, let 7 alone the fact that even the non-disabled 6 users will not be able to copy-paste anything 5 - which can be really annoying.

All in all 4 this sounds like DRM/game protection systems 3 - pissing the hell out of your legit users 2 only to prevent some bad behavior that you 1 can't really prevent anyway.

Score: 0

You could try using Flash / Silverlight 3 / Java to display all your page contents. That 2 would probably stop most crawlers in their 1 tracks.

Score: 0

I used to have a system that would block 5 or allow based on the User-Agent header. It 4 relies on the crawler setting their User-Agent 3 but it seems most of them do.

It won't work 2 if they use a fake header to emulate a popular 1 browser of course.

More Related questions