Let’s say Google crawls site A every hour and site B once in a day. Site B writes an article, site A copies it, changing the time stamp. Site A gets crawled first by Googlebot. Whose content is original in Google’s eyes and will rank highly? And if it’s A, then how does that do justice to site B?
I could get into a lot of really interesting stuff about how to crawl the web. If you really want to know about a signal, the Nyquist rate says you want to sample at two times that frequency. But the fact is: you can always change a web page. So the whole idea, the conception of being able to crawl the entire web and having a perfect copy at every instant, is a little bit flawed, because at any time we can only go and fetch a certain finite number of pages. If we tried to fetch them all, and our architecture could almost support that, then the web might crash from all of those requests. And we try to crawl in a relatively polite way.
The question is essentially: if A is getting crawled a lot but the original article starts on B, what if A rips off B?
Well, there are ways that help to guard against that. For example, if you do a Tweet, people will see it, people may link to it, and we may follow those links faster than we’ll discover it on the other site.
Another thing that you can do is: you can hook up things like PubSubHubbub, which will ping various places. There is a very limited amount in which we will use PubSubHubbub to help improve our crawl, and that might change over time. And that’s a great way to sort of asynchronously say: “hey, there’s a new article or there’s a new blog post”.
Let’s go ahead and play with this hypothetical scenario. If A has copied your article and changed the time stamp, that’s a little bit deceptive, it’s as if they’re claiming that they have written it. So you can do a couple things. Number one, if you are the author of that article, you can always do what’s known as a Digital Millennium Copyright Act sort of notice, where you send in this DMCA request; and you can find the information at google.com/DMCA.html. Basically what you’re saying is: “this site copied me, but I’m the original author”. So this site can either counter-notify, which means they dispute that. They say: “I wrote this page”, which has some penalties to it if they’re lying. Or they cannot dispute it and the stuff disappears off of the other site. So if someone’s ripping you off, you can always do a DMCA notice.
Also, if it’s an auto-generated site and they’re ripping off or scraping a bunch of people, you can do a spam report, because that’s not a high-quality site; that’s not the sort of thing that we want to have within our index.
Let’s just play it all the way out to the corner case. It is, in theory, possible that we will find an article on one site before we find it on the other site. It is definitely the case that we try hard to find out who is the original creator of a particular piece of content, but I wouldn’t claim that we’re perfect. We do as much as I can think of to try to figure out what are the ways for people to indicate that they wrote the content. And in fact, in Google News, we just introduced a couple new tags, almost as an experiment to see how well it works, to sort of say, “here’s the original author of this content”. So there are approaches that we’re exploring to sort of figure out if there are other ways to do that.
by Matt Cutts - Google's Head of Search Quality Team