This is a really tough task, even in theory, because the Internet is volatile and can change any second, but there are a few signals and hints that can be picked up by search engines. For example the time (or place) when they crawled the content, pings from content management systems, PageRank (usually highly reputable sites produce valuable content of their own), rel canonical and rel author or site level signals (where scraper-esque-looking website may be flagged as such and then not given credit even if some piece of content is their own).
How does Google determine the canonical source for a piece of content?
The answer to this question has changed over time as we try to write new algorithms and find different ways of nailing down where was content originally written, where did it first appear (for example, the first time or the first place on the web that we saw content appear).
- If you happen to write something and publish it, and we crawl it and we see all that content, and then it shows up two years later somewhere else, well, it’s more likely that the source is where we first saw it a couple years earlier.
- You can even do things like, if you have a blog or a content management system you can do a ping (WordPress, Blogger, a lot of these sites, whenever you post, whenever you actually update or publish a blog post, you can send that ping), that can go to various blog search and real-time search engines and to Google. And that could maybe help narrow down the time at which some content was posted as well.
- Prioritization by PageRank – if you see the exact same content, and one really, really reputable site has it, and one site that looks like it’s brand new, fly-by-night, never seen it before, a little suspicious, not sure that it’s highest quality, then you can imagine thinking “well, maybe it’s more likely to show up on this more reputable site”.
- Of course, there’s rel canonical – which is a very explicit signal that says “this is the preferred location for this content”.
- You can imagine slightly more indirect methods, like rel = author, which is a way that you can annotate the web and say “hey, I actually wrote this” or “here’s where my author profile page is”. You can sort of give a little bit of a hint about where content came from as far as the prominence and those sorts of things.
- There’s also the idea of in theory doing site-level signals. If we think one particular site is scraping a whole lot, and then we see content show up on that site and another site, then maybe we’re a little less likely to think that the scraper-esque-looking type site is the original author compared to some other site that might have a good history of producing original content.
There are a lot of different factors you could think of. It can be tricky because Googlebot is effectively sampling the web.
The web is infinite and it can change from millisecond to millisecond
It can be very tricky if you are crawling the web to find out exactly when and where content appeared first. We do try to do a good job of it. Sometimes we mess up, and we’re happy to get feedback about when we do, but it’s definitely the case that there are a lot of different hints and signals and potential ways that you can try to figure out what the canonical or original source for some particular content is.
by Matt Cutts - Google's Head of Search Quality Team