Uncrawled URLs in search results

Uncrawled URLs in search results - answered by Matt Cutts

Matt's answer:

CUTTS: Okay. I wanted to talk to you today about robots.txt. One complaint that we often hear is, “I blocked Google from crawling this page and robots.txt and you clearly violated that robots.txt by crawling that page because it’s showing up in Google search results.” A very common complaint, and so, here’s how you can debug that. We’ve had the same robots.txt handling for years and years and years. And we haven’t found any bugs in it for several years, and so, most of the time, what’s happening is this. When someone’s saying, “I blocked example.com/go in robots.txt,” it turns out that the snippets that we return in the search results looks like this. And you’ll notice, unlike most search results, there’s not some text here. Well the reason is that we didn’t really crawl this page. We did abide by robots.txt. You told us this page is blocked so we did not fetch this page. Instead, this is an uncrawled URL. It’s a URL reference. We saw a link to it, but we didn’t fetch the page itself. And so, because we didn’t fetch the page itself, that’s why you don’t see a description or some sort of snippet right in here. So it’s kind of interesting because people often ask, “Well, why do you show uncrawled URLs? What’s the possible use case for that?” And let me take you over here. At one point, the California Department of Motor Vehicles, which is www.dmv.ca.gov, had a robots.txt that blocked all search engines. Now in these days, pretty much every site is savvy enough, you know. At one point, the New York Times and eBay and a whole bunch of different sites would use robots.txt. So if someone comes to Google and they type in California DMV, there’s pretty much one answer and this is what you want to be able to return. So even though they were using robots.txt to say, “You’re not allowed to crawl this page,” we still saw a lot of people linking into this page and they have the anchor text California DMV. So if someone comes to Google and they–they do the query, California DMV, it make sense that this is probably relevant to them. And we can return it even though we haven’t crawled the page. So that’s the particular policy reason why we can sometimes show uncrawled URL, because even though we didn’t fetch the URL itself, we still know from the anchor text of all the people that point to it that this is probably going to be a useful result. Now the interesting thing is suppose you have a site like Nissan. For a long time, Nissan, also Metallica, use robots.txt and had blocked all sites from being crawled. This was years and years and years ago. Again, what we found is that we can go and find information in the open directory project where Nissan and metallica.com were both mentioned in the open directory project. And so sometimes, you’ll see a snippet that looks almost like it was crawled. But this description does not really come from crawling the page. It comes from something like the Open Directory Project. So you can get–we are able to return something that can be very helpful to users without violating robots.txt by not crawling that page. Now if you truly don’t want a page to show up, one of the best things that you can do is let us crawl it and then use a “no index” Meta Tag at the top of the page. When we see a “no index” tag, we’ll drop it from our search results completely. Another option you have is you can also say, “Use the URL removal tool.” So if you block a site completely in robots.txt, then you can use the URL removal tool and remove an entire site from Google’s index. And then it will never show up in that way as well. But it turns out for users being able to return these uncrawled URLs can be very useful. That’s the reason why we do it and most of the time probably 90% of the time when someone says, “You’re violating my robots.txt. You’ve clearly crawled these pages.” What’s really happening is we’re able to return that uncrawled URL reference. And–and so that’s what’s going on. It’s not that we’ve crawled those pages. So those are a couple of easy ways that if you don’t want your sites or your page to show up you can block us in robots.txt and use the URL removal tool, or on all the different pages, you can use a “no index” tag. And then once we crawl that page and see the “No Index” tag, we’ll drop that page from our index completely.


by Matt Cutts - Google's Head of Search Quality Team

 

Original video: