That’s a really fun question for a couple of reasons. So you can think about PDFs specifically. And there’s not that much to do in terms of optimization.
For one thing, I’d make sure that it’s actually text, because you can have PDF that’s primarily composed of images. And we might be able to OCR over time. But really, if you have text in that document, it’s a lot easier for us to index.
You want to make sure that you choose good titles. You probably don’t want to just have massive numbers of PDFs, if it’s all like shovel ware, like you’re just auto-generating content. And you’re just throwing it up there. But there’s a more interesting question underneath this question to me, which is, how do you rank web pages versus PDF documents? PDF documents tend to be longer.
Maybe because it can be a jarring experience to click on a link, and you immediately get thrown into a PDF reader piece of software. And so you really do have apples and oranges. And Google’s philosophy is to try to determine, as best we can, what’s the utility of the next result? Is the user better served by returning a PDF? Or are they better served by returning a web document? And it’s a really hard problem.
Fundamentally, these are different data types. One might be a book in PDF, and one might be a 400-word web page. And trying to figure out what’s the relative utility of those is really, really difficult. Different people will disagree. Different search engines will have different philosophies. We essentially try to say, given what we know about the user, given everything else, given all the relevant signals that we have, try to make our best guess about, OK, the next most useful thing will be a PDF versus a web page. It’s an imperfect science.
Different people will have different philosophies. Some people don’t like to get PDFs. And then some PDFs are nothing more than a few matches, and it’s a book-length thing. And so having a few matches in a PDF might not be as helpful as having the same number of matches in a web document. It’s fundamentally a hard problem. But we do our best try to say, given these different types of media, given these different types of documents, what’s the best match for the user? What’s going to give them the best value and help them out the most in terms of their information need?
by Matt Cutts - Google's Head of Search Quality Team