Isn't this just the cost of having a "free" product? Bots are not really a problem. Its just that their traffic cannot be monetized. If you could monetize-bot traffic your problem would be solved. Or put another way, if you framed the issue as a business model one, not a technical one, it might be a useful exercise.
> if you framed the issue as a business model one, not
> a technical one, it might be a useful exercise.
That was kind of my point. Clearly most of the bots are trying to scrape my search engine for some specific data. I would (generally) be happy to just sell them that data rather than have them waste time trying to scrape us (that is the business model, which goes something like "Hey we have a copy of the big chunk of the web on our servers, what do you want to know?" but none of the bot writers seem willing to got there. They don't even send an email to ask us "Hey, could we get a list of every site you've crawled that uses the following Wordpress theme?" No instead they send query after query for "/theme/xxx" p=1, p=2, ... p=300.
On a good day I just ban their IP for a while, when I'm feeling annoyed I send them results back that are bogus. But the weird thing is you can't even start a conversation with these folks, and I suppose that would be like looters saying "Well ok how about you help load this on a truck for me for 10 cents on the dollar and then your store won't be damaged." or something.
Google posts lots of contact information on their contact page. You would probably want to reach business development. I don't think they are willing to sell access to that index however, we (at Blekko) would. I suppose you could also try to pull it out of common crawl.
It need not to be commercial service. For example, Wikipedia is a donation-only service. A bot visit is generally not different then most user visiting (I'd assume most users don't donate anyway). Wikipedia doesn't really mind serving users that aren't donating, but the bot, while generally not different to normal user, are stealing resources away from actual users.