Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The PDF is a little short on details. It sounds like webamsters would all have to cooperate with allowing crawls from an "OWI" bot.

One of the challenges of creating a "web index" is first creating indexes of each website. "Crawling" to discover every page of a website, as well as all links to external sites, is labour-intensive and relatively inefficient. Part of that is because there is no 100% reliable way to know, before we begin accessing a website, each and every URL for each and every page of the site. There are inconsistent efforts such "site index" pages or the "sitemap" protocol (introduced by Google), but we cannot rely on all websites to create a comprehensive list of pages and to share it.

However, I believe there is a way to generate such a list from something that almost all websites do create: logs.

When Google crawls a website, it is often or maybe even always the case that the site generates logs of every HTTP request that googlebot makes.

If a website were to share publicly, in some standardised format, the portion of their log where googlebot has most recently crawled the site, we might see a URL for each and every page of the site that Google has requested.

Automating this procedure of sharing listings of those googlebot HTTP requests, the public could generate a "site index" directly from the source, via the information on googlebot requests in the logs.

Allowing crawls from a "new" bot would not be necessary.

Webmasters know what URLs they offer to Google. Google knows as well. The public, however, does not.

It is a public web. Absent mistakes by webmasters, any pages that Google is allowed to crawl are intended to be public.

Why should the public not have access to a list of all the pages of websites that Google crawls?

I don't know, but there must be reasons I have failed to consider.

What are the reasons the public not know what pages are publicly available via the web, except as made visible (or invisible) through a middleman like Google?

There are none.

Being able to see logs of all the googlebot requests would be one way to see what Google has in their index without actually accessing Google.



Isn't the act of sharing these logs vulnerable to a similar problem to site maps?

Not everyone will do it and those that do may not do it to 100% completeness: people may not keep their http logs in good order, for example.


"Not everyone will do it..."

Not everyone will provide CCBot with the same access that they provide to Googlebot. The question is how many will?

It is sort of a catch-all issue with anything on the web: "Not everyone will do it." I am not sure that anyone aims for 100% participation where the web is concerned.

There is always an uncertain amount of variation involved with particpation in anything across the entire www.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: