There is an easy way to increase your detection of redirects and parked pages: make two requests, one to the real URL and one to a URL which is intentionally broken. (example.com/i-am-a-link and example.com/fklsdfasdifo for example) Run a heuristic for difference on the resulting content. This won't catch all of them, particularly if you use a really naive heuristic that can't deal with e.g. ads changing, but it's a heck of a lot quicker than comparing manually.
if you see a lot of them go to webmaster tools, if you see them there too its not some kind of test but some other reasone, mostly their shitty js parsinf, which treats anything with an / as a relative url...
I had a project/startup working on dealing with link rot for a while. It would not just tell you a link was 404ing, but recognize when the content of the page had changed significantly and let you know. The fun part was to automatically recommend a good replacement page from within the site, nearby, or the internet archive.
Based on a quick test it seemed that it would take an site owner about 5-10 minutes per link to find a good replacement once they knew it was broken . That's fine for a personal portfolio with 50 links, but for a site like Boing Boing getting all the broken links working again looked like full time work for a year.
I'm curious if you have many outbound links in your 'scalable content'? Do you spend much time maintaining them?
I just want to point out that this won't really work if the site tries to redirect you to a search engine. Naturally broken links will come up with all kinds of search results, but deliberately broken links will come up with "None".
"You can expect to lose about a quarter of them every seven years" can be translated in one of two ways:
1. Every 7 years, 25% of the original links are lost: in 14 years half are working, and in 28 years you have approximately 0% of your links working. This is a steady, linear rate which does not have a half life.
2. Every 7 years, 25% of the remaining links are lost: in 7 years 75% are working, in 14 years 0.75^2 = 56% of the links are still working, and in 28 years 31.6% are working. This is exponential decay, which has a half-life and is contradictory.
From looking at the data, and from the explicitness with which he states that the decay is linear, I'd conclude that the former is true. It would be hard to make a case for the later.
Given that the data for the first few years result from a small number of pages, the graph doesn't allow to reliably distinguish between both interpretations, and his (and your) interpretation is not really justified. If you look just at the complete, reliable years (links from 2000 to 2010), the graph looks convex, just like you'd expect for exponential decay but not for linear decay. So while the first interpratation is probably intended by the author, I'd say he is overstating his case and is most probably wrong.
Yes, indeed. That is a nice graph, and in absence of a mechanism that would explain why no link can last longer than 28 years it should quite convincingly show that linear decay has not been established.
Half life isn't a steady rate; it's exponential gradual decay. For example, if the element yahelium had a half-life of a year, it would degrade by 1/2 every year. So, the degradation during the first year would be twice the degradation of the second year. That's not linear.
I discovered when I got home that the graph was blocked by my office's web filter. Definitely makes more sense with it there. Thanks for your understanding!
I think it's about time that some government or billionaire throws a few millions at an internet archive project. The Internet Archive is nice but more regular snapshots with a wider coverage would be something I'm certain future historians would love to get their hands on (and they will hate us if we don't do it).
A big problem with the Internet Archive is how easy it is to opt out of it, either by robots.txt or explicit request. I understand why they do it, but it seriously damages their mission.
The fact that you can retroactively opt out of it also makes it particularly odd, compared to previous media. If you change your mind about publishing a book, you can't unpublish existing copies, and chances are that some library will still have a copy that they continue to make available to the public (a handful of people in history have succeeded in suppressing already-published books, but it's hard). But if you decide you want to unpublish a webpage, you can actually get it excised from all the archives available to the public as well, erasing it from the historical record, even if it had been available for years and read by thousands--- because, unlike with books, typically none of those thousands of readers will have a copy that they can claim ownership to.
Even worse is that others can retroactively opt out on another's behalf. If someone owned the domain with no exclusion, then their content seemingly disappeared in 2003 because they lost ownership/stopped caring about the domain, and the new owner puts up a robots.txt, the original owner's content will not be publicly accessible.
At least that was the case a few years ago. I know the Wayback Machine has seen some changes in the last year.
I think it's about time that some government or billionaire throws a few millions at an internet archive project.
It may be that one or two governments have already done that. You are, of course, referring to a publicly accessible Internet archive.
As for what a benevolent millionaire (it wouldn't have to be a full billionaire for this to start up) could fund, pg has suggested, "There is room to do to Wikipedia what Wikipedia did to Britannica."
It's interesting that pg thought then that Wikipedia's problem is excessive deletionism, while I (after being a registered Wikipedian and working on various articles) think that Wikipedia's problem is lack of thorough research to prepare article content.
Whatever one's opinion of what's wrong with Wikipedia, the best way to prompt improvement in Wikipedia (or replace it, if you prefer that) is to build another site that does some of what Wikipedia does but does it better somehow. That's not easy, not easy at all, but it's not terribly expensive. I have looked at the Wikimedia Foundation financial reports, and building a strong competitor to Wikipedia is a project that is well within the grasp of several individual millionaires, and within the grasp of quite a few nonprofit charitable organizations. A business corporation that can find out a way to monetize a Wikipedia competitor might have a great business opportunity.
> It's interesting that pg thought then that Wikipedia's problem is excessive deletionism, while I (after being a registered Wikipedian and working on various articles) think that Wikipedia's problem is lack of thorough research to prepare article content.
Yep, that's the usual distinction. Non-Wikipedians believe that Wikipedia should be a compendium of any information that could be useful, however unverifiable or incomplete. Wikipedians want there to be higher standards, but pg makes the common mistake of thinking this is because they are all OCD.[1] The Wikipedia system relies on group verifiability. Low quality info imposes long-term costs on the administrators. The article will be flagged more frequently than others. So pruning low quality info is a matter of administrator self-defence, even if you ignore the ideals of achieving a trustworthy encyclopedia.
A Wikipedia successor would have to abandon trustworthiness (or figure out some way to indicate that certain pages were untrustworthy). Or figure out how not to impose the costs of maintaining unverifiable information on administrators. One way might be to connect the info with the community that cares about it in a more direct and intimate way. Wikipedia fails REALLY badly at the latter, to the point where the wiki-insiders sometimes have more control than the audience for a topic.
> That's not easy, not easy at all, but it's not terribly expensive.
In terms of software and services, it would be no problem at all. But you are overlooking the cost of creating a new Wikipedia in a world where Wikipedia already exists.
Wikipedia content is also famously intractable to reuse in any system other than MediaWiki. We hope to begin alleviating that this year with the big parser redesign. A side effect should be to enable competitors to try different things with our content.
[1] They are, but this is not the primary reason. ;)
AT telling the world they will rip any content they want regardless of owner's wishes, and then trolling about it probably soured most of us on your 'team'.
Have you considered that if you're despised by many, it might be for a reason?
There's a big difference between the way the Archive Team operates and the way the Internet Archive operates. IA is professional and polite, AT is brutish.
Sadly I think technological advances has only accelerated this phenomenon. We've gone from an era of static pages that would require considerable effort to change the overall layout of to CMSes that we can twiddle and upgrade with nary a concern for backward link compatibility.
Personally I think it should be a principle of every professional web developer that you just don't break links, period.
Users may prune their own bookmarks when they discover the links broken – especially when considering some of the pre-Pinboard systems (like in-browser bookmarking) from which the earliest data in this analysis comes. So I suspect this underestimates link-rot.