Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ArchiveTeam Warrior backing up Reddit (archiveteam.org)
102 points by timdaub on Dec 16, 2021 | hide | past | favorite | 71 comments


Is it possible to write a program that connects to an HTTPS server and archives content, but it keeps track of the session keys and the encrypted data coming from the server, and then records all that session traffic in a file. Replaying the file will allow anyone to observe that the data truly did come from that specific server, because it's signed with the cert for that server.

In other words, is it possible to get any HTTPS website to give you what is essentially a digitally signed copy of content you want to prove originated with that site? And is it true that that digital signature is easily verified to belong to the original website?


Unfortunately it's not possible, because TLS negotiates a symmetric key which is then used to encrypt and authenticate the rest of the session. If you post the transcript of a TLS session in an attempt to "prove" that you retrieved a specific document, a third party can verify that you did in fact negotiate a symmetric key with the correct server; but since it's a symmetric key, anyone with knowledge of the key can arbitrarily modify the transcript of the session [well, the part of the session where the HTTP request and response happen]. This obviously includes the original prover, and so a TLS transcript proves nothing at all.


TLSNotary tries to solve this problem: https://tlsnotary.org/

It was posted here a while ago: https://news.ycombinator.com/item?id=29090604


The problem is when you are trying to verify the data down the line the original certificate (up to and including the root cert or even the authority itself) will have expired, so it won’t be possible to trust it.


You can still verify a signature from an expired certificate. With certificate transparency, you can even verify the issuance of that certificate.


Wouldn't you want to validate it for the time at which it was served? In which case you'd just an archive of the CA roots at the time.


And how do you verify that the CA roots are authentic?


The Hg history from Mozilla-Central?


Sure but that's still a third party dependency. And then you have to verify that the Firefox source is authentic. My point is there's no way to have a fully self verifying archive.


Not sure if this is entirely what you're after, but check out https://github.com/WICG/webpackage



I think this requires more explanation. Is this some kind of cloud archiving? Are they using ipfs or something like that?


Volunteers run Docker containers or Virtualbox VMs at home so that the traffic looks residential and does not get banned. For example: https://imgur.com/a/QXrhudA

Most useful content gets packaged by the ArchiveTeam and send to the Internet Archive (no affiliation between the two).


but it's not really necessary for Reddit. their API is fairly robust and there are numerous options for scraping the site.


reddit will ipo soon, which is why this is frontpaged now, i presume. folks expect the api to soon become much more restrictive. other mirror projects don't have the ideology or reputation of archiveteam.


AT is not limited to Reddit scraping, take a look at their wiki.


I'm aware of AT and the work they do. Its a great initiative.

I'm just saying that other projects like https://pushshift.io/ have been capturing all reddit posts and comments for years now.


I'm also not totally familiar with what's going on but I discovered it on /r/datahoarder and I think it's because redditors are scared that content starts vanishing now that reddit files for IPO.

Anyways, here's a further description: http://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior


Of course it will. Reddit is probably one the largest porn sites.


ArchiveTeam usually sends all the data they collect into Internet Archive. The two are independent, unrelated organizations, but IA is pretty open about accepting hoards of this type.


A 2018-era article suggests the Internet Archive held 46PB of content. It’s probably much more now.

This warrior has extracted ~880TB so far. I wouldn’t be surprised if the result occupies a material proportion of the IA’s capacity, at significant cost.

Still, better than letting it all get burnt by the shareholders a few years down the line.


>Still, better than letting it all get burnt by the shareholders

You wish. By the looks of it, it will be burned down by yet another "redesign" when they inevitably shut down the sane UI of https://old.reddit.com because it's not pushing ads and "social" "features" strongly enough.

(If you don't use https://old.reddit.com/ instead of the "new", aka default, reddit, treat yourself to some sanity. Imagine if Reddit was more like HackerNews. Wait, you don't need to imagine, that's what the link goes to.)

I really like how they are pushing so hard for video, while struggling to display more than a thousand text comments or even links.

Feels like reddit is digging it own grave with these moves, and the shareholders' money with it.

I really hope to revisit this comment in 10 years and say how utterly stupid and wrong I was, because I've got an incredible amount out of the communities on reddit, and poured a lot into them to. Particularly, support groups.


Their video player is also garbage, especially on mobile. I'll have to open and close a video multiple times to get it to play, the quality will take a nosedive midway through and just stay that way for the rest of the video, and they take forever to load.


> I really like how they are pushing so hard for video, while struggling to display more than a thousand text comments or even links.

I don't know that this is the case of course, but can I blame Facebook? They "pivoted to video" while lying about their metrics and it wouldn't surprise me if the rest of Silicon Valley is like "Hey, they're making lots of money, why don't we do what they're doing?"


I really don't think you can blame Facebook for the idiotic decisions of others who're cargo culting its strategies.

It's not like Facebook held a gun to Reddit and forced them to have videos.

The answer to the "why" question should be obvious: it's the same in every case. "We" shouldn't do what FAANG is doing because we are not FAANG".

On that point, just because Floyd Mayweather makes a ton of money in the ring, you probably should not, even though you also have fists and can throw a punch.

Same with the framework/platform hype. You don't have Google scale, means you don't have Google's problems that they have deviced solutions for. A megatanker perhaps isn't the best boat for a weekend fishing getaway.

But I digress.


They are going to get popped by journos.

Twitter got popped left and right by journos and this was despite the platform being a huge source of revenue for the media, and the owner's politics.

The week they go public, the stories will start getting published about the content on there, the weak moderation (community moderation...lol), journos sending emails to companies asking whether they like their ads appearing along X or Y figure, etc.

The line that Huffman has taken before on this stuff (basically live and let live) works if you are private. It will go down extremely badly after the IPO.

Sorry. Reddit is already dead.

(Btw, I have no idea why this is...it makes no logical sense because being a public company changes nothing. But the media seems to understand that they can print something, that thing can potentially move the stock, and then they can bounce management into doing something).


To be fair to reddit, their combo of admin + community moderation has improved a lot in the past 5 years or so.

Remember /r/TheDonald? Or /r/jailbait? /r/coontown?

I mean, yes, it's a very low bar, but they deserve some commendation for making an effort that, I feel, paid off.

As far as being popped by journos, I feel like reddit has been on the radar for quite a while, and fared well under fire. Take /r/HermanCainAward, which got some pretty negative (and, I feel, misguided) press. The sub is not only still there, it's (quite sadly) thriving. (Sadly because nobody should receive that "award", but given quite a surplus of qualified recipients, something like that sub serves a purpose. Everyone there will be happy the day the sub stops being active because of lack of submissions, but alas, that day is yet to come).

So after catching flack, /r/HermanCainAward told people to cross out faces and names if they are not a public figure - and that seems to work fine with everyone. Like, the journos are not vultures: they picked on that sub, and the outcome was that it improved. When people are civil to each other, the journos don't a hot scathing article to print.

Getting popped by journos can be a good thing. Maybe some of them will write a hit piece on how waiting for 7 seconds for a text page to load is shameful in 2021.


ArchiveTeam != Internet Archive


Take a look at where ArchiveTeam stores its stuff.


I wonder how much google stores. My dumb butt has close to 100TB of…uhh…Linux isos encrypted and hooked up to plex for $20 a month. Seems like a loss leader for them. But I always think of the multiple spinning discs I’m taking over there for content and residency since it can’t be deduped


How do you have all that for $20 a month?


Google drive is $20 a month for enterprise and I’m an enterprise of one. Unlimited storage.


And with Plex? That’s great


Yep. Rclone supports encryption now but I started this before that was a thing so it’s just a rclone mount with encfs


This is so very necessary. Reddit has banned great content from many subs whose ideology they didn’t agree with. Unfortunately it isn’t even possible to know the URLs of posts from banned subreddits to look them up in the first place.

Does anyone know if there are backups of banned subreddits already?


And with "permanently suspended accounts" nothing they wrote is visible.

And reddit is very liberal with their banhammers.


in some cases it is positively ridiculous. I got banned from /r/coronavirus for posting a scientific article suggesting breakthrough cases in fully vaccinated people was a possibility (this was fairly early, Mar 2021 maybe). The mod denounced me as an antivaxxer which I am certainly am not. Lo and behold - breakthrough cases are a real thing.


> I got banned from /r/coronavirus for ...

Reddit doesn't moderate individual subreddits. Non-employee moderators do.

Most subreddits are just an extension of whatever their moderators want to see. They're a huge shaping influence on the content and character of Reddit, but their actions are mostly invisible.


Where does ArchiveTeam find all the reddit posts and comments to archive? Do they have a script automatically going through the "New" section or are they finding posts through Google or link crawling?


Besides their Archive Warrior distributed crawler I imagine PushShift[0] is probably a starting point for them.

[0] https://files.pushshift.io/


In general, ArchiveTeam has scripts which hit random links to see if there is any content. They have coordination servers which share info on which slugs have been checked before to avoid duplicate effort.



There are several sites that let you view deleted reddit comments. I wonder if they have complete text backups, or if they get the comments from somewhere else.


The ones I know are powered by https://pushshift.io/


That's right. I'm the author of the one called reveddit and maintain a FAQ about how it works if anyone's interested,

https://www.reveddit.com/about/faq/


Inused removeddit for years then it suddenly stopped working due to some sort of reddit API change, is this true for yours too?


Reveddit works fine with the current workings of reddit and the archive service's APIs.

Removeddit's domain lapsed. Prior to that it was working, except on Firefox with Tracking Protection (see https://www.reveddit.com/about/faq/#firefox ).


Many thanks for your efforts!!


.


What a waste of the space of the Internet Archive. Just because something exists doesn't mean it should be backed up, I would be surprised if anyone actually needed something from Usenet for example. Things like this are going to kill the Archive eventually.


> I would be surprised if anyone actually needed something from Usenet for example

Totally disagree. Usenet is the only place recording the history of a huge number of influential projects from the 80s and 90s. That history deserves to be recorded.


absolutely disagree.

Some of the most valuable and insightful anthropological artifacts are merely shop ledgers and discourse on bathroom walls -- and we've never been better equipped to document/store/search/access the entirety of the saved artifacts from the modern age -- and presumably our mastery over information technology as a domain will only improve and make things easier.


I would love to lookup my old usenet posts from the early 90's


My point was that most people wouldn't get much value out of it aside sentimental reasons, and you can back up your own Reddit posts if this matters to you instead of backing up the entirety of it as this is the only part that should matter and should have a smaller fingerprint. If Reddit goes down for whatever reason you could try to merge with others, too.


If you try telling a historian in a hundred years time that there wasn't anything of value worth saving on a site where millions of regular people had conversations about pretty much everything of relevance to life in the early 21st century they will robustly disagree with you.


The groups.google.com search is working again. Why not?


> Just because something exists doesn't mean it should be backed up.

/r/DataHoarder disagrees. It's pretty crazy the what people will back up these days.


I'm thinking about becoming a data hoarder. One of my favorite hobbies is listening to music on YouTube and reading people's comments about how that song was important in some moment of their lives. Most of the videos are unofficial, since the official ones have comments turned off. From time to time, one of these music videos gets dmca'd and sadly the comments are all lost.


There is a limit to storage, especially when you don't own it, and the usefulness of the content collected should be considered when you're backing up what is primarily a news and meme aggregator. They aren't backing up the Library of Babel so where does the line get drawn?


I think the line is drawn where nobody cares enough to archive it.

How interesting would it be to trace the evolution of e.g. TheDonald from parody to party to sewer, without the retrospective editings and deletions?

Yeah, there's a limit to storage, but it's growing faster than, perhaps, you warrant.


The 20 Newsgroups dataset is a collection of ~20k newsgroup documents and is super popular for experimentation in text applications of various machine learning techniques. Without Usenet (and archives of Usenet) that probably wouldn't exist.


You never know what will be useful to the future. To give an example, the field of papyrology is largely built around trying to construct a view of the past using scraps of texts excavated from ancient dumps.


This is why individual people should never be left to make such decisions alone. They're likely to throw away things of enormous value for parochial reasons.


Archive Team is not the same as the Internet Archive.


Where do you think all of this data is being pushed to exactly?


To Archive Team contributors hard drives. You should probably check out how Archive Team works, it's pretty cool.


The "Warrior" program only downloads the data temporarily, and then uploads it to a staging server run by AT members. The staging server packs the WARCs into MegaWARCs and sends them to live at the Internet Archive.


The end result is the data flowing to the IA...sure they are "not affiliated" but the data is sent there which is what the GP is talking about.


Thank you, drive through


...why?

It's a pretty high bar to say that something should not be archived and it's a waste.

I am not even going to state my opinion on the issue now, I am just disappointed about the level of discourse comments like this create. No justification, no logic.

TL;DR is that you don't like reddit. This is not useful to anyone.


no one cared about the snap shot of Bill Clinton dancing with an intern either until later.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: