Is it possible to write a program that connects to an HTTPS server and archives content, but it keeps track of the session keys and the encrypted data coming from the server, and then records all that session traffic in a file. Replaying the file will allow anyone to observe that the data truly did come from that specific server, because it's signed with the cert for that server.
In other words, is it possible to get any HTTPS website to give you what is essentially a digitally signed copy of content you want to prove originated with that site? And is it true that that digital signature is easily verified to belong to the original website?
Unfortunately it's not possible, because TLS negotiates a symmetric key which is then used to encrypt and authenticate the rest of the session. If you post the transcript of a TLS session in an attempt to "prove" that you retrieved a specific document, a third party can verify that you did in fact negotiate a symmetric key with the correct server; but since it's a symmetric key, anyone with knowledge of the key can arbitrarily modify the transcript of the session [well, the part of the session where the HTTP request and response happen]. This obviously includes the original prover, and so a TLS transcript proves nothing at all.
The problem is when you are trying to verify the data down the line the original certificate (up to and including the root cert or even the authority itself) will have expired, so it won’t be possible to trust it.
Sure but that's still a third party dependency. And then you have to verify that the Firefox source is authentic. My point is there's no way to have a fully self verifying archive.
Volunteers run Docker containers or Virtualbox VMs at home so that the traffic looks residential and does not get banned. For example: https://imgur.com/a/QXrhudA
Most useful content gets packaged by the ArchiveTeam and send to the Internet Archive (no affiliation between the two).
reddit will ipo soon, which is why this is frontpaged now, i presume. folks expect the api to soon become much more restrictive. other mirror projects don't have the ideology or reputation of archiveteam.
I'm also not totally familiar with what's going on but I discovered it on /r/datahoarder and I think it's because redditors are scared that content starts vanishing now that reddit files for IPO.
ArchiveTeam usually sends all the data they collect into Internet Archive. The two are independent, unrelated organizations, but IA is pretty open about accepting hoards of this type.
A 2018-era article suggests the Internet Archive held 46PB of content. It’s probably much more now.
This warrior has extracted ~880TB so far. I wouldn’t be surprised if the result occupies a material proportion of the IA’s capacity, at significant cost.
Still, better than letting it all get burnt by the shareholders a few years down the line.
>Still, better than letting it all get burnt by the shareholders
You wish. By the looks of it, it will be burned down by yet another "redesign" when they inevitably shut down the sane UI of https://old.reddit.com because it's not pushing ads and "social" "features" strongly enough.
(If you don't use https://old.reddit.com/ instead of the "new", aka default, reddit, treat yourself to some sanity. Imagine if Reddit was more like HackerNews. Wait, you don't need to imagine, that's what the link goes to.)
I really like how they are pushing so hard for video, while struggling to display more than a thousand text comments or even links.
Feels like reddit is digging it own grave with these moves, and the shareholders' money with it.
I really hope to revisit this comment in 10 years and say how utterly stupid and wrong I was, because I've got an incredible amount out of the communities on reddit, and poured a lot into them to. Particularly, support groups.
Their video player is also garbage, especially on mobile. I'll have to open and close a video multiple times to get it to play, the quality will take a nosedive midway through and just stay that way for the rest of the video, and they take forever to load.
> I really like how they are pushing so hard for video, while struggling to display more than a thousand text comments or even links.
I don't know that this is the case of course, but can I blame Facebook? They "pivoted to video" while lying about their metrics and it wouldn't surprise me if the rest of Silicon Valley is like "Hey, they're making lots of money, why don't we do what they're doing?"
I really don't think you can blame Facebook for the idiotic decisions of others who're cargo culting its strategies.
It's not like Facebook held a gun to Reddit and forced them to have videos.
The answer to the "why" question should be obvious: it's the same in every case. "We" shouldn't do what FAANG is doing because we are not FAANG".
On that point, just because Floyd Mayweather makes a ton of money in the ring, you probably should not, even though you also have fists and can throw a punch.
Same with the framework/platform hype. You don't have Google scale, means you don't have Google's problems that they have deviced solutions for. A megatanker perhaps isn't the best boat for a weekend fishing getaway.
Twitter got popped left and right by journos and this was despite the platform being a huge source of revenue for the media, and the owner's politics.
The week they go public, the stories will start getting published about the content on there, the weak moderation (community moderation...lol), journos sending emails to companies asking whether they like their ads appearing along X or Y figure, etc.
The line that Huffman has taken before on this stuff (basically live and let live) works if you are private. It will go down extremely badly after the IPO.
Sorry. Reddit is already dead.
(Btw, I have no idea why this is...it makes no logical sense because being a public company changes nothing. But the media seems to understand that they can print something, that thing can potentially move the stock, and then they can bounce management into doing something).
To be fair to reddit, their combo of admin + community moderation has improved a lot in the past 5 years or so.
Remember /r/TheDonald? Or /r/jailbait? /r/coontown?
I mean, yes, it's a very low bar, but they deserve some commendation for making an effort that, I feel, paid off.
As far as being popped by journos, I feel like reddit has been on the radar for quite a while, and fared well under fire. Take /r/HermanCainAward, which got some pretty negative (and, I feel, misguided) press. The sub is not only still there, it's (quite sadly) thriving. (Sadly because nobody should receive that "award", but given quite a surplus of qualified recipients, something like that sub serves a purpose. Everyone there will be happy the day the sub stops being active because of lack of submissions, but alas, that day is yet to come).
So after catching flack, /r/HermanCainAward told people to cross out faces and names if they are not a public figure - and that seems to work fine with everyone. Like, the journos are not vultures: they picked on that sub, and the outcome was that it improved. When people are civil to each other, the journos don't a hot scathing article to print.
Getting popped by journos can be a good thing. Maybe some of them will write a hit piece on how waiting for 7 seconds for a text page to load is shameful in 2021.
I wonder how much google stores. My dumb butt has close to 100TB of…uhh…Linux isos encrypted and hooked up to plex for $20 a month. Seems like a loss leader for them. But I always think of the multiple spinning discs I’m taking over there for content and residency since it can’t be deduped
This is so very necessary. Reddit has banned great content from many subs whose ideology they didn’t agree with. Unfortunately it isn’t even possible to know the URLs of posts from banned subreddits to look them up in the first place.
Does anyone know if there are backups of banned subreddits already?
in some cases it is positively ridiculous. I got banned from /r/coronavirus for posting a scientific article suggesting breakthrough cases in fully vaccinated people was a possibility (this was fairly early, Mar 2021 maybe). The mod denounced me as an antivaxxer which I am certainly am not. Lo and behold - breakthrough cases are a real thing.
Most subreddits are just an extension of whatever their moderators want to see. They're a huge shaping influence on the content and character of Reddit, but their actions are mostly invisible.
Where does ArchiveTeam find all the reddit posts and comments to archive? Do they have a script automatically going through the "New" section or are they finding posts through Google or link crawling?
In general, ArchiveTeam has scripts which hit random links to see if there is any content. They have coordination servers which share info on which slugs have been checked before to avoid duplicate effort.
There are several sites that let you view deleted reddit comments. I wonder if they have complete text backups, or if they get the comments from somewhere else.
What a waste of the space of the Internet Archive. Just because something exists doesn't mean it should be backed up, I would be surprised if anyone actually needed something from Usenet for example. Things like this are going to kill the Archive eventually.
> I would be surprised if anyone actually needed something from Usenet for example
Totally disagree. Usenet is the only place recording the history of a huge number of influential projects from the 80s and 90s. That history deserves to be recorded.
Some of the most valuable and insightful anthropological artifacts are merely shop ledgers and discourse on bathroom walls -- and we've never been better equipped to document/store/search/access the entirety of the saved artifacts from the modern age -- and presumably our mastery over information technology as a domain will only improve and make things easier.
My point was that most people wouldn't get much value out of it aside sentimental reasons, and you can back up your own Reddit posts if this matters to you instead of backing up the entirety of it as this is the only part that should matter and should have a smaller fingerprint. If Reddit goes down for whatever reason you could try to merge with others, too.
If you try telling a historian in a hundred years time that there wasn't anything of value worth saving on a site where millions of regular people had conversations about pretty much everything of relevance to life in the early 21st century they will robustly disagree with you.
I'm thinking about becoming a data hoarder. One of my favorite hobbies is listening to music on YouTube and reading people's comments about how that song was important in some moment of their lives. Most of the videos are unofficial, since the official ones have comments turned off. From time to time, one of these music videos gets dmca'd and sadly the comments are all lost.
There is a limit to storage, especially when you don't own it, and the usefulness of the content collected should be considered when you're backing up what is primarily a news and meme aggregator. They aren't backing up the Library of Babel so where does the line get drawn?
The 20 Newsgroups dataset is a collection of ~20k newsgroup documents and is super popular for experimentation in text applications of various machine learning techniques. Without Usenet (and archives of Usenet) that probably wouldn't exist.
You never know what will be useful to the future. To give an example, the field of papyrology is largely built around trying to construct a view of the past using scraps of texts excavated from ancient dumps.
This is why individual people should never be left to make such decisions alone. They're likely to throw away things of enormous value for parochial reasons.
The "Warrior" program only downloads the data temporarily, and then uploads it to a staging server run by AT members. The staging server packs the WARCs into MegaWARCs and sends them to live at the Internet Archive.
It's a pretty high bar to say that something should not be archived and it's a waste.
I am not even going to state my opinion on the issue now, I am just disappointed about the level of discourse comments like this create. No justification, no logic.
TL;DR is that you don't like reddit. This is not useful to anyone.
In other words, is it possible to get any HTTPS website to give you what is essentially a digitally signed copy of content you want to prove originated with that site? And is it true that that digital signature is easily verified to belong to the original website?