What do you think the chances are that Facebook actually does any delete operati...

mdjasper · on May 20, 2019

There are a million reasons to agree that the data is never actually deleted

* need to retain data to fulfill government requests

* internal auditing

* it's all backed up in some "data lake" somewhere to do internal ml or analytics on

* hundreds of copies in database backups from different times

* internal logs that contain the data

* it's already been analyzed and aggregated into learning products and models that aren't going to be recomputed

It's not being "paranoid". As someone who has worked on large scale saas, I say: there is zero, 0, ZERO, 0.00 chance of that data every actually being deleted

athenot · on May 20, 2019

> As someone who has worked on large scale saas, I say: there is zero, 0, ZERO, 0.00 chance of that data every actually being deleted

Unless that is built by design. I happen to also work on a large scale SaaS where we take this stuff very seriously and I can say it is possible to protect this data. However I will agree that this adds considerable complexity, but for some organizations, that is totally worth it.

> need to retain data to fulfill government requests

That's a choice, not a requirement. If you encrypt the data and purposely don't store the keys yourself but instead have the customer store them, then you don't have anything of value for the government.

> internal auditing

Personally Identifiable Info is not something we want to peruse. In fact we purposely don't want to see it because that eliminates a potential for mishandling.

> it's all backed up in some "data lake" somewhere to do internal ml or analytics on

That kind of application shouldn't give carte blanche to disregard retention policies. You can run those applications against replicated shards of the original data; and when the original gets reclaimed, so does the replica.

> hundreds of copies in database backups from different times

Storing useless data forever is not cheap, especially at scale. Better store what needs to be stored and free up what can be freed when retention policies kick in (or user requests it).

> internal logs that contain the data

That's ground for failing certain compliance audits. Logs should never contain PII in the first place, that's an operational failure.

> it's already been analyzed and aggregated into learning products and models that aren't going to be recomputed

That's a tricky one, but if those are actual models instead of giant lookup tables, one could assume the data is not reconstructible. However, that needs to be a design consideration of the models themselves, to prevent user data from persisting.

ForHackernews · on May 20, 2019

Ever is a very very long time indeed. Facebook will go out of business or get sold eventually and delete it all to save money or by accident.

See also: Myspace accidentally dumping everything pre-2015 https://www.engadget.com/2019/03/18/myspace-lost-12-years-mu...

monocasa · on May 20, 2019

For similar reasons, I doubt very much that they actually deleted everything they said they did. I know that they were backing up to tape and storing it at Iron Mountain years prior.

dx034 · on May 21, 2019

But someone has to pay that bill. If Facebook ever fails, I doubt that someone would keep data centers full of information around without getting paid.

monocasa · on May 21, 2019

They would sell it with their other assets.

nkassis · on May 20, 2019

Right and deleting is usually expensive in large systems it's easier to thombstone and delete later

ddorian43 · on May 21, 2019

Say someone uploads pedophilia images to facebook. Does it ever get deleted ? Or stays in the datalake ?

jyriand · on May 20, 2019

Last I remember it was supposed to be a good software engineering practice to never actually delete data from database but just mark it as deleted.

javagram · on May 20, 2019

GDPR has changed this.

FWIW I’ve seen comments from a privacy engineer at google who posted here and said they actually do work hard to delete your data.

I’d expect Facebook and google to successfully delete the data (after giving 3-6 months for backups to age out) but wouldn’t trust most smaller operations to do so. And yeah, that doesn’t mean ML models or whatever but just the retrievable copies of your photos and text posts.

srcmap · on May 20, 2019

After knowing Google have been tracking few hundreds millions people's precise locations for the last 10+ years, I went thru the exercise of deleting all the activities history from Map, Youtube and Search and disable all activities tracking.

For the first few hours, I see new default youtube suggestions. After a few days, I see a lot of old search/view videos pop back to the youtube home page. YT still seems "suggested" videos for me to watch base on the past viewing info.

xmprt · on May 20, 2019

The model is probably based off of your watch history before you deleted everything. There would be no way (theoretically) for anyone to see your watch history but YouTube still thinks those old videos are ones that you'd be interested in and fortunately you haven't watched them yet (as far as the algorithm is concerned) so it's recommending them to you again.

bduerst · on May 20, 2019

It's likely they have an ML "recommender" model that was built off of your history, but doesn't actually store the history nor is it reverse engineer-able. Once you hit a certain threshold of activity or time duration, the model will probably rebuild.

givinguflac · on May 20, 2019

This is a big reason I view the new “we’ll automatically delete your data!” Crap as disingenuous at best. They’ve already learned everything from the data in a few hours, it’s useless after that anyhow.

javagram · on May 20, 2019

A big problem with data is not the company itself using it, but other ways it can come back to bite you.

E.g. a sexual photo retrieved from a private social media album is no threat to your reputation when it’s been assumed into some machine learning model for detecting sexual photos, but it’s certainly a threat if a future data leak allows your enemies to get the actual .PNG or .JPG and send it to the news media or your loved ones. Knowing that the photo can actually be deleted is valuable in this case and I’m sure there are many other similar ones that could be listed.

bduerst · on May 20, 2019

Is that really a problem for users though?

Users want to delete watch history, which they can. Classifier models predicting what you want to watch are not the video watch history, nor are the models capable of producing it.

teddyh · on May 21, 2019

> Classifier models predicting what you want to watch

But that probably is not the only model generated from your data, is it? They probably have many other models generated from your data, everything from ad-displaying models to profiling models for Hydra.

srcmap · on May 20, 2019

But those "suggested" videos were something I watched months/years ago and not some recently watched after I deleted the view history.

bduerst · on May 21, 2019

That's what I'm talking about. Think of the recommender model as your own personal neural net that is trained, over time, to show you videos it thinks you might like. It is not capable of telling you which videos you watched, because it is not a database or list of watched videos, but it is instead a classifier that predicts what you probably want to watch.

hombre_fatal · on May 20, 2019

> GDPR has changed this.

That's a far, far cry from the reality of "possibly prosecutable in EU in some scenarios." We aren't even using a website right now that's in GDPR jurisdiction. But there's also a weird fetishization of GDPR I've noticed where people invoke it like "heh, my dad works for Xbox. just you wait buddy, he'll have you banned."

javagram · on May 20, 2019

I think it’s changed what is best practice. Unless you plan to never introduce your product in the EU, designing your product so that it cannot ever delete data is a bad idea.

In my experience I’ve definitely seen GDPR result in a large company having developers looking at how data can actually be deleted and not just set to deleted=true. I don’t think my company’s lawyers were alone in thinking this suddenly became more important than before.

ironmagma · on May 20, 2019

When a feature (e.g. thoroughly scrubbed data deletion procedure) is introduced into a product, it is often far easier to apply it laterally to all customers than just a subset. For this reason, GDPR has knock-on effects that benefit all users of certain services.

paulddraper · on May 20, 2019

I'm guessing GDPR increased the chances a lot.

qxcbr · on May 20, 2019

For legal reasons, there's a 0% chance that it's immediately deleted, but I'm sure that after a few months some cronjob prunes the database.

ilikehurdles · on May 20, 2019

> What do you think the chances are that Facebook actually does any delete operation on their data?

Close to 0%. There is no way they're actually deleting things on the backend, but this could help prevent your content from indexed by search engines inside and outside of facebook.