Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What do you think the chances are that Facebook actually does any delete operation on their data? I'm sure this is simply marking it as hidden in their database. Maybe I'm just paranoid, but to me the only way to take back something you say on social media, is to never have posted it in the first place.


There are a million reasons to agree that the data is never actually deleted

* need to retain data to fulfill government requests

* internal auditing

* it's all backed up in some "data lake" somewhere to do internal ml or analytics on

* hundreds of copies in database backups from different times

* internal logs that contain the data

* it's already been analyzed and aggregated into learning products and models that aren't going to be recomputed

It's not being "paranoid". As someone who has worked on large scale saas, I say: there is zero, 0, ZERO, 0.00 chance of that data every actually being deleted


> As someone who has worked on large scale saas, I say: there is zero, 0, ZERO, 0.00 chance of that data every actually being deleted

Unless that is built by design. I happen to also work on a large scale SaaS where we take this stuff very seriously and I can say it is possible to protect this data. However I will agree that this adds considerable complexity, but for some organizations, that is totally worth it.

> need to retain data to fulfill government requests

That's a choice, not a requirement. If you encrypt the data and purposely don't store the keys yourself but instead have the customer store them, then you don't have anything of value for the government.

> internal auditing

Personally Identifiable Info is not something we want to peruse. In fact we purposely don't want to see it because that eliminates a potential for mishandling.

> it's all backed up in some "data lake" somewhere to do internal ml or analytics on

That kind of application shouldn't give carte blanche to disregard retention policies. You can run those applications against replicated shards of the original data; and when the original gets reclaimed, so does the replica.

> hundreds of copies in database backups from different times

Storing useless data forever is not cheap, especially at scale. Better store what needs to be stored and free up what can be freed when retention policies kick in (or user requests it).

> internal logs that contain the data

That's ground for failing certain compliance audits. Logs should never contain PII in the first place, that's an operational failure.

> it's already been analyzed and aggregated into learning products and models that aren't going to be recomputed

That's a tricky one, but if those are actual models instead of giant lookup tables, one could assume the data is not reconstructible. However, that needs to be a design consideration of the models themselves, to prevent user data from persisting.


Ever is a very very long time indeed. Facebook will go out of business or get sold eventually and delete it all to save money or by accident.

See also: Myspace accidentally dumping everything pre-2015 https://www.engadget.com/2019/03/18/myspace-lost-12-years-mu...


For similar reasons, I doubt very much that they actually deleted everything they said they did. I know that they were backing up to tape and storing it at Iron Mountain years prior.


But someone has to pay that bill. If Facebook ever fails, I doubt that someone would keep data centers full of information around without getting paid.


They would sell it with their other assets.


Right and deleting is usually expensive in large systems it's easier to thombstone and delete later


Say someone uploads pedophilia images to facebook. Does it ever get deleted ? Or stays in the datalake ?


Last I remember it was supposed to be a good software engineering practice to never actually delete data from database but just mark it as deleted.


GDPR has changed this.

FWIW I’ve seen comments from a privacy engineer at google who posted here and said they actually do work hard to delete your data.

I’d expect Facebook and google to successfully delete the data (after giving 3-6 months for backups to age out) but wouldn’t trust most smaller operations to do so. And yeah, that doesn’t mean ML models or whatever but just the retrievable copies of your photos and text posts.


After knowing Google have been tracking few hundreds millions people's precise locations for the last 10+ years, I went thru the exercise of deleting all the activities history from Map, Youtube and Search and disable all activities tracking.

For the first few hours, I see new default youtube suggestions. After a few days, I see a lot of old search/view videos pop back to the youtube home page. YT still seems "suggested" videos for me to watch base on the past viewing info.


The model is probably based off of your watch history before you deleted everything. There would be no way (theoretically) for anyone to see your watch history but YouTube still thinks those old videos are ones that you'd be interested in and fortunately you haven't watched them yet (as far as the algorithm is concerned) so it's recommending them to you again.


It's likely they have an ML "recommender" model that was built off of your history, but doesn't actually store the history nor is it reverse engineer-able. Once you hit a certain threshold of activity or time duration, the model will probably rebuild.


This is a big reason I view the new “we’ll automatically delete your data!” Crap as disingenuous at best. They’ve already learned everything from the data in a few hours, it’s useless after that anyhow.


A big problem with data is not the company itself using it, but other ways it can come back to bite you.

E.g. a sexual photo retrieved from a private social media album is no threat to your reputation when it’s been assumed into some machine learning model for detecting sexual photos, but it’s certainly a threat if a future data leak allows your enemies to get the actual .PNG or .JPG and send it to the news media or your loved ones. Knowing that the photo can actually be deleted is valuable in this case and I’m sure there are many other similar ones that could be listed.


Is that really a problem for users though?

Users want to delete watch history, which they can. Classifier models predicting what you want to watch are not the video watch history, nor are the models capable of producing it.


> Classifier models predicting what you want to watch

But that probably is not the only model generated from your data, is it? They probably have many other models generated from your data, everything from ad-displaying models to profiling models for Hydra.


But those "suggested" videos were something I watched months/years ago and not some recently watched after I deleted the view history.


That's what I'm talking about. Think of the recommender model as your own personal neural net that is trained, over time, to show you videos it thinks you might like. It is not capable of telling you which videos you watched, because it is not a database or list of watched videos, but it is instead a classifier that predicts what you probably want to watch.


> GDPR has changed this.

That's a far, far cry from the reality of "possibly prosecutable in EU in some scenarios." We aren't even using a website right now that's in GDPR jurisdiction. But there's also a weird fetishization of GDPR I've noticed where people invoke it like "heh, my dad works for Xbox. just you wait buddy, he'll have you banned."


I think it’s changed what is best practice. Unless you plan to never introduce your product in the EU, designing your product so that it cannot ever delete data is a bad idea.

In my experience I’ve definitely seen GDPR result in a large company having developers looking at how data can actually be deleted and not just set to deleted=true. I don’t think my company’s lawyers were alone in thinking this suddenly became more important than before.


When a feature (e.g. thoroughly scrubbed data deletion procedure) is introduced into a product, it is often far easier to apply it laterally to all customers than just a subset. For this reason, GDPR has knock-on effects that benefit all users of certain services.


I'm guessing GDPR increased the chances a lot.


For legal reasons, there's a 0% chance that it's immediately deleted, but I'm sure that after a few months some cronjob prunes the database.


> What do you think the chances are that Facebook actually does any delete operation on their data?

Close to 0%. There is no way they're actually deleting things on the backend, but this could help prevent your content from indexed by search engines inside and outside of facebook.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: