At the 33c3 I showed how you can often uniquely identify a single anonymized user from a dataset containing three million people by using publicly posted links from his/her Twitter timeline. I also showed that this is possible with other types of public information as well, such as YouTube video ratings or reviews on Google Maps.
The math behind it is quite simple and very reliable for many datasets, which makes it very easy to build robust fingerprints based on browsing / location / behavior data. In my opinion, this is what most big companies rely on today for identifying users, as this is more robust than cookie-based mechanisms, which become more ineffective as the use of multiple devices and blockers increases.
Here's the link to the video (you can choose the language in the menu, by default it's German but the talk is also available in English and French):
> In my opinion, this is what most big companies rely on today for identifying users, as this is more robust than cookie-based mechanisms, which become more ineffective as the use of multiple devices and blockers increases.
Correct, another thing this is used for is to identify mobile/pc pairs belonging to the same person.
I encountered something similar the other day. I was on a moving website to schedule a move across town. I put in my info, and they gave me a quote, but the process was weird. It wouldn't let me back out of the quote completely, and if I returned to the main page of the site, it still showed my quote.
So I opened an incognito window and went through the process, no problems. Closed all incognito windows. Opened another, went to the site, and my quote was up, in a new incognito tab.
It was jarring because it meant that they were tracking me somehow, obviously not through the standard mechanisms as I was incognito.
This was the talk with the URLs that say showed if you are logged in and at your profile (worked with different services) right?
What I didn't understand: how was this data captured? 3rd party tracking through ad-networks and alike? (Would I be safe with no-script or even some privacy aware adblock?)
Sorry if this wasn't clear, what we mean is that client-side blocking won't help you to stay anonymous because even a server-side tracker can collect enough information about you to deanonymize you if enough of your "fundamental data" does not change (such as your IP). So in addition to using client-side blocking you should also use a VPN service to masquerade and (frequently) change your IP address.
It has been a couple of days since I watched the video, but if i recall correctly their point was not about the tracking by browser extensions (which is horrible and as mentioned above is in most cases blockable) but more about the ease with which these datasets can be used to de-anonymize users completely. Most large companies will be able to construct at least a partial data set from licensing various tracking providers.
This is why I use isolated sessions when browsing. I compartment my surfing these days because of this exact type of attack (identities and other browsing artifacts spilling over into serendipitous/casual/random browsing). Mozilla are even going to ship this strategy in Firefox soon[1]. Another strategy to lessen the amount of data collected on you is to outright disable Facebook like buttons and Twitter share buttons, because these widgets track you as you navigate around the web. This can be done in uBlock origin[2] under the '3rd party filters' tab and selecting Fanboy's annoyance filter list alongside Anti-ThirdpartySocial filter list.
> Individuals behave differently in the world when they are in different contexts. The way they act at work may differ from how they act with their family. Similarly, users have different contexts when they browse the web. They may not want to mix their social network context with their work context. The goal of this project is to allow users to separate these different contexts while browsing the web on Firefox. Each context will have its own local state which is separated from the state of other contexts
As I keep reminding here, there's no such thing as "anonymized data" - there's only "anonymized until combined with other data sets". Thanks for a demonstration, and thanks 'MaruizioP for posting this.
From cursory reading of the Wikipedia article on differential privacy, I understand this is about limiting access to data to a set of queries that are designed to make it difficult to use the query results to infer data the provider wants to protect. That doesn't seem to help in any way though, if the malicious actor owns the dataset and can run whatever queries they like on it. I'm also not sure how this addresses the problem of including additional datasets to help correlate out identities.
Differential Privacy uses the concept of adding 'noise' to a dataset to make it statistically provable, that, if queries are supposed to only be A, B, C... then the adversary can only tell the target from noise with some probability - The point of restricting access to a subset of queries isn't really the main point of how DP improves privacy, but rather, restricting access to a subset of queries makes the formal proofs of things doable.
> I'm also not sure how this addresses the problem of including additional datasets to help correlate out identities.
I agree with you on the first part of you post, but this part is a little off the mark. In the original paper[1], Cynthia Dwork confronts the issue you point out head on; they actually start with an impossibility proof that show no treatment of the data will get you the property "access to a statistical database should not enable one to learn anything about an individual that could not be learned without access". The impossibility result relies on the existence of outside datasets.
DP instead tries to quantify the probability of identification, and adds differing amounts of Laplace noise to get this. The idea is that the dataset shouldn't look "too different" with or without your information in it. If your participation doesn't change the dataset much, how could someone tell if you are in it or not, or moreover link you to a data point in it?
If the data owner is malicious, you are right that there is no guarantee differential privacy is being used. That's a far cry from "there is no such thing as anonymized data"; you could just as easily say "there is no such thing as encrypted data", which I think we agree is wrong?
One can provide differentially private access to web histories that provably mask the presence/absence of individuals in that dataset, even when combined with arbitrary exciting side information, like social networks.
Even with something as simple as differentially private counting, you can pull out correlations between visits, finding statistically interesting "people who visit page X then visit page Y surprisingly often" nuggets, which are exciting to people who don't have their own search logs.
At the launch of G+ I tested a way to assign search terms that people used to get to my page to G+ users - if they clicked on an +1 G+ button (which had some nifty callbacks then). Kinda worked. I could assign the search term to a set of users which I narrowed down with manual review.
Think Google must have been aware of the issue of intermingling Search with Social as a pretty short time later they scrapped the search query referers from (organic) Google Search.
Note: It was just a proof of concept to see if I could. Also: Nobody used G+ anyway.
Search query referers were dropped due to adoption of HTTPS/TLS. When a url is visited from an HTTPS site the browser does not sends reffer url only the host.
Its still possible to get "aggregate" information about keywords via Google Webmaster tools.
Yeah default-for-logged-in-users-https-with-stripping-part-of-the-referrer started in 2011, besides being a good idea -after the launch of G+ profiles as there was now a public "real name" piece of information that connected search queries to G+ users.
Google Search Console data doesnt let you connect google search queries to sessions.
> Our approach is based on a simple observation:
each person has a distinctive social network, and thus the
set of links appearing in one’s feed is unique.
Compartmentalization mitigates that threat. I have multiple online personas, and Mirimir is the only one who goes on about privacy, anonymity, etc. The only one who visits HN, Wilders, etc. Mirimir and other online personas also share no contacts. However, Mirimir does use pseudonyms ;)
Couldn't you also be identified to a certain probability based on your writing style (stylometry)? I remember reading something about that years ago.
Makes me think someone should make a Chrome extension that will re-write everything someone posts based on the concepts they are trying to get across. To avoid being identified via stylometry.
I find that stylometry is one of the easier barriers to overcome. For example, one pseudonym can persistently make a grammatical error ( e.g. failing to capitalise proper nouns ) and another can use UK English spellings.
And then there's the general 'vibe' of the forum which shapes how a pseduonym writes. If I were to write the same thing on HN versus IRC I would hope that the styles would be very different.
What's really difficult is comparmentalising information, so that even on the same website two pseudonyms don't demonstrate that they have the same knowledge.
Is it worth the bother? Possibly not, but I find it also makes me concentrate on what I'm writing.
I hope "erehwon" is just one of those pseudonyms and doesn't connect to any other of your identities :).
That said, deanonymization methods stack up; add geotemporal correlation of activities and one could presumably connect your various identities together ;).
Is there really much value in keeping the same HN account? Perhaps account rotation would be a much better method of obscuring your identity. Some social networks do have value for maintaining the same friends, while others do not.
If you have one browsing history and one Twitter profile, it is easy to compare them. However, the main technical challenge is that we have to find some way to compare a given browsing history to every single profile on Twitter, which in theory is hundreds of millions of comparisons, plus we would have to find some way to get hundreds of millions of news feeds. Solving this took us several months. Of course in the paper we explained how we reduced the search space so we only have to consider a limited set of candidates, and only have to crawl part of the news feed in real time.
Of course if a couple of grad students could do this in a few months with only publicly available data (plus donated browsing history), then I'm sure ad networks could easily do this too, and given how many ad networks have parts of your browsing history I would say that it's scary that it's so "easy."
Well as the paper notes, we don't quite have their browsing history, just the domains (since HTTPS encrypts the paths). But yes, the domains you visit make you unique and therefore identifiable.
This is useful in that it limits the set of people who have access to your browsing history. But first party tracking will still be a thing (i.e. if you are visiting a website, the owners of that website will know you are visiting it). Actually there is a paper that says that if you only use the New York Times links that people click on from Twitter, it still makes you uniquely identifiable in many cases, so a company like the NYT could (theoretically) track people without using any third party tracking at all.
The math behind it is quite simple and very reliable for many datasets, which makes it very easy to build robust fingerprints based on browsing / location / behavior data. In my opinion, this is what most big companies rely on today for identifying users, as this is more robust than cookie-based mechanisms, which become more ineffective as the use of multiple devices and blockers increases.
Here's the link to the video (you can choose the language in the menu, by default it's German but the talk is also available in English and French):
https://media.ccc.de/v/33c3-8034-build_your_own_nsa