A few years back I had a traveling salesperson problem, I googled a bit, and somehow didn't find anything interesting, so I just created a random loop, and randomly mutated it until it got better. A few months ago I discovered a neat little heuristic: the best loop never self-intersect. And a few days ago, I found that there is a definite algorithm that gives a result whose worst case is bounded WRT to the optimal.
My question is how do you find all those algorithms? Wikipedia never really feel like a good introductory nor discovery place. In particular, some problems have been perfectly studied by scholars, but you don't find them because you have not found the keywords that will direct google in your search. And I am not a researcher, so I don't keep a tab on a domain, I'm a jack of all trades, I work on a wide set of things.
Your method is similar to a popular probabilistic technique called simulated annealing [1].
The general setting for this type of algorithm is that you have a large number of states and you want to find the state that maximizes a particular score. In the TSP, the states are paths and the score is the (negative of the) path length. The idea is to define a Markov Chain whose stationary distribution is proportional to the score. This means that if you jump from state to state according to the probabilities defined by the MC, eventually the probability of ending up at a given state is proportional to that states score.
In the case of TSP the states can be represented as permutations of the cities and the neighbors of a state are given by swapping the positions of two cities. The probability of moving to a neighbor is high if the length of the neighbor is shorter than the current path.
> My question is how do you find all those algorithms? Wikipedia never really feel like a good introductory nor discovery place.
I know a place.
From the Github repo Coding Interview University[0] there is a link to a Jupyter notebook[1] on various ways to solve the traveling salesman problem - it is a very good, detailed, resource.
Knowing the right keywords and roughly how they fit together is a huge component of domain expertise. Building general background knowledge through experience and readings goes a long way for this reason.
I think that most of the times they are particular to a domain to solve particular problem. For example Some researcher would have a grant for developing an optical inspection of a circuit board using a camera moving on the 2d euclidean space efficiently is a TSP. You can develop a Polynomial Time approximation scehne(PTAS) to solve it.
At first it's novel then st some point it appears in a book specific to polynomial time approximation schems in books dedicated to the cause. Look at the references in wikipedia. They often point to a larger body of work than what the wiki can possibly hope to cover.
I would try to look for academic papers on the topic. Even if you can’t find an exact match, papers typically have a section that describes other related work. You can also scan through the citations to see if anything might be a match. If there’s some relevant paper, you’ll probably find it after searching through a few papers.
When reading a paper I would just scan the abstract, conclusion, introduction, related work section (possibly in that order) to see if it’s relevant.
For finding papers I like semantic scholar and Google scholar.
OOC: how did you create your initial random loop? This was a Kaggle problem about seven years ago. The competitive solutions found a good initial guess by breaking up the domain into a grid. They would solve TSP in each grid cell, and then stitch these solutions together to get a reasonable initial global solution. If you created your initial loop greedily (as I did) you got a garbage solution, and also wasted a lot of time that you could have spent doing random swaps or simulated annealing.
Hey, author here! I've been working on infrastructure and database-y things at Heap for the last couple years, ranging from improving Postgres performance and availability to building out services (like this one!) and refactoring, encapsulating, and optimizing core systems.
I'll try to answer any questions about the post (technical or otherwise).
Yes! You are correct. This ensures that we are able to maintain a cohesive view of the entire "user" across all their devices, browsers etc (so long as the Heap customer has a method for identifying their end user).
A classic example is pre- and post- signup behavior for a single user. When a user first lands on a page, they will be anonymous and lack a canonical identity. They may come from specific referrers (search, ad, social media, direct), land on a specific page, or engage with certain parts of the site. All of these actions are tracked and stored using an anonymous id. After the user creates an account and is assigned a canonical id (via the `identify` API call), we still want to associate all the previously tracked data with the canonical identity. This allows our users to perform analyses using events and data points from before and after identification.
> merge the entire set of anonymous ids
In the previous example the "set of anonymous ids" is just a single ID. There are use cases were a user may already have a canonical identity but we want to change/update that canonical id. In this case, we are merging all the data associated with both canonical identities (set of anonymous id's associated with the canonical user and the set of ids associated with the new canonical identity) and creating a single combined user with a cohesive view of all actions on our customer's site/app etc.
It seems to me that under this scheme, if I make a single erroneous identify call, I will irreversibly merge two users. This is a surprising approach. Given that identify calls may occasionally be wrong, I would expect that
identify(anonymous_id, new_canonical_id)
would map anonymous_id => new_canonical_id, but would leave the rest of the set find(anonymous_id) alone.
Yea, it seems like compared to all of the other data they're logging per user, separately preserving the parent id and canonical id in the tree would have little cost and allow them to fix canonicalization errors later.
Then there's a write throughput vs read latency trade-off for reading statistics aggregated by canonical ID, but my guess is that trade-off can be made in a way they're happy with in exchange for the ability to undo mistakes.
That is an interesting idea. Off the top of my head, I'm not sure - I personally do not have much experience with graph databases (mostly focused on Postgres). We are doing some more infrastructure-y work in this area and changing the underlying persistence layer isn't out of the question...
People are so jaded about whiteboarding algorithms/data structures that it's great to see bonafide examples of their use in production outside of Skiena's war stories. Great content.
Somehow "4 hours to 15 minutes" sounds way more impressive than 15x speed up, at least to me. From the article, it's unclear how the hashing mechanism you use actually approximates rank at all; most hashes I'm familiar with are essentially one way functions, ideally preserving no information about the input at all. How are you hashing ID's so that unions result in smaller hashes?
And great question! The hashing heuristic is a bit of a cute trick and I definitely glossed over it a bit in the post. I'll try to elaborate a bit more here...
> most hashes ... are essentially one way functions, ideally preserving no information about the input at all
You are correct, an individual hash does not preserve any information about the node, nor does it approximate rank in the union-find graph. We only get the effect in aggregate when many unions are performed and we repeatedly hash and select the lower value as the new root.
The hash for a given node in the graph is deterministic and fixed. It is effectively a random value. But the hash for a tree of nodes (aka the hash of the tree's root node) decreases with each union operation. This is because we select the root with the lower hash to become the root of the combined tree every union operation. As a result, the node with the lowest hash is the root of each tree. With more union operations, larger trees will tend to have lower and lower hashes - effectively approximating a rank.
Put another (more handwave-y) way, we essentially roll the dice and get a fixed random number (hash) for each node in the tree. Since the tree is assigned the lowest random number contained within, larger trees will probably have lower values than smaller trees (more dices rolls to get a lower number).
Ahhh I see now. I was so caught up trying to think of ways to make the hash function give you this property, I didn't realize that it didn't matter and that it emerged as a result of you picking which hash to assign to a tree. Makes sense now, thanks.
I think a simplified explanation is that, assuming your hash function is ideal (can be assumed to be random, except for being deterministic), you end up building a treap, minus the treap's restriction on the in-order traversal of keys and restriction on the number of children. The height of a treap is O(log N), so your solution is expected to be within a constant factor of an optimal solution.
Woah these look great! Sad I didn't encounter them earlier.
Our team is actively working on improvements in this part of our data infrastructure (hence this project & blog post) so maybe there will be a follow up coming up with the next version of our Identity implementation...
Isn't that expected (within a constant factor of optimal depth, a.k.a. O(log N) tree height)? They're essentially building a treap, minus the in-order traversal of keys restriction.
I feel like, 25 years post-undergrad and over 10 years post grad school, I should go back and re-read some data structures classics: I spend all of my time running database queries and debugging web service calls and only rarely dipping into real algorithmic optimization. I wonder how much performance optimization I'm leaving on the table because my data structures are so rusty.
Some of the more advanced postgres SQL blog posts over the years have come from the Heap team. I'm curious what their good enough, optimized postgres solution was that they replaced with this project. :)
Biggest and most challenging trend is to properly devise a data design schema for:
1. an appropriate privacy and deanonymization level to each data item in a structure (corporate privacy, trade secret, account ID)
2. an appropriate privacy level to a tuple of data items (i.e., PII; name, ID, birthdate)
3. And a security context for each class of end-users (support, engineering team, marketing) to use what amount of and degree of deanonymized data items.
I don't see the point of hashing the key instead of picking the lexicographically smaller key. It will have the same effect and comparison is O(n), same as hashing.
Great question! You are correct, these methods would be equivalent w.r.t. tree "shape". The problem emerges downstream. In our shared database, we reply on the fact that user IDs are uniformly distributed. If we directly compared values, we start skewing the data towards the lower end of the range. By comparing hashes, we get the same heuristic effect without disrupting the uniform distribution.
It's a good way to capture user tracking data and be precise in knowing exactly who is doing what on site or mobile app. Hopefully this data stucture can be extended to completely remove every bit of tracking information collected about a user, so that it can comply with GDPR and respect privacy.
Be careful in tracking users, it may fall foul of GDPR and GDPR compliance. If your company is using analytics like this, it will be almost impossible to remove personally identifiable tracking information from website or mobile app platform.
This means it will become a nightmare to remove the user tracking information from every system and system logs. This is one of the reasons most solutions in USA are ill-suited for Europe. Hopefully USA can really start respecting privacy and build systems which keeps privacy on top.
Given most admired companies are built based on invasion of privacy (facebook, google, amazon, netflix), I doubt there is a will to get rid of privacy invading technologies from core. I see some efforts by Apple, but than the larger app eco-system still rely on trading privacy for some free apps or utilities will be hard to go.
> Hopefully this data stucture can be extended to completely remove every bit of tracking information collected about a user, so that it can comply with GDPR and respect privacy
This is already the case! Interestingly enough, our identity tracking mechanisms actually make it easier to delete users and purge _all_ of their data in the same way it makes unifying all their data for analysis easier in the first place.
When a GDPR request comes in through our API, we search for the matching user in our database. If we find a matching user, we'll look in our Identity system ('s union-find data structure) to find all the other user_ids which were associated with that canonical identity. Then we'll go through and delete all the data for every constituent user. This is essentially a reverse look up (find the set of user_ids for a canonical user_id/identity) and is easy to execute.
My question is how do you find all those algorithms? Wikipedia never really feel like a good introductory nor discovery place. In particular, some problems have been perfectly studied by scholars, but you don't find them because you have not found the keywords that will direct google in your search. And I am not a researcher, so I don't keep a tab on a domain, I'm a jack of all trades, I work on a wide set of things.