Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you have a priori knowledge of the data distribution, you can construct an ideal partitioning scheme. In practice, this runs into some significant problems:

- The average distribution of the data and the instantaneous distribution of the data can be very, very different. This means that some cells are overloaded while others are idle, and the whole system runs as slow as the overloaded cell. The canonical (and fairly benign) example is data following the sun.

- Many data sources have inherently unpredictable data distributions.

- Spatial joins across different data sources (say, weather and social media) require congruent partitioning or it won't scale. Unrelated data sources tend have unrelated data distributions, so this is a problem.

Making your partitions match your data distribution is good practice for static data layers. With some caveats, this can be modified for spatial joins across static data layers as well. For dynamic data sources, you run into issues with data and load skew at scale.



The sun is pretty predictable...

everyone who built any online service know the access pattern is a wave with time of day.

do you mean to say you've seen people trying to shard a geodb by latitude instead of longitude? ... that would be a very sloppy initial research.


really? i got downvoted because i find it hard to imagine that people don't know that their systems will get more requests for the area that is day-time?

or i guess the sun is not predictable? sigh...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: