If you have a priori knowledge of the data distribution, you can construct an ideal partitioning scheme. In practice, this runs into some significant problems:
- The average distribution of the data and the instantaneous distribution of the data can be very, very different. This means that some cells are overloaded while others are idle, and the whole system runs as slow as the overloaded cell. The canonical (and fairly benign) example is data following the sun.
- Many data sources have inherently unpredictable data distributions.
- Spatial joins across different data sources (say, weather and social media) require congruent partitioning or it won't scale. Unrelated data sources tend have unrelated data distributions, so this is a problem.
Making your partitions match your data distribution is good practice for static data layers. With some caveats, this can be modified for spatial joins across static data layers as well. For dynamic data sources, you run into issues with data and load skew at scale.
really? i got downvoted because i find it hard to imagine that people don't know that their systems will get more requests for the area that is day-time?
- The average distribution of the data and the instantaneous distribution of the data can be very, very different. This means that some cells are overloaded while others are idle, and the whole system runs as slow as the overloaded cell. The canonical (and fairly benign) example is data following the sun.
- Many data sources have inherently unpredictable data distributions.
- Spatial joins across different data sources (say, weather and social media) require congruent partitioning or it won't scale. Unrelated data sources tend have unrelated data distributions, so this is a problem.
Making your partitions match your data distribution is good practice for static data layers. With some caveats, this can be modified for spatial joins across static data layers as well. For dynamic data sources, you run into issues with data and load skew at scale.