I enjoyed the article, and wondered what are the queries you support? Something like "SELECT * FROM world WHERE (distance_from_me < 500 meters);" ?
I ask because a lot of what you discuss seems to resonate with folks in the graphics business who are, for example, culling object geometry from a render scene by doing what is effectively the above operation. Its also what a 'whisper' or 'yik yak' type of App does trying to find people near itself although in that case the problem is greatly simplified by assuming a sphere and testing for containment in the sphere (or the circle if being planar)
You can do much more sophisticated operations than simple point-in-polygon queries.
The geospatial implementation in SpaceCurve is quite sophisticated and in some aspects the state-of-the-art. Complex geometry types and operators are obviously supported, and the implementation is fully geodetic and very high precision by default. Complex polygon constraints, aggregates, joins, intersections, etc on data models that also contain complex geometries are supported and massively parallelized. There is nothing else comparable for big data in this regard that I know of.
It is useful to understand that the entire platform, database engine on up, is fundamentally spatially organized even for plain old SQL "text and numbers" data. The SQL implementation is a thin layer that is translated into the underlying spatial algebra. You can use it for non-geospatial things, it just lends itself uniquely well to geospatial data models because it can properly represent complex spatial relationships at scale.
Great, I was actually wondering if you have like an API manual online somewhere? It sounds like it is more than just a performant PostGIS (no disrespect in that, PostGIS is quite slow) but beyond the basic stuff near here, how is the spatial algebra expressed from the client to the underlying database?
Graphics will tend to want 3D spatial queries, though, and this gives the impression of being optimized for, if not restricted to, 2D or 2D-plus-curvature. The article consistently talks about "polygons" as opposed to "polytopes", for example.
The geospatial implementation is a true 3-space model. You can content-address objects and features past low-earth orbit by default. All surfaces upon which geometries are constructed are embedded in this space. Polygons relative to an embedded surface, such as a geodetic 2-ellipsoid, are actually represented as 3-space shapes under the hood; the 2-surface is a convenience for popular use cases but the data is not organized that way.
There are an unbounded number of possible shapes and surfaces in 3-space. Storing and content-addressing them is easy enough. The challenge is that to be useful you need, at a minimum, generally correct intersection algorithms for all of the representable shapes. It is an open-ended computational geometry problem and we continuously extend geometry capabilities as needed. Currently, the most complex shapes can be constructed relative to a number of well-behaved mathematical surfaces embedded in 3-space. Shapes directly constructible in 3-space are significantly more limited but I expect that to expand over time as well, particularly if there are specific requirements and use cases.
The underlying representation was intended to mirror the spatiotemporal organization of the physical world at a data level. We can generate projections but the data lives in 3-space.
One GIS user/developer I spoke with mentioned a project they had: Find best place in town to build a mall.
This included information such as land elevation, floods areas, soil types, distances to roads and their importance (highway, major motorway, how many lanes), distance to facilities (electricity / gas / water / sewer), population demographics (wealth, buses) etc.
So beyond just measuring distances and polygon coverage, your GIS will have a lot of layers of data. Each layer either representing different data in itself, or just more detail. The query then is to find best matches knowing data, criteria, and rankings.
I am aware of PostGIS, the author specifically said PostGIS was unable to keep up. I was wondering about the types of operations that SpaceCurve could do, and keep up, that PostGIS cannot.
Here's one I think is probably 101 level for others here, but what is the difference between a multidimensional scalar and a vector? I'll gratefully read through any reference you can link to.
Whether something is 101-level or a difficult thesis topic often depends on your point of view. Different fields seem to define scalars differently, with the physicists sometimes arguing that the mathematicians have got it all wrong: http://imechanica.org/node/15857
On a related note, I was tremendously impressed with the quality of discussion at imechanica.org. At a glance, it may be the best example of a well-functioning online community that I've ever seen.
Great part:
"They should not call the elements in the number field F scalars. These elements already have names: they are called numbers. And we physicists and engineers need the word scalar to name some other objects. Mathematicians, like physicians, should at least learn to do no harm. "
A vector belongs to a vector space, which implies certain algebraic rules relating it to other vectors in the space (http://en.wikipedia.org/wiki/Vector_space is probably enough reference). A multidimensional scalar is a more general data type, as any or all of the vector axioms can be relaxed or eliminated.
These tend to be the aggregate flow of pervasive sensor networks i.e. millions of machines generating this data autonomously. Collected personal location data alone is 6-10 PB/day globally, most of which is currently discarded because few people can handle it. Some are just incredibly high bandwidth platforms, which are more common in industrial and sensor settings. There are more than a couple operational environments today generating billions of records every second that must be stored (not summarized).
Some of the sexier petabyte-per-day sources include mobile phone telemetry, vehicle telemetry, and countless remote-sensing networks and platforms (of which there are far more than most people imagine). A petabyte per day is not that much data per entity, there are a lot of entities, and you can support this data rate in a single rack of commodity hardware.
A more typical generic IoT application at a large company is 1-100 TB/day, but that is quickly creeping upward as it becomes more cost effective to deploy sensors and store/analyze the data. And more often than not, they need to store weeks or months of that data so the overall data set size is still quite large.
How many companies have million sized sensor networks? I guess few. And even at eg 400k records per second with an average record size of 1k, you'll generate ~ 30 TB per day. I just can't imagine there are many situations with more data incoming.
Can you share company names or more concrete examples?
Even ups only delivers 17M packages / day.
edit: also, which industries customarily have thousand vertex polygons? Thanks for the interesting blog post.
The interesting time series are heavy machinery telemetry feeds (for example a Rolls Royce jet engine has its own communications subsystem to phone home to a RR engineering station via satellite, which however, likely makes it low-bandwidth).
You list hyperdimensional hashing as one of the early methods that you tried without success. While only good for point-to-point distance comparison in its simplest application, it seems that geohashes meet the scalar/vector problem that you describe. Do you have any experience with geohashes, and if so, what's your opinion of them?
loved the article. I strongly believe that when doing geospatial indices, it is quite important to up front establish the nature and distribution of your data.
Searching for a generic/theoretically all-encompassing solution is quite akin to looking for the perfect pub-sub system for all workloads. :)
My comment was related to lat/lon indexed data clustered around cities. If you know beforehand that you're going to deal with such a dataset (static or dynamic), one can pragmatically decide on some sort of by-convention partitioning (e.g. Partitioning by continent/region which will bring it down to in-memory indices not needing continuous disk access).
I love the non-euclidean bit in the piece you wrote. Anyone who has tried to do a k-nearest neighbor query east of New Zealand will appreciate that bit. :D
If you have a priori knowledge of the data distribution, you can construct an ideal partitioning scheme. In practice, this runs into some significant problems:
- The average distribution of the data and the instantaneous distribution of the data can be very, very different. This means that some cells are overloaded while others are idle, and the whole system runs as slow as the overloaded cell. The canonical (and fairly benign) example is data following the sun.
- Many data sources have inherently unpredictable data distributions.
- Spatial joins across different data sources (say, weather and social media) require congruent partitioning or it won't scale. Unrelated data sources tend have unrelated data distributions, so this is a problem.
Making your partitions match your data distribution is good practice for static data layers. With some caveats, this can be modified for spatial joins across static data layers as well. For dynamic data sources, you run into issues with data and load skew at scale.
really? i got downvoted because i find it hard to imagine that people don't know that their systems will get more requests for the area that is day-time?
Interesting article, I'm very new to the field but I have an exciting application for GIS--I'm a biologist at Harvard and we have a new method to generate biological measurements at atomic scale inside biological specimens, and I believe the best way to store and analyze this data is using a GIS. Consider the example in your paper; "What are the optimal aircraft settings for my aircraft in this weather?” For our data, the questions are more like "What is the molecular spatial architecture of this tumor, and how can that inform treatment?" If this seems interesting to you it would be great to have a conversation about ways SpaceCurve could advance our project.
The lower bound for computing the points of intersection is n log n + k for n line segments and k intersections, but obviously k can be n^2 in the worst case.
This is an OK introduction to the issues of geospatial DBs (a meta meta article as it were) but two interesting (to me) points were made in passing:
> Most software engineers assume today, as I did a decade ago, that the design of scalable geospatial databases is a straightforward task.
You could substitute almost any computing topic for "geospatial databases" and still (sadly) have a perfectly reasonable sentence. Partially that is because our field is so new, and partially it's by simple naïveté which is occasionally powerful but mostly dangerous.
> Algorithms in computer science, with rare exception, leverage properties unique to one-dimensional scalar data models. In other words, data types you can abstractly represent as an integer.
Alas, this is painfully true, and the field of algorithms is so complex that the one-dimensional nature of our exploration is easy to overlook.
I'm not sure why the author is dismissing space-filling curves so quickly, just based on the age of the idea. I've worked with them extensively, in the context of spatial indexing in a database system, in both research and product settings. They address a number of problems described for interval data types. The techniques I've been working with "decompose" a spatial object into a number of regions by recursively partitioning the space, ending up with a small number of "z-values" for each. A z-value is represented by an integer, which is then used as a key in a conventional index, e.g. a sorted array, balanced binary tree, B-tree, skiplist, etc.
Getting a uniform distribution for sharding is easy in one dimension. I'm not sure what kind of scaling the author has in mind, but if you put the z-values in a B-tree, it scales exactly like a B-tree, i.e., extremely well. I agree there is no exploitable ordering of the spatial objects, but the trick is to order the z-values. If a given spatial object gives rise to 4 z-values, then they -- and the spatial object -- appear in 4 places in the ordering. That's fine. Run your search (in one dimension), get qualifying records, and then get rid of duplicates later.
The point was really that space-filling curve indexes do not solve the underlying issues with polygon indexing, their oldness was not a factor. That type of indexing is so old that the patents (yes, they were patented by the big database companies) have long since expired. And yet they are rarely used commercially. There is good reason for that. There is a resurgence in their usage by people new to spatial indexing but it is pretty obvious that few implementors are aware of the extent of the computer science, which goes back into the 1960s.
The assumption of size was "large-scale", which in this day and age usually means massively distributed. A B-tree is not a solution to that problem domain.
> The assumption of size was "large-scale", which in this day and age usually means massively distributed. A B-tree is not a solution to that problem domain.
Another thought on this one: Space-filling curves decomposes the problem nicely: Scale out your favorite, ordered key/value stored, and then layer spatial indexing on top. No need to develop a new data structure that both scales out and handles spatial data.
I'm still not sure what problems of polygon indexing are not solved by space-filling curves.
About commercial usage:
- Oracle uses R-trees now, I think. Before that, they had what I thought was a bad implementation of something sort of grid-like. Some notes I wrote from a few years ago on their "hh-encoded" index:
Oracle 9 Spatial offered both R-tree and quadtree indexes
(http://download-uk.oracle.com/docs/cd/B10501_01/appdev.920/a96630/sdo_intro.htm#i877656).
The generation of index terms for a spatial object is called "tesselation". They have
two approaches to tesselation, 1) fixed-size tiles, in which the space is partitioned to
a fixed resolution, and covering cells give rise to index terms; 2) variable-size tiles,
in which spatial decomposition stops when a configurable number of tiles are generated for
the spatial object being indexed.
Now here's where it gets strange. An Oracle quadtree index is either pure-fixed, or
it is hybrid. Hybrid is a fixed index AND a variable index. The docs recommend not
using hybrid. I could not find an explanation of why they even offer a hybrid option
(as opposed to pure-variable).
This Oracle 9 doc is from 2002. I also found a paper doing a performance comparison
of R-tree and fixed from around that time
(http://www.dpi.inpe.br/cursos/ser303/oracle_r_tree.pdf). This paper concludes
that R-trees are usually best, although they identify situations in which quadtrees win.
Oracle 11 appears to have dropped quadtrees altogether
(http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28400/sdo_intro.htm#i877656).
The pure variable-sized tile approach option was never offered in their product,
and not examined in the 2002 performance comparison. As that paper points out,
tuning fixed-size is tricky. If the tile size is too small, you get a large number
of index entries per spatial object. They take space, io bandwidth to read,
cpu time to process, etc. If the tile size is too large, you run the risk of
poor accuracy, retrieving more objects than necessary, again generating more io and
cpu. And if your data set has objects of various sizes, there is no single good size.
- Microsoft is less clear to me. From some of the tuning options on their spatial indexes,
it sounds to me like a fixed-size tile approach, with tunable resolution. But I could be wrong,
and I haven't looked into it in years.
- PostGIS uses R-trees. The problem with this approach is that it's a separate implementation from their main B-tree, so it always lags behind. With space-filling curves, you just use your B-tree and improvements to it then apply to your spatial indexes.
- MySQL also uses R-trees, and only in conjunction with MyISAM. So that's pretty bad.
- I know less about distributed spatial indexes, but I would think that space-filling curves are more amenable to distributed structures because they don't rely on a tree to organize the space. You can partition the 1-d space of z-values to load balance the shards.
Algorithms for spatially indexing points are tractable - see k-d trees etc. R-trees seem pretty lame to be honest, like an ad-hoc data structure to partly meet a number of requirements.
Of course, some kinds of indexing and searching are inherently harder than others. But I don't see why one couldn't program any needed spatial algorithm directly from just an index of all points in the system.
For example, suppose you have an index of triangles. If want to find all triangle intersecting triangle (x,y,z), you could itterative expand a near-neighbor search from each point. Every point you find close than the given triangle-point's further point is a potential neighbor and could checked reasonably quickly. This gives the set S of intersecting triangles to a given triangle in log^n(size(S)), making the process relatively scalable.
And there's more:
"Storing objects in a space-partitioning data structure (kd-tree or BSP for example) makes it easy and fast to perform certain kinds of geometry queries – for example in determining whether a ray intersects an object, space partitioning can reduce the number of intersection test to just a few per primary ray, yielding a logarithmic time complexity with respect to the number of polygons."
Lends credence to my argument that a polygon-polygon intersection should be obtainable in log^n or maybe even log time.
The problem is less with the mathematics, but efficient storage and retrieval of a large number of points (i.e. index won't fit in memory) from disk. Doing a search across a tree rendered to disk is really slow.
Even with mathematically sound storage trees as indexes, locations which are close in space could easily be split between two arms of the tree, making proximity searches quite expensive - you would have to traverse the entire tree to ensure that you do not miss data separated by a poor splitting plane.
There are a number of theoretically efficient solutions to this, but they are all become remarkably inefficient when you add in a hard drive.
I think if you're going to be doing sufficiently 'biggish data' with those kind of retrieval charcteristics, you've got to just give up on disk and use flash only.
Flash solves the "added overhead for random access" portion of the problem, but the overhead for any access at all is still remarkably high, particularly when compared to main memory.
Perhaps when we get more "flash connected to the main bus" options (there are a few commercial options which sit in PCI Express, AGP, and even DRAM slots), we can begin discounting the overhead more and more, but we're not there quite yet.
Well, supposedly, cache-oblivious algorithms exist to deal with this kind of thing - by grouping related data together regardless of exactly what cache-scale you are at.
My experience playing around with KD trees is that they are super effective for multidimensional indexing (including spatial) and range search so long as the data is relatively static. The difficulty comes when trying to update and re-balance them dynamically, which is where other structures perform better.
A possible solution to the firehose problem are stream-databases, the catch phrase is "rather than data waiting for queries, it is queries waiting for data".
The people behind Truviso (http://en.wikipedia.org/wiki/Truviso subsequently bought out by Cisco) had an implementation of stream processing, an overview is here:
We looked at Truviso when we had to solve video-player-analytics and it seemed to us to solve that problem. I imagine a tree-structure of stream-queries where the leaves are the data gathering points, aggregating up the tree for summaries would address the scaling problem.
The problem with stream processing is that spatiotemporal analytics are rarely summarizations. Typically, the only aggregates that are computed at ingest are path, vector, and polygon features (e.g. an entity path reconstructed from samples). This is not the real-time analysis part of the application, just the ingest processing, and these features must be continuously indexed and online.
Most spatiotemporal analytics compare events and entity behavior in the present with the past at a pretty fine-grained level, hence why the summarizations or in-memory shortcuts do not work that well in practice. Your working set is frequently measured in days to months of data for common use cases, limited primarily by the storage budget.
In short, spatiotemporal analytics requires fully online indexing of the data in the stream, and a pretty big index at that for most apps.
I'd love to hear some opinions (from the author?) on how well the current offering of major DBMS-vendors (eg. Microsoft, Oracle, ... ) handle geospatial data. How well do they scale and do they have some inherent limitations?
I respect the author's experience, but "Database Engines Cannot Handle Real-Time Geospatial" is probably not true. We (MemSQL) have developed geospatial on top of our existing skiplist indexes, and can deliver thousands of intersection queries per second, over billions of points, in milliseconds per query, on very modest hardware. If you want more performance, just add more hardware.
We previewed our geo code at the Esri Developer's Conference last week. The demo delivers queries in millisecond time over 170M records without any tuning. The naive version worked fine.
With all due respect, 170M points is a trivial data model, it fits on a thumb drive. That is seconds worth of data in IoT; you can brute-force the implementation and look great. Several well-known companies do. It just doesn't scale to anything real.
The location analytics data models I need to support, e.g. mobile telecom, have trillions of polygons. Indexing millions of records per second continuously concurrent with sub-second queries is pretty normal for IoT, and easy to support with good computer science. (Location is almost never described as a point by primary sources but as polygons; these are converted to centroid approximations for platforms incapable of doing analysis on the source.)
While I do not know the details of your implementation, I do know that indexing complex geometries using a skiplist index won't achieve the necessary volume or velocity required for IoT-like applications. If you restrict yourself to point data, people have been building extremely fast, extremely large quad-trees for years; you just can't scale high-value analytics with them, which limits their utility.
Fair enough; though I'm intrigued by the mention of trillions of polygons. How complex are they? 10 points? 100? 1000? What benefit does the raw data give that centroids can't?
To illustrate how tiny 170M records is: I checked out the count of shapes from a side business of mine in a dataset supporting a couple of first nations in a single Canadian province. That adds up to 38 million shapes. These range from points to polygons consisting of millions of points per polygon. I don't have the patience to take out the full number of points, but it's certainly well in excess of 170M. More likely in the billions.
Consider that that's a dataset for organisations that have someone doing GIS part time a few hours a month, and that the above dataset is the type of dataset often manipulated in desktop GIS apps.
Also doing point intersections with polygons is about the lowest placement of the bar you could possibly do, and quite easy to optimise. E.g. you can scale that easily "naively" to arbitrary size by sharding the point data, and segmenting whatever polygon you want to intersect with.
A more realistic test of how well you're doing (and the type of queries typically run against the above mentioned dataset) would be full polygon to polygon intersections.
What sorts of areas are being modeled with millions of points?
(I'm just curious and realize it may not be appropriate for you to answer, or that you might not want to take the time, or whatever, so please feel free to ignore)
Several Canadian provinces have vast forest management areas with hugely intricate rights (e.g. Alberta have dozens of first nations, dozens of oil companies and dozens of forestry companies all with different rights to the same overlapping areas), and many of them have huge datasets that needs to be coordinated. E.g. years back when we started doing this app, I visited with one forest management company that were doing logging in an area the size of Ireland.
Most of the polygons will be smaller, but there's any number of objects mapping out boundaries of massive areas to a ridiculous level of detail. More often the datasets will map out large number of moderate size (tens to thousands of points) polygons. Anything from buildings to traditional hunting trails, to oil and gas pipelines or mineral deposits.
Many of the huge polygons could be simplified just fine without any noticeable loss of precision, but that's not always a choice anyone is willing to take responsibility for doing, and so the system needs to work on whatever overly detailed data is passed in - customer knows best and all that - (but often the right answer for indexing this kind of data is just that: Simplify the outlines however drastically works for your application, with the most extreme being simply the bounding box; intersect that, discard all the "misses", and do the expensive full intersect on what's left).