I always wonder how readable those queries really are. Its a nice claim but has anyone done any really serious research about this at all?
For example the same in SPARQL 1.1
MATCH (cypher:QueryLanguage)-[:QUERIES]->(graphs)
MATCH (cypher)<-[:USES]-(u:User) WHERE u.name IN [‘Oracle’, ‘Apache Spark’, ‘Tableau’, ‘Structr’]
MATCH (openCypher)-[:MAKES_AVAILBLE]->(cypher)
RETURN cypher.attributes
In the SPARQL case the graph flow does not revert on the edge with <USES> (it can using
?cypher ^<USES> ?user, but it would be weird and in the larger queries very confusing). The SPARQL case also tends to group related concepts together.
This assumes a DEFAULT BASE URI is selected for the SPARQL version that contains all the modeled relations. Which in a straight comparison to Cypher is a fair comparison.
I find Gremlin a lot nicer than Cypher, and a lot more powerful as well. Also up to today Neo4J just has not scaled all that well. I am awaiting the LDBC Benchmark results of Neo4J to see if I am wrong.
What Neo4J has been great at is making a nice solid product that aims at solving developer problems. I believe as a database it has not been that great at solving enterprise or life science community problems. Its still a single database instance without federation on demand.
Yeah, I actually wrote that query just as an example to show what Cypher looks like and some fun around the openCypher announcement. Wasn't expecting it to end up on HN, let alone have Marko convert it to Gremlin...
I disagree with you on the statement that Neo4j does not work well in life sciences. I am a data scientist building large scale systems for mining genomic data, and we built a fairly critical piece of that infrastructure around Neo4j. I actually presented an overview of that work at GraphConnect this week:
Many meaningful lineages in life sciences can be hundreds to thousands of levels deep (our datasets are great examples). Neo4j is the only graph database I have evaluated that handles traversals across lineages of this depth while still achieving the performance scalability promised by maintaining index-free adjacency across which ever node in the cluster a traversal is sent to.
I am just going to point to our work at sparql.uniprot.org.
A graph database with 17 billion edges and 3 billion+ nodes.
Containing in its whole the NCBI tax and GO tax trees. That you can access for free over HTTP using standard SPARQL 1.1.
This does not run on a cluster but single nodes with Virtuoso 7.2.1.
I am not saying that Neo4J is a bad choice, I am just saying that it due to its lack of federation support it is an expensive choice for the life sciences. i.e. an economic argument over a technical one, and not even looking at 1 project a time but in general for the community. Neo4J and Cypher will never support federation in the way that SPARQL allows. This is because all this URI business in RDF is annoying when modelling your data but critical when merging datasets on demand between separate databases. e.g. joining ChEMBL & UniProt & MeSH & PubChem etc...
We in the life sciences rarely do graph traversals for graph traversal sake, but tend to join trees. e.g. intersect a branch of a taxonomic tree with a branch of the GO tree. There are cases where real graph traversals are being done (assembly&variation graphs).
OpenCypher is a great step forward. Now Neo4J needs a open public standard for serializing graphs to disk that can imported into Neo4J and other databases. RDF being supported by so many different databases allows us to support many more of our users (at UniProt) even if they don't use SPARQL or our choice of Graph database themselves.
There's nothing stopping you from flipping that relationship order though, or making that pattern more compact. In fact, I'd prefer an overall reversed order, something like:
MATCH (openCypher)-[:MAKES_AVAILBLE]->(cypher:QueryLanguage)-[:QUERIES]->(graphs),
(u:User)-[:USES]->(cypher)
WHERE u.name IN [‘Oracle’, ‘Apache Spark’, ‘Tableau’, ‘Structr’]
RETURN cypher.attributes
I guess, in this particular case, it's subjective preference which language you feel expresses the query pattern most legibly. I certainly prefer the visual approach of cypher.
I like cypher, then gremlin for small queries. The problem is the queries I see are much, much larger. And at about 10 lines in the use of white space in SPARQL starts to make a
real difference in readability in my opinion.
That can of course also be affected by my slight reading disability where the shape of the words is important. This shape could be disturbed by the connecting sigils in both gremlin and Cypher. So I understand that my preference might not hold for the whole population :)
Does anyone know whether the semantics available in Cypher would be practical to use when querying a distributed graph database? Or is it useful to have a closer to the metal implementation considering the distributed system tradeoffs?
This assumes a DEFAULT BASE URI is selected for the SPARQL version that contains all the modeled relations. Which in a straight comparison to Cypher is a fair comparison.
I find Gremlin a lot nicer than Cypher, and a lot more powerful as well. Also up to today Neo4J just has not scaled all that well. I am awaiting the LDBC Benchmark results of Neo4J to see if I am wrong.
What Neo4J has been great at is making a nice solid product that aims at solving developer problems. I believe as a database it has not been that great at solving enterprise or life science community problems. Its still a single database instance without federation on demand.