[xquery-talk] the sad state of query languages for semi-structured data in the NoSQL industry

daniela florescu dflorescu at me.com
Thu May 28 14:20:10 PDT 2015

The NoSQl industry is extremely successful, used everywhere, and  considered by many the child prodigee of the database industry.

They are proud of themselves because they satisfy user needs, aka:  they store data:
(a) which is not in 1st normal form (aka nested, pre-aggregated)
(b) without schema

…to the practical  benefit of:
(a) the application getting the data out of the database exactly as the application needs it, and not 
altered through a normalization phase.
(b) the lack of fixed schema helps with data flexibility… things change extremely quickly inside an application
those days (fields being added, deleted, changed, etc)

So far so good, and I think until here they are all right.

[[ One may think that this looks a little bit like … XML, but hey, they don’t like XML. Fine.]]

The problems comes when they try to QUERY this data.

The NoSQL industry is re-inventing the wheel from scratch, and in a very chaotic and ad-hoc manner.

Just  look at the sad state of affairs in terms of  query languages and their semantics.

I am just look at the ones who claim that they can store nested and schema-less data (JSON-like, or XML-lIke)

(1) MongoDB
http://docs.mongodb.org/manual/tutorial/query-documents/ <http://docs.mongodb.org/manual/tutorial/query-documents/>

Note: pure JSON. Couldn’t find a simple sort, for example. Etc. Etc.

(2) Cassandra/DataStax
http://www.datastax.com/wp-content/uploads/2013/03/cql_3_ref_card.pdf <http://www.datastax.com/wp-content/uploads/2013/03/cql_3_ref_card.pdf>

Nore: not even an OR, or a NOT. And does it mean to sort on schema-less data ?

(3) Spark/DataBricks
https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html <https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html>

Note: sounds more like an import/export facility… but they call it a JSON Query language

(4) Elastic Search
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html <https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html>

Note: very sophisticated full text,but not structured search of any serious kind. Just some simple aggregates (sum, etc)

(5) Mulesoft
https://www.mulesoft.com/press-center/new-release-june-2015?utm_source=linkedin&sthash.axJqiSBn.mjjo <https://www.mulesoft.com/press-center/new-release-june-2015?utm_source=linkedin&sthash.axJqiSBn.mjjo>

Note: not only they seem to have their own JSON query language, but even their own XML query language, it seems. couldn’t find more details.

(6) Hive
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF>

Note: multiple languages (Xpath, some json, some SQL, glued together somehow chaotically)

I can fill in tons of pages with YET-ANOTHER-LANGUGAGE-LIKE-THIS. 

(7) MarkLogic



Now I can spot several mistake here:

1. None of those query language has a clearly designed, mathematical data model. in the absence of such a data model, that describes the input, the output
and the intermediate results of a query, how can we define a clean semantics ?

2. All of them have a hacky semantics — “let’s run it and we’ll se what the result is” kind of thing. The semantics in most cost corner cases — and by definition
semi-structured data is ONLY corner cases -- is not defined.

3. Some try to piggy back on the SQL semantics, ignoring the fact that the SQL was designed to work on relations, and JSON (or in general, nested data) 
has nothing to do with relations.  SQL semantics cannot be “ported”….just because we reuse the same keywords.

4. None attempted to define a type system (even a basic one for atomic types like dates, and arithmetics on them..) and a schema language.


Now maybe it’s clear why I am so sad that the XQuery community, instead of trying to help the younger and naive NoSQL community, which still believes that
SQL is “good enough”, and using the SELECT-FROM-WHERE keywords is the magic bullet to define the semantics of any kind of query language, the XQuery community
 is still looking at their own navel, and marveling, like the well known CEO: "we can handle flexible data" !!!

Just compare those languages I listed above with the work that has been done in the past 16 years in XQuery, and the correctness and the complexity of the result
vs, the hacky solutions above.

P.S. And yes, that work from XQuery was used 100% in the design of JSONiq, which was designed with the dual goal in mind:
(a) reuse 100% of the experience of design and implementation of XQuery and
(b) provide a query language that is synactically and semantically acceptable for the JSON community.

if we succeeded or not, that’s another story, but I am not aware of any other solution that even comes CLOSE to that goal.

Best regards

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://x-query.com/pipermail/talk/attachments/20150528/14b64e70/attachment.html>

More information about the talk mailing list