[xquery-talk] the sad state of query languages for semi-structured data in the NoSQL industry

Thu May 28 22:33:50 PDT 2015

Another message that I sent this morning, and it didn't make it though.....until now.

Thanks Marklogic for opening up the blockade.

I guess the MarkLogic  lawyers needed a little bit of time to scratch their heads about what to do.....(and BTW,
silencing me isn't a solution... I lived in a communist country for 22 years... they've tried that ... didn't work)

But the following message is a serious discussion about the state of affairs in the query languages universe for NoSQL
databases.

> On May 28, 2015, at 2:20 PM, daniela florescu <dflorescu at me.com> wrote:
> 
> The NoSQl industry is extremely successful, used everywhere, and  considered by many the child prodigee of the database industry.
> 
> 
> They are proud of themselves because they satisfy user needs, aka:  they store data:
> (a) which is not in 1st normal form (aka nested, pre-aggregated)
> (b) without schema
> 
> …to the practical  benefit of:
> (a) the application getting the data out of the database exactly as the application needs it, and not 
> altered through a normalization phase.
> (b) the lack of fixed schema helps with data flexibility… things change extremely quickly inside an application
> those days (fields being added, deleted, changed, etc)
> 
> 
> So far so good, and I think until here they are all right.
> 
> [[ One may think that this looks a little bit like … XML, but hey, they don’t like XML. Fine.]]
> 
> The problems comes when they try to QUERY this data.
> 
> 
> The NoSQL industry is re-inventing the wheel from scratch, and in a very chaotic and ad-hoc manner.
> 
> Just  look at the sad state of affairs in terms of  query languages and their semantics.
> 
> I am just look at the ones who claim that they can store nested and schema-less data (JSON-like, or XML-lIke)
> 
> (1) MongoDB
> http://docs.mongodb.org/manual/tutorial/query-documents/ <http://docs.mongodb.org/manual/tutorial/query-documents/>
> 
> Note: pure JSON. Couldn’t find a simple sort, for example. Etc. Etc.
> 
> (2) Cassandra/DataStax
> http://www.datastax.com/wp-content/uploads/2013/03/cql_3_ref_card.pdf <http://www.datastax.com/wp-content/uploads/2013/03/cql_3_ref_card.pdf>
> 
> Nore: not even an OR, or a NOT. And does it mean to sort on schema-less data ?
> 
> (3) Spark/DataBricks
> https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html <https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html>
> 
> Note: sounds more like an import/export facility… but they call it a JSON Query language
> 
> (4) Elastic Search
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html <https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html>
> 
> Note: very sophisticated full text,but not structured search of any serious kind. Just some simple aggregates (sum, etc)
> 
> 
> (5) Mulesoft
> https://www.mulesoft.com/press-center/new-release-june-2015?utm_source=linkedin&sthash.axJqiSBn.mjjo <https://www.mulesoft.com/press-center/new-release-june-2015?utm_source=linkedin&sthash.axJqiSBn.mjjo>
> 
> Note: not only they seem to have their own JSON query language, but even their own XML query language, it seems. couldn’t find more details.
> 
> (6) Hive
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF>
> 
> Note: multiple languages (Xpath, some json, some SQL, glued together somehow chaotically)
> 
> I can fill in tons of pages with YET-ANOTHER-LANGUGAGE-LIKE-THIS. 
> 
> (7) MarkLogic
> 
> https://docs.marklogic.com/8.0/guide/app-dev/json <https://docs.marklogic.com/8.0/guide/app-dev/json>
> 
> 
> 
> ==============
> 
> Now I can spot several mistake here:
> 
> 1. None of those query language has a clearly designed, mathematical data model. in the absence of such a data model, that describes the input, the output
> and the intermediate results of a query, how can we define a clean semantics ?
> 
> 2. All of them have a hacky semantics — “let’s run it and we’ll se what the result is” kind of thing. The semantics in most cost corner cases — and by definition
> semi-structured data is ONLY corner cases -- is not defined.
> 
> 3. Some try to piggy back on the SQL semantics, ignoring the fact that the SQL was designed to work on relations, and JSON (or in general, nested data) 
> has nothing to do with relations.  SQL semantics cannot be “ported”….just because we reuse the same keywords.
> 
> 4. None attempted to define a type system (even a basic one for atomic types like dates, and arithmetics on them..) and a schema language.
> 
> ==============
> 
> 
> Now maybe it’s clear why I am so sad that the XQuery community, instead of trying to help the younger and naive NoSQL community, which still believes that
> SQL is “good enough”, and using the SELECT-FROM-WHERE keywords is the magic bullet to define the semantics of any kind of query language, the XQuery community
>  is still looking at their own navel, and marveling, like the well known CEO: "we can handle flexible data" !!!
> 
> Just compare those languages I listed above with the work that has been done in the past 16 years in XQuery, and the correctness and the complexity of the result
> vs, the hacky solutions above.
> 
> P.S. And yes, that work from XQuery was used 100% in the design of JSONiq, which was designed with the dual goal in mind:
> (a) reuse 100% of the experience of design and implementation of XQuery and
> (b) provide a query language that is synactically and semantically acceptable for the JSON community.
> 
> if we succeeded or not, that’s another story, but I am not aware of any other solution that even comes CLOSE to that goal.
> 
> 
> Best regards
> Dana
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://x-query.com/pipermail/talk/attachments/20150528/5748807e/attachment.html>