[xquery-talk] [xml-dev] XML to graph

Ihe Onwuka ihe.onwuka at gmail.com
Wed Jul 1 15:02:11 PDT 2015

Hmmm  I wonder whether this would have worked on the scraped ratings data
that I had to clean. Well I did that with XPath in XSLT, might take a look
and see.

I have 3  different movie data sets from different sources. The one I just
posted  containing 3m movies, 1 has 180k movies (created by with JSONiq
running against freebase) and the other about 50k movies , all of which I
have managed to cast in XML.

So there is plenty of data to experiment with.

I look forward to your trip, I'll be around for a few months myself.

On Wed, Jul 1, 2015 at 2:59 PM, daniela florescu <dflorescu at me.com> wrote:

> Ihe,
> transforming XQuery to be able to do data cleaning has been a LONG desire
> of mine.
> Helena Galhardas was a PhD student of mine. She is now a professor in
> Lisbon,
> She and her students wrote the data cleaning package in Zorba — it’s 100%
> clean XQuery,
> so you can reuse it for other engines.
> Let me know how it goes.
> On the 7th I am leaving to Europe for 3-4 months.
> I will certainly visit London often.
> Hope we can talk, best
> Dana
> On Jul 1, 2015, at 11:54 AM, daniela florescu <dflorescu at me.com> wrote:
> Ihe,
> before you load anything anywhere, you need to do data cleaning on this
> data
> if you do integration from the Web and data has no unique ids…..
> In particular entity resolution…
> Literature is full of data cleaning and entity resolution algorithms.
> One that you will find familiar (because it looks very much like XQuery
> :-) is here:
> http://www.inesc-id.pt/ficheiros/publicacoes/1259.pdf
> Best regards
> Dana
> On Jul 1, 2015, at 10:04 AM, Ihe Onwuka <ihe.onwuka at gmail.com> wrote:
> You will note that the data doesn't have a unique id. Title certainly
> isn't unique, if you consider how many movies there have been called Batman
> or Treasure Island.
> Now I may encounter data about this movie from another source that covers
> different facets , for example it's box office takings or movie reviews.
> So it's a classic semantic web application. I want to amalgamate disparate
> data about the same fact in one entity. As I said I have a transformation
> that does this but it doesn't scale very well because I have to search the
> entire movie base to find the best match. To overcome this I have to adopt
> a mapReduce-ish approach to solve the problem.
> The thinking is a graphical representation would eliminate that problem
> because a graph gives me a persistent data structure  already  indexed for
> retrieval via several different axes, whereas indexes constructed in the
> XSLT transformation for the same purpose  are ephemeral and would need to
> be reconstructed every time you ran the transformation.
> On Wed, Jul 1, 2015 at 12:46 PM, Peter Hunsberger <
> peter.hunsberger at gmail.com> wrote:
>> Should be pretty straight forward to import that into Neo4J or Titan.
>> Neo might be simplest, in particular via conversion of the data into JSON.
>> However, Titan might give you other capabilities such as using Hadoop type
>> processing either for import or for subsequent analytics. Without knowing
>> more about the business requirements can't really give you much more than
>> that...
>> Peter Hunsberger
>> On Wed, Jul 1, 2015 at 11:32 AM, Ihe Onwuka <ihe.onwuka at gmail.com> wrote:
>>> I would like  to convert the XML snippet below to a multi-relational
>>> graph representation.
>>> One way is to transform a triple store via RDF. Another which I am less
>>> familiar with is to transform to graphML followed by a subsequent import
>>> into some graph database tool.
>>> The graphical representation is desirable for processing rather than
>>> visualization reasons. Chiefly I have a matching algorthim implemented in
>>> XSLT which works fine but doesn't scale well, a problem that I think can be
>>> solved with a graphical representation.
>>> I am keen to hear from my elders and betters on the subject.
>>> <movie title="20000 lieues sous les mers">
>>> <actors>
>>> <person name="Méliès, Georges"/>
>>> </actors>
>>> <alias>
>>> <title title="20,000 Leagues Under the Sea " year="1907"/>
>>> <title title="Amid the Workings of the Deep " year="1907"/>
>>> <title title="Deux cent mille lieues sous les mers " year="1907"/>
>>> <title title="Le cauchemar d'un pêcheur " year="1907"/>
>>> <title title="Under the Seas " year="1907"/>
>>> </alias>
>>> <directors>
>>> <person name="Méliès, Georges"/>
>>> </directors>
>>> <genres>
>>> <tag name="adventure"/>
>>> <tag name="fantasy"/>
>>> <tag name="sci-fi"/>
>>> <tag name="short"/>
>>> </genres>
>>> <keywords>
>>> <tag name="based-on-novel"/>
>>> <tag name="dream"/>
>>> <tag name="fish"/>
>>> <tag name="number-in-title"/>
>>> <tag name="submarine"/>
>>> <tag name="undersea-monster"/>
>>> <tag name="underwater"/>
>>> </keywords>
>>> <producers>
>>> <person name="Méliès, Georges"/>
>>> </producers>
>>> </movie>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://x-query.com/pipermail/talk/attachments/20150701/e1672655/attachment.html>

More information about the talk mailing list