[xquery-talk] deduplicating information in XML files

Helena Galhardas helena.galhardas at ist.utl.pt
Wed Oct 17 03:18:36 PDT 2012


Dear Robby,

Zorba XQuery processor currently supports a data cleaning module 
(look for data cleaning in http://www.zorba-xquery.com/html/modules)
that we may want to try.

In principle, the requirements associated to your data de-duplication problem can be 
addressed by writing an XQuery program that invokes some of the functions available
in this data cleaning module.

We would appreciate your feedback if you decide to do so and please let us know if we
can help somehow.

Thanks.
Best Regards,
Helena Galhardas

On Oct 12, 2012, at 1:02 PM, Robby Pelssers wrote:

> Hi all,
> 
> This time I have a rather challenging task at hand.  Let me first describe the use case.  We have lots of product information stored in XML.  Some of that information describes 
> . Technical applications
> . Features and benefits
> . Technical summary
> 
> One of the problems is a lot of products had e.g. the same features and benefits as they are of the same product family or group.  But as we stored that info per product it got duplicated.  Now we want to deduplicate that info by generating DITA maps and topics (both are just XML).  Now for simplicity let's assume we generate the following content for product1 and product2.  The goal is to get from INPUT to OUTPUT by checking if the body of the linked topics are duplicates, next create 1 generic topic and rewrite the links in the map to  point to that single topic.  I have XSLT / XQuery (XMLDB) and Java at my disposal to get the job done.  I'm not sure what will be the easiest way to get the job done.  Keep also in mind that my INPUT will contain a few 1000 files (maps and linked topics) and I will need to deduplicate the whole set ;-)
> 
> Thx upfront for any input,
> Robby  
> 
> INPUT
> 
> Product1_map.xml
> <map>
>   <features-benefits-ref href="features-benefits/Product1_FandB.xml "/>
> </map>
> 
> Product1_FandB.xml:
> <content>
>   <meta>
>     <id>product1</id>
>   <meta>
>   <body>
>     <p>Suitable for high frequency applications due to fast switching characteristics</p>
>     <p>Suitable for logic level gate drive sources</p>
>   <body>
> </content>
> 
> Product2_map.xml
> <map>
>   <features-benefits-ref href="features-benefits/Product2_FandB.xml "/>
> </map>
> 
> Product2_FandB.xml:
> <content>
>   <meta>
>     <id>product2</id>
>   <meta>
>   <body>
>     <p>Suitable for high frequency applications due to fast switching characteristics</p>
>     <p>Suitable for logic level gate drive sources</p>
>   <body>
> </content>
> 
> Expected output:
> 
> Product1_map.xml
> <map>
>   <features-benefits-ref href="features-benefits/FandB_1.xml "/>
> </map>
> 
> Product2_map.xml
> <map>
>   <features-benefits-ref href="features-benefits/FandB_1.xml "/>
> </map>
> 
> FandB_1.xml:
> <content>
>   <meta>
>     <id><!- can become empty  -> </id>
>   <meta>
>   <body>
>     <p>Suitable for high frequency applications due to fast switching characteristics</p>
>     <p>Suitable for logic level gate drive sources</p>
>   <body>
> </content>
> 
> 
> _______________________________________________
> talk at x-query.com
> http://x-query.com/mailman/listinfo/talk




More information about the talk mailing list