[xquery-talk] Doing some Pattern Frequency Distribution

Fri Jun 9 00:46:54 PDT 2006

> I would recommend to go with XQuery and a real XML database 
> for those sizes - I'm not much of an expert for XSLT 
> processors, but I doubt you'll get good results with XSLT on 
> datasets of 5 GB. 

I had a report the other day of a user running a Saxon XSLT transformation
over a 3Gb input file in 12 minutes. However, this requires a very careful
and stereotypical coding pattern to take advantage of Saxon's serial
processing mode - see
http://www.saxonica.com/documentation/sourcedocs/serial.html 

Unfortunately, Saxon's implementation of xsl:for-each-group with value-based
grouping (and the same goes for distinct-values()) requires that all the
groups are held in memory, so I suspect this won't work for your problem -
although it could work if the phone numbers you are analyzing represent a
small part of the total data.

Really you want to visit each element exactly once, in sequence, and
maintain a set of totals as you go. This doesn't lend itself well to
implementation in either XSLT or XQuery. I think I would write this one in
Java as a pure SAX application.

Incidentally, XML databases vary greatly in their ability to handle large
documents. Many are optimized to handle large numbers of small documents,
and behave rather poorly when asked to deal with small numbers of
gigabyte-sized documents.

Michael Kay
http://www.saxonica.com/