[xquery-talk] A Poor Man's XPROC
ihe.onwuka at gmail.com
Wed Jan 22 08:12:50 PST 2014
Just sharing something I've used to get by with a bash shell
transformation workflow. I have already been served a large portion
of "Why don't you use XPROC/xmlsh" and the answer is because I didn't
think of it at the outset.
The two "tricks" here are process substitution - a means of piping one
output into several subsequent steps and piping output direct to the
shell for execution.
I lay no claim to originality or shell expertise, it's just something
that has worked for me and is worth sharing. Some of my terminology
may also be off for the same reason but this works.
Suppose we wish to parse some html - lets say amazon. We start out
with a bog standard curl request piped into tagsoup to make it well
curl -s --request GET "www.amazon.com" | java -jar
$HOME/tagsoup-1.2.1.jar --nons |
this is where the fun starts tagsoup has given us some xhtml and we
want a) save it b) parse it further so.....
tee amazon.xhtml | java -jar $HOME/saxon9he.jar -s:- -xsl:createMetadata.xsl |
tee the xhtml and passes it on to the next step which is our first
transform to create some Metadata from amazon.xhtml. Now we want to
pass this metadata on to 3 processes.
1. a transformation to create a metadata Header
2. a transformation to create a metadata Tail record.
3 then we wish to process the xhtml (a step which I will explain later).
This is where the first use of process substitution kicks in. A quick
note. You need to start your script with #!/bin/bash rather than
#!/bin/sh to get this to work.
We invoke process substitution by tee >(some process) >(some other
process) etc. What this will do is take the input that was piped into
tee and pass it on to other subprocesses in the workflow where the
thing in brackets is what I call a subprocess. Note you need the >
sign that precedes the brackets. So lets apply this to the output of
the createMetadata stage
tee >(java -jar $HOME/saxon9he.jar -s:- -xsl:metadataHeader.xsl | curl
-s --request PUT etc)
that stores the metadata header in the database, but we want the same
input to go into the transform that creates the metadata Tail so......
>(java -jar $HOME/saxon9he.jar -s:- -xsl:metadataTail.xsl | curl -s --request PUT etc)
and now we have a third use for the output of the createMetadata
transform. We are going to generate the rest of the shell script to
complete the workflow. What I did here was to write a transformation
that produced bash shell code to do the rest of the work. The reason
for doing so was to take advantage of the power of XSLT (or XQuery)
to handle the conditional logic that would determine how the job would
proceed. For example maybe a certain step is only to execute if a
certain metadata element is present (or a certain XML file exists).
Instead of testing for this and trying to introduce conditional logic
into the bash script you could say use doc-available to check fro the
file and then depending on the outcome generate the appropriate script
code. To execute it all you need to do is pipe the output into bash.
Here is the finally process substitution step. Recall we still have
access to the output of the create metadata step.
> (java -jar $HOME/saxon9he.jar -s:- -xsl:generateReviews.xsl | bash )
where generateReviews transforms the metadata to the appropriate bash
shell script code and that gets executed by piping it into bash.
Not claiming it is elegant or efficient - it is just something that
has worked for me. Somebody may find it or bits of it useful or I may
get told of a better way (xmlsh/Xproc excepted).
More information about the talk