[xquery-talk] [SEARCHING XML FOR DATA] Total newbie question...

Jeff Dexter jeff.dexter at rainingdata.com
Mon May 22 08:38:31 PDT 2006


Petri,

 

            Unfortunately my mailer garbled your original XML with its own
HTML, so I had to just deal with the original HTML from your link, but
hopefully this helps. Also note that I'm using some conventions from
TigerLogic to deal with the HTML and how it's modeled in XQuery so you'll
need to change these for your setup.

 

            The basic problem is that you're searching the space of table
elements in the document, of which there are many, for one table without too
many identifying characteristics. Some tables in the document are well
identified using id attributes, but, alas, not the one for which you are
searching, which means we need to identify it in some other manner. I've
used the first table header as an identifying characteristic for the table,
as follows.

 

declare default element namespace 'http://www.w3.org/1999/xhtml';

            doc( 'http://it.finance.yahoo.com/q/cp?s=%5EMIB30', 'text/html'
)//table[ (tr/td)[1] eq 'Codice' ]

 

Note I've restricted the search to the first cell in the table - if I hadn't
done that not only would I get an error (the eq operator doesn't handle
multiple operands on either side) but it ensures I'm searching a minimal
amount of the document for my table. Also note that I've surrounded tr/td in
parentheses. (tr/td)[1] means the first td element out of the entire set
under the tr/td path of this table. If I had simply written tr/td[1], this
means the first td element in each tr element in the table, which is
something quite different and would have led to an error, since there are
multiple such items and the eq operator is designed to handle only one.

 

Since you don't want the table but rather the contents of the table, you can
iterate over these as follows. 

 

            declare default element namespace
'http://www.w3.org/1999/xhtml';

for $i in doc( 'http://it.finance.yahoo.com/q/cp?s=%5EMIB30', 'text/html'
)//table[ (tr/td)[1] eq 'Codice' ]/tr

return

                        $i

 

You can then construct the result you want by returning something other than
$i, as in.

 

            declare default element namespace
'http://www.w3.org/1999/xhtml';

for $i in doc( 'http://it.finance.yahoo.com/q/cp?s=%5EMIB30', 'text/html'
)//table[ (tr/td)[1] eq 'Codice' ]/tr

return

                        <titolo><codice>{ data($i/td[ 1 ]) }</codice> .
</titolo>

 

Note here the use of the fn:data( ) function - the key here is that each
table element can contain varying degrees of markup intended to format the
text therein. Use of fn:data( ) will ensure all of the text will be
extracted from the table cell but none of the markup. Each of the columns
you desire to extract for your final result format can be done using the
same code as above, but using a different index in the predicate to extract
the column at that index (e.g. $i/td[ 1 ], $i/td[ 2 ], etc.).

 

Other things you may want to do to improve this query:

 

-          Use a positional variable on the ForExpr to eliminate the first
row - even though it's not identified using th, it's just the table header
and therefore you probably want to eliminate from the query.

-          Cleanup the strings returned from each table cell - in many cases
they're fully formatted with currency indicators, percentages, etc. Your XML
would be better served marking these up as complex elements with decimal
types and attributes defining the properties of this content. Being able to
search on price could be important but you can't do it if the Euro symbol is
carried along with the price.

 

Hope this helps.

 

Jeff Dexter.

Chief Architect, TigerLogic

www.rainingdata.com

 

  _____  

From: talk-bounces at xquery.com [mailto:talk-bounces at xquery.com] On Behalf Of
Petri Alessandro
Sent: Saturday, May 20, 2006 7:23 AM
To: talk at xquery.com
Subject: [xquery-talk] [SEARCHING XML FOR DATA] Total newbie question...

 

 

Hi everyone. I'm doing a project for an university exam and i need advice on
the xquery involved.
I developed an application which parses the HTML taken from a web page and
translates it into a well formad XML.
I then query it through XQEngine java library. I basically want to extract
from this URL: http://it.finance.yahoo.com/q/cp?s=%5EMIB30 the data from the
central table. I'd like the return XML to be formed more or less this way:


AL.MDD
ALLEANZA ASS
9,4900
-0,94%
0


for each table row. I really need some hints here as i can perform easy
queries on the document but can't get to the one i need to extract this
data.

Anticipated thanx to people who will answer :)

PS: the XML document i got from the transformed HTML is the following (Sorry
if it's big):



...cut...


<table
width="100%"
cellpadding="0"
cellspacing="0"
border="0"
class="yfnc_tableout1">


<table
width="100%"
cellpadding="2"
cellspacing="1"
border="0">

<td
class="yfnc_tablehead1"
align="center">Codice
<td
class="yfnc_tablehead1"
align="center">Nome
<td
class="yfnc_tablehead1"
align="center">Prezzo
<td
class="yfnc_tablehead1"
align="center">Variazione
<td
class="yfnc_tablehead1"
align="center">Volumi


<td
class="yfnc_tabledata1">

<a
href="/q?s=AL.MDD">AL.MDD


<td
class="yfnc_tabledata1">
ALLEANZA ASS

<td
class="yfnc_tabledata1"
align="center">
9,4900 ? 

18 mag

<td
class="yfnc_tabledata1"
align="center">
<img
width="10"
height="14"
border="0"
src="http://us.i1.yimg.com/us.yimg.com/i/us/fi/03rd/down_r.gif"
alt="Down" /> 
<b
style="color:#cc0000;">0,0900
(0,94%)
<td
class="yfnc_tabledata1"
align="right">0



...cut...

<td
class="yfnc_tabledata1">
UNICREDITO ITALIANO

<td
class="yfnc_tabledata1"
align="center">
6,0650 ? 

18 mag

<td
class="yfnc_tabledata1"
align="center">
<img
width="10"
height="14"
border="0"
src="http://us.i1.yimg.com/us.yimg.com/i/us/fi/03rd/down_r.gif"
alt="Down" /> 
<b
style="color:#cc0000;">0,1750
(2,80%)
<td
class="yfnc_tabledata1"
align="right">0





...cut...

 

 

--------------------------------------------------------------------
CONFIDENTIALITY NOTICE
This message and its attachments are addressed solely to the persons
above and may contain confidential information. If you have received
the message in error, be informed that any use of the content hereof
is prohibited. Please return it immediately to the sender and delete
the message. Should you have any questions, please contact us by
replying to  <mailto:webmaster at telecomitalia.it> webmaster at telecomitalia.it.
        Thank you
                                         <http://www.telecomitalia.it>
www.telecomitalia.it
--------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://xquery.com/pipermail/talk/attachments/20060522/851615dc/attachment.htm


More information about the talk mailing list