I am using webharvest with xquery to get a data from a website.
I have the 2 xquery variables with the following data
$text
:
<p> <strong>Psoria-Shield Inc.</strong> (<a href="http://www.psoria-shield.com/"></a><a href="/Tracker?data=gB90UgQvS9bs99znBBkklh-mudx4NTcPFIy_wiP7zUJ-qBXYABNid0GYgW4g7qVsjn3_dv2FPGzaYgKnhq_Ujg%3D%3D" target="_top">www.psoria-shield.com</a>) is a Tampa FL based company specializing in design, manufacturing, and distribution of medical devices to domestic and international
markets. PSI employs full-time engineering, production, sales staff, and manufactures within an ISO 13485 certified quality
system. PSI's flagship product, Psoria-Light®, is FDA-cleared and CE marked and delivers targeted UV phototherapy for
the treatment of certain skin disorders. Psoria-Shield Inc., was acquired by Wellness Center USA Inc. ("WCUI") in August 2012,
and is now a wholly-owned subsidiary.
</p>
<p> <strong>AminoFactory</strong> (<a href="http://www.aminofactory.com/"></a><a href="/Tracker?data=O0xbFRJiVuWDzRDq7SVwVR9xAPYLIGQyBw4mDziUrH4KB3DIYUasiO_O78eteJsv2doAGtg4kRhAqmnvkQ-9LA%3D%3D" target="_top">www.aminofactory.com</a>), a division of Wellness Center USA, Inc., is an online supplement store that markets and sells a wide range of high-quality
nutritional vitamins and supplements. By utilizing AminoFactory's online catalog, bodybuilders, athletes, and health conscious
consumers can choose and purchase the highest quality nutritional products from a wide array of offerings in just a few clicks.
</p>
<pre>At Wellness Center Usa, Inc.
Tel: (847) 925-1885 <a href="/Tracker?data=rhuzXSqaPgDJ--ByIIMSm7wrtVUZmqiD7wl78d4gUHajkKceardtmAscrHABzvo360XXBJCWn_Rb_s-yPMVXTw_XJrSieD88bIXbE9snPn4%3D" target="_top">www.wellnescenterusa.com</a> Investor Relations Contact:
Arthur Douglas & Associates, Inc.
Arthur Batson
Phone: 407-478-1120 <a href="/Tracker?data=9uKwR5tr9QwjFw830lvFTIWgz-s_eHaywZHwDl3el2RfYe5VuQZd_8sJU4J7HoFgOdyCn8br77RK60SIqLZkCy468cEKHpGUgE-nanwYfHo%3D" target="_top">www.arthurdouglasinc.com</a></pre> </span><span class="dt-green">
and $contact
:
At Wellness Center Usa, Inc.
Tel: (847) 925-1885 <a href="/Tracker?data=rhuzXSqaPgDJ--ByIIMSm7wrtVUZmqiD7wl78d4gUHajkKceardtmAscrHABzvo360XXBJCWn_Rb_s-yPMVXTw_XJrSieD88bIXbE9snPn4%3D" target="_top">www.wellnescenterusa.com</a> Investor Relations Contact:
Arthur Douglas & Associates, Inc.
Arthur Batson
Phone: 407-478-1120 <a href="/Tracker?data=9uKwR5tr9QwjFw830lvFTIWgz-s_eHaywZHwDl3el2RfYe5VuQZd_8sJU4J7HoFgOdyCn8br77RK60SIqLZkCy468cEKHpGUgE-nanwYfHo%3D" target="_top">www.arthurdouglasinc.com</a>
(This above text is just a example.)
What I want to so is remove the content of $contact
from $text
so far I have come up with the following code:
{
for $x in $text
return if(matches($contact, '')) then $x
else if(matches($contact, $x)) then '' else $x
}
It is not working. I dont know where I am going wrong. Please let me know the right way of doing this.
Do not use matches(...)
for exact string comparison, it is made for regular expressions and you'd need to escape a bunch of special characters.
If the HTML subtree is the exact same, use this:
$text[not(deep-equal(., <pre>{ $contact }</pre>))]
If you only want to compare its contents, use data(...)
:
$text[not(data(.) = string-join(data($contact)))]
But given the data you posted, you'd be fine just removing all <pre/>
nodes:
$text[local-name() != 'pre']