htmlxmlxpathnsindexpathdomxpath

XPath for first word?


For this HTML / XML:

<div class="contentBlock">
  <h2> </h2>
  <h1></h1>
  <h1>DBS055 - single  module packages</h1>
</div>

I want to extract with XPath only DBS055, not the entire text.


Solution

  • XPath 2.0

    //h1[normalize-space()]/replace(normalize-space(),'^([\w\-]+).*', '$1')
    

    will return all of the first words of the string values of those h1 elements that have a non-space character in their string value.

    XPath 1.0

    substring-before(
      concat(
        normalize-space(
          translate(//h1[normalize-space()][1], ',;/.', ' ')), ' '), ' ')
    

    approximates the more robust XPath 2.0 solution. Expand ',;/.' as necessary for various characters you consider to define word boundaries.

    Explanation:

    1. Select the first h1 that has a non-whitespace-only string value.
    2. Map all word boundary characters to spaces.
    3. Append a space to cover single-word case.
    4. Normalize spacing.
    5. Return the substring before the first space.