rubyxhtmlnokogirilibxml2

Why does Nokogiri's to_xhtml create new `id` attributes from `name`?


Consider the following code:

require 'nokogiri' # v1.5.2
doc = Nokogiri.XML('<body><a name="foo">ick</a></body>')

puts doc.to_html
#=> <body><a name="foo">ick</a></body>

puts doc.to_xml
#=> <?xml version="1.0"?>
#=> <body>
#=>   <a name="foo">ick</a>
#=> </body>

 puts doc.to_xhtml
 #=> <body>
 #=>   <a name="foo" id="foo">ick</a>
 #=> </body>

Notice the new id attribute that has been created.

  1. Who is responsible for this, Nokogiri or libxml2?
  2. Why does this occur? (Is this enforcing a standard?)
    The closest I can find is this spec describing how you may put both an id and name attribute with the same value.
  3. Is there any way to avoid this, given the desire to use the to_xhtml method on input that may have <a name="foo">?

This problem arises because I have some input I am parsing with an id attribute on one element and a separate element with a name attribute that happens to conflict.


Solution

  • Apparently it's a feature of libxml2. In http://www.w3.org/TR/xhtml1/#h-4.10 we find:

    In XML, fragment identifiers are of type ID, and there can only be a single attribute of type ID per element. Therefore, in XHTML 1.0 the id attribute is defined to be of type ID. In order to ensure that XHTML 1.0 documents are well-structured XML documents, XHTML 1.0 documents MUST use the id attribute when defining fragment identifiers on the elements listed above.
    [...]
    Note that in XHTML 1.0, the name attribute of these elements is formally deprecated, and will be removed in a subsequent version of XHTML.

    The best 'workaround' I've come up with is:

    # Destroy all <a name="..."> elements, replacing with children
    # if another element with a conflicting id already exists in the document
    doc.xpath('//a[@name][not(@id)][not(@href)]').each do |a|
      a.replace(a.children) if doc.at_css("##{a['name']}")
    end