regexruby-characters

RegExp for furigana (Japanese)


I'm trying to create regex that will remove furigana (ruby) from Japanese words:

<ruby><rb>二度</rb><rp>(</rp><rt>にど</rt><rp>)</rp>と</ruby> //old string
二度と // new string

I created new = old.replace(/<rt>.*<\/rt>/,'').replace(/<rp>.*<\/rp>/,'').replace('<ruby><rb>','').replace('</rb></ruby>','') and it works... almost.

When there are multiple ruby tags, it doesn't work at desired:

<ruby><rb>息</rb><rp>(</rp><rt>いき</rt><rp>)</rp></ruby>を<ruby><rb>切</rb><rp>(</rp><rt>き</rt><rp>)</rp></ruby>らして
息らして //new string, using function above (wrong)
息を切らして //should be this

I'm very new to RegExp, so I'm not sure how to handle this one.


Solution

  • Try to use

    var newstring = oldstring.replace(/<rb>([^<]*)<\/rb>|<rp>[^<]*<\/rp>|<rt>[^<]*<\/rt>|<\/?ruby>/g, "$1");
    

    The idea here is to capture rb tags content to put it in replacement pattern, rp and rt tags are removed with their content, and ruby tags are removed too.

    Content between tags is described with [^<] (all that is not a <) since these tags (rb, rp, rt) can't be nested.