mysqlregexcjkruby-characters

MySQL regexp with Japanese furigana


I have a large database (~2700 entries) of vocabulary. Each row contains an English word, the Japanese equivalent, and other data not relevant to this problem. I have created a facility to search and display the results in a table, but I'm having a small problem with the furigana.

Japanese sentences are written with a mix of Chinese characters (kanji) and the phonetic scripts (kana). Not everyone can read every kanji, and sometimes the same kanji has multiple readings. In those cases, the phoetic kana is placed above the kanji - this is called furigana:

enter image description here

I present these phonetic readings to the user with the <ruby> tag in the following format:

<ruby>
  <rb>勉強</rb>    <!-- the kanji -->
  <rp>(</rp>      <!-- define where the phonetic part starts in the string -->
    <rt>べんきょう</rt>   <!-- the phonetic kana itself -->
  <rp>)</rp>      <!-- define the end of the phonetic part -->
</ruby>する        <!-- the last part is already phonetic so needs no ruby -->

The strings are stored in my database like this:

勉強(べんきょう)する

where anything between the parentheses is the reading for the kanji immediately preceeding it. Storing the strings this way allows fallback for browsers that don't support ruby tags (such as, amazingly, Firefox).

All of this is fine, but the problem comes when a user is searching. If they search for

勉強

Then it will show up. But if they try to search for

勉強する

it won't work, because in the database there is a string defining the phonetic pronunciation in the middle.

The full-width parentheses in the above example are used only to denote this phonetic script. Given this, I am looking for a way to essentially tell the MySQL search to ignore anything it finds between rounded parentheses. I have a basic knowledge of how to do most simple queries in MySQL, but I'm certainly not an expert. I have looked at the docs, but (to me, at least) they are not very user-friendly. Perhaps not very beginner-friendly. I thought it might be possible with some sort of construction involving a regular expression, but I can't figure out how.

Is there a way to do what I want?


Solution

  • As said in How to do a regular expression replace in MySQL?, there seems to be impossible without an user-defined function (you can only replace explicit sequences).

    Rather dirty solution: you can tolerate anything between two consecutive Japanese characters, LIKE '勉%強%す%る'. I never suggested that.

    Or, you can keep an optional field in your table that potentially contains a version with furigana.