[SOLVED] Removing duplicate words mysql concat

Removing duplicate words mysql concat_ws

I have a query in which I select the data I need for a sphinx index. One of the things I do is a concat_ws of multiple name aliases, different languages and such. This presents a problem when the names overlap. For example: one entry has the names "Clannad", and the alternative title "ＣＬＡＮＮＡＤ　-クラナド-". Another has the names "Clannad After Story", "クラナドアフターストーリー" and "Clannad: After Story". Now bear with me, because I know this would be easily resolved in this particular case, but I'd wish for it to be applicable all over the board. If you search "Clannad" you'll get the After Story entry first because of the double match on 'Clannad'.

What I'd like to do is remove all duplicate words/non-unique words in the concat_ws statement. If that is even possible.

The query looks something like:

SELECT CONCAT_WS(' ',a.Name,a.Name2,a.Name3,a.Name4) AS name

(I hope I structured this question correctly, this being my first here) Thank you,

Solution

As Marc has suggested in a comment, this quite painful to manage in SQL (as far as I can see). I'd suggest caching the processed value in another column, and then index that.

SELECT a.name_words AS name, ...

Combining each of your name values and then getting the distinct words is a separate matter - but that really depends on what language you have at hand. Regular expressions should be of some help though - here's a quick attempt in Ruby:

[name, name2, name3, name4].join(' ').split(/\b/).reject { |word|
  word.blank?
}.collect { |word|
  word.downcase
}.uniq