javascriptnode.jsstringchinese-locale

How to find chinese and english character in NodeJS?


I have a string containing Chinese and English characters and I want to split the string into the individual Chinese and English characters.

Here are some examples:

  1. hello 你好
  2. 你好 你好 hello

This page teaches how to detect Chinese character but it didn't work when splitting up the string.

Thanks in advance


Solution

  • You could split the string at every occurrence of space at every occurrence of a 'Chinese' character, as so:

      let chiStr = "你好 你好 hello"
      chiStr.split(' ')//splitting the string at every occurrence of a space
      //expected result: ["你好", "你好", "hello"]
    
      const REGEX_CHINESE = /[\u4e00-\u9fff]|[\u3400-\u4dbf]|[\u{20000}-\u{2a6df}]|[\u{2a700}-\u{2b73f}]|[\u{2b740}-\u{2b81f}]|[\u{2b820}-\u{2ceaf}]|[\uf900-\ufaff]|[\u3300-\u33ff]|[\ufe30-\ufe4f]|[\uf900-\ufaff]|[\u{2f800}-\u{2fa1f}]/u;
      const hasJapanese = (str) => REGEX_CHINESE.test(str);
    
      chiStr.split(REGEX_CHINESE) splitting the string at every occurrence of a 'chinese' character
      //expected result: ["你", "好", "你", "好", " hello"]
    
    

    Another good approach is to filter out the Chinese words and the English words into separate arrays as so:

    const REGEX_CHINESE = /[\u4e00-\u9fff]|[\u3400-\u4dbf]|[\u{20000}-\u{2a6df}]|[\u{2a700}-\u{2b73f}]|[\u{2b740}-\u{2b81f}]|[\u{2b820}-\u{2ceaf}]|[\uf900-\ufaff]|[\u3300-\u33ff]|[\ufe30-\ufe4f]|[\uf900-\ufaff]|[\u{2f800}-\u{2fa1f}]/u;
    const hasJapanese = (str) => REGEX_CHINESE.test(str);
    
    const seperateWords = (str)=>{
       let newStr = str.split(' ')
       let chiWords = newStr.filter((string)=>REGEX_CHINESE.test(string))//All chinnese words
       let engWords = newStr.filter((string)=>!REGEX_CHINESE.test(string)) //All english words
       let arrayOfDiffWords = [chiWords, engWords]
       return arrayOfDiffWords
    }
    console.log(seperateWords("你好 你好 hello")) //test