javascriptarrayssplitkannada

Split Kannada word into syllabic clusters


We are wondering if there is any method to split a Kannada word to get the syllabic clusters using JavaScript.

For example, I want to split the word ಕನ್ನಡ into the syllabic clusters ["ಕ", "ನ್ನ", "ಡ"]. But when I split it with split, the actual array obtained is ["ಕ", "ನ", "್", "ನ", "ಡ"]

Example Fiddle


Solution

  • I cannot say that this is a complete solution. But works to an extent with some basic understanding of how words are formed:

    var k = 'ಕನ್ನಡ';
    var parts = k.split('');
    arr = []; 
    for(var i=0; i< parts.length; i++) {
      var s = k.charAt(i); 
    
      // while the next char is not a swara/vyanjana or previous char was a virama 
      while((i+1) < k.length && k.charCodeAt(i+1) < 0xC85 || k.charCodeAt(i+1) > 0xCB9 || k.charCodeAt(i) == 0xCCD) { 
        s += k.charAt(i+1); 
        i++; 
      } 
      arr.push(s);
    }
    console.log(arr);
    

    As the comments in the code say, we keep appending chars to previous char as long as they are not swara or vyanjana or previous char was a virama. You might have to work with different words to make sure you cover different cases. This particular case doesn't cover the numbers.

    For Character codes you can refer to this link: http://www.unicode.org/charts/PDF/U0C80.pdf