I would like to convert characters with accents or similar to the corresponding ordinary character:
á
, à
, â
should become "a"
é
, ê
should be e
Ç
to C
It could be done by concatenating a million .replace(...)
calls, but I'm looking for a more elegant solution. The difficulty is to find out which ordinary character belongs to which extended character. I can easily see that an á
is an extension of an a
. But how do I automate this step?
Why I want to do this:
I have an interface between two applications. Application One provides data that contains said accents. Application Two can only work with data that matches [a-zA-Z]
.
You can use the library latinize, installable through:
npm install latinize
Since you are using typescript, you also can get its typing:
npm install @types/latinize
Usage:
var latinize = require('latinize');
latinize('ỆᶍǍᶆṔƚÉ áéíóúýčďěňřšťžů'); // => 'ExAmPlE aeiouycdenrstzu'
Internally, it replaces each character that is not a latin char or an arabic number through a regex and a callback function.
function latinize(str) {
if (typeof str === 'string') {
return str.replace(/[^A-Za-z0-9]/g, function(x) {
return latinize.characters[x] || x;
});
} else {
return str;
}
}
and it finds the target character via the help of a predefined character lookup table.
In the end, this solution is also a search and replace approach. I know you want to automate the discovery of the characters, yet the font system doesn't work that way.
The computer and hence JavaScript is unaware of the design and the meaning of a character. Instead, a character is nothing but a random number we use to identify a symbol. And that system is quite arbitrary and there is not much of an internal consistency.
So even though you know that â
should relate to a
by its design, the computer only knows that in UTF8 it has a digit U+00E2
. You want it to be U+0061
though.
Yet there is no connection just from knowing the number. You would have to compare the symbol and that's hardly possible, esp. if you get down to very similar looking symbols, e.g. Α U+0391
to A U+0041
.
There is no way to compute meaning. You'll have to map an extended character to its Latin counterpart yourself (or via the help of a library).