phpcharacter-encodingunicode-string

How to convert strange strong/bold Unicode to non bold UTF-8 chars in php?


I'm trying to store a tweet in my database with twitter api, but I get this kind of strage chars which seems to be "naturals" bold chars

NORMAL CHARS:

azertyuio

STRANGE CHARS:

𝘀𝗲𝘁 𝗶𝘀 𝗿𝗲𝗮𝗱𝘆 𝗳𝗼𝗿 𝘁𝗵𝗲 𝗱𝗶𝘀𝗰𝘂𝘀𝘀𝗶𝗼𝗻!!

If I paste the strongs chars in my netbeans editor I get something like square chars...

I've never seen that before. Could you help me to convert this text in a non bold chars in php?


Solution

  • This is one of the reasons for using UTF or HTML entity character encoding rather than ansi. UTF allows you to store and display characters like these (and those from other languages), handle searches when someone inputs these characters in those languages/charsets (which will only match things written in those same characters), and so on.

    The alternative would be for you to write a "conversion" for every odd character set that people choose to use. Still, converting these is possible to do -- you'll just need to decide whether it is really worth your time.

    The characters you submitted are called Sans-Serif Mathematical Bold characters. You can find the list here at w3.org. As well, there are standard, slanted, slanted bold variations for just these (use the previous and next links at the top of that page).

    The problem you will encounter is that, unlike switching capitalized characters to lowercase (add 32 to the decimal value, or chr(ord(x)+32) ) there won't be a set decimal amount you can use to switch all characters from Mathematical Bold to an ANSI equivalent for each of the character groups. As well, ord() and chr() will not work for these characters.

    Example:

    𝗮 is 120302, a is 97. 120302 - 97 = 120205
    𝗔 is 120276, A is 65. 120276 - 65 = 120211

    Thus, subtracting 120205 would give you the correct lowercase a for 𝗮, however, the same would not work for 𝗔. That means your would have to determine which charset the character is (Mathematical Bold, Slanted Mathematical, etc), identify the subset it belongs to (a-z, A-Z, 0-9), then use a corresponding offset you calculated to correct it. In order to do that, you have to check every character of every tweet for characters that fit in one of your supported conversion charsets, then convert it those letters.

    That might be worth doing if there are a large number of tweets using Mathematical Bold only, but if you're importing large sets of tweets *that can contain all sorts of potential characters, you're in for a lot of work.

    If you think it is worthwhile, the first thing you'll need to do is look at the raw character encoding you're receiving from the API, whether it needs to be converted, then decide whether you want to map between charsets using an array of characters, use a range of values for the subsets, or some other method. You also need to decide how you'll scan for those characters.

    All in all, the answer to your question is that it is possible to convert them, but your situation and particulars are going to determine whether it is worthwhile and how you accomplish it. It's not something that can be written for you.