sqloracle-databaseutf-8symbolscharacter-set

How to identify non utf8 symbols in a range of text?


In our database we have Latin and Cyrillic alphabets and all is fine with those characters. But, sometimes people enter characters that appear as ? or squares. Is there a way to find those specific symbols, because looking manually through a couple of tables with >400 000 records for 1 mistake is insane.

I have looked for online solutions that I can load my data as text, I have tried notepad++, but either I get all Cyrillic marked or I don't get anything.

EDIT: L.􀈉.D. is an example.


Solution

  • Based on the ASCII value of the character you can filter the text.Here i am assuming all characters greater than 65534 as outside the normal range.You can modify the range according to your requirement.The db fiddle here

     with test (col) as (
      select 
        'L.A.D' 
      from 
        dual 
      union all 
      select 
        'L.􀈉.D.' 
      from 
        dual
    ) 
    select 
      col, 
      case when max(ascii_of_one_character) >= 65535 then 'NOT OK' else 'OK' end result 
    from 
      (
        select 
          col, 
          substr(col, column_value, 1) one_character, 
          ascii(
            substr(col, column_value, 1)
          ) ascii_of_one_character 
        from 
          test cross 
          join table(
            cast(
              multiset(
                select 
                  level 
                from 
                  dual connect by level <= length(col)
              ) as sys.odcinumberlist
            )
          )
      ) 
    group by 
      col
      having max(ascii_of_one_character) >= 65535;