sqlpostgresqlencodingutf-8windows-1255

How to import from a mixed-encoding file to a PostgreSQL table


I have a 30 GB text file. the encoding of the file is UTF8 but it also contains some Windows-1252 characters. So, when I try to import, it gives the following error:

ERROR:  invalid byte sequence for encoding "UTF8": 0x9b

How can I fix this?

the file already has UTF8 format, when i run the 'file' command for this file it says the encoding is UTF8. but it also contains some not UTF8 byte sequences. for example when I run the \copy command after a while it gives the above mentioned error for this row:

0B012234    Basic study of <img src="/fulltext-image.asp?format=htmlnonpaginated&src=323K744431152658_html\233_2    basic study of img src fulltext image asp format htmlnonpaginated src 323k744431152658_html 233_2   1975        Semigroup Forum semigroup forum 04861B53        19555

Solution

  • The issue is caused by the backslash (\).
    Use CSV format which does not treat backslash as a special character, e.g. -

    \copy t from myfile.txt with csv quote E'\x1' delimiter E'\x2'