pythonutf-8python-unicodewindows-1252cp1252

Correctly reading text from Windows-1252(cp1252) file in python


so okay, as the title suggests the problem I have is with correctly reading input from a windows-1252 encoded file in python and inserting said input into SQLAlchemy-MySql table.

The current system setup:
Windows 7 VM with "Roger Access Control System" which outputs the file;
Ubuntu 12.04 LTS VM with a shared-folder to the Windows system so I can access the file, using "Python 2.7.3".

Now to the actual problem, for the input file I have a "VM shared-folder" that contains a file that is genereate on a Windows 7 system through Roger Access Control System(roger.pl for more details), this file is called "PREvents.csv" which suggests to it's contents, a ";" seperated list of data.

An example format of the data:

2013-03-19;15:58:30;100;Jānis;Dumburs;1;Uznemums1;0;Ieeja;
2013-03-19;15:58:40;100;Jānis;Dumburs;1;Uznemums1;2;Izeja;

The 4th field contains the card owners name and 5th contains the owners lastname, the 6th contains the owners assigned group.

The issue comes from the fact that any one of the 3 above mentioned fields can contain characters specific to Latvian language, in the example file the word "Jānis" contains the letter "ā" which in unicode is 257.

As I'm used to, I open the file as such:

try:
    f = codecs.open(file, 'rb', 'cp1252')
except IOError:
    f = codecs.open(file, 'wb', 'cp1252')

So far, everything works - it opens the file and so I move on to iterate over each line of the file(this is a continuos running script so pardon the loop):

while True:
    line = f.readline()

    if not line:
        # Pause loop for 1 second
        time.sleep(1)
    else:
        # Split the line into list
        date, timed, userid, firstname, lastname, groupid, groupname, typed, pointname, empty = line.split(';')

And this is where the issues start, if I print repr(firstname) it prints u'J\xe2nis' which is, as far as I undestand, not correct - `\xe2\ does not represent the Latvian character "ā".
Further down the loop depending on event type I assign the variables to SQLAlchemy object and insert/update:

if typed == '0':  # Entry type
    event = Events(
        period,
        fullname,
        userid,
        groupname,
        timestamp,
        0,
        0
    )
    session.add(event)
else:  # Exit type
    event = session.query(Events).filter(
        Events.period == period,
        Events.exit == 0,
        Events.userid == userid
    ).first()
    if event is not None:
        event.exit = timestamp
        event.spent = timestamp - event.entry

# Commit changes to database
session.commit()

In my search for answers I've found how to define the default encoding to use:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Which hasn't helped me in any way.

Basically, this is all leads to the me not being able to insert the correct owners First/last name aswell as owners assigned groupname if they contain any of Latvian-specific characters, for example:

Instead of the character "ā" it inserts "â"

I'd also like to add that I cannot change the "PREvents.csv" file encoding and the "RACS" system does not support inserting into UTF-8 or Unicode files - if you try either way, the system inserts random symbols for the Latvian-specific characters.

Please let me now if any other information is needed, I'll gladly provide it :)

Any help would be highly appreciated.


Solution

  • CP1252 cannot represent ā; your input contains the similar character â. repr just displays an ASCII representation of a unicode string in Python 2.x:

    >>> print(repr(b'J\xe2nis'.decode('cp1252')))
    u'J\xe2nis'
    >>> print(b'J\xe2nis'.decode('cp1252'))
    Jânis