pythondataframemysql-connectorvarbinary

Printing varbinary data in Python retrieve from mysql-connector query


Thanks for all the knowledged shared within Stackoverflow.

I have an issue printing varbinary data retrieved from a MySQL database using Python3 and the mysql_connector library. The varbinary column (RawTxt) is stored as varbinary in MySQL and shows up like a list of numbers within brackets [] when retrieved in Python.

Would I cast it first then print? I tried adding " charset='utf8' " parameter to the connection variable to no avail. Not sure if it is even a charset / codec problem. This seems a little specialized for I can't find much on the internet about it.

Below is the code, a print using df.head(), and then what it looks like when sent to a .csv file with df.to_csv

Code:

import mysql.connector as mysql
import pandas as pd

dbconn = mysql.connect(
    host = '1234.com',
    user = 'root',
    password = '1234',
    database = 'tq_verbatims'
)

dbcursor = dbconn.cursor()
dbcursor.execute("""
SELECT a.RawID, b.RawTxt, c.GibberishOdds, a.cntResp 
FROM tq_verbatims.tblRawSpellingsCoded a, tq_verbatims.tblRawSpellings b, tq_verbatims.tblGibberishOdds c 
WHERE a.RawID = b.RawID AND b.SanitizedRawID = c.SanitizedRawID;
"""
)

tblRawSpellingCoded = dbcursor.fetchall()
df = pd.DataFrame(tblRawSpellingCoded, columns=dbcursor.column_names)
df.to_csv (r'C:\Users\export_dataframe.csv', index = False, header=True)
df.head()

df.head() result:

  RawID   RawTxt      GibberishOdds   cntResp
0   116 [40, 85, 78, 75, 78, 79, 87, 78, 41]    0.050513    1
1   237 [46, 46, 46, 46, 46, 46]    2.718280    1
2   284 [46, 78, 79, 84, 32, 83, 85, 82, 69]    0.079919    1
3   585 [51, 32, 77]    2.718280    6
4   658 [51, 77]    2.718280    403

Within the .csv file, the RawTxt field shows as bytearray(b'(UNKOWN)') in the .csv file output where it should have just had the text 'UNKNOWN'. The RawTxt actual data for those 4 rows is 'UNKNOWN', '......', 'NOT SURE', '3 M'

RawID, RawTxt, GibberishOdds, cntResp
116,bytearray(b'(UNKNOWN)'),0.0505134,1
237,bytearray(b'......'),2.71828,1
284,bytearray(b'.NOT SURE'),0.0799194,1
585,bytearray(b'3 M'),2.71828,6

Thanks in advance.


Solution

  • The reason the RawTxt field is being exported as bytearray is because the cursor is returning the values in that field as bytearray values. Without knowing the exact schema of your tables, it's difficult to tell why RawTxt is converted to bytearray by the mysql-connector package.

    In any case, if you are certain that RawTxt contains valid string values, then you can convert the values to a string before or after you've passed the data to Pandas.

    Before Pandas:
    import mysql.connector as mysql
    import pandas as pd
    
    dbconn = mysql.connect(
        host = '1234.com',
        user = 'root',
        password = '1234',
        database = 'tq_verbatims'
    )
    
    dbcursor = dbconn.cursor()
    dbcursor.execute("""
    SELECT a.RawID, b.RawTxt, c.GibberishOdds, a.cntResp 
    FROM tq_verbatims.tblRawSpellingsCoded a, tq_verbatims.tblRawSpellings b, tq_verbatims.tblGibberishOdds c 
    WHERE a.RawID = b.RawID AND b.SanitizedRawID = c.SanitizedRawID;
    """
    )
    tblRawSpellingCoded = []
    for row in dbcursor.fetchall():
        row[1] = row[1].decode('utf8')
        tblRawSpellingCoded.append(row)
    df = pd.DataFrame(tblRawSpellingCoded, columns=dbcursor.column_names)
    df.to_csv (r'C:\Users\export_dataframe.csv', index = False, header=True)
    df.head()
    
    After Pandas:
    import mysql.connector as mysql
    import pandas as pd
    
    dbconn = mysql.connect(
        host = '1234.com',
        user = 'root',
        password = '1234',
        database = 'tq_verbatims'
    )
    
    dbcursor = dbconn.cursor()
    dbcursor.execute("""
    SELECT a.RawID, b.RawTxt, c.GibberishOdds, a.cntResp 
    FROM tq_verbatims.tblRawSpellingsCoded a, tq_verbatims.tblRawSpellings b, tq_verbatims.tblGibberishOdds c 
    WHERE a.RawID = b.RawID AND b.SanitizedRawID = c.SanitizedRawID;
    """
    )
    
    tblRawSpellingCoded = dbcursor.fetchall()
    df = pd.DataFrame(tblRawSpellingCoded, columns=dbcursor.column_names)
    df.RawTxt = df.RawTxt.str.decode('utf8')
    df.to_csv (r'C:\Users\export_dataframe.csv', index = False, header=True)
    df.head()
    

    Either of the solutions should convert the RawTxt values from bytearray values to string, with the assumption that the values are valid string values.