Thanks for all the knowledged shared within Stackoverflow.
I have an issue printing varbinary data retrieved from a MySQL database using Python3 and the mysql_connector library. The varbinary column (RawTxt) is stored as varbinary in MySQL and shows up like a list of numbers within brackets [] when retrieved in Python.
Would I cast it first then print? I tried adding " charset='utf8' " parameter to the connection variable to no avail. Not sure if it is even a charset / codec problem. This seems a little specialized for I can't find much on the internet about it.
Below is the code, a print using df.head(), and then what it looks like when sent to a .csv file with df.to_csv
Code:
import mysql.connector as mysql
import pandas as pd
dbconn = mysql.connect(
host = '1234.com',
user = 'root',
password = '1234',
database = 'tq_verbatims'
)
dbcursor = dbconn.cursor()
dbcursor.execute("""
SELECT a.RawID, b.RawTxt, c.GibberishOdds, a.cntResp
FROM tq_verbatims.tblRawSpellingsCoded a, tq_verbatims.tblRawSpellings b, tq_verbatims.tblGibberishOdds c
WHERE a.RawID = b.RawID AND b.SanitizedRawID = c.SanitizedRawID;
"""
)
tblRawSpellingCoded = dbcursor.fetchall()
df = pd.DataFrame(tblRawSpellingCoded, columns=dbcursor.column_names)
df.to_csv (r'C:\Users\export_dataframe.csv', index = False, header=True)
df.head()
df.head() result:
RawID RawTxt GibberishOdds cntResp
0 116 [40, 85, 78, 75, 78, 79, 87, 78, 41] 0.050513 1
1 237 [46, 46, 46, 46, 46, 46] 2.718280 1
2 284 [46, 78, 79, 84, 32, 83, 85, 82, 69] 0.079919 1
3 585 [51, 32, 77] 2.718280 6
4 658 [51, 77] 2.718280 403
Within the .csv file, the RawTxt field shows as bytearray(b'(UNKOWN)') in the .csv file output where it should have just had the text 'UNKNOWN'. The RawTxt actual data for those 4 rows is 'UNKNOWN', '......', 'NOT SURE', '3 M'
RawID, RawTxt, GibberishOdds, cntResp
116,bytearray(b'(UNKNOWN)'),0.0505134,1
237,bytearray(b'......'),2.71828,1
284,bytearray(b'.NOT SURE'),0.0799194,1
585,bytearray(b'3 M'),2.71828,6
Thanks in advance.
The reason the RawTxt
field is being exported as bytearray
is because the cursor
is returning the values in that field as bytearray
values. Without knowing the exact schema of your tables, it's difficult to tell why RawTxt
is converted to bytearray
by the mysql-connector
package.
In any case, if you are certain that RawTxt
contains valid string values, then you can convert the values to a string before or after you've passed the data to Pandas
.
Pandas
:import mysql.connector as mysql
import pandas as pd
dbconn = mysql.connect(
host = '1234.com',
user = 'root',
password = '1234',
database = 'tq_verbatims'
)
dbcursor = dbconn.cursor()
dbcursor.execute("""
SELECT a.RawID, b.RawTxt, c.GibberishOdds, a.cntResp
FROM tq_verbatims.tblRawSpellingsCoded a, tq_verbatims.tblRawSpellings b, tq_verbatims.tblGibberishOdds c
WHERE a.RawID = b.RawID AND b.SanitizedRawID = c.SanitizedRawID;
"""
)
tblRawSpellingCoded = []
for row in dbcursor.fetchall():
row[1] = row[1].decode('utf8')
tblRawSpellingCoded.append(row)
df = pd.DataFrame(tblRawSpellingCoded, columns=dbcursor.column_names)
df.to_csv (r'C:\Users\export_dataframe.csv', index = False, header=True)
df.head()
Pandas
:import mysql.connector as mysql
import pandas as pd
dbconn = mysql.connect(
host = '1234.com',
user = 'root',
password = '1234',
database = 'tq_verbatims'
)
dbcursor = dbconn.cursor()
dbcursor.execute("""
SELECT a.RawID, b.RawTxt, c.GibberishOdds, a.cntResp
FROM tq_verbatims.tblRawSpellingsCoded a, tq_verbatims.tblRawSpellings b, tq_verbatims.tblGibberishOdds c
WHERE a.RawID = b.RawID AND b.SanitizedRawID = c.SanitizedRawID;
"""
)
tblRawSpellingCoded = dbcursor.fetchall()
df = pd.DataFrame(tblRawSpellingCoded, columns=dbcursor.column_names)
df.RawTxt = df.RawTxt.str.decode('utf8')
df.to_csv (r'C:\Users\export_dataframe.csv', index = False, header=True)
df.head()
Either of the solutions should convert the RawTxt
values from bytearray
values to string
, with the assumption that the values are valid string values.