I am trying to convert the text files in my Azure blob container from ANSI to UTF-8 encoding without downloading the files locally using python. I am getting the following error when I try to import BlockBlobService in my Python code to deal with Azure Blob Storage. I believe I have the correct python modules installed already, but there might be some other module that is missing which I am not aware of or it could be "not having the correct python module version". "pip list" command shows the following on my VM. Any help on this would be good.
pip list Package Version
azure-common 1.1.25
azure-core 1.4.0
azure-nspkg 3.0.2
azure-storage 0.36.0
azure-storage-blob 12.3.0
azure-storage-common 2.1.0
azure-storage-nspkg 3.1.0
bcrypt 3.1.7
certifi 2020.4.5.1
cffi 1.14.0
chardet 3.0.4
cryptography 2.9
idna 2.9
isodate 0.6.0
msrest 0.6.13
oauthlib 3.1.0
paramiko 2.7.1
pip 20.0.2
pycparser 2.20
PyNaCl 1.3.0
python-dateutil 2.8.1
requests 2.23.0
requests-oauthlib 1.3.0
setuptools 41.2.0
six 1.14.0
urllib3 1.25.8
wheel 0.34.2
If your blob encoding is not UTF-8
, it's not able to change it. And you said you want to use create_blob_from_text
to do it, so I suppose your text file is not UTF-8
and you want to change it to UTF-8
to upload it.
Firstly you should know, if your text file is UTF-8
, you don't need change anything just upload it, it will still be UTF-8
. However if you file is not UTF-8
, it won't convert it to UTF-8
, it will be encoded to UTF-8
with original encoding. If you could understand this, you will know how to upload you file to azure blob with UTF-8
encoding.
Like below I upload a text file with encoding GBK
.
txt= open('D:/hello.txt').readline() # GBK Tex
charset = 'UTF-8'
block_blob_service.create_blob_from_text(container_name='test',blob_name='test-gbk.txt',text=txt.encode('ISO-8859-1').decode('GBK'),encoding=charset)
Below is the pic, left is the original file with GBK
encoding, right is the file downloading from the azure blob it's encoded with 'UTF-8'.
Update: I open the text file to BytesIO
then upload it with the below code. You could ignore the latin-1
.
text=open('E:/test.txt',encoding='latin-1').readline()
charset = 'UTF-8'
buf=BytesIO(text.encode('ISO-8859-1').decode('ANSI').encode('UTF-8'))
block_blob_service.create_blob_from_stream(container_name='test',blob_name='test.txt',stream=buf)