i am "playing" with apache beam/dataflow in datalab. I am trying to read a csv file from gcs. when i create the pcollection using:
lines = p | 'ReadMyFile' >> beam.io.ReadFromText('gs://' + BUCKET_NAME + '/' + input_file, coder='StrUtf8Coder')
I get the following error:
LookupError: unknown encoding: "THE","NAME","OF","COLUMNS"
it seems the name of columns is interpreted as encoding?
I do not understand what's wrong. If i do not specify the "coder" i get
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 1045: invalid continuation byte
Outside apache beam I am able to handle this error by reading the file from gcs:
blob = storage.Blob(gs_path, bucket)
data = blob.download_as_string()
data.decode('utf-8', 'ignore')
I read apache beam only support utf8 and the file does not contain only utf8.
Should I download and then convert to pcollection?
Any suggestion?
I would suggest changing the coding on the actual file. If you save the file with "Save as" you can select UTF-8 encoding for the format on excel CSVs and regular .txt. Once you do that you need to make sure you add a line of code like
class DoWork(beam.DoFn):
def process(self, text):
text = textfilePcollection.encode('utf-8')
Do other stuff
This isn't how I would like to do it because it isn't code-centric, but it has work for me before. Unfortunately, I don't have a code-centric solution.