I have a CSV file which consists of new carriage returns (\n) in each row. While reading the CSV file from cloud storage using TextIO.read function of Apache beam it is considering \n as new record. how can i overcome this issue.
I have tried with by extending filebasedsource but it is reading only first line of the CSV file when we apply pTransorms.
help will be appreciated
Thanks in Advance
TextIO
can not do this - it always splits input based on carriage returns and is not aware of CSV-specific quoting of some of these carriage returns.
However, Beam 2.2 includes a transform that will make it very easy for you to write the CSV-specific (or any other file format specific reading) code yourself: FileIO
. Do something like this:
p.apply(FileIO.match().filepattern("gs://..."))
.apply(FileIO.readMatches())
.apply(ParDo.of(new DoFn<ReadableFile, TableRow>() {
@ProcessElement
public void process(ProcessContext c) throws IOException {
try (InputStream is = Channels.newInputStream(c.element().open())) {
// ... Use your favorite Java CSV library ...
... c.output(next csv record) ...
}
}
}))