[SOLVED] Records are not ingesting correctly through MLCP when special characters find in the csv

Records are not ingesting correctly through MLCP when special characters find in the csv

I am ingesting data to MarkLogic using MLCP, but many records got skipped due to invalid characters in the file.

Is there any way to ignore the invalid characters and ingest all records present in the CSV without skipping records?

Below is the error message in the logs:

WARN Skipped record: abc.csv at line 1414, reason: invalid char between encapsulated token and delimiter

Solution

It would be helpful if you provided an example of the records that were causing the exception to be thrown. However, the most common reason is that you have a , as a delimiter and have quotes within the value that is not encapsulating the entire value.

For instance:

“foo”,“bar” Y,”foo”

In this case, "bar" Y is invalid. You could fix that by escaping the quotes:

“foo”,“"bar"” Y,”foo”

https://www.marklogic.com/blog/delimited_text_mlcp

What does the exception mean?

Invalid char between encapsulated token and delimiter means that you have invalid characters between an encapsulator and a delimiter. Hold on — what is an encapsulator? To put simply, it is the character used to wrap the CSV field or column that may contain special characters, such as line breaks. In most cases, people use double-quotes as the encapsulator.

How to work around the exception?

The best way to get around this exception is to avoid having malformed CSV data in the first place. If that is not possible, you can escape the double quotes in the field if you really want them to be part of the string. But remember, you must escape double quotes using another double quote in CSV!