I'm trying to cleanup a raw data that has embedded \r\n or \n in csv lines.Line terminator is \r\n.
I'm able to weave below Mule 4 dataweave code except the punctuation translation logic. used reduce but its not translating correctly.
dataweave code:
%dw 2.0
output application/csv header=true
var translateMap = {
"‘": "'", "’": "'", "‚": "'", "‛": "'",
"“": "\"", "”": "\"", "„": "\"",
"–": "-", "—": "--", "―": "--",
"…": "...", "•": "*",
"′": "'", "″": "\"",
"‹": "<", "›": ">", "«": "<<", "»": ">>",
" ": " ",
"‐": "-", "‑": "-", "‒": "-", "−": "-",
"©": "(c)", "®": "(R)", "™": "(TM)"
}
fun cleanField(value: String) = (
translateMap reduce ((acc, pair) -> acc replace pair.key with pair.value)
replace /(\r\n|\n)/ with " "
replace /[^\x00-\x7F]/ with ""
)
---
payload map (row) ->
row mapObject (key, value) -> {
(value) : cleanField(key)
}
Sample Data:
Header1|Header2|Header3|Header4|Header5|Header6\r\n
Value1A|Value1B|Value1C|Value1D|Value1E|Value1F\r\n
Value2A|Value2B|Value2C—with—emdash|Value2D|Value2E|Value2F\r\n
Value3A|Value3B|Value3C|Value3D ␍\f mid-line |Value3E|Value3F\r\n
Value4A|‘Single’Quote|“Double”Quote|Value4D|Value4E|Value4F\r\n
Value5A|Value5B|Value5C|Value5D|Value5E|Value5F‐hyphen\r\n
Explanation of the Sample Data:
Line Terminator: Each line ends with \r\n as requested. Pipe Separated: Fields within each record are separated by the pipe symbol|.
Header: The first line contains the header row:
Header1|Header2|Header3|Header4|Header5|Header6.
Five Records: There are five data rows following the header.
Six Columns: Each record has six values separated by pipes.
UTF-8 Punctuation Marks: Line 4 contains a left single quotation mark ‘ and a right double quotation mark “.
\r\n in Mid-Line: Line 3 contains \r\n mid-line. The \r (carriage return) and \n (form feed) characters are embedded within the "Value3D" field.
Emdash: Line 2 contains an emdash — within the "Value2C" field.
UTF-8 Hyphen: Line 5 contains a non-standard hyphen (U+2010, Hyphen) at the end of the "Value5F" field.
Expected output:
translate UTF-8 punctuation marks with ASCII one's
remove embedded line feeds \r\n. preserve the line terminator \r\n
remove any other UTF-8's that are out of ASCII range
current output :
ASCII is being cleaned correctly and embedded lines. translation of punctuation isn't working.
new output:
yeah. I couldn't attach the screenshot.
1. Header1|Header2|Header3|Header4|Header5|Header6
2. Value1A|Value1B|Value1C\r\n
|Value1D|Value1E|Value1F\r\n
3. Value2A|Value2B|Value2C--with--emdash|Value2D|Value2E|Value2F
4. Value3A|Value3B|Value3C|Value3D mid-line |Value3E|Value3F
5. Value4A|'Single'Quote|\"Double\"Quote|Value4D|Value4E|Value4F
6. Value5A|Value5B|Value5C|Value5D|Value5E|Value5F-hyphen
Cleaning the embedded \r\n is tricky preserving the real terminator. below replace do not have any effect on line.
replace /(\r\n|\n)/ with " "
It is hard to say without seeing the expected output but the problems seem to be:
I fixed those issues below. I also used replaceAll() instead of replace for the translation map.
%dw 2.0
output application/csv header=true
import replaceAll from dw::core::Strings
var translateMap = {
"‘": "'", "’": "'", "‚": "'", "‛": "'",
"“": "\"", "”": "\"", "„": "\"",
"–": "-", "—": "--", "―": "--",
"…": "...", "•": "*",
"′": "'", "″": "\"",
"‹": "<", "›": ">", "«": "<<", "»": ">>",
" ": " ",
"‐": "-", "‑": "-", "‒": "-", "−": "-",
"©": "(c)", "®": "(R)", "™": "(TM)"
}
fun cleanField(value: String) = (
namesOf(translateMap) reduce ((key, out=value) -> replaceAll(out, key, translateMap[key]))
replace /(\\r\\n|\\n)/ with " "
replace /[^\x00-\x7F]/ with ""
)
---
payload map (row) ->
row mapObject (key, value) -> {
(value) : cleanField(key as String)
}
Output:
Header1,Header2,Header3,Header4,Header5,Header6\r\n
Value1A,Value1B,Value1C,Value1D,Value1E,Value1F
Value2A,Value2B,Value2C--with--emdash,Value2D,Value2E,Value2F
Value3A,Value3B,Value3C,Value3D \\f mid-line ,Value3E,Value3F
Value4A,'Single'Quote,\"Double\"Quote,Value4D,Value4E,Value4F
Value5A,Value5B,Value5C,Value5D,Value5E,Value5F-hyphen