utf-8dataweavemule4non-ascii-characters

Translate UTF-8 punctuation with normal ascii punctuation marks


I'm trying to cleanup a raw data that has embedded \r\n or \n in csv lines.Line terminator is \r\n.

I'm able to weave below Mule 4 dataweave code except the punctuation translation logic. used reduce but its not translating correctly.

dataweave code:

    %dw 2.0
    output application/csv header=true
    var translateMap = {
      "‘": "'", "’": "'", "‚": "'", "‛": "'",
      "“": "\"", "”": "\"", "„": "\"",
      "–": "-", "—": "--", "―": "--",
      "…": "...", "•": "*",
      "′": "'", "″": "\"",
      "‹": "<", "›": ">", "«": "<<", "»": ">>",
      " ": " ", 
      "‐": "-", "‑": "-", "‒": "-", "−": "-",
      "©": "(c)", "®": "(R)", "™": "(TM)"
    }

fun cleanField(value: String) = (
    translateMap reduce ((acc, pair) -> acc replace pair.key with pair.value)
      replace /(\r\n|\n)/ with " "
      replace /[^\x00-\x7F]/ with ""
)
---
payload map (row) ->
  row mapObject (key, value) -> {
     (value) : cleanField(key)
  }

Sample Data:

Header1|Header2|Header3|Header4|Header5|Header6\r\n
Value1A|Value1B|Value1C|Value1D|Value1E|Value1F\r\n
Value2A|Value2B|Value2C—with—emdash|Value2D|Value2E|Value2F\r\n
Value3A|Value3B|Value3C|Value3D ␍\f mid-line |Value3E|Value3F\r\n
Value4A|‘Single’Quote|“Double”Quote|Value4D|Value4E|Value4F\r\n
Value5A|Value5B|Value5C|Value5D|Value5E|Value5F‐hyphen\r\n

Explanation of the Sample Data:

Expected output:

current output :

ASCII is being cleaned correctly and embedded lines. translation of punctuation isn't working.


new output:

yeah. I couldn't attach the screenshot.


 1. Header1|Header2|Header3|Header4|Header5|Header6
 2. Value1A|Value1B|Value1C\r\n
|Value1D|Value1E|Value1F\r\n
 3. Value2A|Value2B|Value2C--with--emdash|Value2D|Value2E|Value2F
 4. Value3A|Value3B|Value3C|Value3D mid-line |Value3E|Value3F
 5. Value4A|'Single'Quote|\"Double\"Quote|Value4D|Value4E|Value4F
 6. Value5A|Value5B|Value5C|Value5D|Value5E|Value5F-hyphen

Cleaning the embedded \r\n is tricky preserving the real terminator. below replace do not have any effect on line.

replace /(\r\n|\n)/ with " "

enter image description here


Solution

  • It is hard to say without seeing the expected output but the problems seem to be:

    1. Incorrect usage of reduce(): you want to use reduce to apply all the keys of the translation map to the value and return the transformed value. I changed the accumulator to be the value.
    2. Using a static selector instead of a dynamic selector.

    I fixed those issues below. I also used replaceAll() instead of replace for the translation map.

    %dw 2.0
    output application/csv header=true
    import replaceAll from dw::core::Strings    
    
    var translateMap = {
        "‘": "'", "’": "'", "‚": "'", "‛": "'",
        "“": "\"", "”": "\"", "„": "\"",
        "–": "-", "—": "--", "―": "--",
        "…": "...", "•": "*",
        "′": "'", "″": "\"",
        "‹": "<", "›": ">", "«": "<<", "»": ">>",
        " ": " ", 
        "‐": "-", "‑": "-", "‒": "-", "−": "-",
        "©": "(c)", "®": "(R)", "™": "(TM)"
    }
    
    fun cleanField(value: String) = (
        namesOf(translateMap) reduce ((key, out=value) -> replaceAll(out, key, translateMap[key]))
          replace /(\\r\\n|\\n)/ with " "
          replace /[^\x00-\x7F]/ with ""
    )
    ---
    payload map (row) ->
      row mapObject (key, value) -> {
         (value) : cleanField(key as String)
      }
    

    Output:

    Header1,Header2,Header3,Header4,Header5,Header6\r\n
    Value1A,Value1B,Value1C,Value1D,Value1E,Value1F 
    Value2A,Value2B,Value2C--with--emdash,Value2D,Value2E,Value2F 
    Value3A,Value3B,Value3C,Value3D \\f mid-line ,Value3E,Value3F 
    Value4A,'Single'Quote,\"Double\"Quote,Value4D,Value4E,Value4F 
    Value5A,Value5B,Value5C,Value5D,Value5E,Value5F-hyphen