csvgoogle-apps-scriptspreadsheetworkday-api

How to convert a paragraph html string to plain text without html tags in google app script?


this is a follow up question from my previous question. I'm having trouble when I want to convert HTML strings to plain text without HTML tags in google app script using the reference in this question. However, this time it's a paragraph format.

This is the script that I use:

function pullDataFromWorkday() {
  var url = 'https://services1.myworkday.com/ccx/service/customreport2/[company name]/[owner's email]/[Report Name]?format=csv'; //this is the csv link from workday report
  var b64 = 'asdfghjklkjhgfdfghj=='; //this is supposed to be our workday password in b64
  var response = UrlFetchApp.fetch(url, {
      headers: {
        Authorization: 'Basic '+ b64
      }
  });

//Parse   
  if (response.getResponseCode() >= 200 && response.getResponseCode() < 300) {
    var blob = response.getBlob();
    var string = blob.getDataAsString();
    var data = Utilities.parseCsv(string, ",");

    for(i=1;i<data.length;i++)
    {

      data[i][0];
      data[i][1];
      data[i][2]=toStringFromHtml(data[i][2]);
      data[i][3]=toStringFromHtml(data[i][3]);
      data[i][4]=toStringFromHtml(data[i][4]);
      data[i][5]=toStringFromHtml(data[i][5]);
    }

  //Paste  it in   
  var ss = SpreadsheetApp.getActive();
  var sheet = ss.getSheetByName('Sheet1');
  sheet.clear();
  sheet.getRange(1,1,data.length,data[0].length).setValues(data);
    }

  else {
    return;
    }
  }



function toStringFromHtml(html)
{
  
html = '<div>' + html + '</div>';
html = html.replace(/<br>/g,"");
var document = XmlService.parse(html);
var strText = XmlService.getPrettyFormat().format(document);
strText = strText.replace(/<[^>]*>/g,"");
return strText.trim();
}

This is the sample of the data that I want:

enter image description here

Or you can use this sample spreadsheet.

Is there any step that I miss or I do wrong?

Thank you before for answering the question


Solution

  • In your situation, how about modifying toStringFromHtml as follows?

    Modified script:

    function toStringFromHtml(html) {
      html = '<div>' + html + '</div>';
      html = html.replace(/<br>/g, "").replace(/<p><\/p><p><\/p>/g, "<p></p>").replace(/<span>|<\/span>/g, "");
      var document = XmlService.parse(html);
      var strText = XmlService.getPrettyFormat().setIndent("").format(document);
      strText = strText.replace(/<[^>]*>/g, "");
      return strText.trim();
    }
    

    Note: