javascriptarraysgoogle-sheetsgoogle-apps-scriptunique

Remove Duplicates from Very Large Array, Keeping Oldest Timestamp


I have a large dataset (400K+ rows by 4 columns). I'm currently grabbing the data into an array using the following (thanks to tanaike):

function getValues_({ spreadsheetId, sheetName, start = 1, maxRow, limit = 100000 }) {
  return [...Array(Math.ceil(maxRow / limit))].flatMap((_) => {
    const last = start - 1 + limit;
    const range = `'${sheetName}'!A${start}:${last > maxRow ? maxRow : last}`;
    const temp = Sheets.Spreadsheets.Values.get(spreadsheetId, range).values;
    start += limit;
    return temp;
  });
}

function setValues_({ spreadsheetId, sheetName, start = 1, maxRow, limit = 100000, values }) {
  Array.from(Array(Math.ceil(maxRow / limit))).forEach((_) => {
    const v = values.splice(0, limit);
    Sheets.Spreadsheets.Values.update({ values: v }, spreadsheetId, `'${sheetName}'!A${start}`, { valueInputOption: "USER_ENTERED" });
    start += limit;
  });
}

The first function grabs data 100K rows at a time and places it into an array. The second function pastes data 100K rows at a time. Works great. I then do some work on it via loops.

I now find myself needing to remove duplicates. Here is what my data looks like:

ID Content Completion Date Pathway
1 abc 01/01/2024 Apple
1 def 01/01/2024 Apple
1 ghi 01/01/2024 Apple
1 def 01/11/2024 Apple
1 abc 01/01/2023 Apple
1 abc 01/01/2024 Apple

I would like to remove those rows that are duplicates based on ID, Content and Pathway...keeping the oldest date in the Completion Date column. It should look like this:

ID Content Completion Date Pathway
1 def 01/01/2024 Apple
1 ghi 01/01/2024 Apple
1 abc 01/01/2023 Apple

How would you approach this?

You should know:

Update:


Solution

  • Use Another Script to Reduce Your Data

    You may add another script which will filter out your current data based on the oldest date. You may achieve this by using the following script which uses map(), filter(), and sort functions.

    Script

    function redeuceData() {
      var ss = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
      var data = ss.getDataRange().getDisplayValues();
      var uniqueContent = [...new Set(data.map(x=>x[1]))];
      var out = uniqueContent.map(x=>data.filter(y=>y[1]==x).sort()).map(z=>z[0])
      ss.clearContents();
      ss.getRange(1,1,out.length,out[0].length).setValues(out);
    }
    

    The [...new Set(data.map(x=>x[1]))] creates a new set of data with unique values for the column content. This set will become the basis of grouping all rows according to the value of content. After grouping the rows together, the sort function is applied so that the oldest date will be positioned on to index 0 of the groupings after which is reduced with the help of map which only gets all rows with index 0.

    Output:

    Using this sample data:

    Sample data

    Running the script will result in:

    result

    References:

    Update:

    If you wish to reduce your data based on 3 columns, you may use the following script which expands the filter based on ID, Content, and Pathway:

    function redeuceData2() {
      var ss = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
      var data = ss.getDataRange().getDisplayValues();
      var uniqueContent = [...new Set(data.map(x => [x[0],x[1],x[3]].join("|")))];
      var out = uniqueContent.map(x => data.filter(y => [y[0],y[1],y[3]].join("|") == x).sort()).map(z => z[0]);
      ss.clearContents();
      ss.getRange(1, 1, out.length, out[0].length).setValues(out);
    }
    

    Output:

    Using another sample data with different IDs and Pathways:

    sample data 2

    Running the updated script should result in:

    Output 2