I have a large dataset (400K+ rows by 4 columns). I'm currently grabbing the data into an array using the following (thanks to tanaike):
function getValues_({ spreadsheetId, sheetName, start = 1, maxRow, limit = 100000 }) {
return [...Array(Math.ceil(maxRow / limit))].flatMap((_) => {
const last = start - 1 + limit;
const range = `'${sheetName}'!A${start}:${last > maxRow ? maxRow : last}`;
const temp = Sheets.Spreadsheets.Values.get(spreadsheetId, range).values;
start += limit;
return temp;
});
}
function setValues_({ spreadsheetId, sheetName, start = 1, maxRow, limit = 100000, values }) {
Array.from(Array(Math.ceil(maxRow / limit))).forEach((_) => {
const v = values.splice(0, limit);
Sheets.Spreadsheets.Values.update({ values: v }, spreadsheetId, `'${sheetName}'!A${start}`, { valueInputOption: "USER_ENTERED" });
start += limit;
});
}
The first function grabs data 100K rows at a time and places it into an array. The second function pastes data 100K rows at a time. Works great. I then do some work on it via loops.
I now find myself needing to remove duplicates. Here is what my data looks like:
ID | Content | Completion Date | Pathway |
---|---|---|---|
1 | abc | 01/01/2024 | Apple |
1 | def | 01/01/2024 | Apple |
1 | ghi | 01/01/2024 | Apple |
1 | def | 01/11/2024 | Apple |
1 | abc | 01/01/2023 | Apple |
1 | abc | 01/01/2024 | Apple |
I would like to remove those rows that are duplicates based on ID, Content and Pathway...keeping the oldest date in the Completion Date column. It should look like this:
ID | Content | Completion Date | Pathway |
---|---|---|---|
1 | def | 01/01/2024 | Apple |
1 | ghi | 01/01/2024 | Apple |
1 | abc | 01/01/2023 | Apple |
How would you approach this?
.removeDuplicates()
?You should know:
Update:
.removingDuplicates
took ~30 seconds on about 400K rows. That's actually not bad but I'm curious if there's another way.You may add another script which will filter out your current data based on the oldest date. You may achieve this by using the following script which uses map()
, filter()
, and sort
functions.
function redeuceData() {
var ss = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
var data = ss.getDataRange().getDisplayValues();
var uniqueContent = [...new Set(data.map(x=>x[1]))];
var out = uniqueContent.map(x=>data.filter(y=>y[1]==x).sort()).map(z=>z[0])
ss.clearContents();
ss.getRange(1,1,out.length,out[0].length).setValues(out);
}
The [...new Set(data.map(x=>x[1]))]
creates a new set of data with unique values for the column content
. This set will become the basis of grouping all rows according to the value of content. After grouping the rows together, the sort
function is applied so that the oldest date will be positioned on to index 0
of the groupings after which is reduced with the help of map
which only gets all rows with index 0
.
Using this sample data:
Running the script will result in:
If you wish to reduce your data based on 3 columns, you may use the following script which expands the filter based on ID
, Content
, and Pathway
:
function redeuceData2() {
var ss = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
var data = ss.getDataRange().getDisplayValues();
var uniqueContent = [...new Set(data.map(x => [x[0],x[1],x[3]].join("|")))];
var out = uniqueContent.map(x => data.filter(y => [y[0],y[1],y[3]].join("|") == x).sort()).map(z => z[0]);
ss.clearContents();
ss.getRange(1, 1, out.length, out[0].length).setValues(out);
}
Using another sample data with different IDs
and Pathways
:
Running the updated script should result in: