I need to scrub an SQL Server table on a regular basis, but my solution is taking ridiculously long (about 12 minutes for 73,000 records).
My table has 4 fields:
id1
id2
val1
val2
For every group of records with the same "id1", I need to keep the first (lowest id2) and last (highest id2) and delete everything in between UNLESS val1 or val2 has changed from the previous (next lowest "id2") record.
If you're following me so far, what would a more efficient algorithm be? Here is my java code:
boolean bDEL=false;
qps = conn.prepareStatement("SELECT id1, id2, val1, val2 from STATUS_DATA ORDER BY id1, id2");
qrs = qps.executeQuery();
//KEEP FIRST & LAST, DISCARD EVERYTHING ELSE *EXCEPT* WHERE CHANGE IN val1 or val2
while (qrs.next()) {
thisID1 = qrs.getInt("id1");
thisID2 = qrs.getInt("id2");
thisVAL1= qrs.getInt("val1");
thisVAL2= qrs.getDouble("val2");
if (thisID1==lastID1) {
if (bDEL) { //Ensures this is not the last record
qps2 = conn2.prepareStatement("DELETE FROM STATUS_DATA where id1="+lastID1+" and id2="+lastID2);
qps2.executeUpdate();
qps2.close();
bDEL = false;
}
if (thisVAL1==lastVAL1 && thisVAL2==lastVAL2) {
bDEL = true;
}
} else if (bDEL) bDEL=false;
lastID1 = thisID1;
lastID2 = thisID2;
lastVAL1= thisVAL1;
lastVAL2= thisVAL2;
}
UPDATE 4/20/2015 @ 11:10 AM
OK so here is my final solution - for every record, the Java code enters an XML record into a string which is written to file every 10,000 records and then java calls a stored procedure on SQL Server and passes the file name to read. The stored procedure can only use the file name as a variable if dynamic SQL is used to execute the openrowset. I will play around with the interval of procedure execution but so far my performance results are as follows:
BEFORE (1 record delete at a time):
73,000 records processed, 101 records per secondAFTER (bulk XML import):
1.4 Million records processed, 5800 records per second
JAVA SNIPPET:
String ts, sXML = "<DataRecords>\n";
boolean bDEL=false;
qps = conn.prepareStatement("SELECT id1, id2, val1, val2 from STATUS_DATA ORDER BY id1, id2");
qrs = qps.executeQuery();
//KEEP FIRST & LAST, DISCARD EVERYTHING ELSE *EXCEPT* WHERE CHANGE IN val1 or val2
while (qrs.next()) {
thisID1 = qrs.getInt("id1");
thisID2 = qrs.getInt("id2");
thisVAL1= qrs.getInt("val1");
thisVAL2= qrs.getDouble("val2");
if (bDEL && thisID1==lastID1) { //Ensures this is not the first or last record
sXML += "<nxtrec id1=\""+lastID1+"\" id2=\""+lastID2+"\"/>\n";
if ((i + 1) % 10000 == 0) { //Execute every 10000 records
sXML += "</DataRecords>\n"; //Close off Parent Tag
ts = String.valueOf((new java.util.Date()).getTime()); //Each XML File Uniquely Named
writeFile(sDir, "ds"+ts+".xml", sXML); //Write XML to file
conn2=dataSource.getConnection();
cs = conn2.prepareCall("EXEC SCRUB_DATA ?");
cs.setString(1, sdir + "ds"+ts+".xml");
cs.executeUpdate(); //Execute Stored Procedure
cs.close(); conn2.close();
deleteFile(SHMdirdata, "ds"+ts+".xml"); //Delete File
sXML = "<DataRecords>\n";
}
bDEL = false;
}
if (thisID1==lastID1 && thisVAL1==lastVAL1 && thisVAL2==lastVAL2) {
bDEL = true;
} else if (bDEL) bDEL=false;
} else if (bDEL) bDEL=false;
lastID1 = thisID1;
lastID2 = thisID2;
lastVAL1= thisVAL1;
lastVAL2= thisVAL2;
i++;
}
qrs.close(); qps.close(); conn.close();
sXML += "</DataRecords>\n";
ts = String.valueOf((new java.util.Date()).getTime());
writeFile(sdir, "ds"+ts+".xml", sXML);
conn2=dataSource.getConnection();
cs = conn2.prepareCall("EXEC SCRUB_DATA ?");
cs.setString(1, sdir + "ds"+ts+".xml");
cs.executeUpdate();
cs.close(); conn2.close();
deleteFile(SHMdirdata, "ds"+ts+".xml");
XML FILE OUTPUT:
<DataRecords>
<nxtrec id1="100" id2="1112"/>
<nxtrec id1="100" id2="1113"/>
<nxtrec id1="100" id2="1117"/>
<nxtrec id1="102" id2="1114"/>
...
<nxtrec id1="838" id2="1112"/>
</DataRecords>
SQL SERVER STORED PROCEDURE:
PROCEDURE [dbo].[SCRUB_DATA] @floc varchar(100) -- File Location (dir + filename) as only parameter
BEGIN
SET NOCOUNT ON;
DECLARE @sql as varchar(max);
SET @sql = '
DECLARE @XmlFile XML
SELECT @XmlFile = BulkColumn
FROM OPENROWSET(BULK ''' + @floc + ''', SINGLE_BLOB) x;
CREATE TABLE #TEMP_TABLE (id1 INT, id2 INT);
INSERT INTO #TEMP_TABLE (id1, id2)
SELECT
id1 = DataTab.value(''@id1'', ''int''),
id2 = DataTab.value(''@id2'', ''int'')
FROM
@XmlFile.nodes(''/DataRecords/nxtrec'') AS XTbl(DataTab);
delete from D
from STATUS_DATA D
inner join #TEMP_TABLE T on ( (T.id1 = D.id1) and (T.id2 = D.id2) );
';
EXEC (@sql);
END
It is almost for certain that your performance issues are not in your algorithm, but rather in the implementation. Say for example your cleanup step has to remove 10,000 records, this means you will have 10000 round trips to your database server.
Instead of doing that, write each of the id pairs to be deleted to an XML file, and send that XML file to SQL server stored proc that shreds the XML into a corresponding temp or temp_var table. Then use a single delete from (or equivalent) to delete all 10K rows.
If you don't know how to shred xml in TSQL, it is well worth the time to learn. Take a look at a simple example to get you started, out just check out a couple of search results for "tsql shred xml" to get going.
ADDED
Pulling 10K records to client should be < 1 second. Your Java code likewise. If you don't have the time to learn use XML as suggested, you could write a quick an dirty stored proc that accepts 10 (20, 50?) pairs of ids and delete the corresponding records from within the stored proc. I use the XML approach regularly to "batch" stuff from the client. If your batches are "large", you might take a look at using the BULK INSERT command on SQL Server -- but the XML is easy and a bit more flexible as it can contain nested data structures. E.g., master/detail relationships.
ADDED
I just did this locally
create table #tmp
(
id int not null
primary key(id)
)
GO
insert #tmp (id)
select 4
union
select 5
GO
-- now has two rows #tmp
delete from L
from TaskList L
inner join #tmp T on (T.id = L.taskID)
(2 row(s) affected)
-- and they are no longer in TaskList
i.e., this should not be a problem unless you are doing it wrong somehow. Are you creating the temp table and then attempting to use it in different databases connections/sessions. If the sessions are different, the temp table won't be seen in the 2nd session.
Hard to think of another way for this to be wrong off the top of my head.