asp.netfile-uploadxmlhttprequestvalums-file-uploader

Uploading files and preventing duplicates by knowing to update the file


In our system, when a user uploads a file it is stored in a unique file system structure and a database record is generated. A file is uploaded via the webbrowser via XMLHttpRequest. The file then gets moved from the temporary upload area into the FS.

How can I detect that a file after being uploaded already exists in my FS?

If the file uploaded is the same as one already uploaded.
If the file is the same file, but the uploaded content has been updated which 
  means I need to update the file in the FS.

I am ignoring file names as a way of knowing if the file already exists. A filename cannot be considered unique. An example is that some cameras name photos using an incremental number that rolls over after a time. When a file is uploaded via the web browser, the source file structure is masked. E.g. C:\Users\Drive\File\Uploaded\From. So I cant use the that to figure out if the file has already been uploaded.

How do I know the file being uploaded already exists because its content is the same. Or it exists but because the uploaded file has been changed, so I can just update the file?

Microsoft Word documents create a challenge as Word regenerates the file on every save.

In a situation where the user renames a file on their own accord, I could say tough luck.


Solution

  • I would start with finding files that are the same via an SHA Hash. You could use something like this to get a list of files that have the same hash as your newly uploaded file then take some action.

    Just an example of getting the hash of the new file:

    string newfile;
        using(FileStream fs = new FileStream(   string newfile;
        using(FileStream fs = new FileStream("C:\\Users\\Drive\\File\\Uploaded\\From\\newfile.txt", FileMode.Open))
        {
            using (System.Security.Cryptography.SHA1Managed sha1 = new System.Security.Cryptography.SHA1Managed())
            {
                newfile = BitConverter.ToString(sha1.ComputeHash(fs));
            }
        }   
    

    This goes through all files and gets a list of file names and hashes

    var allfiles = Directory.GetFiles(@"var allfiles = Directory.GetFiles(@"C:\Users\Drive\File\Uploaded\From\", "*.*")
            .Select(
                f => new
                         {
                             FileName = f,
                             FileHash = new System.Security.Cryptography.SHA1Managed()
                                                                .ComputeHash(new FileStream(f, 
                                                                                 FileMode.Open, 
                                                                                 FileAccess.Read))
                         })       
            .ToList();
    
            foreach(var fi in allfiles){
            if(newfile == BitConverter.ToString(fi.FileHash))
                Console.WriteLine("Match!!!");
            Console.WriteLine(fi.FileName + ' ' + BitConverter.ToString(fi.FileHash));
            }
    

    }", ".") .Select( f => new { FileName = f, FileHash = new System.Security.Cryptography.SHA1Managed() .ComputeHash(new FileStream(f, FileMode.Open, FileAccess.Read)) })
    .ToList();

    This loops through them all and looks for a match to the new one.

            foreach(var fi in allfiles){
            if(newfile == BitConverter.ToString(fi.FileHash))
                Console.WriteLine("Match!!!");
            Console.WriteLine(fi.FileName + ' ' + BitConverter.ToString(fi.FileHash));
            }
    

    Ideally you would save this hash when the file is uploaded since this is very intense to recompute.