javapathfilenamesrace-conditionrealpath

How to determine if 2 instances of java.nio.file.Path are pointing to the same file?


I have a simple problem -- I have multiple threads that are doing appends to a file, based upon a file name that the user can choose.

At first, I tried to synchronize on instances of Path, but after reading comments from Reddit HERE and HERE, it appears that this is the wrong approach. Namely -- 2 different instances of String will result in 2 different instances of Path that are not equals, according to Path.equals(), but DO point to the exact same file. I tested this myself, and it appears that they are right. For example, consider this example that I ran on my Windows 11 machine.

final Path a = Path.of("abc.java"); //find abc.java in the current directory
final Path b = Path.of("./abc.java"); //go to current directory, then find abc.java
System.out.println(a.equals(b)); //false, but they are the same file!

(and to be clear, the reason I care about Path.equals() is because that will allow it to be unique in a Map<Path, V> or a Set<Path>, and then I can use those containers to get the same instance always. After all, the synchronized(someObject) statement handles thread synchronization based on whether or not a == b and NOT on a.equals(b). Just wanted to clarify, since it wasn't clear my intentions before.

I need something to synchronize on. And it is clear that just synchronizing on an instance of Path that I put into my Map<Path, V> made via Path.of(sanitizedStringFromUser) isn't going to cut it. Sure, I could play "whack-a-mole", and try to find all of the possible ways that 2 different instances of String could resolve to the same file when served to Path.of(). But I am certain that I am reinventing the wheel here.

How should I resolve this, using idiomatic Java?


Solution

  • I have a simple problem

    No you don't. For example, the answer to your question depends on how you care to define the word "the same". You have a complex problem that cannot be answered at all without a list of caveats.

    I better get into those caveats then.

    It's a JVM, not an OS

    Any other process on the system can, of course, write to those files too and cause all sorts of havoc.

    One commonly employed fix is to start with your "I append to a file" and call it right then and there. Do not do that. If you have a file that just poofs into existence fully formed, that makes many things a lot simpler. There is no need to worry about the notion of '... but what if some other process sees that file and goes: Swell! I can read it! but this then fails because you weren't done appending to it', for example.

    To do this, you create a temporary file (which you can create using a method invocation that ensures it truly is unique), append until you're happy with it, and then atomically rename it into the correct place. File systems do support atomic creation and renaming (meaning: If 2 processes simultaneously atomically rename some file into the same x.foo file, at least one of them will fail, guaranteed). They do not support atomic filling, i.e. you can't ask the OS: I want to write to this file for a while, can you make it look to other processes like it does not exist at all until I tell you I'm done with it?

    Which is why you fake that by saying 'give me a unique file atomically and guaranteed (filesystems can do that) and then 'move it to its final location atomically, i.e. only if it does not already exist and if 2 processes attempt to do this simultaneously, all-but-one will fail' (filesystems can do that as well). This then means to other processes there is no file until all of a sudden there is a file, and it is in its complete, finished state.

    If you think that is not appropriate.. it is, you just need to redesign those systems that need this file to do this.

    Unless, of course, you're talking about log files. Or rather, even then - having multiple separate systems that all try to make a sensible, consecutive log file is not possible either - instead each process should write its own log, and if you want, you can merge them later (either once all are done, or if these are forever-running processes, they should rotate their logs and you can thus merge all the rotated-out logs, as they are 'finished'). We're now back to the start: You have processes using atomic access to create unique files they definitely own and there is no risk.

    But the JVM is one system!

    So use the JVM's tools. synchronize on some logger object, have all things send their logs to that logger object and now this logger thing is the one and only bit of code that needs to open a file and write to it. Alternatively, have each part of the JVM write to a unique file, and merge them later.

    Nevermind all that, I just want an answer to my question

    Well, what you want is impossible, which is why you need to go with the alternatives.

    Take, for example, this 'trick':

    touch foo.txt
    ln foo.txt bar.txt
    

    You now have 2 files - foo.txt and bar.txt. They are separate in all ways. No possible imagination of "path equality" would ever call these 2 things the same.

    Nevertheless, write to one and you end up changing the other. Because they are hardlinked together. There is no canonical path here - even though I first created foo and then hardlinked it into bar, as far as the file system is concerned, foo and bar are peers. foo is not 'more canonical' than bar. Had you reversed the operation (create bar first, then hardlink it into foo), the bits on disk are identical in every way except, possibly, timestamps, which surely you don't want to look at, and which can be made equal trivially.

    And yet, if within your JVM you decide to open an append stream on both of these it'll be one heck of a mess. Understandable that you'd want to avoid this, but you can't. At least, not in a way that java supports, i.e. not in an OS independent way.

    If you want to merely try to get somewhat close, there's path.toRealPath() which will follow softlinks and which will apply .. and . as well, but, this does not give you 100% guarantees that you won't end up with 2 appenders that congeal their output into a big old mess.

    On presumably most systems, you could use Files.isSameFile. That method should return true if giving it 2 paths to different locations that are hardlinks of each other. The javadoc is rather vague, as per a comment from Sweeper, it works on MacOS, and therefore presumably all posixy systems, at least. Note that windows also has hardlinks these days, made with fsutil if memory serves, you should test if you want to use this.

    Presumably you'd have a list of all existing appenders, and anytime any code wants to make another appender, you'd check every item in the existing list with isSameFile; no lookup is possible here, you'd have to do this possibly relatively expensive operation. You might want to therefore cache the results of such an operation.

    But isSameFile doesn't let you write a guaranteed system. To get that guarantee, you have each appender atomically create a new file. Now it is not possible for them to clash, and it's guaranteed.

    How do I do that?

    To create a file in a way that you know, 100% guaranteed, there is no clash, you use:

    try (var out = Files.newOutputStream(pathToFile, StandardOpenOption.CREATE_NEW)) {
     ... 
    }
    

    The only way to run into trouble here is if some other process (or some other code inside your JVM process) finds that file and decides to also write to it. At that point, it's 'pilot error'. You can't stop the user from tossing their computer in a blender either. The point is, if every 'appender' uses CREATE_NEW it is impossible to get a clash.

    If you want to rename them into the right place, you use:

    Files.move(pathToTempFile, pathToFinalFile, StandardCopyOption.ATOMIC_MOVE)
    

    This will move it only if pathToFinalFile doesn't already exist (no matter how much resolving or unrolling of soft links, aliases, .. / ., has to be done), and will guarantee that this holds up even if 2 processes or threads attempt to do this simultaneously.

    That just leaves 'how do I make a temp file' - generally just, in a loop, append random numbers, keep calling Files.newOutputStream(..., CREATE_NEW) until it works, use that. You can use java's baked in temp file generator to do this if you must, Files.createTempFile.