clinuxx86-64checkpoint

Is it possible to interrupt a process and checkpoint it to resume it later on?


Lets say, you have an application, which is consuming up all the computational power. Now you want to do some other necessary work. Is there any way on Linux, to interrupt that application and checkpoint its state, so that later on it could be resumed from the state it was interrupted?

Especially I am interested in a way, where the application could be stopped and restarted on another machine. Is that possible too?


Solution

  • In general terms, checkpointing a process is not entirely possible (because a process is not only an address space, but also has other resources likes file descriptors, and TCP/IP sockets ...).

    In practice, you can use some checkpointing libraries like BLCR etc. With certain limiting conditions, you might be able to migrate a checkpoint image from one system to another one (very similar to the source one: same kernel, same versions of libraries & compilers, etc.).

    Migrating images is also possible at the virtual machine level. Some of them are quite good for that.

    You could also design and implement your software with your own checkpointing machinery. Then, you should think of using garbage collection techniques and terminology. Look also into Emacs (or Xemacs) unexec.c file (which is heavily machine dependent).

    Some languages implementation & runtime have checkpointing primitives. SBCL (a free Common Lisp implementation) is able to save a core image and restart it later. SML/NJ is able to export an image. Squeak (a Smalltalk implementation) also has such ability.

    As an other example of checkpointing, the GCC compiler is actually able to compile a single *.h header (into a pre-compiled header file which is a persistent image of GCC heap) by using persistence techniques.

    Read more about orthogonal persistence. It is also a research subject. serialization is also relevant (and you might want to use textual formats à la JSON, YAML, XML, ...). You might also use hibernation techniques (on the whole system level).