User Tools

Site Tools


Running long computations

When running a long computation at a Linux machine, you may want to log out in between. There are several ways, the simplest is by using 'nohup', a more advanced one is by using 'screen'.

In all cases, be aware that running more programs than there are processors at a system is not useful. Processes just get slower. This also holds for the processes being run interactively, so be nice to your fellow students, and don't run too many processes in the background at the same time. Also, please make sure processes that you start actually finish at some time. Nobody likes run-away processes that eat CPU time for weeks without doing useful work.

FIXME Add a section about 'screen'.


Nohup means 'no hangup', and it originates from the time that people dialed into a computer by modem over a phone line. When you disconnected (literally, hanged up the phone), all processes that were running, were killed. The nohup command was invented to prevent this from happening.

Suppose you want to run

sleep 20

while you are logged out.

You can do this by wrapping the command into a nohup command, like

nohup sleep 20 >& output &

Three new things are happening here: * The command you want to run is prefixed with the 'nohup' command. * The output of the command is captured in the file 'output' (you are not logged on, so the output cannot be displayed). Take care that you are not dumping too much information in the file, as it eats disk space. If you are not interested in the output at all, you can use '/dev/null' instead of 'output', a device that throws away anything you send to it. * The '&' at the end makes that your prompt returns immediately (instead of waiting until the program finishes).

At a shell, it looks like:

$ nohup sleep 20 >& /dev/null &
[1] 12144

It started a new process with process ID (pid) 12144, and this is the first process running in the background (the '[1]') in this shell. You can ask about the jobs of this shell with 'jobs'. If you wait long enough (and depending on your shell), the system may also report finishing its job after about 20 seconds.

$ jobs
[1]  + running    nohup sleep 20 >& /dev/null
[1]  + done       nohup sleep 20 >& /dev/null

Note that the last reported line '[1] + done …' came all by itself, nothing was entered at the prompt.

After starting the job, you can log out if you like.

If you want to find this process again from another shell (that is, after logging in again at the same system), you need to use 'ps'.

$ ps ux
user     12144  0.0  0.0  58896   548 pts/8    SN   09:41   0:00 sleep 20

'ps' gives a list of active processes at the system. Normally there are many, the 'u' filters on processes only owned by yourself, and the 'x' includes all your processes, not just the ones started from the current shell. Amongst the output is of course also the 'sleep 20' command. You can recognize it by its pid (the number '12144'), and the command itself at the end of the line. If it is not yet finished (in which case it will not show up above in the 'ps' output), you can stop it with

$ kill 12144

After the process is finished, you can look at the output file ('output' above). You can also look at the output before stopping the process, but the last part of the file may not have been written in that case. Don't forget to remove the output file afterwards.


FIXME Write a section about 'ulimit -a' and 'ulimit -t'

sesystems/running_long_computations.txt · Last modified: Monday, 15 August 2011 : 10:15:56 by hat