README.md

   1 jobflow by rofl0r
   2 =================
   3
   4 this program is inspired by the functionality of GNU parallel, but tries
   5 to keep low overhead and follow the UNIX philosophy of doing one thing well.
   6
   7 how it works
   8 ------------
   9
  10 basically, it works by processing stdin, launching one process per line.
  11 the actual line can be passed to the started program as an argv.
  12 this allows for easy parallelization of standard unix tasks.
  13
  14 it is possible to save the current processed line, so when the task is killed
  15 it can be continued later.
  16
  17 example usage
  18 -------------
  19
  20 you have a list of things, and a tool that processes a single thing.
  21
  22     cat things.list | jobflow -threads=8 -exec ./mytask {}
  23
  24     seq 100 | jobflow -threads=100 -exec echo {}
  25
  26     cat urls.txt | jobflow -threads=32 -exec wget {}
  27
  28     find . -name '*.bmp' | jobflow -threads=8 -exec bmp2jpeg {.}.bmp {.}.jpg
  29
  30 run jobflow without arguments to see a list of possible command line options,
  31 and argument permutations.
  32
  33 starting from version 1.3.1, jobflow can also be used to extract a range of
  34 lines, e.g.:
  35
  36     seq 100 | jobflow -skip 10 -count 10  # print lines 11 to 20
  37
  38 Comparison with GNU parallel
  39 ----------------------------
  40
  41 GNU parallel is written in perl, which has the following disadvantages:
  42 - requires a perl installation
  43   even though most people already have perl installed anyway, installing it
  44   just for this purpose requires up to 50 MB storage (and potentially up to
  45   several hours of time to compile it from source on slow devices)
  46 - requires a lot of time on startup (parsing sources, etc)
  47 - requires a lot of memory (typically between 5-60 MB)
  48
  49 jobflow OTOH is written in C, which has numerous advantages.
  50 - once compiled to a tiny static binary, can be used without 3rd party stuff
  51 - very little and constant memory usage (typically a few KB)
  52 - no startup overhead
  53 - much higher execution speed
  54
  55 apart from the chosen language and related performance differences, the
  56 following other differences exist between GNU parallel and jobflow:
  57
  58 + supports rlimits passed to started processes
  59 - doesn't support ssh (usage of remote cpus)
  60 - doesn't support all kinds of argument permutations:
  61   while GNU parallel has a rich set of options to permute the input,
  62   this doesn't adhere to the UNIX philosophy.
  63   jobflow can achieve the same result by passing the unmodified input
  64   to a user-created script that does the required permutations with other
  65   standard tools.
  66
  67 available command line options
  68 ------------------------------
  69
  70     -skip N -threads N -resume -statefile=/tmp/state -delayedflush
  71     -delayedspinup N -buffered -joinoutput -limits mem=16M,cpu=10
  72     -eof=XXX
  73     -exec ./mycommand {}
  74
  75 -skip N
  76
  77     N=number of entries to skip
  78 -count N
  79
  80     N=only process count lines (after skipping)
  81 -threads N (alternative: -j N)
  82
  83     N=number of parallel processes to spawn
  84 -resume
  85
  86     resume from last jobnumber stored in statefile
  87 -eof XXX
  88
  89     use XXX as the EOF marker on stdin
  90     if the marker is encountered, behave as if stdin was closed
  91     not compatible with pipe/bulk mode
  92 -statefile XXX
  93
  94     XXX=filename
  95     saves last launched jobnumber into a file
  96 -delayedflush
  97
  98     only write to statefile whenever all processes are busy,
  99     and at program end
 100 -delayedspinup N
 101
 102     N=maximum amount of milliseconds
 103     ...to wait when spinning up a fresh set of processes
 104     a random value between 0 and the chosen amount is used to delay initial
 105     spinup.
 106     this can be handy to circumvent an I/O lockdown because of a burst of
 107     activity on program startup
 108 -buffered
 109
 110     store the stdout and stderr of launched processes into a temporary file
 111     which will be printed after a process has finished.
 112     this prevents mixing up of output of different processes.
 113 -joinoutput
 114
 115     if -buffered, write both stdout and stderr into the same file.
 116     this saves the chronological order of the output, and the combined output
 117     will only be printed to stdout.
 118 -bulk N
 119
 120     do bulk copies with a buffer of N bytes. only usable in pipe mode.
 121     this passes (almost) the entire buffer to the next scheduled job.
 122     the passed buffer will be truncated to the last line break boundary,
 123     so jobs always get entire lines to work with.
 124     this option is useful when you have huge input files and relatively short
 125     task runtimes. by using it, syscall overhead can be reduced to a minimum.
 126     N must be a multiple of 4KB. the suffixes G/M/K are detected.
 127     actual memory allocation will be twice the amount passed.
 128     note that pipe buffer size is limited to 64K on linux, so anything higher
 129     than that probably doesn't make sense.
 130 -limits [mem=N,cpu=N,stack=N,fsize=N,nofiles=N]
 131
 132     sets the rlimit of the new created processes.
 133     see "man setrlimit" for an explanation. the suffixes G/M/K are detected.
 134 -exec command with args
 135
 136     everything past -exec is treated as the command to execute on each line of
 137     stdin received. the line can be passed as an argument using {}.
 138     {.} passes everything before the last dot in a line as an argument.
 139     it is possible to use multiple substitutions inside a single argument,
 140     but currently only of one type.
 141     if -exec is omitted, input will merely be dumped to stdout (like cat).
 142
 143
 144 BUILD
 145 -----
 146
 147 just run `make`.
 148
 149 you may override variables used in the Makefile and set optimization
 150 CFLAGS and similar thing using a file called `config.mak`, e.g.:
 151
 152     echo "CFLAGS=-O2 -g" > config.mak
 153     make -j2