Released as 20231222 ('SundhnĂșkagĂ­gur')
[parallel.git] / src / parallel_examples.pod
blobf0bccf48f4224964a1faa4b86949be566a2ae5d9
1 #!/usr/bin/perl -w
3 # SPDX-FileCopyrightText: 2021-2023 Ole Tange, http://ole.tange.dk and Free Software and Foundation, Inc.
4 # SPDX-License-Identifier: GFDL-1.3-or-later
5 # SPDX-License-Identifier: CC-BY-SA-4.0
7 =encoding utf8
9 =head1 GNU PARALLEL EXAMPLES
11 =head2 EXAMPLE: Working as xargs -n1. Argument appending
13 GNU B<parallel> can work similar to B<xargs -n1>.
15 To compress all html files using B<gzip> run:
17 find . -name '*.html' | parallel gzip --best
19 If the file names may contain a newline use B<-0>. Substitute FOO BAR with
20 FUBAR in all files in this dir and subdirs:
22 find . -type f -print0 | \
23 parallel -q0 perl -i -pe 's/FOO BAR/FUBAR/g'
25 Note B<-q> is needed because of the space in 'FOO BAR'.
28 =head2 EXAMPLE: Simple network scanner
30 B<prips> can generate IP-addresses from CIDR notation. With GNU
31 B<parallel> you can build a simple network scanner to see which
32 addresses respond to B<ping>:
34 prips 130.229.16.0/20 | \
35 parallel --timeout 2 -j0 \
36 'ping -c 1 {} >/dev/null && echo {}' 2>/dev/null
39 =head2 EXAMPLE: Reading arguments from command line
41 GNU B<parallel> can take the arguments from command line instead of
42 stdin (standard input). To compress all html files in the current dir
43 using B<gzip> run:
45 parallel gzip --best ::: *.html
47 To convert *.wav to *.mp3 using LAME running one process per CPU run:
49 parallel lame {} -o {.}.mp3 ::: *.wav
52 =head2 EXAMPLE: Inserting multiple arguments
54 When moving a lot of files like this: B<mv *.log destdir> you will
55 sometimes get the error:
57 bash: /bin/mv: Argument list too long
59 because there are too many files. You can instead do:
61 ls | grep -E '\.log$' | parallel mv {} destdir
63 This will run B<mv> for each file. It can be done faster if B<mv> gets
64 as many arguments that will fit on the line:
66 ls | grep -E '\.log$' | parallel -m mv {} destdir
68 In many shells you can also use B<printf>:
70 printf '%s\0' *.log | parallel -0 -m mv {} destdir
73 =head2 EXAMPLE: Context replace
75 To remove the files I<pict0000.jpg> .. I<pict9999.jpg> you could do:
77 seq -w 0 9999 | parallel rm pict{}.jpg
79 You could also do:
81 seq -w 0 9999 | perl -pe 's/(.*)/pict$1.jpg/' | parallel -m rm
83 The first will run B<rm> 10000 times, while the last will only run
84 B<rm> as many times needed to keep the command line length short
85 enough to avoid B<Argument list too long> (it typically runs 1-2 times).
87 You could also run:
89 seq -w 0 9999 | parallel -X rm pict{}.jpg
91 This will also only run B<rm> as many times needed to keep the command
92 line length short enough.
95 =head2 EXAMPLE: Compute intensive jobs and substitution
97 If ImageMagick is installed this will generate a thumbnail of a jpg
98 file:
100 convert -geometry 120 foo.jpg thumb_foo.jpg
102 This will run with number-of-cpus jobs in parallel for all jpg files
103 in a directory:
105 ls *.jpg | parallel convert -geometry 120 {} thumb_{}
107 To do it recursively use B<find>:
109 find . -name '*.jpg' | \
110 parallel convert -geometry 120 {} {}_thumb.jpg
112 Notice how the argument has to start with B<{}> as B<{}> will include path
113 (e.g. running B<convert -geometry 120 ./foo/bar.jpg
114 thumb_./foo/bar.jpg> would clearly be wrong). The command will
115 generate files like ./foo/bar.jpg_thumb.jpg.
117 Use B<{.}> to avoid the extra .jpg in the file name. This command will
118 make files like ./foo/bar_thumb.jpg:
120 find . -name '*.jpg' | \
121 parallel convert -geometry 120 {} {.}_thumb.jpg
124 =head2 EXAMPLE: Substitution and redirection
126 This will generate an uncompressed version of .gz-files next to the .gz-file:
128 parallel zcat {} ">"{.} ::: *.gz
130 Quoting of > is necessary to postpone the redirection. Another
131 solution is to quote the whole command:
133 parallel "zcat {} >{.}" ::: *.gz
135 Other special shell characters (such as * ; $ > < | >> <<) also need
136 to be put in quotes, as they may otherwise be interpreted by the shell
137 and not given to GNU B<parallel>.
140 =head2 EXAMPLE: Composed commands
142 A job can consist of several commands. This will print the number of
143 files in each directory:
145 ls | parallel 'echo -n {}" "; ls {}|wc -l'
147 To put the output in a file called <name>.dir:
149 ls | parallel '(echo -n {}" "; ls {}|wc -l) >{}.dir'
151 Even small shell scripts can be run by GNU B<parallel>:
153 find . | parallel 'a={}; name=${a##*/};' \
154 'upper=$(echo "$name" | tr "[:lower:]" "[:upper:]");'\
155 'echo "$name - $upper"'
157 ls | parallel 'mv {} "$(echo {} | tr "[:upper:]" "[:lower:]")"'
159 Given a list of URLs, list all URLs that fail to download. Print the
160 line number and the URL.
162 cat urlfile | parallel "wget {} 2>/dev/null || grep -n {} urlfile"
164 Create a mirror directory with the same file names except all files and
165 symlinks are empty files.
167 cp -rs /the/source/dir mirror_dir
168 find mirror_dir -type l | parallel -m rm {} '&&' touch {}
170 Find the files in a list that do not exist
172 cat file_list | parallel 'if [ ! -e {} ] ; then echo {}; fi'
175 =head2 EXAMPLE: Composed command with perl replacement string
177 You have a bunch of file. You want them sorted into dirs. The dir of
178 each file should be named the first letter of the file name.
180 parallel 'mkdir -p {=s/(.).*/$1/=}; mv {} {=s/(.).*/$1/=}' ::: *
183 =head2 EXAMPLE: Composed command with multiple input sources
185 You have a dir with files named as 24 hours in 5 minute intervals:
186 00:00, 00:05, 00:10 .. 23:55. You want to find the files missing:
188 parallel [ -f {1}:{2} ] "||" echo {1}:{2} does not exist \
189 ::: {00..23} ::: {00..55..5}
192 =head2 EXAMPLE: Calling Bash functions
194 If the composed command is longer than a line, it becomes hard to
195 read. In Bash you can use functions. Just remember to B<export -f> the
196 function.
198 doit() {
199 echo Doing it for $1
200 sleep 2
201 echo Done with $1
203 export -f doit
204 parallel doit ::: 1 2 3
206 doubleit() {
207 echo Doing it for $1 $2
208 sleep 2
209 echo Done with $1 $2
211 export -f doubleit
212 parallel doubleit ::: 1 2 3 ::: a b
214 To do this on remote servers you need to transfer the function using
215 B<--env>:
217 parallel --env doit -S server doit ::: 1 2 3
218 parallel --env doubleit -S server doubleit ::: 1 2 3 ::: a b
220 If your environment (aliases, variables, and functions) is small you
221 can copy the full environment without having to
222 B<export -f> anything. See B<env_parallel>.
225 =head2 EXAMPLE: Function tester
227 To test a program with different parameters:
229 tester() {
230 if (eval "$@") >&/dev/null; then
231 perl -e 'printf "\033[30;102m[ OK ]\033[0m @ARGV\n"' "$@"
232 else
233 perl -e 'printf "\033[30;101m[FAIL]\033[0m @ARGV\n"' "$@"
236 export -f tester
237 parallel tester my_program ::: arg1 arg2
238 parallel tester exit ::: 1 0 2 0
240 If B<my_program> fails a red FAIL will be printed followed by the failing
241 command; otherwise a green OK will be printed followed by the command.
244 =head2 EXAMPLE: Continously show the latest line of output
246 It can be useful to monitor the output of running jobs.
248 This shows the most recent output line until a job finishes. After
249 which the output of the job is printed in full:
251 parallel '{} | tee >(cat >&3)' ::: 'command 1' 'command 2' \
252 3> >(perl -ne '$|=1;chomp;printf"%.'$COLUMNS's\r",$_." "x100')
255 =head2 EXAMPLE: Log rotate
257 Log rotation renames a logfile to an extension with a higher number:
258 log.1 becomes log.2, log.2 becomes log.3, and so on. The oldest log is
259 removed. To avoid overwriting files the process starts backwards from
260 the high number to the low number. This will keep 10 old versions of
261 the log:
263 seq 9 -1 1 | parallel -j1 mv log.{} log.'{= $_++ =}'
264 mv log log.1
267 =head2 EXAMPLE: Removing file extension when processing files
269 When processing files removing the file extension using B<{.}> is
270 often useful.
272 Create a directory for each zip-file and unzip it in that dir:
274 parallel 'mkdir {.}; cd {.}; unzip ../{}' ::: *.zip
276 Recompress all .gz files in current directory using B<bzip2> running 1
277 job per CPU in parallel:
279 parallel "zcat {} | bzip2 >{.}.bz2 && rm {}" ::: *.gz
281 Convert all WAV files to MP3 using LAME:
283 find sounddir -type f -name '*.wav' | parallel lame {} -o {.}.mp3
285 Put all converted in the same directory:
287 find sounddir -type f -name '*.wav' | \
288 parallel lame {} -o mydir/{/.}.mp3
291 =head2 EXAMPLE: Replacing parts of file names
293 If you deal with paired end reads, you will have files like
294 barcode1_R1.fq.gz, barcode1_R2.fq.gz, barcode2_R1.fq.gz, and
295 barcode2_R2.fq.gz.
297 You want barcodeI<N>_R1 to be processed with barcodeI<N>_R2.
299 parallel --plus myprocess {} {/_R1.fq.gz/_R2.fq.gz} ::: *_R1.fq.gz
301 If the barcode does not contain '_R1', you can do:
303 parallel --plus myprocess {} {/_R1/_R2} ::: *_R1.fq.gz
306 =head2 EXAMPLE: Removing strings from the argument
308 If you have directory with tar.gz files and want these extracted in
309 the corresponding dir (e.g foo.tar.gz will be extracted in the dir
310 foo) you can do:
312 parallel --plus 'mkdir {..}; tar -C {..} -xf {}' ::: *.tar.gz
314 If you want to remove a different ending, you can use {%string}:
316 parallel --plus echo {%_demo} ::: mycode_demo keep_demo_here
318 You can also remove a starting string with {#string}
320 parallel --plus echo {#demo_} ::: demo_mycode keep_demo_here
322 To remove a string anywhere you can use regular expressions with
323 {/regexp/replacement} and leave the replacement empty:
325 parallel --plus echo {/demo_/} ::: demo_mycode remove_demo_here
328 =head2 EXAMPLE: Download 24 images for each of the past 30 days
330 Let us assume a website stores images like:
332 https://www.example.com/path/to/YYYYMMDD_##.jpg
334 where YYYYMMDD is the date and ## is the number 01-24. This will
335 download images for the past 30 days:
337 getit() {
338 date=$(date -d "today -$1 days" +%Y%m%d)
339 num=$2
340 echo wget https://www.example.com/path/to/${date}_${num}.jpg
342 export -f getit
344 parallel getit ::: $(seq 30) ::: $(seq -w 24)
346 B<$(date -d "today -$1 days" +%Y%m%d)> will give the dates in
347 YYYYMMDD with B<$1> days subtracted.
350 =head2 EXAMPLE: Download world map from NASA
352 NASA provides tiles to download on earthdata.nasa.gov. Download tiles
353 for Blue Marble world map and create a 10240x20480 map.
355 base=https://map1a.vis.earthdata.nasa.gov/wmts-geo/wmts.cgi
356 service="SERVICE=WMTS&REQUEST=GetTile&VERSION=1.0.0"
357 layer="LAYER=BlueMarble_ShadedRelief_Bathymetry"
358 set="STYLE=&TILEMATRIXSET=EPSG4326_500m&TILEMATRIX=5"
359 tile="TILEROW={1}&TILECOL={2}"
360 format="FORMAT=image%2Fjpeg"
361 url="$base?$service&$layer&$set&$tile&$format"
363 parallel -j0 -q wget "$url" -O {1}_{2}.jpg ::: {0..19} ::: {0..39}
364 parallel eval convert +append {}_{0..39}.jpg line{}.jpg ::: {0..19}
365 convert -append line{0..19}.jpg world.jpg
368 =head2 EXAMPLE: Download Apollo-11 images from NASA using jq
370 Search NASA using their API to get JSON for images related to 'apollo
371 11' and has 'moon landing' in the description.
373 The search query returns JSON containing URLs to JSON containing
374 collections of pictures. One of the pictures in each of these
375 collection is I<large>.
377 B<wget> is used to get the JSON for the search query. B<jq> is then
378 used to extract the URLs of the collections. B<parallel> then calls
379 B<wget> to get each collection, which is passed to B<jq> to extract
380 the URLs of all images. B<grep> filters out the I<large> images, and
381 B<parallel> finally uses B<wget> to fetch the images.
383 base="https://images-api.nasa.gov/search"
384 q="q=apollo 11"
385 description="description=moon landing"
386 media_type="media_type=image"
387 wget -O - "$base?$q&$description&$media_type" |
388 jq -r .collection.items[].href |
389 parallel wget -O - |
390 jq -r .[] |
391 grep large |
392 parallel wget
395 =head2 EXAMPLE: Download video playlist in parallel
397 B<youtube-dl> is an excellent tool to download videos. It can,
398 however, not download videos in parallel. This takes a playlist and
399 downloads 10 videos in parallel.
401 url='youtu.be/watch?v=0wOf2Fgi3DE&list=UU_cznB5YZZmvAmeq7Y3EriQ'
402 export url
403 youtube-dl --flat-playlist "https://$url" |
404 parallel --tagstring {#} --lb -j10 \
405 youtube-dl --playlist-start {#} --playlist-end {#} '"https://$url"'
408 =head2 EXAMPLE: Prepend last modified date (ISO8601) to file name
410 parallel mv {} '{= $a=pQ($_); $b=$_;' \
411 '$_=qx{date -r "$a" +%FT%T}; chomp; $_="$_ $b" =}' ::: *
413 B<{=> and B<=}> mark a perl expression. B<pQ> perl-quotes the
414 string. B<date +%FT%T> is the date in ISO8601 with time.
416 =head2 EXAMPLE: Save output in ISO8601 dirs
418 Save output from B<ps aux> every second into dirs named
419 yyyy-mm-ddThh:mm:ss+zz:zz.
421 seq 1000 | parallel -N0 -j1 --delay 1 \
422 --results '{= $_=`date -Isec`; chomp=}/' ps aux
425 =head2 EXAMPLE: Digital clock with "blinking" :
427 The : in a digital clock blinks. To make every other line have a ':'
428 and the rest a ' ' a perl expression is used to look at the 3rd input
429 source. If the value modulo 2 is 1: Use ":" otherwise use " ":
431 parallel -k echo {1}'{=3 $_=$_%2?":":" "=}'{2}{3} \
432 ::: {0..12} ::: {0..5} ::: {0..9}
435 =head2 EXAMPLE: Aggregating content of files
437 This:
439 parallel --header : echo x{X}y{Y}z{Z} \> x{X}y{Y}z{Z} \
440 ::: X {1..5} ::: Y {01..10} ::: Z {1..5}
442 will generate the files x1y01z1 .. x5y10z5. If you want to aggregate
443 the output grouping on x and z you can do this:
445 parallel eval 'cat {=s/y01/y*/=} > {=s/y01//=}' ::: *y01*
447 For all values of x and z it runs commands like:
449 cat x1y*z1 > x1z1
451 So you end up with x1z1 .. x5z5 each containing the content of all
452 values of y.
455 =head2 EXAMPLE: Breadth first parallel web crawler/mirrorer
457 This script below will crawl and mirror a URL in parallel. It
458 downloads first pages that are 1 click down, then 2 clicks down, then
459 3; instead of the normal depth first, where the first link link on
460 each page is fetched first.
462 Run like this:
464 PARALLEL=-j100 ./parallel-crawl http://gatt.org.yeslab.org/
466 Remove the B<wget> part if you only want a web crawler.
468 It works by fetching a page from a list of URLs and looking for links
469 in that page that are within the same starting URL and that have not
470 already been seen. These links are added to a new queue. When all the
471 pages from the list is done, the new queue is moved to the list of
472 URLs and the process is started over until no unseen links are found.
474 #!/bin/bash
476 # E.g. http://gatt.org.yeslab.org/
477 URL=$1
478 # Stay inside the start dir
479 BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
480 URLLIST=$(mktemp urllist.XXXX)
481 URLLIST2=$(mktemp urllist.XXXX)
482 SEEN=$(mktemp seen.XXXX)
484 # Spider to get the URLs
485 echo $URL >$URLLIST
486 cp $URLLIST $SEEN
488 while [ -s $URLLIST ] ; do
489 cat $URLLIST |
490 parallel lynx -listonly -image_links -dump {} \; \
491 wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
492 perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and
493 do { $seen{$1}++ or print }' |
494 grep -F $BASEURL |
495 grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
496 mv $URLLIST2 $URLLIST
497 done
499 rm -f $URLLIST $URLLIST2 $SEEN
502 =head2 EXAMPLE: Process files from a tar file while unpacking
504 If the files to be processed are in a tar file then unpacking one file
505 and processing it immediately may be faster than first unpacking all
506 files.
508 tar xvf foo.tgz | perl -ne 'print $l;$l=$_;END{print $l}' | \
509 parallel echo
511 The Perl one-liner is needed to make sure the file is complete before
512 handing it to GNU B<parallel>.
515 =head2 EXAMPLE: Rewriting a for-loop and a while-read-loop
517 for-loops like this:
519 (for x in `cat list` ; do
520 do_something $x
521 done) | process_output
523 and while-read-loops like this:
525 cat list | (while read x ; do
526 do_something $x
527 done) | process_output
529 can be written like this:
531 cat list | parallel do_something | process_output
533 For example: Find which host name in a list has IP address 1.2.3 4:
535 cat hosts.txt | parallel -P 100 host | grep 1.2.3.4
537 If the processing requires more steps the for-loop like this:
539 (for x in `cat list` ; do
540 no_extension=${x%.*};
541 do_step1 $x scale $no_extension.jpg
542 do_step2 <$x $no_extension
543 done) | process_output
545 and while-loops like this:
547 cat list | (while read x ; do
548 no_extension=${x%.*};
549 do_step1 $x scale $no_extension.jpg
550 do_step2 <$x $no_extension
551 done) | process_output
553 can be written like this:
555 cat list | parallel "do_step1 {} scale {.}.jpg ; do_step2 <{} {.}" |\
556 process_output
558 If the body of the loop is bigger, it improves readability to use a function:
560 (for x in `cat list` ; do
561 do_something $x
562 [... 100 lines that do something with $x ...]
563 done) | process_output
565 cat list | (while read x ; do
566 do_something $x
567 [... 100 lines that do something with $x ...]
568 done) | process_output
570 can both be rewritten as:
572 doit() {
573 x=$1
574 do_something $x
575 [... 100 lines that do something with $x ...]
577 export -f doit
578 cat list | parallel doit
580 =head2 EXAMPLE: Rewriting nested for-loops
582 Nested for-loops like this:
584 (for x in `cat xlist` ; do
585 for y in `cat ylist` ; do
586 do_something $x $y
587 done
588 done) | process_output
590 can be written like this:
592 parallel do_something {1} {2} :::: xlist ylist | process_output
594 Nested for-loops like this:
596 (for colour in red green blue ; do
597 for size in S M L XL XXL ; do
598 echo $colour $size
599 done
600 done) | sort
602 can be written like this:
604 parallel echo {1} {2} ::: red green blue ::: S M L XL XXL | sort
607 =head2 EXAMPLE: Finding the lowest difference between files
609 B<diff> is good for finding differences in text files. B<diff | wc -l>
610 gives an indication of the size of the difference. To find the
611 differences between all files in the current dir do:
613 parallel --tag 'diff {1} {2} | wc -l' ::: * ::: * | sort -nk3
615 This way it is possible to see if some files are closer to other
616 files.
619 =head2 EXAMPLE: for-loops with column names
621 When doing multiple nested for-loops it can be easier to keep track of
622 the loop variable if is is named instead of just having a number. Use
623 B<--header :> to let the first argument be an named alias for the
624 positional replacement string:
626 parallel --header : echo {colour} {size} \
627 ::: colour red green blue ::: size S M L XL XXL
629 This also works if the input file is a file with columns:
631 cat addressbook.tsv | \
632 parallel --colsep '\t' --header : echo {Name} {E-mail address}
635 =head2 EXAMPLE: All combinations in a list
637 GNU B<parallel> makes all combinations when given two lists.
639 To make all combinations in a single list with unique values, you
640 repeat the list and use replacement string B<{choose_k}>:
642 parallel --plus echo {choose_k} ::: A B C D ::: A B C D
644 parallel --plus echo 2{2choose_k} 1{1choose_k} ::: A B C D ::: A B C D
646 B<{choose_k}> works for any number of input sources:
648 parallel --plus echo {choose_k} ::: A B C D ::: A B C D ::: A B C D
650 Where B<{choose_k}> does not care about order, B<{uniq}> cares about
651 order. It simply skips jobs where values from different input sources
652 are the same:
654 parallel --plus echo {uniq} ::: A B C ::: A B C ::: A B C
655 parallel --plus echo {1uniq}+{2uniq}+{3uniq} \
656 ::: A B C ::: A B C ::: A B C
658 The behaviour of B<{choose_k}> is undefined, if the input values of each
659 source are different.
662 =head2 EXAMPLE: From a to b and b to c
664 Assume you have input like:
666 aardvark
667 babble
670 each
672 and want to run combinations like:
674 aardvark babble
675 babble cab
676 cab dab
677 dab each
679 If the input is in the file in.txt:
681 parallel echo {1} - {2} ::::+ <(head -n -1 in.txt) <(tail -n +2 in.txt)
683 If the input is in the array $a here are two solutions:
685 seq $((${#a[@]}-1)) | \
686 env_parallel --env a echo '${a[{=$_--=}]} - ${a[{}]}'
687 parallel echo {1} - {2} ::: "${a[@]::${#a[@]}-1}" :::+ "${a[@]:1}"
690 =head2 EXAMPLE: Count the differences between all files in a dir
692 Using B<--results> the results are saved in /tmp/diffcount*.
694 parallel --results /tmp/diffcount "diff -U 0 {1} {2} | \
695 tail -n +3 |grep -v '^@'|wc -l" ::: * ::: *
697 To see the difference between file A and file B look at the file
698 '/tmp/diffcount/1/A/2/B'.
701 =head2 EXAMPLE: Speeding up fast jobs
703 Starting a job on the local machine takes around 3-10 ms. This can be
704 a big overhead if the job takes very few ms to run. Often you can
705 group small jobs together using B<-X> which will make the overhead
706 less significant. Compare the speed of these:
708 seq -w 0 9999 | parallel touch pict{}.jpg
709 seq -w 0 9999 | parallel -X touch pict{}.jpg
711 If your program cannot take multiple arguments, then you can use GNU
712 B<parallel> to spawn multiple GNU B<parallel>s:
714 seq -w 0 9999999 | \
715 parallel -j10 -q -I,, --pipe parallel -j0 touch pict{}.jpg
717 If B<-j0> normally spawns 252 jobs, then the above will try to spawn
718 2520 jobs. On a normal GNU/Linux system you can spawn 32000 jobs using
719 this technique with no problems. To raise the 32000 jobs limit raise
720 /proc/sys/kernel/pid_max to 4194303.
722 If you do not need GNU B<parallel> to have control over each job (so
723 no need for B<--retries> or B<--joblog> or similar), then it can be
724 even faster if you can generate the command lines and pipe those to a
725 shell. So if you can do this:
727 mygenerator | sh
729 Then that can be parallelized like this:
731 mygenerator | parallel --pipe --block 10M sh
733 E.g.
735 mygenerator() {
736 seq 10000000 | perl -pe 'print "echo This is fast job number "';
738 mygenerator | parallel --pipe --block 10M sh
740 The overhead is 100000 times smaller namely around 100 nanoseconds per
741 job.
744 =head2 EXAMPLE: Using shell variables
746 When using shell variables you need to quote them correctly as they
747 may otherwise be interpreted by the shell.
749 Notice the difference between:
751 ARR=("My brother's 12\" records are worth <\$\$\$>"'!' Foo Bar)
752 parallel echo ::: ${ARR[@]} # This is probably not what you want
754 and:
756 ARR=("My brother's 12\" records are worth <\$\$\$>"'!' Foo Bar)
757 parallel echo ::: "${ARR[@]}"
759 When using variables in the actual command that contains special
760 characters (e.g. space) you can quote them using B<'"$VAR"'> or using
761 "'s and B<-q>:
763 VAR="My brother's 12\" records are worth <\$\$\$>"
764 parallel -q echo "$VAR" ::: '!'
765 export VAR
766 parallel echo '"$VAR"' ::: '!'
768 If B<$VAR> does not contain ' then B<"'$VAR'"> will also work
769 (and does not need B<export>):
771 VAR="My 12\" records are worth <\$\$\$>"
772 parallel echo "'$VAR'" ::: '!'
774 If you use them in a function you just quote as you normally would do:
776 VAR="My brother's 12\" records are worth <\$\$\$>"
777 export VAR
778 myfunc() { echo "$VAR" "$1"; }
779 export -f myfunc
780 parallel myfunc ::: '!'
783 =head2 EXAMPLE: Group output lines
785 When running jobs that output data, you often do not want the output
786 of multiple jobs to run together. GNU B<parallel> defaults to grouping
787 the output of each job, so the output is printed when the job
788 finishes. If you want full lines to be printed while the job is
789 running you can use B<--line-buffer>. If you want output to be
790 printed as soon as possible you can use B<-u>.
792 Compare the output of:
794 parallel wget --progress=dot --limit-rate=100k \
795 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
796 ::: {12..16}
797 parallel --line-buffer wget --progress=dot --limit-rate=100k \
798 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
799 ::: {12..16}
800 parallel --latest-line wget --progress=dot --limit-rate=100k \
801 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
802 ::: {12..16}
803 parallel -u wget --progress=dot --limit-rate=100k \
804 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
805 ::: {12..16}
807 =head2 EXAMPLE: Tag output lines
809 GNU B<parallel> groups the output lines, but it can be hard to see
810 where the different jobs begin. B<--tag> prepends the argument to make
811 that more visible:
813 parallel --tag wget --limit-rate=100k \
814 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
815 ::: {12..16}
817 B<--tag> works with B<--line-buffer> but not with B<-u>:
819 parallel --tag --line-buffer wget --limit-rate=100k \
820 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
821 ::: {12..16}
823 Check the uptime of the servers in I<~/.parallel/sshloginfile>:
825 parallel --tag -S .. --nonall uptime
828 =head2 EXAMPLE: Colorize output
830 Give each job a new color. Most terminals support ANSI colors with the
831 escape code "\033[30;3Xm" where 0 <= X <= 7:
833 seq 10 | \
834 parallel --tagstring '\033[30;3{=$_=++$::color%8=}m' seq {}
835 parallel --rpl '{color} $_="\033[30;3".(++$::color%8)."m"' \
836 --tagstring {color} seq {} ::: {1..10}
838 To get rid of the initial \t (which comes from B<--tagstring>):
840 ... | perl -pe 's/\t//'
843 =head2 EXAMPLE: Keep order of output same as order of input
845 Normally the output of a job will be printed as soon as it
846 completes. Sometimes you want the order of the output to remain the
847 same as the order of the input. This is often important, if the output
848 is used as input for another system. B<-k> will make sure the order of
849 output will be in the same order as input even if later jobs end
850 before earlier jobs.
852 Append a string to every line in a text file:
854 cat textfile | parallel -k echo {} append_string
856 If you remove B<-k> some of the lines may come out in the wrong order.
858 Another example is B<traceroute>:
860 parallel traceroute ::: qubes-os.org debian.org freenetproject.org
862 will give traceroute of qubes-os.org, debian.org and
863 freenetproject.org, but it will be sorted according to which job
864 completed first.
866 To keep the order the same as input run:
868 parallel -k traceroute ::: qubes-os.org debian.org freenetproject.org
870 This will make sure the traceroute to qubes-os.org will be printed
871 first.
873 A bit more complex example is downloading a huge file in chunks in
874 parallel: Some internet connections will deliver more data if you
875 download files in parallel. For downloading files in parallel see:
876 "EXAMPLE: Download 10 images for each of the past 30 days". But if you
877 are downloading a big file you can download the file in chunks in
878 parallel.
880 To download byte 10000000-19999999 you can use B<curl>:
882 curl -r 10000000-19999999 https://example.com/the/big/file >file.part
884 To download a 1 GB file we need 100 10MB chunks downloaded and
885 combined in the correct order.
887 seq 0 99 | parallel -k curl -r \
888 {}0000000-{}9999999 https://example.com/the/big/file > file
891 =head2 EXAMPLE: Parallel grep
893 B<grep -r> greps recursively through directories. GNU B<parallel> can
894 often speed this up.
896 find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}
898 This will run 1.5 job per CPU, and give 1000 arguments to B<grep>.
900 There are situations where the above will be slower than B<grep -r>:
902 =over 2
904 =item *
906 If data is already in RAM. The overhead of starting jobs and buffering
907 output may outweigh the benefit of running in parallel.
909 =item *
911 If the files are big. If a file cannot be read in a single seek, the
912 disk may start thrashing.
914 =back
916 The speedup is caused by two factors:
918 =over 2
920 =item *
922 On rotating harddisks small files often require a seek for each
923 file. By searching for more files in parallel, the arm may pass
924 another wanted file on its way.
926 =item *
928 NVMe drives often perform better by having multiple command running in
929 parallel.
931 =back
934 =head2 EXAMPLE: Grepping n lines for m regular expressions.
936 The simplest solution to grep a big file for a lot of regexps is:
938 grep -f regexps.txt bigfile
940 Or if the regexps are fixed strings:
942 grep -F -f regexps.txt bigfile
944 There are 3 limiting factors: CPU, RAM, and disk I/O.
946 RAM is easy to measure: If the B<grep> process takes up most of your
947 free memory (e.g. when running B<top>), then RAM is a limiting factor.
949 CPU is also easy to measure: If the B<grep> takes >90% CPU in B<top>,
950 then the CPU is a limiting factor, and parallelization will speed this
953 It is harder to see if disk I/O is the limiting factor, and depending
954 on the disk system it may be faster or slower to parallelize. The only
955 way to know for certain is to test and measure.
958 =head3 Limiting factor: RAM
960 The normal B<grep -f regexps.txt bigfile> works no matter the size of
961 bigfile, but if regexps.txt is so big it cannot fit into memory, then
962 you need to split this.
964 B<grep -F> takes around 100 bytes of RAM and B<grep> takes about 500
965 bytes of RAM per 1 byte of regexp. So if regexps.txt is 1% of your
966 RAM, then it may be too big.
968 If you can convert your regexps into fixed strings do that. E.g. if
969 the lines you are looking for in bigfile all looks like:
971 ID1 foo bar baz Identifier1 quux
972 fubar ID2 foo bar baz Identifier2
974 then your regexps.txt can be converted from:
976 ID1.*Identifier1
977 ID2.*Identifier2
979 into:
981 ID1 foo bar baz Identifier1
982 ID2 foo bar baz Identifier2
984 This way you can use B<grep -F> which takes around 80% less memory and
985 is much faster.
987 If it still does not fit in memory you can do this:
989 parallel --pipe-part -a regexps.txt --block 1M grep -F -f - -n bigfile | \
990 sort -un | perl -pe 's/^\d+://'
992 The 1M should be your free memory divided by the number of CPU threads and
993 divided by 200 for B<grep -F> and by 1000 for normal B<grep>. On
994 GNU/Linux you can do:
996 free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ { sum += $2 }
997 END { print sum }' /proc/meminfo)
998 percpu=$((free / 200 / $(parallel --number-of-threads)))k
1000 parallel --pipe-part -a regexps.txt --block $percpu --compress \
1001 grep -F -f - -n bigfile | \
1002 sort -un | perl -pe 's/^\d+://'
1004 If you can live with duplicated lines and wrong order, it is faster to do:
1006 parallel --pipe-part -a regexps.txt --block $percpu --compress \
1007 grep -F -f - bigfile
1009 =head3 Limiting factor: CPU
1011 If the CPU is the limiting factor parallelization should be done on
1012 the regexps:
1014 cat regexps.txt | parallel --pipe -L1000 --round-robin --compress \
1015 grep -f - -n bigfile | \
1016 sort -un | perl -pe 's/^\d+://'
1018 The command will start one B<grep> per CPU and read I<bigfile> one
1019 time per CPU, but as that is done in parallel, all reads except the
1020 first will be cached in RAM. Depending on the size of I<regexps.txt> it
1021 may be faster to use B<--block 10m> instead of B<-L1000>.
1023 Some storage systems perform better when reading multiple chunks in
1024 parallel. This is true for some RAID systems and for some network file
1025 systems. To parallelize the reading of I<bigfile>:
1027 parallel --pipe-part --block 100M -a bigfile -k --compress \
1028 grep -f regexps.txt
1030 This will split I<bigfile> into 100MB chunks and run B<grep> on each of
1031 these chunks. To parallelize both reading of I<bigfile> and I<regexps.txt>
1032 combine the two using B<--cat>:
1034 parallel --pipe-part --block 100M -a bigfile --cat cat regexps.txt \
1035 \| parallel --pipe -L1000 --round-robin grep -f - {}
1037 If a line matches multiple regexps, the line may be duplicated.
1039 =head3 Bigger problem
1041 If the problem is too big to be solved by this, you are probably ready
1042 for Lucene.
1045 =head2 EXAMPLE: Using remote computers
1047 To run commands on a remote computer SSH needs to be set up and you
1048 must be able to login without entering a password (The commands
1049 B<ssh-copy-id>, B<ssh-agent>, and B<sshpass> may help you do that).
1051 If you need to login to a whole cluster, you typically do not want to
1052 accept the host key for every host. You want to accept them the first
1053 time and be warned if they are ever changed. To do that:
1055 # Add the servers to the sshloginfile
1056 (echo servera; echo serverb) > .parallel/my_cluster
1057 # Make sure .ssh/config exist
1058 touch .ssh/config
1059 cp .ssh/config .ssh/config.backup
1060 # Disable StrictHostKeyChecking temporarily
1061 (echo 'Host *'; echo StrictHostKeyChecking no) >> .ssh/config
1062 parallel --slf my_cluster --nonall true
1063 # Remove the disabling of StrictHostKeyChecking
1064 mv .ssh/config.backup .ssh/config
1066 The servers in B<.parallel/my_cluster> are now added in B<.ssh/known_hosts>.
1068 To run B<echo> on B<server.example.com>:
1070 seq 10 | parallel --sshlogin server.example.com echo
1072 To run commands on more than one remote computer run:
1074 seq 10 | parallel --sshlogin s1.example.com,s2.example.net echo
1078 seq 10 | parallel --sshlogin server.example.com \
1079 --sshlogin server2.example.net echo
1081 If the login username is I<foo> on I<server2.example.net> use:
1083 seq 10 | parallel --sshlogin server.example.com \
1084 --sshlogin foo@server2.example.net echo
1086 If your list of hosts is I<server1-88.example.net> with login I<foo>:
1088 seq 10 | parallel -Sfoo@server{1..88}.example.net echo
1090 To distribute the commands to a list of computers, make a file
1091 I<mycomputers> with all the computers:
1093 server.example.com
1094 foo@server2.example.com
1095 server3.example.com
1097 Then run:
1099 seq 10 | parallel --sshloginfile mycomputers echo
1101 To include the local computer add the special sshlogin ':' to the list:
1103 server.example.com
1104 foo@server2.example.com
1105 server3.example.com
1108 GNU B<parallel> will try to determine the number of CPUs on each of
1109 the remote computers, and run one job per CPU - even if the remote
1110 computers do not have the same number of CPUs.
1112 If the number of CPUs on the remote computers is not identified
1113 correctly the number of CPUs can be added in front. Here the computer
1114 has 8 CPUs.
1116 seq 10 | parallel --sshlogin 8/server.example.com echo
1119 =head2 EXAMPLE: Transferring of files
1121 To recompress gzipped files with B<bzip2> using a remote computer run:
1123 find logs/ -name '*.gz' | \
1124 parallel --sshlogin server.example.com \
1125 --transfer "zcat {} | bzip2 -9 >{.}.bz2"
1127 This will list the .gz-files in the I<logs> directory and all
1128 directories below. Then it will transfer the files to
1129 I<server.example.com> to the corresponding directory in
1130 I<$HOME/logs>. On I<server.example.com> the file will be recompressed
1131 using B<zcat> and B<bzip2> resulting in the corresponding file with
1132 I<.gz> replaced with I<.bz2>.
1134 If you want the resulting bz2-file to be transferred back to the local
1135 computer add I<--return {.}.bz2>:
1137 find logs/ -name '*.gz' | \
1138 parallel --sshlogin server.example.com \
1139 --transfer --return {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1141 After the recompressing is done the I<.bz2>-file is transferred back to
1142 the local computer and put next to the original I<.gz>-file.
1144 If you want to delete the transferred files on the remote computer add
1145 I<--cleanup>. This will remove both the file transferred to the remote
1146 computer and the files transferred from the remote computer:
1148 find logs/ -name '*.gz' | \
1149 parallel --sshlogin server.example.com \
1150 --transfer --return {.}.bz2 --cleanup "zcat {} | bzip2 -9 >{.}.bz2"
1152 If you want run on several computers add the computers to I<--sshlogin>
1153 either using ',' or multiple I<--sshlogin>:
1155 find logs/ -name '*.gz' | \
1156 parallel --sshlogin server.example.com,server2.example.com \
1157 --sshlogin server3.example.com \
1158 --transfer --return {.}.bz2 --cleanup "zcat {} | bzip2 -9 >{.}.bz2"
1160 You can add the local computer using I<--sshlogin :>. This will disable the
1161 removing and transferring for the local computer only:
1163 find logs/ -name '*.gz' | \
1164 parallel --sshlogin server.example.com,server2.example.com \
1165 --sshlogin server3.example.com \
1166 --sshlogin : \
1167 --transfer --return {.}.bz2 --cleanup "zcat {} | bzip2 -9 >{.}.bz2"
1169 Often I<--transfer>, I<--return> and I<--cleanup> are used together. They can be
1170 shortened to I<--trc>:
1172 find logs/ -name '*.gz' | \
1173 parallel --sshlogin server.example.com,server2.example.com \
1174 --sshlogin server3.example.com \
1175 --sshlogin : \
1176 --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1178 With the file I<mycomputers> containing the list of computers it becomes:
1180 find logs/ -name '*.gz' | parallel --sshloginfile mycomputers \
1181 --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1183 If the file I<~/.parallel/sshloginfile> contains the list of computers
1184 the special short hand I<-S ..> can be used:
1186 find logs/ -name '*.gz' | parallel -S .. \
1187 --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1190 =head2 EXAMPLE: Advanced file transfer
1192 Assume you have files in in/*, want them processed on server,
1193 and transferred back into /other/dir:
1195 parallel -S server --trc /other/dir/./{/}.out \
1196 cp {/} {/}.out ::: in/./*
1199 =head2 EXAMPLE: Distributing work to local and remote computers
1201 Convert *.mp3 to *.ogg running one process per CPU on local computer
1202 and server2:
1204 parallel --trc {.}.ogg -S server2,: \
1205 'mpg321 -w - {} | oggenc -q0 - -o {.}.ogg' ::: *.mp3
1208 =head2 EXAMPLE: Running the same command on remote computers
1210 To run the command B<uptime> on remote computers you can do:
1212 parallel --tag --nonall -S server1,server2 uptime
1214 B<--nonall> reads no arguments. If you have a list of jobs you want
1215 to run on each computer you can do:
1217 parallel --tag --onall -S server1,server2 echo ::: 1 2 3
1219 Remove B<--tag> if you do not want the sshlogin added before the
1220 output.
1222 If you have a lot of hosts use '-j0' to access more hosts in parallel.
1225 =head2 EXAMPLE: Running 'sudo' on remote computers
1227 Put the password into passwordfile then run:
1229 parallel --ssh 'cat passwordfile | ssh' --nonall \
1230 -S user@server1,user@server2 sudo -S ls -l /root
1233 =head2 EXAMPLE: Using remote computers behind NAT wall
1235 If the workers are behind a NAT wall, you need some trickery to get to
1236 them.
1238 If you can B<ssh> to a jumphost, and reach the workers from there,
1239 then the obvious solution would be this, but it B<does not work>:
1241 parallel --ssh 'ssh jumphost ssh' -S host1 echo ::: DOES NOT WORK
1243 It does not work because the command is dequoted by B<ssh> twice where
1244 as GNU B<parallel> only expects it to be dequoted once.
1246 You can use a bash function and have GNU B<parallel> quote the command:
1248 jumpssh() { ssh -A jumphost ssh $(parallel --shellquote ::: "$@"); }
1249 export -f jumpssh
1250 parallel --ssh jumpssh -S host1 echo ::: this works
1252 Or you can instead put this in B<~/.ssh/config>:
1254 Host host1 host2 host3
1255 ProxyCommand ssh jumphost.domain nc -w 1 %h 22
1257 It requires B<nc(netcat)> to be installed on jumphost. With this you
1258 can simply:
1260 parallel -S host1,host2,host3 echo ::: This does work
1262 =head3 No jumphost, but port forwards
1264 If there is no jumphost but each server has port 22 forwarded from the
1265 firewall (e.g. the firewall's port 22001 = port 22 on host1, 22002 = host2,
1266 22003 = host3) then you can use B<~/.ssh/config>:
1268 Host host1.v
1269 Port 22001
1270 Host host2.v
1271 Port 22002
1272 Host host3.v
1273 Port 22003
1274 Host *.v
1275 Hostname firewall
1277 And then use host{1..3}.v as normal hosts:
1279 parallel -S host1.v,host2.v,host3.v echo ::: a b c
1281 =head3 No jumphost, no port forwards
1283 If ports cannot be forwarded, you need some sort of VPN to traverse
1284 the NAT-wall. TOR is one options for that, as it is very easy to get
1285 working.
1287 You need to install TOR and setup a hidden service. In B<torrc> put:
1289 HiddenServiceDir /var/lib/tor/hidden_service/
1290 HiddenServicePort 22 127.0.0.1:22
1292 Then start TOR: B</etc/init.d/tor restart>
1294 The TOR hostname is now in B</var/lib/tor/hidden_service/hostname> and
1295 is something similar to B<izjafdceobowklhz.onion>. Now you simply
1296 prepend B<torsocks> to B<ssh>:
1298 parallel --ssh 'torsocks ssh' -S izjafdceobowklhz.onion \
1299 -S zfcdaeiojoklbwhz.onion,auclucjzobowklhi.onion echo ::: a b c
1301 If not all hosts are accessible through TOR:
1303 parallel -S 'torsocks ssh izjafdceobowklhz.onion,host2,host3' \
1304 echo ::: a b c
1306 See more B<ssh> tricks on https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Proxies_and_Jump_Hosts
1309 =head2 EXAMPLE: Use sshpass with ssh
1311 If you cannot use passwordless login, you may be able to use B<sshpass>:
1313 seq 10 | parallel -S user-with-password:MyPassword@server echo
1317 export SSHPASS='MyPa$$w0rd'
1318 seq 10 | parallel -S user-with-password:@server echo
1321 =head2 EXAMPLE: Use outrun instead of ssh
1323 B<outrun> lets you run a command on a remote server. B<outrun> sets up
1324 a connection to access files at the source server, and automatically
1325 transfers files. B<outrun> must be installed on the remote system.
1327 You can use B<outrun> in an sshlogin this way:
1329 parallel -S 'outrun user@server' command
1333 parallel --ssh outrun -S server command
1336 =head2 EXAMPLE: Slurm cluster
1338 The Slurm Workload Manager is used in many clusters.
1340 Here is a simple example of using GNU B<parallel> to call B<srun>:
1342 #!/bin/bash
1344 #SBATCH --time 00:02:00
1345 #SBATCH --ntasks=4
1346 #SBATCH --job-name GnuParallelDemo
1347 #SBATCH --output gnuparallel.out
1349 module purge
1350 module load gnu_parallel
1352 my_parallel="parallel --delay .2 -j $SLURM_NTASKS"
1353 my_srun="srun --export=all --exclusive -n1"
1354 my_srun="$my_srun --cpus-per-task=1 --cpu-bind=cores"
1355 $my_parallel "$my_srun" echo This is job {} ::: {1..20}
1358 =head2 EXAMPLE: Parallelizing rsync
1360 B<rsync> is a great tool, but sometimes it will not fill up the
1361 available bandwidth. Running multiple B<rsync> in parallel can fix
1362 this.
1364 cd src-dir
1365 find . -type f |
1366 parallel -j10 -X rsync -zR -Ha ./{} fooserver:/dest-dir/
1368 Adjust B<-j10> until you find the optimal number.
1370 B<rsync -R> will create the needed subdirectories, so all files are
1371 not put into a single dir. The B<./> is needed so the resulting command
1372 looks similar to:
1374 rsync -zR ././sub/dir/file fooserver:/dest-dir/
1376 The B</./> is what B<rsync -R> works on.
1378 If you are unable to push data, but need to pull them and the files
1379 are called digits.png (e.g. 000000.png) you might be able to do:
1381 seq -w 0 99 | parallel rsync -Havessh fooserver:src/*{}.png destdir/
1384 =head2 EXAMPLE: Use multiple inputs in one command
1386 Copy files like foo.es.ext to foo.ext:
1388 ls *.es.* | perl -pe 'print; s/\.es//' | parallel -N2 cp {1} {2}
1390 The perl command spits out 2 lines for each input. GNU B<parallel>
1391 takes 2 inputs (using B<-N2>) and replaces {1} and {2} with the inputs.
1393 Count in binary:
1395 parallel -k echo ::: 0 1 ::: 0 1 ::: 0 1 ::: 0 1 ::: 0 1 ::: 0 1
1397 Print the number on the opposing sides of a six sided die:
1399 parallel --link -a <(seq 6) -a <(seq 6 -1 1) echo
1400 parallel --link echo :::: <(seq 6) <(seq 6 -1 1)
1402 Convert files from all subdirs to PNG-files with consecutive numbers
1403 (useful for making input PNG's for B<ffmpeg>):
1405 parallel --link -a <(find . -type f | sort) \
1406 -a <(seq $(find . -type f|wc -l)) convert {1} {2}.png
1408 Alternative version:
1410 find . -type f | sort | parallel convert {} {#}.png
1413 =head2 EXAMPLE: Use a table as input
1415 Content of table_file.tsv:
1417 foo<TAB>bar
1418 baz <TAB> quux
1420 To run:
1422 cmd -o bar -i foo
1423 cmd -o quux -i baz
1425 you can run:
1427 parallel -a table_file.tsv --colsep '\t' cmd -o {2} -i {1}
1429 Note: The default for GNU B<parallel> is to remove the spaces around
1430 the columns. To keep the spaces:
1432 parallel -a table_file.tsv --trim n --colsep '\t' cmd -o {2} -i {1}
1435 =head2 EXAMPLE: Output to database
1437 GNU B<parallel> can output to a database table and a CSV-file:
1439 dburl=csv:///%2Ftmp%2Fmydir
1440 dbtableurl=$dburl/mytable.csv
1441 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1443 It is rather slow and takes up a lot of CPU time because GNU
1444 B<parallel> parses the whole CSV file for each update.
1446 A better approach is to use an SQLite-base and then convert that to CSV:
1448 dburl=sqlite3:///%2Ftmp%2Fmy.sqlite
1449 dbtableurl=$dburl/mytable
1450 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1451 sql $dburl '.headers on' '.mode csv' 'SELECT * FROM mytable;'
1453 This takes around a second per job.
1455 If you have access to a real database system, such as PostgreSQL, it
1456 is even faster:
1458 dburl=pg://user:pass@host/mydb
1459 dbtableurl=$dburl/mytable
1460 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1461 sql $dburl \
1462 "COPY (SELECT * FROM mytable) TO stdout DELIMITER ',' CSV HEADER;"
1464 Or MySQL:
1466 dburl=mysql://user:pass@host/mydb
1467 dbtableurl=$dburl/mytable
1468 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1469 sql -p -B $dburl "SELECT * FROM mytable;" > mytable.tsv
1470 perl -pe 's/"/""/g; s/\t/","/g; s/^/"/; s/$/"/;
1471 %s=("\\" => "\\", "t" => "\t", "n" => "\n");
1472 s/\\([\\tn])/$s{$1}/g;' mytable.tsv
1475 =head2 EXAMPLE: Output to CSV-file for R
1477 If you have no need for the advanced job distribution control that a
1478 database provides, but you simply want output into a CSV file that you
1479 can read into R or LibreCalc, then you can use B<--results>:
1481 parallel --results my.csv seq ::: 10 20 30
1483 > mydf <- read.csv("my.csv");
1484 > print(mydf[2,])
1485 > write(as.character(mydf[2,c("Stdout")]),'')
1488 =head2 EXAMPLE: Use XML as input
1490 The show Aflyttet on Radio 24syv publishes an RSS feed with their audio
1491 podcasts on: http://arkiv.radio24syv.dk/audiopodcast/channel/4466232
1493 Using B<xpath> you can extract the URLs for 2019 and download them
1494 using GNU B<parallel>:
1496 wget -O - http://arkiv.radio24syv.dk/audiopodcast/channel/4466232 | \
1497 xpath -e "//pubDate[contains(text(),'2019')]/../enclosure/@url" | \
1498 parallel -u wget '{= s/ url="//; s/"//; =}'
1501 =head2 EXAMPLE: Run the same command 10 times
1503 If you want to run the same command with the same arguments 10 times
1504 in parallel you can do:
1506 seq 10 | parallel -n0 my_command my_args
1509 =head2 EXAMPLE: Working as cat | sh. Resource inexpensive jobs and evaluation
1511 GNU B<parallel> can work similar to B<cat | sh>.
1513 A resource inexpensive job is a job that takes very little CPU, disk
1514 I/O and network I/O. Ping is an example of a resource inexpensive
1515 job. wget is too - if the webpages are small.
1517 The content of the file jobs_to_run:
1519 ping -c 1 10.0.0.1
1520 wget http://example.com/status.cgi?ip=10.0.0.1
1521 ping -c 1 10.0.0.2
1522 wget http://example.com/status.cgi?ip=10.0.0.2
1524 ping -c 1 10.0.0.255
1525 wget http://example.com/status.cgi?ip=10.0.0.255
1527 To run 100 processes simultaneously do:
1529 parallel -j 100 < jobs_to_run
1531 As there is not a I<command> the jobs will be evaluated by the shell.
1534 =head2 EXAMPLE: Call program with FASTA sequence
1536 FASTA files have the format:
1538 >Sequence name1
1539 sequence
1540 sequence continued
1541 >Sequence name2
1542 sequence
1543 sequence continued
1544 more sequence
1546 To call B<myprog> with the sequence as argument run:
1548 cat file.fasta |
1549 parallel --pipe -N1 --recstart '>' --rrs \
1550 'read a; echo Name: "$a"; myprog $(tr -d "\n")'
1553 =head2 EXAMPLE: Call program with interleaved FASTQ records
1555 FASTQ files have the format:
1557 @M10991:61:000000000-A7EML:1:1101:14011:1001 1:N:0:28
1558 CTCCTAGGTCGGCATGATGGGGGAAGGAGAGCATGGGAAGAAATGAGAGAGTAGCAAGG
1560 #8BCCGGGGGFEFECFGGGGGGGGG@;FFGGGEG@FF<EE<@FFC,CEGCCGGFF<FGF
1562 Interleaved FASTQ starts with a line like these:
1564 @HWUSI-EAS100R:6:73:941:1973#0/1
1565 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
1566 @EAS139:136:FC706VJ:2:2104:15343:197393 1:N:18:1
1568 where '/1' and ' 1:' determines this is read 1.
1570 This will cut big.fq into one chunk per CPU thread and pass it on
1571 stdin (standard input) to the program fastq-reader:
1573 parallel --pipe-part -a big.fq --block -1 --regexp \
1574 --recend '\n' --recstart '@.*(/1| 1:.*)\n[A-Za-z\n\.~]' \
1575 fastq-reader
1578 =head2 EXAMPLE: Processing a big file using more CPUs
1580 To process a big file or some output you can use B<--pipe> to split up
1581 the data into blocks and pipe the blocks into the processing program.
1583 If the program is B<gzip -9> you can do:
1585 cat bigfile | parallel --pipe --recend '' -k gzip -9 > bigfile.gz
1587 This will split B<bigfile> into blocks of 1 MB and pass that to B<gzip
1588 -9> in parallel. One B<gzip> will be run per CPU. The output of B<gzip
1589 -9> will be kept in order and saved to B<bigfile.gz>
1591 B<gzip> works fine if the output is appended, but some processing does
1592 not work like that - for example sorting. For this GNU B<parallel> can
1593 put the output of each command into a file. This will sort a big file
1594 in parallel:
1596 cat bigfile | parallel --pipe --files sort |\
1597 parallel -Xj1 sort -m {} ';' rm {} >bigfile.sort
1599 Here B<bigfile> is split into blocks of around 1MB, each block ending
1600 in '\n' (which is the default for B<--recend>). Each block is passed
1601 to B<sort> and the output from B<sort> is saved into files. These
1602 files are passed to the second B<parallel> that runs B<sort -m> on the
1603 files before it removes the files. The output is saved to
1604 B<bigfile.sort>.
1606 GNU B<parallel>'s B<--pipe> maxes out at around 100 MB/s because every
1607 byte has to be copied through GNU B<parallel>. But if B<bigfile> is a
1608 real (seekable) file GNU B<parallel> can by-pass the copying and send
1609 the parts directly to the program:
1611 parallel --pipe-part --block 100m -a bigfile --files sort |\
1612 parallel -Xj1 sort -m {} ';' rm {} >bigfile.sort
1615 =head2 EXAMPLE: Grouping input lines
1617 When processing with B<--pipe> you may have lines grouped by a
1618 value. Here is I<my.csv>:
1620 Transaction Customer Item
1621 1 a 53
1622 2 b 65
1623 3 b 82
1624 4 c 96
1625 5 c 67
1626 6 c 13
1627 7 d 90
1628 8 d 43
1629 9 d 91
1630 10 d 84
1631 11 e 72
1632 12 e 102
1633 13 e 63
1634 14 e 56
1635 15 e 74
1637 Let us assume you want GNU B<parallel> to process each customer. In
1638 other words: You want all the transactions for a single customer to be
1639 treated as a single record.
1641 To do this we preprocess the data with a program that inserts a record
1642 separator before each customer (column 2 = $F[1]). Here we first make
1643 a 50 character random string, which we then use as the separator:
1645 sep=`perl -e 'print map { ("a".."z","A".."Z")[rand(52)] } (1..50);'`
1646 cat my.csv | \
1647 perl -ape '$F[1] ne $l and print "'$sep'"; $l = $F[1]' | \
1648 parallel --recend $sep --rrs --pipe -N1 wc
1650 If your program can process multiple customers replace B<-N1> with a
1651 reasonable B<--blocksize>.
1654 =head2 EXAMPLE: Running more than 250 jobs workaround
1656 If you need to run a massive amount of jobs in parallel, then you will
1657 likely hit the filehandle limit which is often around 250 jobs. If you
1658 are super user you can raise the limit in /etc/security/limits.conf
1659 but you can also use this workaround. The filehandle limit is per
1660 process. That means that if you just spawn more GNU B<parallel>s then
1661 each of them can run 250 jobs. This will spawn up to 2500 jobs:
1663 cat myinput |\
1664 parallel --pipe -N 50 --round-robin -j50 parallel -j50 your_prg
1666 This will spawn up to 62500 jobs (use with caution - you need 64 GB
1667 RAM to do this, and you may need to increase /proc/sys/kernel/pid_max):
1669 cat myinput |\
1670 parallel --pipe -N 250 --round-robin -j250 parallel -j250 your_prg
1673 =head2 EXAMPLE: Working as mutex and counting semaphore
1675 The command B<sem> is an alias for B<parallel --semaphore>.
1677 A counting semaphore will allow a given number of jobs to be started
1678 in the background. When the number of jobs are running in the
1679 background, GNU B<sem> will wait for one of these to complete before
1680 starting another command. B<sem --wait> will wait for all jobs to
1681 complete.
1683 Run 10 jobs concurrently in the background:
1685 for i in *.log ; do
1686 echo $i
1687 sem -j10 gzip $i ";" echo done
1688 done
1689 sem --wait
1691 A mutex is a counting semaphore allowing only one job to run. This
1692 will edit the file I<myfile> and prepends the file with lines with the
1693 numbers 1 to 3.
1695 seq 3 | parallel sem sed -i -e '1i{}' myfile
1697 As I<myfile> can be very big it is important only one process edits
1698 the file at the same time.
1700 Name the semaphore to have multiple different semaphores active at the
1701 same time:
1703 seq 3 | parallel sem --id mymutex sed -i -e '1i{}' myfile
1706 =head2 EXAMPLE: Mutex for a script
1708 Assume a script is called from cron or from a web service, but only
1709 one instance can be run at a time. With B<sem> and B<--shebang-wrap>
1710 the script can be made to wait for other instances to finish. Here in
1711 B<bash>:
1713 #!/usr/bin/sem --shebang-wrap -u --id $0 --fg /bin/bash
1715 echo This will run
1716 sleep 5
1717 echo exclusively
1719 Here B<perl>:
1721 #!/usr/bin/sem --shebang-wrap -u --id $0 --fg /usr/bin/perl
1723 print "This will run ";
1724 sleep 5;
1725 print "exclusively\n";
1727 Here B<python>:
1729 #!/usr/local/bin/sem --shebang-wrap -u --id $0 --fg /usr/bin/python
1731 import time
1732 print "This will run ";
1733 time.sleep(5)
1734 print "exclusively";
1737 =head2 EXAMPLE: Start editor with file names from stdin (standard input)
1739 You can use GNU B<parallel> to start interactive programs like emacs or vi:
1741 cat filelist | parallel --tty -X emacs
1742 cat filelist | parallel --tty -X vi
1744 If there are more files than will fit on a single command line, the
1745 editor will be started again with the remaining files.
1748 =head2 EXAMPLE: Running sudo
1750 B<sudo> requires a password to run a command as root. It caches the
1751 access, so you only need to enter the password again if you have not
1752 used B<sudo> for a while.
1754 The command:
1756 parallel sudo echo ::: This is a bad idea
1758 is no good, as you would be prompted for the sudo password for each of
1759 the jobs. Instead do:
1761 sudo parallel echo ::: This is a good idea
1763 This way you only have to enter the sudo password once.
1765 =head2 EXAMPLE: Run ping in parallel
1767 B<ping> prints out statistics when killed with CTRL-C.
1769 Unfortunately, CTRL-C will also normally kill GNU B<parallel>.
1771 But by using B<--open-tty> and ignoring SIGINT you can get the wanted effect:
1773 parallel -j0 --open-tty --lb --tag ping '{= $SIG{INT}=sub {} =}' \
1774 ::: 1.1.1.1 8.8.8.8 9.9.9.9 21.21.21.21 80.80.80.80 88.88.88.88
1776 B<--open-tty> will make the B<ping>s receive SIGINT (from CTRL-C).
1777 CTRL-C will not kill GNU B<parallel>, so that will only exit after
1778 B<ping> is done.
1781 =head2 EXAMPLE: GNU Parallel as queue system/batch manager
1783 GNU B<parallel> can work as a simple job queue system or batch manager.
1784 The idea is to put the jobs into a file and have GNU B<parallel> read
1785 from that continuously. As GNU B<parallel> will stop at end of file we
1786 use B<tail> to continue reading:
1788 true >jobqueue; tail -n+0 -f jobqueue | parallel
1790 To submit your jobs to the queue:
1792 echo my_command my_arg >> jobqueue
1794 You can of course use B<-S> to distribute the jobs to remote
1795 computers:
1797 true >jobqueue; tail -n+0 -f jobqueue | parallel -S ..
1799 Output only will be printed when reading the next input after a job
1800 has finished: So you need to submit a job after the first has finished
1801 to see the output from the first job.
1803 If you keep this running for a long time, jobqueue will grow. A way of
1804 removing the jobs already run is by making GNU B<parallel> stop when
1805 it hits a special value and then restart. To use B<--eof> to make GNU
1806 B<parallel> exit, B<tail> also needs to be forced to exit:
1808 true >jobqueue;
1809 while true; do
1810 tail -n+0 -f jobqueue |
1811 (parallel -E StOpHeRe -S ..; echo GNU Parallel is now done;
1812 perl -e 'while(<>){/StOpHeRe/ and last};print <>' jobqueue > j2;
1813 (seq 1000 >> jobqueue &);
1814 echo Done appending dummy data forcing tail to exit)
1815 echo tail exited;
1816 mv j2 jobqueue
1817 done
1819 In some cases you can run on more CPUs and computers during the night:
1821 # Day time
1822 echo 50% > jobfile
1823 cp day_server_list ~/.parallel/sshloginfile
1824 # Night time
1825 echo 100% > jobfile
1826 cp night_server_list ~/.parallel/sshloginfile
1827 tail -n+0 -f jobqueue | parallel --jobs jobfile -S ..
1829 GNU B<parallel> discovers if B<jobfile> or B<~/.parallel/sshloginfile>
1830 changes.
1833 =head2 EXAMPLE: GNU Parallel as dir processor
1835 If you have a dir in which users drop files that needs to be processed
1836 you can do this on GNU/Linux (If you know what B<inotifywait> is
1837 called on other platforms file a bug report):
1839 inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |\
1840 parallel -u echo
1842 This will run the command B<echo> on each file put into B<my_dir> or
1843 subdirs of B<my_dir>.
1845 You can of course use B<-S> to distribute the jobs to remote
1846 computers:
1848 inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |\
1849 parallel -S .. -u echo
1851 If the files to be processed are in a tar file then unpacking one file
1852 and processing it immediately may be faster than first unpacking all
1853 files. Set up the dir processor as above and unpack into the dir.
1855 Using GNU B<parallel> as dir processor has the same limitations as
1856 using GNU B<parallel> as queue system/batch manager.
1859 =head2 EXAMPLE: Locate the missing package
1861 If you have downloaded source and tried compiling it, you may have seen:
1863 $ ./configure
1864 [...]
1865 checking for something.h... no
1866 configure: error: "libsomething not found"
1868 Often it is not obvious which package you should install to get that
1869 file. Debian has `apt-file` to search for a file. `tracefile` from
1870 https://codeberg.org/tange/tangetools can tell which files a program
1871 tried to access. In this case we are interested in one of the last
1872 files:
1874 $ tracefile -un ./configure | tail | parallel -j0 apt-file search
1877 =head1 AUTHOR
1879 When using GNU B<parallel> for a publication please cite:
1881 O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login:
1882 The USENIX Magazine, February 2011:42-47.
1884 This helps funding further development; and it won't cost you a cent.
1885 If you pay 10000 EUR you should feel free to use GNU Parallel without citing.
1887 Copyright (C) 2007-10-18 Ole Tange, http://ole.tange.dk
1889 Copyright (C) 2008-2010 Ole Tange, http://ole.tange.dk
1891 Copyright (C) 2010-2023 Ole Tange, http://ole.tange.dk and Free
1892 Software Foundation, Inc.
1894 Parts of the manual concerning B<xargs> compatibility is inspired by
1895 the manual of B<xargs> from GNU findutils 4.4.2.
1898 =head1 LICENSE
1900 This program is free software; you can redistribute it and/or modify
1901 it under the terms of the GNU General Public License as published by
1902 the Free Software Foundation; either version 3 of the License, or
1903 at your option any later version.
1905 This program is distributed in the hope that it will be useful,
1906 but WITHOUT ANY WARRANTY; without even the implied warranty of
1907 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
1908 GNU General Public License for more details.
1910 You should have received a copy of the GNU General Public License
1911 along with this program. If not, see <https://www.gnu.org/licenses/>.
1913 =head2 Documentation license I
1915 Permission is granted to copy, distribute and/or modify this
1916 documentation under the terms of the GNU Free Documentation License,
1917 Version 1.3 or any later version published by the Free Software
1918 Foundation; with no Invariant Sections, with no Front-Cover Texts, and
1919 with no Back-Cover Texts. A copy of the license is included in the
1920 file LICENSES/GFDL-1.3-or-later.txt.
1922 =head2 Documentation license II
1924 You are free:
1926 =over 9
1928 =item B<to Share>
1930 to copy, distribute and transmit the work
1932 =item B<to Remix>
1934 to adapt the work
1936 =back
1938 Under the following conditions:
1940 =over 9
1942 =item B<Attribution>
1944 You must attribute the work in the manner specified by the author or
1945 licensor (but not in any way that suggests that they endorse you or
1946 your use of the work).
1948 =item B<Share Alike>
1950 If you alter, transform, or build upon this work, you may distribute
1951 the resulting work only under the same, similar or a compatible
1952 license.
1954 =back
1956 With the understanding that:
1958 =over 9
1960 =item B<Waiver>
1962 Any of the above conditions can be waived if you get permission from
1963 the copyright holder.
1965 =item B<Public Domain>
1967 Where the work or any of its elements is in the public domain under
1968 applicable law, that status is in no way affected by the license.
1970 =item B<Other Rights>
1972 In no way are any of the following rights affected by the license:
1974 =over 2
1976 =item *
1978 Your fair dealing or fair use rights, or other applicable
1979 copyright exceptions and limitations;
1981 =item *
1983 The author's moral rights;
1985 =item *
1987 Rights other persons may have either in the work itself or in
1988 how the work is used, such as publicity or privacy rights.
1990 =back
1992 =back
1994 =over 9
1996 =item B<Notice>
1998 For any reuse or distribution, you must make clear to others the
1999 license terms of this work.
2001 =back
2003 A copy of the full license is included in the file as
2004 LICENCES/CC-BY-SA-4.0.txt
2007 =head1 SEE ALSO
2009 B<parallel>(1), B<parallel_tutorial>(7), B<env_parallel>(1),
2010 B<parset>(1), B<parsort>(1), B<parallel_alternatives>(7),
2011 B<parallel_design>(7), B<niceload>(1), B<sql>(1), B<ssh>(1),
2012 B<ssh-agent>(1), B<sshpass>(1), B<ssh-copy-id>(1), B<rsync>(1)
2014 =cut