parallel: Fixed bug #65736: Default values for XDG_CONFIG_HOME XDG_CACHE_HOME
[parallel.git] / src / parallel_examples.pod
blob9045f3502c1e522fd01d843f088407d11643dfe5
1 #!/usr/bin/perl -w
3 # SPDX-FileCopyrightText: 2021-2024 Ole Tange, http://ole.tange.dk and Free Software and Foundation, Inc.
4 # SPDX-License-Identifier: GFDL-1.3-or-later
5 # SPDX-License-Identifier: CC-BY-SA-4.0
7 =encoding utf8
9 =head1 GNU PARALLEL EXAMPLES
11 =head2 EXAMPLE: Working as xargs -n1. Argument appending
13 GNU B<parallel> can work similar to B<xargs -n1>.
15 To compress all html files using B<gzip> run:
17 find . -name '*.html' | parallel gzip --best
19 If the file names may contain a newline use B<-0>. Substitute FOO BAR with
20 FUBAR in all files in this dir and subdirs:
22 find . -type f -print0 | \
23 parallel -q0 perl -i -pe 's/FOO BAR/FUBAR/g'
25 Note B<-q> is needed because of the space in 'FOO BAR'.
28 =head2 EXAMPLE: Simple network scanner
30 B<prips> can generate IP-addresses from CIDR notation. With GNU
31 B<parallel> you can build a simple network scanner to see which
32 addresses respond to B<ping>:
34 prips 130.229.16.0/20 | \
35 parallel --timeout 2 -j0 \
36 'ping -c 1 {} >/dev/null && echo {}' 2>/dev/null
39 =head2 EXAMPLE: Reading arguments from command line
41 GNU B<parallel> can take the arguments from command line instead of
42 stdin (standard input). To compress all html files in the current dir
43 using B<gzip> run:
45 parallel gzip --best ::: *.html
47 To convert *.wav to *.mp3 using LAME running one process per CPU run:
49 parallel lame {} -o {.}.mp3 ::: *.wav
52 =head2 EXAMPLE: Inserting multiple arguments
54 When moving a lot of files like this: B<mv *.log destdir> you will
55 sometimes get the error:
57 bash: /bin/mv: Argument list too long
59 because there are too many files. You can instead do:
61 ls | grep -E '\.log$' | parallel mv {} destdir
63 This will run B<mv> for each file. It can be done faster if B<mv> gets
64 as many arguments that will fit on the line:
66 ls | grep -E '\.log$' | parallel -m mv {} destdir
68 In many shells you can also use B<printf>:
70 printf '%s\0' *.log | parallel -0 -m mv {} destdir
73 =head2 EXAMPLE: Context replace
75 To remove the files I<pict0000.jpg> .. I<pict9999.jpg> you could do:
77 seq -w 0 9999 | parallel rm pict{}.jpg
79 You could also do:
81 seq -w 0 9999 | perl -pe 's/(.*)/pict$1.jpg/' | parallel -m rm
83 The first will run B<rm> 10000 times, while the last will only run
84 B<rm> as many times needed to keep the command line length short
85 enough to avoid B<Argument list too long> (it typically runs 1-2 times).
87 You could also run:
89 seq -w 0 9999 | parallel -X rm pict{}.jpg
91 This will also only run B<rm> as many times needed to keep the command
92 line length short enough.
95 =head2 EXAMPLE: Compute intensive jobs and substitution
97 If ImageMagick is installed this will generate a thumbnail of a jpg
98 file:
100 convert -geometry 120 foo.jpg thumb_foo.jpg
102 This will run with number-of-cpus jobs in parallel for all jpg files
103 in a directory:
105 ls *.jpg | parallel convert -geometry 120 {} thumb_{}
107 To do it recursively use B<find>:
109 find . -name '*.jpg' | \
110 parallel convert -geometry 120 {} {}_thumb.jpg
112 Notice how the argument has to start with B<{}> as B<{}> will include path
113 (e.g. running B<convert -geometry 120 ./foo/bar.jpg
114 thumb_./foo/bar.jpg> would clearly be wrong). The command will
115 generate files like ./foo/bar.jpg_thumb.jpg.
117 Use B<{.}> to avoid the extra .jpg in the file name. This command will
118 make files like ./foo/bar_thumb.jpg:
120 find . -name '*.jpg' | \
121 parallel convert -geometry 120 {} {.}_thumb.jpg
124 =head2 EXAMPLE: Substitution and redirection
126 This will generate an uncompressed version of .gz-files next to the .gz-file:
128 parallel zcat {} ">"{.} ::: *.gz
130 Quoting of > is necessary to postpone the redirection. Another
131 solution is to quote the whole command:
133 parallel "zcat {} >{.}" ::: *.gz
135 Other special shell characters (such as * ; $ > < | >> <<) also need
136 to be put in quotes, as they may otherwise be interpreted by the shell
137 and not given to GNU B<parallel>.
140 =head2 EXAMPLE: Composed commands
142 A job can consist of several commands. This will print the number of
143 files in each directory:
145 ls | parallel 'echo -n {}" "; ls {}|wc -l'
147 To put the output in a file called <name>.dir:
149 ls | parallel '(echo -n {}" "; ls {}|wc -l) >{}.dir'
151 Even small shell scripts can be run by GNU B<parallel>:
153 find . | parallel 'a={}; name=${a##*/};' \
154 'upper=$(echo "$name" | tr "[:lower:]" "[:upper:]");'\
155 'echo "$name - $upper"'
157 ls | parallel 'mv {} "$(echo {} | tr "[:upper:]" "[:lower:]")"'
159 Given a list of URLs, list all URLs that fail to download. Print the
160 line number and the URL.
162 cat urlfile | parallel "wget {} 2>/dev/null || grep -n {} urlfile"
164 Create a mirror directory with the same file names except all files and
165 symlinks are empty files.
167 cp -rs /the/source/dir mirror_dir
168 find mirror_dir -type l | parallel -m rm {} '&&' touch {}
170 Find the files in a list that do not exist
172 cat file_list | parallel 'if [ ! -e {} ] ; then echo {}; fi'
175 =head2 EXAMPLE: Composed command with perl replacement string
177 You have a bunch of file. You want them sorted into dirs. The dir of
178 each file should be named the first letter of the file name.
180 parallel 'mkdir -p {=s/(.).*/$1/=}; mv {} {=s/(.).*/$1/=}' ::: *
183 =head2 EXAMPLE: Composed command with multiple input sources
185 You have a dir with files named as 24 hours in 5 minute intervals:
186 00:00, 00:05, 00:10 .. 23:55. You want to find the files missing:
188 parallel [ -f {1}:{2} ] "||" echo {1}:{2} does not exist \
189 ::: {00..23} ::: {00..55..5}
192 =head2 EXAMPLE: Calling Bash functions
194 If the composed command is longer than a line, it becomes hard to
195 read. In Bash you can use functions. Just remember to B<export -f> the
196 function.
198 doit() {
199 echo Doing it for $1
200 sleep 2
201 echo Done with $1
203 export -f doit
204 parallel doit ::: 1 2 3
206 doubleit() {
207 echo Doing it for $1 $2
208 sleep 2
209 echo Done with $1 $2
211 export -f doubleit
212 parallel doubleit ::: 1 2 3 ::: a b
214 To do this on remote servers you need to transfer the function using
215 B<--env>:
217 parallel --env doit -S server doit ::: 1 2 3
218 parallel --env doubleit -S server doubleit ::: 1 2 3 ::: a b
220 If your environment (aliases, variables, and functions) is small you
221 can copy the full environment without having to
222 B<export -f> anything. See B<env_parallel>.
225 =head2 EXAMPLE: Function tester
227 To test a program with different parameters:
229 tester() {
230 if (eval "$@") >&/dev/null; then
231 perl -e 'printf "\033[30;102m[ OK ]\033[0m @ARGV\n"' "$@"
232 else
233 perl -e 'printf "\033[30;101m[FAIL]\033[0m @ARGV\n"' "$@"
236 export -f tester
237 parallel tester my_program ::: arg1 arg2
238 parallel tester exit ::: 1 0 2 0
240 If B<my_program> fails a red FAIL will be printed followed by the failing
241 command; otherwise a green OK will be printed followed by the command.
244 =head2 EXAMPLE: Identify few failing jobs
246 B<--bar> works best if jobs have no output. If the failing jobs have
247 output you can identify the jobs like this:
249 job-with-few-failures() {
250 # Force reproducibility
251 RANDOM=$1
252 # This fails 1% (328 of 32768)
253 if [ $RANDOM -lt 328 ] ; then
254 echo Failed $1
257 export -f job-with-few-failures
258 seq 1000 | parallel --bar --tag job-with-few-failures
261 =head2 EXAMPLE: Continously show the latest line of output
263 It can be useful to monitor the output of running jobs.
265 This shows the most recent output line until a job finishes. After
266 which the output of the job is printed in full:
268 parallel '{} | tee >(cat >&3)' ::: 'command 1' 'command 2' \
269 3> >(perl -ne '$|=1;chomp;printf"%.'$COLUMNS's\r",$_." "x100')
272 =head2 EXAMPLE: Log rotate
274 Log rotation renames a logfile to an extension with a higher number:
275 log.1 becomes log.2, log.2 becomes log.3, and so on. The oldest log is
276 removed. To avoid overwriting files the process starts backwards from
277 the high number to the low number. This will keep 10 old versions of
278 the log:
280 seq 9 -1 1 | parallel -j1 mv log.{} log.'{= $_++ =}'
281 mv log log.1
284 =head2 EXAMPLE: Removing file extension when processing files
286 When processing files removing the file extension using B<{.}> is
287 often useful.
289 Create a directory for each zip-file and unzip it in that dir:
291 parallel 'mkdir {.}; cd {.}; unzip ../{}' ::: *.zip
293 Recompress all .gz files in current directory using B<bzip2> running 1
294 job per CPU in parallel:
296 parallel "zcat {} | bzip2 >{.}.bz2 && rm {}" ::: *.gz
298 Convert all WAV files to MP3 using LAME:
300 find sounddir -type f -name '*.wav' | parallel lame {} -o {.}.mp3
302 Put all converted in the same directory:
304 find sounddir -type f -name '*.wav' | \
305 parallel lame {} -o mydir/{/.}.mp3
308 =head2 EXAMPLE: Replacing parts of file names
310 If you deal with paired end reads, you will have files like
311 barcode1_R1.fq.gz, barcode1_R2.fq.gz, barcode2_R1.fq.gz, and
312 barcode2_R2.fq.gz.
314 You want barcodeI<N>_R1 to be processed with barcodeI<N>_R2.
316 parallel --plus myprocess {} {/_R1.fq.gz/_R2.fq.gz} ::: *_R1.fq.gz
318 If the barcode does not contain '_R1', you can do:
320 parallel --plus myprocess {} {/_R1/_R2} ::: *_R1.fq.gz
323 =head2 EXAMPLE: Removing strings from the argument
325 If you have directory with tar.gz files and want these extracted in
326 the corresponding dir (e.g foo.tar.gz will be extracted in the dir
327 foo) you can do:
329 parallel --plus 'mkdir {..}; tar -C {..} -xf {}' ::: *.tar.gz
331 If you want to remove a different ending, you can use {%string}:
333 parallel --plus echo {%_demo} ::: mycode_demo keep_demo_here
335 You can also remove a starting string with {#string}
337 parallel --plus echo {#demo_} ::: demo_mycode keep_demo_here
339 To remove a string anywhere you can use regular expressions with
340 {/regexp/replacement} and leave the replacement empty:
342 parallel --plus echo {/demo_/} ::: demo_mycode remove_demo_here
345 =head2 EXAMPLE: Download 24 images for each of the past 30 days
347 Let us assume a website stores images like:
349 https://www.example.com/path/to/YYYYMMDD_##.jpg
351 where YYYYMMDD is the date and ## is the number 01-24. This will
352 download images for the past 30 days:
354 getit() {
355 date=$(date -d "today -$1 days" +%Y%m%d)
356 num=$2
357 echo wget https://www.example.com/path/to/${date}_${num}.jpg
359 export -f getit
361 parallel getit ::: $(seq 30) ::: $(seq -w 24)
363 B<$(date -d "today -$1 days" +%Y%m%d)> will give the dates in
364 YYYYMMDD with B<$1> days subtracted.
367 =head2 EXAMPLE: Download world map from NASA
369 NASA provides tiles to download on earthdata.nasa.gov. Download tiles
370 for Blue Marble world map and create a 10240x20480 map.
372 base=https://map1a.vis.earthdata.nasa.gov/wmts-geo/wmts.cgi
373 service="SERVICE=WMTS&REQUEST=GetTile&VERSION=1.0.0"
374 layer="LAYER=BlueMarble_ShadedRelief_Bathymetry"
375 set="STYLE=&TILEMATRIXSET=EPSG4326_500m&TILEMATRIX=5"
376 tile="TILEROW={1}&TILECOL={2}"
377 format="FORMAT=image%2Fjpeg"
378 url="$base?$service&$layer&$set&$tile&$format"
380 parallel -j0 -q wget "$url" -O {1}_{2}.jpg ::: {0..19} ::: {0..39}
381 parallel eval convert +append {}_{0..39}.jpg line{}.jpg ::: {0..19}
382 convert -append line{0..19}.jpg world.jpg
385 =head2 EXAMPLE: Download Apollo-11 images from NASA using jq
387 Search NASA using their API to get JSON for images related to 'apollo
388 11' and has 'moon landing' in the description.
390 The search query returns JSON containing URLs to JSON containing
391 collections of pictures. One of the pictures in each of these
392 collection is I<large>.
394 B<wget> is used to get the JSON for the search query. B<jq> is then
395 used to extract the URLs of the collections. B<parallel> then calls
396 B<wget> to get each collection, which is passed to B<jq> to extract
397 the URLs of all images. B<grep> filters out the I<large> images, and
398 B<parallel> finally uses B<wget> to fetch the images.
400 base="https://images-api.nasa.gov/search"
401 q="q=apollo 11"
402 description="description=moon landing"
403 media_type="media_type=image"
404 wget -O - "$base?$q&$description&$media_type" |
405 jq -r .collection.items[].href |
406 parallel wget -O - |
407 jq -r .[] |
408 grep large |
409 parallel wget
412 =head2 EXAMPLE: Download video playlist in parallel
414 B<youtube-dl> is an excellent tool to download videos. It can,
415 however, not download videos in parallel. This takes a playlist and
416 downloads 10 videos in parallel.
418 url='youtu.be/watch?v=0wOf2Fgi3DE&list=UU_cznB5YZZmvAmeq7Y3EriQ'
419 export url
420 youtube-dl --flat-playlist "https://$url" |
421 parallel --tagstring {#} --lb -j10 \
422 youtube-dl --playlist-start {#} --playlist-end {#} '"https://$url"'
425 =head2 EXAMPLE: Prepend last modified date (ISO8601) to file name
427 parallel mv {} '{= $a=pQ($_); $b=$_;' \
428 '$_=qx{date -r "$a" +%FT%T}; chomp; $_="$_ $b" =}' ::: *
430 B<{=> and B<=}> mark a perl expression. B<pQ> perl-quotes the
431 string. B<date +%FT%T> is the date in ISO8601 with time.
433 =head2 EXAMPLE: Save output in ISO8601 dirs
435 Save output from B<ps aux> every second into dirs named
436 yyyy-mm-ddThh:mm:ss+zz:zz.
438 seq 1000 | parallel -N0 -j1 --delay 1 \
439 --results '{= $_=`date -Isec`; chomp=}/' ps aux
442 =head2 EXAMPLE: Digital clock with "blinking" :
444 The : in a digital clock blinks. To make every other line have a ':'
445 and the rest a ' ' a perl expression is used to look at the 3rd input
446 source. If the value modulo 2 is 1: Use ":" otherwise use " ":
448 parallel -k echo {1}'{=3 $_=$_%2?":":" "=}'{2}{3} \
449 ::: {0..12} ::: {0..5} ::: {0..9}
452 =head2 EXAMPLE: Aggregating content of files
454 This:
456 parallel --header : echo x{X}y{Y}z{Z} \> x{X}y{Y}z{Z} \
457 ::: X {1..5} ::: Y {01..10} ::: Z {1..5}
459 will generate the files x1y01z1 .. x5y10z5. If you want to aggregate
460 the output grouping on x and z you can do this:
462 parallel eval 'cat {=s/y01/y*/=} > {=s/y01//=}' ::: *y01*
464 For all values of x and z it runs commands like:
466 cat x1y*z1 > x1z1
468 So you end up with x1z1 .. x5z5 each containing the content of all
469 values of y.
472 =head2 EXAMPLE: Breadth first parallel web crawler/mirrorer
474 This script below will crawl and mirror a URL in parallel. It
475 downloads first pages that are 1 click down, then 2 clicks down, then
476 3; instead of the normal depth first, where the first link link on
477 each page is fetched first.
479 Run like this:
481 PARALLEL=-j100 ./parallel-crawl http://gatt.org.yeslab.org/
483 Remove the B<wget> part if you only want a web crawler.
485 It works by fetching a page from a list of URLs and looking for links
486 in that page that are within the same starting URL and that have not
487 already been seen. These links are added to a new queue. When all the
488 pages from the list is done, the new queue is moved to the list of
489 URLs and the process is started over until no unseen links are found.
491 #!/bin/bash
493 # E.g. http://gatt.org.yeslab.org/
494 URL=$1
495 # Stay inside the start dir
496 BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
497 URLLIST=$(mktemp urllist.XXXX)
498 URLLIST2=$(mktemp urllist.XXXX)
499 SEEN=$(mktemp seen.XXXX)
501 # Spider to get the URLs
502 echo $URL >$URLLIST
503 cp $URLLIST $SEEN
505 while [ -s $URLLIST ] ; do
506 cat $URLLIST |
507 parallel lynx -listonly -image_links -dump {} \; \
508 wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
509 perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and
510 do { $seen{$1}++ or print }' |
511 grep -F $BASEURL |
512 grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
513 mv $URLLIST2 $URLLIST
514 done
516 rm -f $URLLIST $URLLIST2 $SEEN
519 =head2 EXAMPLE: Process files from a tar file while unpacking
521 If the files to be processed are in a tar file then unpacking one file
522 and processing it immediately may be faster than first unpacking all
523 files.
525 tar xvf foo.tgz | perl -ne 'print $l;$l=$_;END{print $l}' | \
526 parallel echo
528 The Perl one-liner is needed to make sure the file is complete before
529 handing it to GNU B<parallel>.
532 =head2 EXAMPLE: Rewriting a for-loop and a while-read-loop
534 for-loops like this:
536 (for x in `cat list` ; do
537 do_something $x
538 done) | process_output
540 and while-read-loops like this:
542 cat list | (while read x ; do
543 do_something $x
544 done) | process_output
546 can be written like this:
548 cat list | parallel do_something | process_output
550 For example: Find which host name in a list has IP address 1.2.3 4:
552 cat hosts.txt | parallel -P 100 host | grep 1.2.3.4
554 If the processing requires more steps the for-loop like this:
556 (for x in `cat list` ; do
557 no_extension=${x%.*};
558 do_step1 $x scale $no_extension.jpg
559 do_step2 <$x $no_extension
560 done) | process_output
562 and while-loops like this:
564 cat list | (while read x ; do
565 no_extension=${x%.*};
566 do_step1 $x scale $no_extension.jpg
567 do_step2 <$x $no_extension
568 done) | process_output
570 can be written like this:
572 cat list | parallel "do_step1 {} scale {.}.jpg ; do_step2 <{} {.}" |\
573 process_output
575 If the body of the loop is bigger, it improves readability to use a function:
577 (for x in `cat list` ; do
578 do_something $x
579 [... 100 lines that do something with $x ...]
580 done) | process_output
582 cat list | (while read x ; do
583 do_something $x
584 [... 100 lines that do something with $x ...]
585 done) | process_output
587 can both be rewritten as:
589 doit() {
590 x=$1
591 do_something $x
592 [... 100 lines that do something with $x ...]
594 export -f doit
595 cat list | parallel doit
597 =head2 EXAMPLE: Rewriting nested for-loops
599 Nested for-loops like this:
601 (for x in `cat xlist` ; do
602 for y in `cat ylist` ; do
603 do_something $x $y
604 done
605 done) | process_output
607 can be written like this:
609 parallel do_something {1} {2} :::: xlist ylist | process_output
611 Nested for-loops like this:
613 (for colour in red green blue ; do
614 for size in S M L XL XXL ; do
615 echo $colour $size
616 done
617 done) | sort
619 can be written like this:
621 parallel echo {1} {2} ::: red green blue ::: S M L XL XXL | sort
624 =head2 EXAMPLE: Finding the lowest difference between files
626 B<diff> is good for finding differences in text files. B<diff | wc -l>
627 gives an indication of the size of the difference. To find the
628 differences between all files in the current dir do:
630 parallel --tag 'diff {1} {2} | wc -l' ::: * ::: * | sort -nk3
632 This way it is possible to see if some files are closer to other
633 files.
636 =head2 EXAMPLE: for-loops with column names
638 When doing multiple nested for-loops it can be easier to keep track of
639 the loop variable if is is named instead of just having a number. Use
640 B<--header :> to let the first argument be an named alias for the
641 positional replacement string:
643 parallel --header : echo {colour} {size} \
644 ::: colour red green blue ::: size S M L XL XXL
646 This also works if the input file is a file with columns:
648 cat addressbook.tsv | \
649 parallel --colsep '\t' --header : echo {Name} {E-mail address}
652 =head2 EXAMPLE: All combinations in a list
654 GNU B<parallel> makes all combinations when given two lists.
656 To make all combinations in a single list with unique values, you
657 repeat the list and use replacement string B<{choose_k}>:
659 parallel --plus echo {choose_k} ::: A B C D ::: A B C D
661 parallel --plus echo 2{2choose_k} 1{1choose_k} ::: A B C D ::: A B C D
663 B<{choose_k}> works for any number of input sources:
665 parallel --plus echo {choose_k} ::: A B C D ::: A B C D ::: A B C D
667 Where B<{choose_k}> does not care about order, B<{uniq}> cares about
668 order. It simply skips jobs where values from different input sources
669 are the same:
671 parallel --plus echo {uniq} ::: A B C ::: A B C ::: A B C
672 parallel --plus echo {1uniq}+{2uniq}+{3uniq} \
673 ::: A B C ::: A B C ::: A B C
675 The behaviour of B<{choose_k}> is undefined, if the input values of each
676 source are different.
679 =head2 EXAMPLE: From a to b and b to c
681 Assume you have input like:
683 aardvark
684 babble
687 each
689 and want to run combinations like:
691 aardvark babble
692 babble cab
693 cab dab
694 dab each
696 If the input is in the file in.txt:
698 parallel echo {1} - {2} ::::+ <(head -n -1 in.txt) <(tail -n +2 in.txt)
700 If the input is in the array $a here are two solutions:
702 seq $((${#a[@]}-1)) | \
703 env_parallel --env a echo '${a[{=$_--=}]} - ${a[{}]}'
704 parallel echo {1} - {2} ::: "${a[@]::${#a[@]}-1}" :::+ "${a[@]:1}"
707 =head2 EXAMPLE: Count the differences between all files in a dir
709 Using B<--results> the results are saved in /tmp/diffcount*.
711 parallel --results /tmp/diffcount "diff -U 0 {1} {2} | \
712 tail -n +3 |grep -v '^@'|wc -l" ::: * ::: *
714 To see the difference between file A and file B look at the file
715 '/tmp/diffcount/1/A/2/B'.
718 =head2 EXAMPLE: Speeding up fast jobs
720 Starting a job on the local machine takes around 3-10 ms. This can be
721 a big overhead if the job takes very few ms to run. Often you can
722 group small jobs together using B<-X> which will make the overhead
723 less significant. Compare the speed of these:
725 seq -w 0 9999 | parallel touch pict{}.jpg
726 seq -w 0 9999 | parallel -X touch pict{}.jpg
728 If your program cannot take multiple arguments, then you can use GNU
729 B<parallel> to spawn multiple GNU B<parallel>s:
731 seq -w 0 9999999 | \
732 parallel -j10 -q -I,, --pipe parallel -j0 touch pict{}.jpg
734 If B<-j0> normally spawns 252 jobs, then the above will try to spawn
735 2520 jobs. On a normal GNU/Linux system you can spawn 32000 jobs using
736 this technique with no problems. To raise the 32000 jobs limit raise
737 /proc/sys/kernel/pid_max to 4194303.
739 If you do not need GNU B<parallel> to have control over each job (so
740 no need for B<--retries> or B<--joblog> or similar), then it can be
741 even faster if you can generate the command lines and pipe those to a
742 shell. So if you can do this:
744 mygenerator | sh
746 Then that can be parallelized like this:
748 mygenerator | parallel --pipe --block 10M sh
750 E.g.
752 mygenerator() {
753 seq 10000000 | perl -pe 'print "echo This is fast job number "';
755 mygenerator | parallel --pipe --block 10M sh
757 The overhead is 100000 times smaller namely around 100 nanoseconds per
758 job.
761 =head2 EXAMPLE: Using shell variables
763 When using shell variables you need to quote them correctly as they
764 may otherwise be interpreted by the shell.
766 Notice the difference between:
768 ARR=("My brother's 12\" records are worth <\$\$\$>"'!' Foo Bar)
769 parallel echo ::: ${ARR[@]} # This is probably not what you want
771 and:
773 ARR=("My brother's 12\" records are worth <\$\$\$>"'!' Foo Bar)
774 parallel echo ::: "${ARR[@]}"
776 When using variables in the actual command that contains special
777 characters (e.g. space) you can quote them using B<'"$VAR"'> or using
778 "'s and B<-q>:
780 VAR="My brother's 12\" records are worth <\$\$\$>"
781 parallel -q echo "$VAR" ::: '!'
782 export VAR
783 parallel echo '"$VAR"' ::: '!'
785 If B<$VAR> does not contain ' then B<"'$VAR'"> will also work
786 (and does not need B<export>):
788 VAR="My 12\" records are worth <\$\$\$>"
789 parallel echo "'$VAR'" ::: '!'
791 If you use them in a function you just quote as you normally would do:
793 VAR="My brother's 12\" records are worth <\$\$\$>"
794 export VAR
795 myfunc() { echo "$VAR" "$1"; }
796 export -f myfunc
797 parallel myfunc ::: '!'
800 =head2 EXAMPLE: Group output lines
802 When running jobs that output data, you often do not want the output
803 of multiple jobs to run together. GNU B<parallel> defaults to grouping
804 the output of each job, so the output is printed when the job
805 finishes. If you want full lines to be printed while the job is
806 running you can use B<--line-buffer>. If you want output to be
807 printed as soon as possible you can use B<-u>.
809 Compare the output of:
811 parallel wget --progress=dot --limit-rate=100k \
812 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
813 ::: {12..16}
814 parallel --line-buffer wget --progress=dot --limit-rate=100k \
815 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
816 ::: {12..16}
817 parallel --latest-line wget --progress=dot --limit-rate=100k \
818 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
819 ::: {12..16}
820 parallel -u wget --progress=dot --limit-rate=100k \
821 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
822 ::: {12..16}
824 =head2 EXAMPLE: Tag output lines
826 GNU B<parallel> groups the output lines, but it can be hard to see
827 where the different jobs begin. B<--tag> prepends the argument to make
828 that more visible:
830 parallel --tag wget --limit-rate=100k \
831 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
832 ::: {12..16}
834 B<--tag> works with B<--line-buffer> but not with B<-u>:
836 parallel --tag --line-buffer wget --limit-rate=100k \
837 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
838 ::: {12..16}
840 Check the uptime of the servers in I<~/.parallel/sshloginfile>:
842 parallel --tag -S .. --nonall uptime
845 =head2 EXAMPLE: Colorize output
847 Give each job a new color. Most terminals support ANSI colors with the
848 escape code "\033[30;3Xm" where 0 <= X <= 7:
850 seq 10 | \
851 parallel --tagstring '\033[30;3{=$_=++$::color%8=}m' seq {}
852 parallel --rpl '{color} $_="\033[30;3".(++$::color%8)."m"' \
853 --tagstring {color} seq {} ::: {1..10}
855 To get rid of the initial \t (which comes from B<--tagstring>):
857 ... | perl -pe 's/\t//'
860 =head2 EXAMPLE: Keep order of output same as order of input
862 Normally the output of a job will be printed as soon as it
863 completes. Sometimes you want the order of the output to remain the
864 same as the order of the input. This is often important, if the output
865 is used as input for another system. B<-k> will make sure the order of
866 output will be in the same order as input even if later jobs end
867 before earlier jobs.
869 Append a string to every line in a text file:
871 cat textfile | parallel -k echo {} append_string
873 If you remove B<-k> some of the lines may come out in the wrong order.
875 Another example is B<traceroute>:
877 parallel traceroute ::: qubes-os.org debian.org freenetproject.org
879 will give traceroute of qubes-os.org, debian.org and
880 freenetproject.org, but it will be sorted according to which job
881 completed first.
883 To keep the order the same as input run:
885 parallel -k traceroute ::: qubes-os.org debian.org freenetproject.org
887 This will make sure the traceroute to qubes-os.org will be printed
888 first.
890 A bit more complex example is downloading a huge file in chunks in
891 parallel: Some internet connections will deliver more data if you
892 download files in parallel. For downloading files in parallel see:
893 "EXAMPLE: Download 10 images for each of the past 30 days". But if you
894 are downloading a big file you can download the file in chunks in
895 parallel.
897 To download byte 10000000-19999999 you can use B<curl>:
899 curl -r 10000000-19999999 https://example.com/the/big/file >file.part
901 To download a 1 GB file we need 100 10MB chunks downloaded and
902 combined in the correct order.
904 seq 0 99 | parallel -k curl -r \
905 {}0000000-{}9999999 https://example.com/the/big/file > file
908 =head2 EXAMPLE: Parallel grep
910 B<grep -r> greps recursively through directories. GNU B<parallel> can
911 often speed this up.
913 find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}
915 This will run 1.5 job per CPU, and give 1000 arguments to B<grep>.
917 There are situations where the above will be slower than B<grep -r>:
919 =over 2
921 =item *
923 If data is already in RAM. The overhead of starting jobs and buffering
924 output may outweigh the benefit of running in parallel.
926 =item *
928 If the files are big. If a file cannot be read in a single seek, the
929 disk may start thrashing.
931 =back
933 The speedup is caused by two factors:
935 =over 2
937 =item *
939 On rotating harddisks small files often require a seek for each
940 file. By searching for more files in parallel, the arm may pass
941 another wanted file on its way.
943 =item *
945 NVMe drives often perform better by having multiple command running in
946 parallel.
948 =back
951 =head2 EXAMPLE: Grepping n lines for m regular expressions.
953 The simplest solution to grep a big file for a lot of regexps is:
955 grep -f regexps.txt bigfile
957 Or if the regexps are fixed strings:
959 grep -F -f regexps.txt bigfile
961 There are 3 limiting factors: CPU, RAM, and disk I/O.
963 RAM is easy to measure: If the B<grep> process takes up most of your
964 free memory (e.g. when running B<top>), then RAM is a limiting factor.
966 CPU is also easy to measure: If the B<grep> takes >90% CPU in B<top>,
967 then the CPU is a limiting factor, and parallelization will speed this
970 It is harder to see if disk I/O is the limiting factor, and depending
971 on the disk system it may be faster or slower to parallelize. The only
972 way to know for certain is to test and measure.
975 =head3 Limiting factor: RAM
977 The normal B<grep -f regexps.txt bigfile> works no matter the size of
978 bigfile, but if regexps.txt is so big it cannot fit into memory, then
979 you need to split this.
981 B<grep -F> takes around 100 bytes of RAM and B<grep> takes about 500
982 bytes of RAM per 1 byte of regexp. So if regexps.txt is 1% of your
983 RAM, then it may be too big.
985 If you can convert your regexps into fixed strings do that. E.g. if
986 the lines you are looking for in bigfile all looks like:
988 ID1 foo bar baz Identifier1 quux
989 fubar ID2 foo bar baz Identifier2
991 then your regexps.txt can be converted from:
993 ID1.*Identifier1
994 ID2.*Identifier2
996 into:
998 ID1 foo bar baz Identifier1
999 ID2 foo bar baz Identifier2
1001 This way you can use B<grep -F> which takes around 80% less memory and
1002 is much faster.
1004 If it still does not fit in memory you can do this:
1006 parallel --pipe-part -a regexps.txt --block 1M grep -F -f - -n bigfile | \
1007 sort -un | perl -pe 's/^\d+://'
1009 The 1M should be your free memory divided by the number of CPU threads and
1010 divided by 200 for B<grep -F> and by 1000 for normal B<grep>. On
1011 GNU/Linux you can do:
1013 free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ { sum += $2 }
1014 END { print sum }' /proc/meminfo)
1015 percpu=$((free / 200 / $(parallel --number-of-threads)))k
1017 parallel --pipe-part -a regexps.txt --block $percpu --compress \
1018 grep -F -f - -n bigfile | \
1019 sort -un | perl -pe 's/^\d+://'
1021 If you can live with duplicated lines and wrong order, it is faster to do:
1023 parallel --pipe-part -a regexps.txt --block $percpu --compress \
1024 grep -F -f - bigfile
1026 =head3 Limiting factor: CPU
1028 If the CPU is the limiting factor parallelization should be done on
1029 the regexps:
1031 cat regexps.txt | parallel --pipe -L1000 --round-robin --compress \
1032 grep -f - -n bigfile | \
1033 sort -un | perl -pe 's/^\d+://'
1035 The command will start one B<grep> per CPU and read I<bigfile> one
1036 time per CPU, but as that is done in parallel, all reads except the
1037 first will be cached in RAM. Depending on the size of I<regexps.txt> it
1038 may be faster to use B<--block 10m> instead of B<-L1000>.
1040 Some storage systems perform better when reading multiple chunks in
1041 parallel. This is true for some RAID systems and for some network file
1042 systems. To parallelize the reading of I<bigfile>:
1044 parallel --pipe-part --block 100M -a bigfile -k --compress \
1045 grep -f regexps.txt
1047 This will split I<bigfile> into 100MB chunks and run B<grep> on each of
1048 these chunks. To parallelize both reading of I<bigfile> and I<regexps.txt>
1049 combine the two using B<--cat>:
1051 parallel --pipe-part --block 100M -a bigfile --cat cat regexps.txt \
1052 \| parallel --pipe -L1000 --round-robin grep -f - {}
1054 If a line matches multiple regexps, the line may be duplicated.
1056 =head3 Bigger problem
1058 If the problem is too big to be solved by this, you are probably ready
1059 for Lucene.
1062 =head2 EXAMPLE: Using remote computers
1064 To run commands on a remote computer SSH needs to be set up and you
1065 must be able to login without entering a password (The commands
1066 B<ssh-copy-id>, B<ssh-agent>, and B<sshpass> may help you do that).
1068 If you need to login to a whole cluster, you typically do not want to
1069 accept the host key for every host. You want to accept them the first
1070 time and be warned if they are ever changed. To do that:
1072 # Add the servers to the sshloginfile
1073 (echo servera; echo serverb) > .parallel/my_cluster
1074 # Make sure .ssh/config exist
1075 touch .ssh/config
1076 cp .ssh/config .ssh/config.backup
1077 # Disable StrictHostKeyChecking temporarily
1078 (echo 'Host *'; echo StrictHostKeyChecking no) >> .ssh/config
1079 parallel --slf my_cluster --nonall true
1080 # Remove the disabling of StrictHostKeyChecking
1081 mv .ssh/config.backup .ssh/config
1083 The servers in B<.parallel/my_cluster> are now added in B<.ssh/known_hosts>.
1085 To run B<echo> on B<server.example.com>:
1087 seq 10 | parallel --sshlogin server.example.com echo
1089 To run commands on more than one remote computer run:
1091 seq 10 | parallel --sshlogin s1.example.com,s2.example.net echo
1095 seq 10 | parallel --sshlogin server.example.com \
1096 --sshlogin server2.example.net echo
1098 If the login username is I<foo> on I<server2.example.net> use:
1100 seq 10 | parallel --sshlogin server.example.com \
1101 --sshlogin foo@server2.example.net echo
1103 If your list of hosts is I<server1-88.example.net> with login I<foo>:
1105 seq 10 | parallel -Sfoo@server{1..88}.example.net echo
1107 To distribute the commands to a list of computers, make a file
1108 I<mycomputers> with all the computers:
1110 server.example.com
1111 foo@server2.example.com
1112 server3.example.com
1114 Then run:
1116 seq 10 | parallel --sshloginfile mycomputers echo
1118 To include the local computer add the special sshlogin ':' to the list:
1120 server.example.com
1121 foo@server2.example.com
1122 server3.example.com
1125 GNU B<parallel> will try to determine the number of CPUs on each of
1126 the remote computers, and run one job per CPU - even if the remote
1127 computers do not have the same number of CPUs.
1129 If the number of CPUs on the remote computers is not identified
1130 correctly the number of CPUs can be added in front. Here the computer
1131 has 8 CPUs.
1133 seq 10 | parallel --sshlogin 8/server.example.com echo
1136 =head2 EXAMPLE: Transferring of files
1138 To recompress gzipped files with B<bzip2> using a remote computer run:
1140 find logs/ -name '*.gz' | \
1141 parallel --sshlogin server.example.com \
1142 --transfer "zcat {} | bzip2 -9 >{.}.bz2"
1144 This will list the .gz-files in the I<logs> directory and all
1145 directories below. Then it will transfer the files to
1146 I<server.example.com> to the corresponding directory in
1147 I<$HOME/logs>. On I<server.example.com> the file will be recompressed
1148 using B<zcat> and B<bzip2> resulting in the corresponding file with
1149 I<.gz> replaced with I<.bz2>.
1151 If you want the resulting bz2-file to be transferred back to the local
1152 computer add I<--return {.}.bz2>:
1154 find logs/ -name '*.gz' | \
1155 parallel --sshlogin server.example.com \
1156 --transfer --return {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1158 After the recompressing is done the I<.bz2>-file is transferred back to
1159 the local computer and put next to the original I<.gz>-file.
1161 If you want to delete the transferred files on the remote computer add
1162 I<--cleanup>. This will remove both the file transferred to the remote
1163 computer and the files transferred from the remote computer:
1165 find logs/ -name '*.gz' | \
1166 parallel --sshlogin server.example.com \
1167 --transfer --return {.}.bz2 --cleanup "zcat {} | bzip2 -9 >{.}.bz2"
1169 If you want run on several computers add the computers to I<--sshlogin>
1170 either using ',' or multiple I<--sshlogin>:
1172 find logs/ -name '*.gz' | \
1173 parallel --sshlogin server.example.com,server2.example.com \
1174 --sshlogin server3.example.com \
1175 --transfer --return {.}.bz2 --cleanup "zcat {} | bzip2 -9 >{.}.bz2"
1177 You can add the local computer using I<--sshlogin :>. This will disable the
1178 removing and transferring for the local computer only:
1180 find logs/ -name '*.gz' | \
1181 parallel --sshlogin server.example.com,server2.example.com \
1182 --sshlogin server3.example.com \
1183 --sshlogin : \
1184 --transfer --return {.}.bz2 --cleanup "zcat {} | bzip2 -9 >{.}.bz2"
1186 Often I<--transfer>, I<--return> and I<--cleanup> are used together. They can be
1187 shortened to I<--trc>:
1189 find logs/ -name '*.gz' | \
1190 parallel --sshlogin server.example.com,server2.example.com \
1191 --sshlogin server3.example.com \
1192 --sshlogin : \
1193 --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1195 With the file I<mycomputers> containing the list of computers it becomes:
1197 find logs/ -name '*.gz' | parallel --sshloginfile mycomputers \
1198 --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1200 If the file I<~/.parallel/sshloginfile> contains the list of computers
1201 the special short hand I<-S ..> can be used:
1203 find logs/ -name '*.gz' | parallel -S .. \
1204 --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1207 =head2 EXAMPLE: Advanced file transfer
1209 Assume you have files in in/*, want them processed on server,
1210 and transferred back into /other/dir:
1212 parallel -S server --trc /other/dir/./{/}.out \
1213 cp {/} {/}.out ::: in/./*
1216 =head2 EXAMPLE: Distributing work to local and remote computers
1218 Convert *.mp3 to *.ogg running one process per CPU on local computer
1219 and server2:
1221 parallel --trc {.}.ogg -S server2,: \
1222 'mpg321 -w - {} | oggenc -q0 - -o {.}.ogg' ::: *.mp3
1225 =head2 EXAMPLE: Running the same command on remote computers
1227 To run the command B<uptime> on remote computers you can do:
1229 parallel --tag --nonall -S server1,server2 uptime
1231 B<--nonall> reads no arguments. If you have a list of jobs you want
1232 to run on each computer you can do:
1234 parallel --tag --onall -S server1,server2 echo ::: 1 2 3
1236 Remove B<--tag> if you do not want the sshlogin added before the
1237 output.
1239 If you have a lot of hosts use '-j0' to access more hosts in parallel.
1242 =head2 EXAMPLE: Running 'sudo' on remote computers
1244 Put the password into passwordfile then run:
1246 parallel --ssh 'cat passwordfile | ssh' --nonall \
1247 -S user@server1,user@server2 sudo -S ls -l /root
1250 =head2 EXAMPLE: Using remote computers behind NAT wall
1252 If the workers are behind a NAT wall, you need some trickery to get to
1253 them.
1255 If you can B<ssh> to a jumphost, and reach the workers from there,
1256 then the obvious solution would be this, but it B<does not work>:
1258 parallel --ssh 'ssh jumphost ssh' -S host1 echo ::: DOES NOT WORK
1260 It does not work because the command is dequoted by B<ssh> twice where
1261 as GNU B<parallel> only expects it to be dequoted once.
1263 You can use a bash function and have GNU B<parallel> quote the command:
1265 jumpssh() { ssh -A jumphost ssh $(parallel --shellquote ::: "$@"); }
1266 export -f jumpssh
1267 parallel --ssh jumpssh -S host1 echo ::: this works
1269 Or you can instead put this in B<~/.ssh/config>:
1271 Host host1 host2 host3
1272 ProxyCommand ssh jumphost.domain nc -w 1 %h 22
1274 It requires B<nc(netcat)> to be installed on jumphost. With this you
1275 can simply:
1277 parallel -S host1,host2,host3 echo ::: This does work
1279 =head3 No jumphost, but port forwards
1281 If there is no jumphost but each server has port 22 forwarded from the
1282 firewall (e.g. the firewall's port 22001 = port 22 on host1, 22002 = host2,
1283 22003 = host3) then you can use B<~/.ssh/config>:
1285 Host host1.v
1286 Port 22001
1287 Host host2.v
1288 Port 22002
1289 Host host3.v
1290 Port 22003
1291 Host *.v
1292 Hostname firewall
1294 And then use host{1..3}.v as normal hosts:
1296 parallel -S host1.v,host2.v,host3.v echo ::: a b c
1298 =head3 No jumphost, no port forwards
1300 If ports cannot be forwarded, you need some sort of VPN to traverse
1301 the NAT-wall. TOR is one options for that, as it is very easy to get
1302 working.
1304 You need to install TOR and setup a hidden service. In B<torrc> put:
1306 HiddenServiceDir /var/lib/tor/hidden_service/
1307 HiddenServicePort 22 127.0.0.1:22
1309 Then start TOR: B</etc/init.d/tor restart>
1311 The TOR hostname is now in B</var/lib/tor/hidden_service/hostname> and
1312 is something similar to B<izjafdceobowklhz.onion>. Now you simply
1313 prepend B<torsocks> to B<ssh>:
1315 parallel --ssh 'torsocks ssh' -S izjafdceobowklhz.onion \
1316 -S zfcdaeiojoklbwhz.onion,auclucjzobowklhi.onion echo ::: a b c
1318 If not all hosts are accessible through TOR:
1320 parallel -S 'torsocks ssh izjafdceobowklhz.onion,host2,host3' \
1321 echo ::: a b c
1323 See more B<ssh> tricks on https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Proxies_and_Jump_Hosts
1326 =head2 EXAMPLE: Use sshpass with ssh
1328 If you cannot use passwordless login, you may be able to use B<sshpass>:
1330 seq 10 | parallel -S user-with-password:MyPassword@server echo
1334 export SSHPASS='MyPa$$w0rd'
1335 seq 10 | parallel -S user-with-password:@server echo
1338 =head2 EXAMPLE: Use outrun instead of ssh
1340 B<outrun> lets you run a command on a remote server. B<outrun> sets up
1341 a connection to access files at the source server, and automatically
1342 transfers files. B<outrun> must be installed on the remote system.
1344 You can use B<outrun> in an sshlogin this way:
1346 parallel -S 'outrun user@server' command
1350 parallel --ssh outrun -S server command
1353 =head2 EXAMPLE: Slurm cluster
1355 The Slurm Workload Manager is used in many clusters.
1357 Here is a simple example of using GNU B<parallel> to call B<srun>:
1359 #!/bin/bash
1361 #SBATCH --time 00:02:00
1362 #SBATCH --ntasks=4
1363 #SBATCH --job-name GnuParallelDemo
1364 #SBATCH --output gnuparallel.out
1366 module purge
1367 module load gnu_parallel
1369 my_parallel="parallel --delay .2 -j $SLURM_NTASKS"
1370 my_srun="srun --export=all --exclusive -n1"
1371 my_srun="$my_srun --cpus-per-task=1 --cpu-bind=cores"
1372 $my_parallel "$my_srun" echo This is job {} ::: {1..20}
1375 =head2 EXAMPLE: Parallelizing rsync
1377 B<rsync> is a great tool, but sometimes it will not fill up the
1378 available bandwidth. Running multiple B<rsync> in parallel can fix
1379 this.
1381 cd src-dir
1382 find . -type f |
1383 parallel -j10 -X rsync -zR -Ha ./{} fooserver:/dest-dir/
1385 Adjust B<-j10> until you find the optimal number.
1387 B<rsync -R> will create the needed subdirectories, so all files are
1388 not put into a single dir. The B<./> is needed so the resulting command
1389 looks similar to:
1391 rsync -zR ././sub/dir/file fooserver:/dest-dir/
1393 The B</./> is what B<rsync -R> works on.
1395 If you are unable to push data, but need to pull them and the files
1396 are called digits.png (e.g. 000000.png) you might be able to do:
1398 seq -w 0 99 | parallel rsync -Havessh fooserver:src/*{}.png destdir/
1401 =head2 EXAMPLE: Use multiple inputs in one command
1403 Copy files like foo.es.ext to foo.ext:
1405 ls *.es.* | perl -pe 'print; s/\.es//' | parallel -N2 cp {1} {2}
1407 The perl command spits out 2 lines for each input. GNU B<parallel>
1408 takes 2 inputs (using B<-N2>) and replaces {1} and {2} with the inputs.
1410 Count in binary:
1412 parallel -k echo ::: 0 1 ::: 0 1 ::: 0 1 ::: 0 1 ::: 0 1 ::: 0 1
1414 Print the number on the opposing sides of a six sided die:
1416 parallel --link -a <(seq 6) -a <(seq 6 -1 1) echo
1417 parallel --link echo :::: <(seq 6) <(seq 6 -1 1)
1419 Convert files from all subdirs to PNG-files with consecutive numbers
1420 (useful for making input PNG's for B<ffmpeg>):
1422 parallel --link -a <(find . -type f | sort) \
1423 -a <(seq $(find . -type f|wc -l)) convert {1} {2}.png
1425 Alternative version:
1427 find . -type f | sort | parallel convert {} {#}.png
1430 =head2 EXAMPLE: Use a table as input
1432 Content of table_file.tsv:
1434 foo<TAB>bar
1435 baz <TAB> quux
1437 To run:
1439 cmd -o bar -i foo
1440 cmd -o quux -i baz
1442 you can run:
1444 parallel -a table_file.tsv --colsep '\t' cmd -o {2} -i {1}
1446 Note: The default for GNU B<parallel> is to remove the spaces around
1447 the columns. To keep the spaces:
1449 parallel -a table_file.tsv --trim n --colsep '\t' cmd -o {2} -i {1}
1452 =head2 EXAMPLE: Output to database
1454 GNU B<parallel> can output to a database table and a CSV-file:
1456 dburl=csv:///%2Ftmp%2Fmydir
1457 dbtableurl=$dburl/mytable.csv
1458 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1460 It is rather slow and takes up a lot of CPU time because GNU
1461 B<parallel> parses the whole CSV file for each update.
1463 A better approach is to use an SQLite-base and then convert that to CSV:
1465 dburl=sqlite3:///%2Ftmp%2Fmy.sqlite
1466 dbtableurl=$dburl/mytable
1467 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1468 sql $dburl '.headers on' '.mode csv' 'SELECT * FROM mytable;'
1470 This takes around a second per job.
1472 If you have access to a real database system, such as PostgreSQL, it
1473 is even faster:
1475 dburl=pg://user:pass@host/mydb
1476 dbtableurl=$dburl/mytable
1477 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1478 sql $dburl \
1479 "COPY (SELECT * FROM mytable) TO stdout DELIMITER ',' CSV HEADER;"
1481 Or MySQL:
1483 dburl=mysql://user:pass@host/mydb
1484 dbtableurl=$dburl/mytable
1485 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1486 sql -p -B $dburl "SELECT * FROM mytable;" > mytable.tsv
1487 perl -pe 's/"/""/g; s/\t/","/g; s/^/"/; s/$/"/;
1488 %s=("\\" => "\\", "t" => "\t", "n" => "\n");
1489 s/\\([\\tn])/$s{$1}/g;' mytable.tsv
1492 =head2 EXAMPLE: Output to CSV-file for R
1494 If you have no need for the advanced job distribution control that a
1495 database provides, but you simply want output into a CSV file that you
1496 can read into R or LibreCalc, then you can use B<--results>:
1498 parallel --results my.csv seq ::: 10 20 30
1500 > mydf <- read.csv("my.csv");
1501 > print(mydf[2,])
1502 > write(as.character(mydf[2,c("Stdout")]),'')
1505 =head2 EXAMPLE: Use XML as input
1507 The show Aflyttet on Radio 24syv publishes an RSS feed with their audio
1508 podcasts on: http://arkiv.radio24syv.dk/audiopodcast/channel/4466232
1510 Using B<xpath> you can extract the URLs for 2019 and download them
1511 using GNU B<parallel>:
1513 wget -O - http://arkiv.radio24syv.dk/audiopodcast/channel/4466232 | \
1514 xpath -e "//pubDate[contains(text(),'2019')]/../enclosure/@url" | \
1515 parallel -u wget '{= s/ url="//; s/"//; =}'
1518 =head2 EXAMPLE: Run the same command 10 times
1520 If you want to run the same command with the same arguments 10 times
1521 in parallel you can do:
1523 seq 10 | parallel -n0 my_command my_args
1526 =head2 EXAMPLE: Working as cat | sh. Resource inexpensive jobs and evaluation
1528 GNU B<parallel> can work similar to B<cat | sh>.
1530 A resource inexpensive job is a job that takes very little CPU, disk
1531 I/O and network I/O. Ping is an example of a resource inexpensive
1532 job. wget is too - if the webpages are small.
1534 The content of the file jobs_to_run:
1536 ping -c 1 10.0.0.1
1537 wget http://example.com/status.cgi?ip=10.0.0.1
1538 ping -c 1 10.0.0.2
1539 wget http://example.com/status.cgi?ip=10.0.0.2
1541 ping -c 1 10.0.0.255
1542 wget http://example.com/status.cgi?ip=10.0.0.255
1544 To run 100 processes simultaneously do:
1546 parallel -j 100 < jobs_to_run
1548 As there is not a I<command> the jobs will be evaluated by the shell.
1551 =head2 EXAMPLE: Call program with FASTA sequence
1553 FASTA files have the format:
1555 >Sequence name1
1556 sequence
1557 sequence continued
1558 >Sequence name2
1559 sequence
1560 sequence continued
1561 more sequence
1563 To call B<myprog> with the sequence as argument run:
1565 cat file.fasta |
1566 parallel --pipe -N1 --recstart '>' --rrs \
1567 'read a; echo Name: "$a"; myprog $(tr -d "\n")'
1570 =head2 EXAMPLE: Call program with interleaved FASTQ records
1572 FASTQ files have the format:
1574 @M10991:61:000000000-A7EML:1:1101:14011:1001 1:N:0:28
1575 CTCCTAGGTCGGCATGATGGGGGAAGGAGAGCATGGGAAGAAATGAGAGAGTAGCAAGG
1577 #8BCCGGGGGFEFECFGGGGGGGGG@;FFGGGEG@FF<EE<@FFC,CEGCCGGFF<FGF
1579 Interleaved FASTQ starts with a line like these:
1581 @HWUSI-EAS100R:6:73:941:1973#0/1
1582 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
1583 @EAS139:136:FC706VJ:2:2104:15343:197393 1:N:18:1
1585 where '/1' and ' 1:' determines this is read 1.
1587 This will cut big.fq into one chunk per CPU thread and pass it on
1588 stdin (standard input) to the program fastq-reader:
1590 parallel --pipe-part -a big.fq --block -1 --regexp \
1591 --recend '\n' --recstart '@.*(/1| 1:.*)\n[A-Za-z\n\.~]' \
1592 fastq-reader
1595 =head2 EXAMPLE: Processing a big file using more CPUs
1597 To process a big file or some output you can use B<--pipe> to split up
1598 the data into blocks and pipe the blocks into the processing program.
1600 If the program is B<gzip -9> you can do:
1602 cat bigfile | parallel --pipe --recend '' -k gzip -9 > bigfile.gz
1604 This will split B<bigfile> into blocks of 1 MB and pass that to B<gzip
1605 -9> in parallel. One B<gzip> will be run per CPU. The output of B<gzip
1606 -9> will be kept in order and saved to B<bigfile.gz>
1608 B<gzip> works fine if the output is appended, but some processing does
1609 not work like that - for example sorting. For this GNU B<parallel> can
1610 put the output of each command into a file. This will sort a big file
1611 in parallel:
1613 cat bigfile | parallel --pipe --files sort |\
1614 parallel -Xj1 sort -m {} ';' rm {} >bigfile.sort
1616 Here B<bigfile> is split into blocks of around 1MB, each block ending
1617 in '\n' (which is the default for B<--recend>). Each block is passed
1618 to B<sort> and the output from B<sort> is saved into files. These
1619 files are passed to the second B<parallel> that runs B<sort -m> on the
1620 files before it removes the files. The output is saved to
1621 B<bigfile.sort>.
1623 GNU B<parallel>'s B<--pipe> maxes out at around 100 MB/s because every
1624 byte has to be copied through GNU B<parallel>. But if B<bigfile> is a
1625 real (seekable) file GNU B<parallel> can by-pass the copying and send
1626 the parts directly to the program:
1628 parallel --pipe-part --block 100m -a bigfile --files sort |\
1629 parallel -Xj1 sort -m {} ';' rm {} >bigfile.sort
1632 =head2 EXAMPLE: Grouping input lines
1634 When processing with B<--pipe> you may have lines grouped by a
1635 value. Here is I<my.csv>:
1637 Transaction Customer Item
1638 1 a 53
1639 2 b 65
1640 3 b 82
1641 4 c 96
1642 5 c 67
1643 6 c 13
1644 7 d 90
1645 8 d 43
1646 9 d 91
1647 10 d 84
1648 11 e 72
1649 12 e 102
1650 13 e 63
1651 14 e 56
1652 15 e 74
1654 Let us assume you want GNU B<parallel> to process each customer. In
1655 other words: You want all the transactions for a single customer to be
1656 treated as a single record.
1658 To do this we preprocess the data with a program that inserts a record
1659 separator before each customer (column 2 = $F[1]). Here we first make
1660 a 50 character random string, which we then use as the separator:
1662 sep=`perl -e 'print map { ("a".."z","A".."Z")[rand(52)] } (1..50);'`
1663 cat my.csv | \
1664 perl -ape '$F[1] ne $l and print "'$sep'"; $l = $F[1]' | \
1665 parallel --recend $sep --rrs --pipe -N1 wc
1667 If your program can process multiple customers replace B<-N1> with a
1668 reasonable B<--blocksize>.
1671 =head2 EXAMPLE: Running more than 250 jobs workaround
1673 If you need to run a massive amount of jobs in parallel, then you will
1674 likely hit the filehandle limit which is often around 250 jobs. If you
1675 are super user you can raise the limit in /etc/security/limits.conf
1676 but you can also use this workaround. The filehandle limit is per
1677 process. That means that if you just spawn more GNU B<parallel>s then
1678 each of them can run 250 jobs. This will spawn up to 2500 jobs:
1680 cat myinput |\
1681 parallel --pipe -N 50 --round-robin -j50 parallel -j50 your_prg
1683 This will spawn up to 62500 jobs (use with caution - you need 64 GB
1684 RAM to do this, and you may need to increase /proc/sys/kernel/pid_max):
1686 cat myinput |\
1687 parallel --pipe -N 250 --round-robin -j250 parallel -j250 your_prg
1690 =head2 EXAMPLE: Working as mutex and counting semaphore
1692 The command B<sem> is an alias for B<parallel --semaphore>.
1694 A counting semaphore will allow a given number of jobs to be started
1695 in the background. When the number of jobs are running in the
1696 background, GNU B<sem> will wait for one of these to complete before
1697 starting another command. B<sem --wait> will wait for all jobs to
1698 complete.
1700 Run 10 jobs concurrently in the background:
1702 for i in *.log ; do
1703 echo $i
1704 sem -j10 gzip $i ";" echo done
1705 done
1706 sem --wait
1708 A mutex is a counting semaphore allowing only one job to run. This
1709 will edit the file I<myfile> and prepends the file with lines with the
1710 numbers 1 to 3.
1712 seq 3 | parallel sem sed -i -e '1i{}' myfile
1714 As I<myfile> can be very big it is important only one process edits
1715 the file at the same time.
1717 Name the semaphore to have multiple different semaphores active at the
1718 same time:
1720 seq 3 | parallel sem --id mymutex sed -i -e '1i{}' myfile
1723 =head2 EXAMPLE: Mutex for a script
1725 Assume a script is called from cron or from a web service, but only
1726 one instance can be run at a time. With B<sem> and B<--shebang-wrap>
1727 the script can be made to wait for other instances to finish. Here in
1728 B<bash>:
1730 #!/usr/bin/sem --shebang-wrap -u --id $0 --fg /bin/bash
1732 echo This will run
1733 sleep 5
1734 echo exclusively
1736 Here B<perl>:
1738 #!/usr/bin/sem --shebang-wrap -u --id $0 --fg /usr/bin/perl
1740 print "This will run ";
1741 sleep 5;
1742 print "exclusively\n";
1744 Here B<python>:
1746 #!/usr/local/bin/sem --shebang-wrap -u --id $0 --fg /usr/bin/python
1748 import time
1749 print "This will run ";
1750 time.sleep(5)
1751 print "exclusively";
1754 =head2 EXAMPLE: Start editor with file names from stdin (standard input)
1756 You can use GNU B<parallel> to start interactive programs like emacs or vi:
1758 cat filelist | parallel --tty -X emacs
1759 cat filelist | parallel --tty -X vi
1761 If there are more files than will fit on a single command line, the
1762 editor will be started again with the remaining files.
1765 =head2 EXAMPLE: Running sudo
1767 B<sudo> requires a password to run a command as root. It caches the
1768 access, so you only need to enter the password again if you have not
1769 used B<sudo> for a while.
1771 The command:
1773 parallel sudo echo ::: This is a bad idea
1775 is no good, as you would be prompted for the sudo password for each of
1776 the jobs. Instead do:
1778 sudo parallel echo ::: This is a good idea
1780 This way you only have to enter the sudo password once.
1782 =head2 EXAMPLE: Run ping in parallel
1784 B<ping> prints out statistics when killed with CTRL-C.
1786 Unfortunately, CTRL-C will also normally kill GNU B<parallel>.
1788 But by using B<--open-tty> and ignoring SIGINT you can get the wanted effect:
1790 parallel -j0 --open-tty --lb --tag ping '{= $SIG{INT}=sub {} =}' \
1791 ::: 1.1.1.1 8.8.8.8 9.9.9.9 21.21.21.21 80.80.80.80 88.88.88.88
1793 B<--open-tty> will make the B<ping>s receive SIGINT (from CTRL-C).
1794 CTRL-C will not kill GNU B<parallel>, so that will only exit after
1795 B<ping> is done.
1798 =head2 EXAMPLE: GNU Parallel as queue system/batch manager
1800 GNU B<parallel> can work as a simple job queue system or batch manager.
1801 The idea is to put the jobs into a file and have GNU B<parallel> read
1802 from that continuously. As GNU B<parallel> will stop at end of file we
1803 use B<tail> to continue reading:
1805 true >jobqueue; tail -n+0 -f jobqueue | parallel
1807 To submit your jobs to the queue:
1809 echo my_command my_arg >> jobqueue
1811 You can of course use B<-S> to distribute the jobs to remote
1812 computers:
1814 true >jobqueue; tail -n+0 -f jobqueue | parallel -S ..
1816 Output only will be printed when reading the next input after a job
1817 has finished: So you need to submit a job after the first has finished
1818 to see the output from the first job.
1820 If you keep this running for a long time, jobqueue will grow. A way of
1821 removing the jobs already run is by making GNU B<parallel> stop when
1822 it hits a special value and then restart. To use B<--eof> to make GNU
1823 B<parallel> exit, B<tail> also needs to be forced to exit:
1825 true >jobqueue;
1826 while true; do
1827 tail -n+0 -f jobqueue |
1828 (parallel -E StOpHeRe -S ..; echo GNU Parallel is now done;
1829 perl -e 'while(<>){/StOpHeRe/ and last};print <>' jobqueue > j2;
1830 (seq 1000 >> jobqueue &);
1831 echo Done appending dummy data forcing tail to exit)
1832 echo tail exited;
1833 mv j2 jobqueue
1834 done
1836 In some cases you can run on more CPUs and computers during the night:
1838 # Day time
1839 echo 50% > jobfile
1840 cp day_server_list ~/.parallel/sshloginfile
1841 # Night time
1842 echo 100% > jobfile
1843 cp night_server_list ~/.parallel/sshloginfile
1844 tail -n+0 -f jobqueue | parallel --jobs jobfile -S ..
1846 GNU B<parallel> discovers if B<jobfile> or B<~/.parallel/sshloginfile>
1847 changes.
1850 =head2 EXAMPLE: GNU Parallel as dir processor
1852 If you have a dir in which users drop files that needs to be processed
1853 you can do this on GNU/Linux (If you know what B<inotifywait> is
1854 called on other platforms file a bug report):
1856 inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |\
1857 parallel -u echo
1859 This will run the command B<echo> on each file put into B<my_dir> or
1860 subdirs of B<my_dir>.
1862 You can of course use B<-S> to distribute the jobs to remote
1863 computers:
1865 inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |\
1866 parallel -S .. -u echo
1868 If the files to be processed are in a tar file then unpacking one file
1869 and processing it immediately may be faster than first unpacking all
1870 files. Set up the dir processor as above and unpack into the dir.
1872 Using GNU B<parallel> as dir processor has the same limitations as
1873 using GNU B<parallel> as queue system/batch manager.
1876 =head2 EXAMPLE: Locate the missing package
1878 If you have downloaded source and tried compiling it, you may have seen:
1880 $ ./configure
1881 [...]
1882 checking for something.h... no
1883 configure: error: "libsomething not found"
1885 Often it is not obvious which package you should install to get that
1886 file. Debian has `apt-file` to search for a file. `tracefile` from
1887 https://codeberg.org/tange/tangetools can tell which files a program
1888 tried to access. In this case we are interested in one of the last
1889 files:
1891 $ tracefile -un ./configure | tail | parallel -j0 apt-file search
1894 =head1 AUTHOR
1896 When using GNU B<parallel> for a publication please cite:
1898 O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login:
1899 The USENIX Magazine, February 2011:42-47.
1901 This helps funding further development; and it won't cost you a cent.
1902 If you pay 10000 EUR you should feel free to use GNU Parallel without citing.
1904 Copyright (C) 2007-10-18 Ole Tange, http://ole.tange.dk
1906 Copyright (C) 2008-2010 Ole Tange, http://ole.tange.dk
1908 Copyright (C) 2010-2024 Ole Tange, http://ole.tange.dk and Free
1909 Software Foundation, Inc.
1911 Parts of the manual concerning B<xargs> compatibility is inspired by
1912 the manual of B<xargs> from GNU findutils 4.4.2.
1915 =head1 LICENSE
1917 This program is free software; you can redistribute it and/or modify
1918 it under the terms of the GNU General Public License as published by
1919 the Free Software Foundation; either version 3 of the License, or
1920 at your option any later version.
1922 This program is distributed in the hope that it will be useful,
1923 but WITHOUT ANY WARRANTY; without even the implied warranty of
1924 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
1925 GNU General Public License for more details.
1927 You should have received a copy of the GNU General Public License
1928 along with this program. If not, see <https://www.gnu.org/licenses/>.
1930 =head2 Documentation license I
1932 Permission is granted to copy, distribute and/or modify this
1933 documentation under the terms of the GNU Free Documentation License,
1934 Version 1.3 or any later version published by the Free Software
1935 Foundation; with no Invariant Sections, with no Front-Cover Texts, and
1936 with no Back-Cover Texts. A copy of the license is included in the
1937 file LICENSES/GFDL-1.3-or-later.txt.
1939 =head2 Documentation license II
1941 You are free:
1943 =over 9
1945 =item B<to Share>
1947 to copy, distribute and transmit the work
1949 =item B<to Remix>
1951 to adapt the work
1953 =back
1955 Under the following conditions:
1957 =over 9
1959 =item B<Attribution>
1961 You must attribute the work in the manner specified by the author or
1962 licensor (but not in any way that suggests that they endorse you or
1963 your use of the work).
1965 =item B<Share Alike>
1967 If you alter, transform, or build upon this work, you may distribute
1968 the resulting work only under the same, similar or a compatible
1969 license.
1971 =back
1973 With the understanding that:
1975 =over 9
1977 =item B<Waiver>
1979 Any of the above conditions can be waived if you get permission from
1980 the copyright holder.
1982 =item B<Public Domain>
1984 Where the work or any of its elements is in the public domain under
1985 applicable law, that status is in no way affected by the license.
1987 =item B<Other Rights>
1989 In no way are any of the following rights affected by the license:
1991 =over 2
1993 =item *
1995 Your fair dealing or fair use rights, or other applicable
1996 copyright exceptions and limitations;
1998 =item *
2000 The author's moral rights;
2002 =item *
2004 Rights other persons may have either in the work itself or in
2005 how the work is used, such as publicity or privacy rights.
2007 =back
2009 =back
2011 =over 9
2013 =item B<Notice>
2015 For any reuse or distribution, you must make clear to others the
2016 license terms of this work.
2018 =back
2020 A copy of the full license is included in the file as
2021 LICENCES/CC-BY-SA-4.0.txt
2024 =head1 SEE ALSO
2026 B<parallel>(1), B<parallel_tutorial>(7), B<env_parallel>(1),
2027 B<parset>(1), B<parsort>(1), B<parallel_alternatives>(7),
2028 B<parallel_design>(7), B<niceload>(1), B<sql>(1), B<ssh>(1),
2029 B<ssh-agent>(1), B<sshpass>(1), B<ssh-copy-id>(1), B<rsync>(1)
2031 =cut