User avatar
Posts: 4830
Joined: Fri Jul 22, 2016 9:09 pm
Location: Eberbach, Germany
Contact: Website Twitter YouTube

Parallel recursive bash mergesort / fun / challenge

Wed Sep 15, 2021 7:23 pm

I recently learned how to use Linux "comm" command to merge two sorted lists:

And in same thread, how to use Linux "tee" command to split input onto more than one pipes.

I have not found a solution yet, how to just use tee and comm to implement mergesort (in bash) -- that is the challenge (for you?).

Linux sort command is so much faster, bash implementation of mergesort can only be done for fun.
I did it, 26 lines, and only 19 if you eliminate empty lines: ... a5cccada44

My Pi400 directory listing has 495 lines, and Linux "sort" needs 22ms to sort that:

Code: Select all

pi@raspberrypi400:~ $ time (ls -l | sort >/dev/null)

real	0m0.022s
user	0m0.019s
sys	0m0.008s
pi@raspberrypi400:~ $ ls -l | wc
    495    4471   30713
pi@raspberrypi400:~ $ 

I created a sequential version of mergesort bash script, minimal diff:

Code: Select all

pi@raspberrypi400:~ $ diff mergesort mergesortn
< if [[ n1 -gt 1 ]]; then (cat ${tms}0 | $0 ${1}0)& else cp ${tms}0 ${tsm}0& fi
< pid1=$!
< if [[ n2 -gt 1 ]]; then (cat ${tms}1 | $0 ${1}1)& else cp ${tms}1 ${tsm}1& fi
< pid2=$!
< wait $pid1 $pid2
> if [[ n1 -gt 1 ]]; then (cat ${tms}0 | $0 ${1}0); else cp ${tms}0 ${tsm}0; fi
> if [[ n2 -gt 1 ]]; then (cat ${tms}1 | $0 ${1}1); else cp ${tms}1 ${tsm}1; fi
pi@raspberrypi400:~ $ 
I like the "wait command" waiting until both recursive parallel processes have completed.

As expected, the parallel version runs twice as fast (and 344 times slower than "sort" command):

Code: Select all

pi@raspberrypi400:~ $ time (ls -l | ./mergesort >/dev/null)

real	0m7.560s
user	0m13.322s
sys	0m9.396s
pi@raspberrypi400:~ $ time (ls -l | ./mergesortn >/dev/null)

real	0m14.317s
user	0m10.259s
sys	0m6.636s
pi@raspberrypi400:~ $ 

What the script does is, splitting odd and even lines from input into two files, and then call mergesort recursively for those two files. Finally, after both files got sorted recursively, use above mentioned "comm" command solution to merge the two sorted files into final sorted output. Recursion ends when a file contains 1 line only (that is sorted by definition).

A lot of files get created. All file sizes added on a single level add up to original file size. For N line input there are log₂(N) levels. Files are "/tmp/ms.{0,1}^k" with 1<=k<=log₂(N) for halved input files, and "/tmp/sm.{0,1}^k" containing recursively sorted results. All unneeded files are deleted on the fly by mergesort.

The parallel execution had a nice side effect -- since recursive call is done by "$0 ...", the call is a new process (horrible performance wise). But it allows to count the number of running mergesort instance easily using "ps"! This command reports one more than the number of mergesort processes continuously:

Code: Select all

while true; do ps -ef | grep mergesort | wc -l; done

I did run the parallel version of bash mergesort and did run mergesort process counting in parallel -- more than 900(!) processes in between ;-)
Peek_2021-09-15_17-01.gif (112.92 KiB) Viewed 466 times

I found zsh version of mergesort on rosettacode: ... #UnixPipes

It runs only with zsh, uses "sort -m" for merging two sorted files (I don't like that).
And it does not handle lines correctly, I want this:

Code: Select all

pi@raspberrypi400:~ $ echo -e "4 1\n2 3" | ./mergesort 
2 3
4 1
pi@raspberrypi400:~ $ 
And not this:

Code: Select all

pi@raspberrypi400:~ $ echo -e "4 1\n2 3" | ./zmergesort 
pi@raspberrypi400:~ $

Return to “General programming discussion”