Re: Why Avoid BASIC on RPi?
Ah, Steve, I was wondering what you meant by using big integers via software interrupts. It seems instead of having shared libraries one could just plug the functionality into the OS itself. Cunning.
If you present the code I'd sure be tempted to boot RISC OS and try it out.
If you present the code I'd sure be tempted to boot RISC OS and try it out.
Memory in C++ is a leaky abstraction .
Re: Why Avoid BASIC on RPi?
Although the Karatsuba algorithm was published in 1962, it is possible the size of the big numbers expected for programs written in BASIC on RISC OS in the 80s was sufficiently small that the asymptotic superiority of Karatsuba multiplication wasn't a clear win over the simplicity of the O(n^2) algorithm. At the same time, computing the 4784969th Fibonacci number on a Pi in 5 minutes using an O(n^2) algorithmeven coded in assemblysounds almost too good to be true. It would be interesting to know how it is done.Steve Drain wrote: ↑Fri Jan 04, 2019 5:36 pmI may do, but it only applies to BBC BASIC running on a RISC OS machine with the Numbers module loaded.
To clarify for anyone unfamiliar with RISC OS, modules are operation system extensions. They can be written in assembler, C or just about in compiled BASIC. They provide software interrupts (SWIs) which can be called by any language. In this case the Numbers module was written by a third party quite while ago and provides a fairly comprehensive suite of big number operations.
You mistake my meaning. An assembler version would call the SWIs in the same way, but the overheads would be much reduced.A version with inline assembler instead of SWIs also sounds nice. Do you know whether the assembler implements an O(n^2) algorithm for the multiplication or one with greater asymptotic efficiency such as Karatsuba?
I do not know how multiplication is done. I have the source, but have not examined it. I do not think it is Karatsuba, because it is a development of algorithms originally written in the 1980s.
Re: Why Avoid BASIC on RPi?
There have been faster methods over the past 10 years, one by De, Kurur, Saha and Saptharishi but it doesn't get faster than the Schönhage–Strassen FFT method (fastest one currently in GMP) until your numbers have 10 ** (10 ** 4797) bits!ejolson wrote: ↑Fri Jan 04, 2019 6:44 pmAlthough the Karatsuba algorithm was published in 1962, it is possible the size of the big numbers expected for programs written in BASIC on RISC OS in the 80s was sufficiently small that the asymptotic superiority of Karatsuba multiplication wasn't a clear win over the simplicity of the O(n^2) algorithm. At the same time, computing the 4784969th Fibonacci number on a Pi in 5 minutes using an O(n^2) algorithmeven coded in assemblysounds almost too good to be true. It would be interesting to know how it is done.
She who travels light — forgot something.
Please note that my name doesn't start with the @ character so can people please stop writing it as if it does!
Please note that my name doesn't start with the @ character so can people please stop writing it as if it does!
Re: Why Avoid BASIC on RPi?
Interesting. I looked up De, Kurur, Saha and Saptharishi bignumber multiplication and foundPaeryn wrote: ↑Sat Jan 05, 2019 4:19 amThere have been faster methods over the past 10 years, one by De, Kurur, Saha and Saptharishi but it doesn't get faster than the Schönhage–Strassen FFT method (fastest one currently in GMP) until your numbers have 10 ** (10 ** 4797) bits!ejolson wrote: ↑Fri Jan 04, 2019 6:44 pmAlthough the Karatsuba algorithm was published in 1962, it is possible the size of the big numbers expected for programs written in BASIC on RISC OS in the 80s was sufficiently small that the asymptotic superiority of Karatsuba multiplication wasn't a clear win over the simplicity of the O(n^2) algorithm. At the same time, computing the 4784969th Fibonacci number on a Pi in 5 minutes using an O(n^2) algorithmeven coded in assemblysounds almost too good to be true. It would be interesting to know how it is done.
I remember when 64K was the size of the universe. I guess that goes to show how important efficient expressivity is when coding asymptotically fast bignumber multiplication routines. Maybe if Christoph had written the algorithm in linenumbered BASIC the crossover point would have been sooner, though it's possible he was avoiding BASIC. I hear the optimiser results in too much volatility. So how big would n have to be for the nth Fibonacci number to fill the whole universe?Christoph Lüders wrote:In this diploma thesis, results of an implementation of DKSS multiplication are presented: runtime is about 30 times larger than SSA, while memory requirements are about 3.75 times higher than SSA. A possible crossover point is estimated to be out of reach even if we utilized the whole universe for computer memory.
Re: Why Avoid BASIC on RPi?
I was having a play with our fibo() under Google benchmark and came across an interesting result.
First we make a benchmark for fibo():
Then run this produces an output like so:
Or one can get a simple CSV output.
I did the same for ejolson's formulation of the fibo doubling that I have implemented in C++. Hmm...that's odd ejolson's formulation seems to be faster up until some suitably large N. So I plot both results:
First we make a benchmark for fibo():
Code: Select all
// Define fibo benchmark
static void BM_fibo(benchmark::State& state) {
int n = state.range(0);
for (auto _ : state) {
if (!fiboInitialized) {
fiboInit();
fiboInitialized = true;
}
benchmark::DoNotOptimize(fibo(n));
}
}
BENCHMARK(BM_fibo)
>RangeMultiplier(2)
>RangeMultiplier(2)
>Range(1<<4, 1<<25)
>Complexity()
>Unit(benchmark::kMillisecond);
Code: Select all
$ ./bench
20190105 20:40:56
Running ./bench
Run on (8 X 3401 MHz CPU s)
Load Average: 0.52, 0.58, 0.59

Benchmark Time CPU Iterations

BM_fibo/16 0.001 ms 0.001 ms 640000
BM_fibo/32 0.002 ms 0.002 ms 320000
BM_fibo/64 0.004 ms 0.004 ms 154483
BM_fibo/128 0.009 ms 0.009 ms 74667
BM_fibo/256 0.017 ms 0.017 ms 40727
BM_fibo/512 0.035 ms 0.035 ms 20364
BM_fibo/1024 0.070 ms 0.070 ms 8960
BM_fibo/2048 0.140 ms 0.138 ms 4978
BM_fibo/4096 0.284 ms 0.289 ms 2489
BM_fibo/8192 0.578 ms 0.572 ms 1120
BM_fibo/16384 1.18 ms 1.17 ms 560
BM_fibo/32768 2.49 ms 2.46 ms 280
BM_fibo/65536 5.20 ms 5.16 ms 112
BM_fibo/131072 11.4 ms 11.5 ms 64
BM_fibo/262144 25.5 ms 25.7 ms 28
BM_fibo/524288 59.6 ms 59.7 ms 11
BM_fibo/1048576 144 ms 144 ms 5
BM_fibo/2097152 364 ms 367 ms 2
BM_fibo/4194304 950 ms 938 ms 1
BM_fibo/8388608 2551 ms 2562 ms 1
BM_fibo/16777216 7095 ms 7094 ms 1
BM_fibo/33554432 20137 ms 20141 ms 1
I did the same for ejolson's formulation of the fibo doubling that I have implemented in C++. Hmm...that's odd ejolson's formulation seems to be faster up until some suitably large N. So I plot both results:
Last edited by Heater on Sat Jan 05, 2019 9:36 pm, edited 3 times in total.
Memory in C++ is a leaky abstraction .
Re: Why Avoid BASIC on RPi?
Perhaps this plot shows what is going on more clearly at the high end:
Note 1: I have disabled the memoization of my fibo in all these benchmarks, for an applesapples comparison.
Note 2: These are using my same bint big integer class to get the arithmetic done.
Note 1: I have disabled the memoization of my fibo in all these benchmarks, for an applesapples comparison.
Note 2: These are using my same bint big integer class to get the arithmetic done.
Memory in C++ is a leaky abstraction .
Re: Why Avoid BASIC on RPi?
Depending on the code, n=2^k or n=2^k+1 are possible special cases for the doubling formulas. It would be interesting to see how straight the resulting lines are when other values of n are selected, perhaps randomly.Heater wrote: ↑Sat Jan 05, 2019 7:29 pmPerhaps this plot shows what is going on more clearly at the high end:
Note 1: I have disabled the memoization of my fibo in all these benchmarks, for an applesapples comparison.
Note 2: These are using my same bint big integer class to get the arithmetic done.
fibo_comparison_3.png
Do either lines have the slope log(3)/log(2)? Maybe plot C*n^(log(3)/log(2)) for some suitably chosen value of C to compare.
Re: Why Avoid BASIC on RPi?
Hmm... If I understand you correctly....
We can look at the raw data output of the benchmark, at the high end:
We can calculate the slopes of the lines on a log/log graph as:
Heater curve:
> (Math.log(20031.2)  Math.log(25.2404)) / (Math.log(33554432)  Math.log(262144))
1.3760426230727143
ejolson curve:
> (Math.log(26796.9)  Math.log(12.207)) / (Math.log(33554432)  Math.log(262144))
1.58573453007763
Meanwhile log(3)/log(2) is:
> Math.log10(3) / Math.log10(2)
1.584962500721156
Hmm..bang on for your curve. A bit high for mine.
What does it all mean?
I'll try and find time to test with random numbers later.
We can look at the raw data output of the benchmark, at the high end:
Code: Select all
..
"BM_fibo/262144",26,25.634,25.2404,ms,,,,,
"BM_fibo/524288",11,59.3675,59.6591,ms,,,,,
"BM_fibo/1048576",5,143.325,143.75,ms,,,,,
"BM_fibo/2097152",2,364.529,359.375,ms,,,,,
"BM_fibo/4194304",1,945.639,953.125,ms,,,,,
"BM_fibo/8388608",1,2540.48,2546.88,ms,,,,,
"BM_fibo/16777216",1,7060.5,7046.88,ms,,,,,
"BM_fibo/33554432",1,20041.1,20031.2,ms,,,,,
...
"BM_fiboEjOlson/262144",64,11.9017,12.207,ms,,,,,
"BM_fiboEjOlson/524288",19,36.0646,35.3618,ms,,,,,
"BM_fiboEjOlson/1048576",6,108.45,109.375,ms,,,,,
"BM_fiboEjOlson/2097152",2,326.485,328.125,ms,,,,,
"BM_fiboEjOlson/4194304",1,990.295,1000,ms,,,,,
"BM_fiboEjOlson/8388608",1,2983.46,2953.12,ms,,,,,
"BM_fiboEjOlson/16777216",1,8897.75,8906.25,ms,,,,,
"BM_fiboEjOlson/33554432",1,26804.4,26796.9,ms,,,,,
Heater curve:
> (Math.log(20031.2)  Math.log(25.2404)) / (Math.log(33554432)  Math.log(262144))
1.3760426230727143
ejolson curve:
> (Math.log(26796.9)  Math.log(12.207)) / (Math.log(33554432)  Math.log(262144))
1.58573453007763
Meanwhile log(3)/log(2) is:
> Math.log10(3) / Math.log10(2)
1.584962500721156
Hmm..bang on for your curve. A bit high for mine.
What does it all mean?
I'll try and find time to test with random numbers later.
Memory in C++ is a leaky abstraction .
Re: Why Avoid BASIC on RPi?
I'm not sure the exact codes you are comparing, but my guess isHeater wrote: ↑Sun Jan 06, 2019 8:43 amHmm... If I understand you correctly....
We can look at the raw data output of the benchmark, at the high end:We can calculate the slopes of the lines on a log/log graph as:Code: Select all
.. "BM_fibo/262144",26,25.634,25.2404,ms,,,,, "BM_fibo/524288",11,59.3675,59.6591,ms,,,,, "BM_fibo/1048576",5,143.325,143.75,ms,,,,, "BM_fibo/2097152",2,364.529,359.375,ms,,,,, "BM_fibo/4194304",1,945.639,953.125,ms,,,,, "BM_fibo/8388608",1,2540.48,2546.88,ms,,,,, "BM_fibo/16777216",1,7060.5,7046.88,ms,,,,, "BM_fibo/33554432",1,20041.1,20031.2,ms,,,,, ... "BM_fiboEjOlson/262144",64,11.9017,12.207,ms,,,,, "BM_fiboEjOlson/524288",19,36.0646,35.3618,ms,,,,, "BM_fiboEjOlson/1048576",6,108.45,109.375,ms,,,,, "BM_fiboEjOlson/2097152",2,326.485,328.125,ms,,,,, "BM_fiboEjOlson/4194304",1,990.295,1000,ms,,,,, "BM_fiboEjOlson/8388608",1,2983.46,2953.12,ms,,,,, "BM_fiboEjOlson/16777216",1,8897.75,8906.25,ms,,,,, "BM_fiboEjOlson/33554432",1,26804.4,26796.9,ms,,,,,
Heater curve:
> (Math.log(20031.2)  Math.log(25.2404)) / (Math.log(33554432)  Math.log(262144))
1.3760426230727143
ejolson curve:
> (Math.log(26796.9)  Math.log(12.207)) / (Math.log(33554432)  Math.log(262144))
1.58573453007763
Meanwhile log(3)/log(2) is:
> Math.log10(3) / Math.log10(2)
1.584962500721156
Hmm..bang on for your curve. A bit high for mine.
What does it all mean?
I'll try and find time to test with random numbers later.
1. The flatter curve has a significant O(n) term that comes from memory allocations.
2. The steeper curve, if it uses the same recursion as the original fibonacci.c program, has a constant C in the asymptotically dominant term C*n^(log(3)/log(2)) that is 2 to 4 times larger than it needs to be.
Re: Why Avoid BASIC on RPi?
I'm comparing my C++ interpretation of your C fibo() algorithm vs my C++ interpretation of Paeryn's Haskell FiboFast example.
The actual codes used to obtain these results are below. They are now in their own fibo.cpp file here:
https://github.com/ZiCog/fibo_4784969/b ... B/fibo.cpp (Note: The version in the repo is using memoization, what I'm running here does not). I think my C++ interpretation of your fibo is the same as the current C version.
Do your comments still apply to these codes?:
The actual codes used to obtain these results are below. They are now in their own fibo.cpp file here:
https://github.com/ZiCog/fibo_4784969/b ... B/fibo.cpp (Note: The version in the repo is using memoization, what I'm running here does not). I think my C++ interpretation of your fibo is the same as the current C version.
Do your comments still apply to these codes?:
Code: Select all
// This Fibonacci version derived from ejolson's doubling formula C example.
bint a;
bint b;
void fiboEjOlson(const int n) {
if (n == 0) {
a = zero;
b = one;
return;
}
fiboEjOlson(n / 2);
bint ta = a;
if (isEven(n)) {
a = ta * ((b + b)  ta);
b = (ta * ta) + (b * b);
} else {
a = (a * a) + (b * b);
b = b * ((ta + ta) + b);
}
}
void fiboInit() {
// Initialize the fibo's memo.
memo.clear();
memo[0] = zero;
memo[1] = one;
memo[2] = one;
}
// This Fibonacci version derived from Paeryn's Haskell example.
const bint fibo (int n) {
if (memo.find(n) != memo.end()) {
return memo[n];
}
int k = (n / 2);
const bint a = fibo(k);
const bint b = fibo(k  1);
if (isEven(n)) {
return a * (two * b + a);
}
bint twoa = two * a;
if ((n % 4) == 1) {
return (twoa + b) * (twoa  b) + two;
}
return (twoa + b) * (twoa  b)  two;
}
Memory in C++ is a leaky abstraction .
Re: Why Avoid BASIC on RPi?
Yes, that is the version of the (a,b)linear recurrence which is 2 to 4 times slower than it needs to be. To provide a frame of reference, the original C program has not been improved since it was first posted.
A more optimal way of writing the same doubling formula can be found in the FreeBASIC code missing the Karatsuba algorithm and, of course, the program written in linenumbered BASIC. Incidentally, it made me happy for you describe that particular program as strangely beautiful.
Re: Why Avoid BASIC on RPi?
Hmm... so I thought I'd transcribe that FreeBASIC fibo to C++ and see how it flies. So far it's not working.A more optimal way of writing the same doubling formula can be found in the FreeBASIC code...
But isn't there something wrong with this in fibowork:
Code: Select all
sub fibowork(n as integer,a as bignum,b as bignum)
if n=0 then
a.n=0:b.n=1:b.d(1)=1
return
end if
a.n=1:a.d(1)=0
Memory in C++ is a leaky abstraction .
Re: Why Avoid BASIC on RPi?
Yes, a is zero. The code trims leading zeros from numbers so a number with zero digits is also zero.Heater wrote: ↑Sun Jan 06, 2019 8:45 pmHmm... so I thought I'd transcribe that FreeBASIC fibo to C++ and see how it flies. So far it's not working.A more optimal way of writing the same doubling formula can be found in the FreeBASIC code...
But isn't there something wrong with this in fibowork:Aren't we to be setting a to zero there:Code: Select all
sub fibowork(n as integer,a as bignum,b as bignum) if n=0 then a.n=0:b.n=1:b.d(1)=1 return end if
a.n=1:a.d(1)=0
Re: Why Avoid BASIC on RPi?
Ah, OK. In my C++ I treat zero length bints as Not a Number (BINT_NULL).
Anyway: Yay C++ fibo(4784969) is 15% faster again!
So I transposed the fibo of the FreeBASIC version into C++. It's now about 15% faster than previously. More that twice as fast as my old version with memoization removed. Twice as fast as the previous ejolson inspired version.
I'll post the code and do some bench marking on it tomorrow but preliminary results look like this:
Note: The BM_fibo above is without memoization. Even with memoization this new version beats it by 100ms or so!
Anyway: Yay C++ fibo(4784969) is 15% faster again!
So I transposed the fibo of the FreeBASIC version into C++. It's now about 15% faster than previously. More that twice as fast as my old version with memoization removed. Twice as fast as the previous ejolson inspired version.
I'll post the code and do some bench marking on it tomorrow but preliminary results look like this:
Code: Select all
$ ./bench
20190107 04:21:42
Running ./bench
Run on (8 X 3401 MHz CPU s)
Load Average: 0.52, 0.58, 0.59

Benchmark Time CPU Iterations

BM_fibo/4784969 1162 ms 1156 ms 1
BM_fiboEjOlson/4784969 1196 ms 1219 ms 1
BM_fiboEjOlsonNew/4784969 528 ms 516 ms 1
Memory in C++ is a leaky abstraction .
Re: Why Avoid BASIC on RPi?
That's why I had expected the linenumbered BASIC program to do better than it did. Unfortunately, it seems that language is just not expressive enough to efficiently code Karatsuba multiplication.
Re: Why Avoid BASIC on RPi?
I've been optimizing the original fibonacci.c code and created a new program parallel.c which uses OpenMP to obtain additional efficiency when running on multicore hardware. Results are as follows when running the new program on a Pi 3B in 32bit mode under clocked with arm_freq=900:I will be posting the new code shortly. For now I would note that optimizations to the original serial code led to a factor 2.5 times improvement while parallelization to the 4 cores gave another 3.6 times improvement. It is also worth noting that parallel.c took significantly less time to write and debug than the linenumbered BASIC program.
Code: Select all
$ make
/usr/local/gcc6.4/bin/gcc O3 mcpu=cortexa7 mfpu=neonvfpv4 Wall o serial parallel.c
/usr/local/gcc6.4/bin/gcc fopenmp O3 mcpu=cortexa7 mfpu=neonvfpv4 Wall o parallel parallel.c
/usr/local/gcc6.4/bin/gcc O3 mcpu=cortexa7 mfpu=neonvfpv4 Wall o fibonacci fibonacci.c
$ time ./fibonacci >fibonacci.txt
real 0m52.750s
user 0m52.728s
sys 0m0.010s
$ time ./serial >serial.txt
real 0m20.777s
user 0m20.756s
sys 0m0.020s
$ time ./parallel >parallel.txt
real 0m5.768s
user 0m21.004s
sys 0m0.010s
$ md5sum fibonacci.txt
c926caa99ed1e30fe0d3c966da901738 fibonacci.txt
$ md5sum serial.txt
c926caa99ed1e30fe0d3c966da901738 serial.txt
$ md5sum parallel.txt
c926caa99ed1e30fe0d3c966da901738 parallel.txt
Re: Why Avoid BASIC on RPi?
Wow. Can't wait to try it out and see the code.
A 3.6 speed up over 4 cores is a pretty good scaling factor.
A 3.6 speed up over 4 cores is a pretty good scaling factor.
Memory in C++ is a leaky abstraction .
Re: Why Avoid BASIC on RPi?
This is interesting:
ejolson's new fibo() formulation is a clear winner across the board up to fibo(134217728) by a factor of 2 and much more at the low end.
But if we look at the high end we see the original formulation I am using is rising at a lower rate. It will win again somewhere further out...
ejolson's new fibo() formulation is a clear winner across the board up to fibo(134217728) by a factor of 2 and much more at the low end.
But if we look at the high end we see the original formulation I am using is rising at a lower rate. It will win again somewhere further out...
 Attachments

 new_2.png (26.04 KiB) Viewed 2260 times

 new_1.png (29.62 KiB) Viewed 2260 times
Memory in C++ is a leaky abstraction .
Re: Why Avoid BASIC on RPi?
A programming language which allows for the efficient expression of complicated algorithms is liberating in the sense that the person who learns such a language is no longer constrained by the functionality of builtin features and standard libraries but has the opportunity to develop and implement new algorithms that solve new problems and old problems in new ways. While it might appear that the goal is to compute milliondigit Fibonacci numbers, the title of the thread is why avoid BASIC on the Raspberry Pi.Heater wrote: ↑Fri Jan 04, 2019 4:22 pmHow these things are implemented under the hood is of no consequence, as long as they perform well enough. Perhaps the language run time makes use of libs written in other languages, perhaps it does not. We are talking about the language here not any particular implementation.
I find it profoundly important to pay attention to what goes on behind the scenes, because for me coding the doubling formulas and Karatsuba multiplication is best seen as an exercise to help determine how expressive various programming languages might be for efficiently implementing algorithms in general. For example, there is good reason to believe that builtin bignumber arithmetic has little to do with the yettobeinvented nonbipartite womble algorithm.
When a swimmer performs a series of exercises on land, it is clear the goal is not the exercise but building the strength needed for subsequent swimming. When a calculus student is confronted with a series of exercises, if finding answers takes priority over the goal of building the skills and understanding necessary for subsequent work, then many students will notice the answers are easiest to find by consulting the key at the end of the book. When the goal is understanding why to avoid BASIC, then what I want to know is whether BASIC has the ability to efficiently express anything the hardware is capable of. If finding the 4784969th Fibonacci number takes priority, then we forget the purpose of the exercise.
What caused the end of the golden age of personal computing? Was the end caused by the failure of BASIC to liberate a person who learned that language? With the Pi a second age of personal computing has begun. Does Python also lack the ability to efficiently express new algorithms in a way that repeats the mistakes of the past?
Last edited by ejolson on Tue Jan 08, 2019 12:40 am, edited 3 times in total.
Re: Why Avoid BASIC on RPi?
It does look like you turned an O(n^1.58) algorithm into an O(n^1.38) one. However, I'd be very surprised if the red and blue lines ever really crossed.
Re: Why Avoid BASIC on RPi?
ejolson,
Great. Where do I pick up my Turing award?It does look like you turned an O(n^1.58) algorithm into an O(n^1.38) one.
Dang...There goes fame and fortune... asymptotes are a bitch.However, I'd be very surprised if the red and blue lines ever really crossed.
Memory in C++ is a leaky abstraction .
Re: Why Avoid BASIC on RPi?
This post includes a new code parallel.c demonstrating the expressiveness of the C programming language with OpenMP extensions in the context of computing the 4784969th Fibonacci number using the doubling formulas and Karatsuba multiplication. This program was inspired by the original fibonacci.c program with the following modifications:
1. Algebraically simplified versions of doubling formula given by
F(2k) = F(k)[2F(k+1)F(k)]
F(2k+1) = F(k+1)[2F(k+1)F(k)]+(1)^(k+1)
F(2k+1) = F(k)[2F(k)+F(k+1)]+(1)^(k)
F(2k+2) = F(k+1)[2F(k)+F(k+1)]
have been used which reduce the number of multiplications per doubling step from three to two.
2. The final return of the doubling formula occurs in a separate routine and only calculates the value of b, since that is the only value needed.
3. All memory is allocated from the execution stack using C99 variable length arrays. The routines for bignumber addition and subtraction have been rewritten to take an additional argument specifying the stackallocated buffers in which to store the big numbers. This change allows lockless memory allocation for the multithreaded version, since every thread has its own execution stack.
4. Signed 32bit integers have been used to store the digits of the big numbers. This is more cache friendly and allows addition and subtraction to be performed as vector addition and vector subtraction on the arrays of digits which represent the big numbers. The resulting loops are vectorizable. The O(n^2) multiplication has been modified to use a 64bit temporary workspace with a delayed carry that avoids any intermediate divisions until the very end.
5. A horrible bigcarry routine has been written to fix the mess made by the simplified vector addition and subtraction routines.
6. The multiplication, addition, subtraction and carry routines are composable. They were actually composable before, but the difficulty in freeing temporary memory allocated for the return values discouraged use of that feature in the original fibonacci.c code. Now the memory is stackallocated and automatically freed at just the right time.
The changes itemized above as 1 through 6 result in a program that runs 2.5 times faster than the original fibonacci.c code. Note that most of the speed improvement results from 1 and 2. The final change is to introduce parallelism.
Most of the parallelism comes from the Karatsuba multiplication algorithm; however, parallelism was also introduced into the use of the doubling formula as well. As the Karatsuba algorithm is by nature a recursive divide and conquer algorithm, then the forkjoin model of dynamic parallelism is easy to implement.
The current code relies on a header file that defines the macros spawn and sync. In order to accommodate either OpenMP or the Cilk parallel processing extensions to the C programming language, the syntax of spawn is a bit irritating. The irritation seems worthwhile, however, as it allows the program to be compiled using either OpenMP or Cilk with no further changes.
7. Make copies of the serial versions of the subroutines fibowork and bigmul3 called fiboworks and bigmul3s. Then introduce the spawn and sync keywords to perform the multiplications in parallel and a crossover point to determine when to switch to the serial code.
It should be pointed out that once steps 1 through 6 were completed adding the parallelism in step 7 took about 5 minutes and worked the first time. It should also be pointed out that addition, subtraction and bigcarry have not been parallelized. The current code scales to two cores almost perfectly and obtains a factor 3.6 speedup on four cores. Parallel efficiency when scaling to eight cores is only 70 percent. It is likely further changes would have to be made in order to scale well using more than eight cores.
The code isand the header file is
1. Algebraically simplified versions of doubling formula given by
F(2k) = F(k)[2F(k+1)F(k)]
F(2k+1) = F(k+1)[2F(k+1)F(k)]+(1)^(k+1)
F(2k+1) = F(k)[2F(k)+F(k+1)]+(1)^(k)
F(2k+2) = F(k+1)[2F(k)+F(k+1)]
have been used which reduce the number of multiplications per doubling step from three to two.
2. The final return of the doubling formula occurs in a separate routine and only calculates the value of b, since that is the only value needed.
3. All memory is allocated from the execution stack using C99 variable length arrays. The routines for bignumber addition and subtraction have been rewritten to take an additional argument specifying the stackallocated buffers in which to store the big numbers. This change allows lockless memory allocation for the multithreaded version, since every thread has its own execution stack.
4. Signed 32bit integers have been used to store the digits of the big numbers. This is more cache friendly and allows addition and subtraction to be performed as vector addition and vector subtraction on the arrays of digits which represent the big numbers. The resulting loops are vectorizable. The O(n^2) multiplication has been modified to use a 64bit temporary workspace with a delayed carry that avoids any intermediate divisions until the very end.
5. A horrible bigcarry routine has been written to fix the mess made by the simplified vector addition and subtraction routines.
6. The multiplication, addition, subtraction and carry routines are composable. They were actually composable before, but the difficulty in freeing temporary memory allocated for the return values discouraged use of that feature in the original fibonacci.c code. Now the memory is stackallocated and automatically freed at just the right time.
The changes itemized above as 1 through 6 result in a program that runs 2.5 times faster than the original fibonacci.c code. Note that most of the speed improvement results from 1 and 2. The final change is to introduce parallelism.
Most of the parallelism comes from the Karatsuba multiplication algorithm; however, parallelism was also introduced into the use of the doubling formula as well. As the Karatsuba algorithm is by nature a recursive divide and conquer algorithm, then the forkjoin model of dynamic parallelism is easy to implement.
The current code relies on a header file that defines the macros spawn and sync. In order to accommodate either OpenMP or the Cilk parallel processing extensions to the C programming language, the syntax of spawn is a bit irritating. The irritation seems worthwhile, however, as it allows the program to be compiled using either OpenMP or Cilk with no further changes.
7. Make copies of the serial versions of the subroutines fibowork and bigmul3 called fiboworks and bigmul3s. Then introduce the spawn and sync keywords to perform the multiplications in parallel and a crossover point to determine when to switch to the serial code.
It should be pointed out that once steps 1 through 6 were completed adding the parallelism in step 7 took about 5 minutes and worked the first time. It should also be pointed out that addition, subtraction and bigcarry have not been parallelized. The current code scales to two cores almost perfectly and obtains a factor 3.6 speedup on four cores. Parallel efficiency when scaling to eight cores is only 70 percent. It is likely further changes would have to be made in order to scale well using more than eight cores.
The code is
Code: Select all
/* parallel.c  Compute the nth Fibonacci Number
Written January 7, 2018 by Eric Olson
This program demonstrates the expressiveness of C with OpenMP
as measured by explicitly coding a parallel version of the
Karatsuba multiplication algorithm for bignumber arithmetic
and then using the doubling formulas
F(2k) = F(k)[2F(k+1)F(k)]
F(2k+1) = F(k+1)[2F(k+1)F(k)]+(1)^(k+1)
F(2k+1) = F(k)[2F(k)+F(k+1)]+(1)^(k)
F(2k+2) = F(k+1)[2F(k)+F(k+1)]
to compute the nth Fibonacci number. Note that n is specified
in the first line of the main routine.
*/
#include <stdio.h>
#include <math.h>
#include <string.h>
#include <strings.h>
#include <stdlib.h>
#include <alloca.h>
#include <sys/time.h>
#include <sys/resource.h>
#define TUNE1 44
#define TUNE2 1024
#define TUNE3 1048576
#include "parallel.h"
typedef unsigned long long llu;
typedef int limb;
typedef unsigned long long limb2;
typedef struct {
int n; limb *d;
} bignum;
static int rexp,cmult,digits,nproc;
static limb rbase,rcarry;
static limb2 cadd,cmax;
static void bigprint(bignum x){
char fmt[10];
sprintf(fmt,"%%0%dllu",rexp);
if(!x.n){
printf("0\n");
return;
}
for(int i=x.n1;i>=0;i){
if(i==x.n1) printf("%llu",(llu)x.d[i]);
else printf(fmt,(llu)x.d[i]);
}
printf("\n");
}
static bignum bigtrim(bignum x){
for(int k=x.n1;k>=0;k){
if(x.d[k]!=0){
x.n=k+1;
return x;
}
}
x.n=0;
return x;
}
static bignum bigcarry(bignum x){
int imax=x.n1;
for(int i=0;i<imax;i++){
if(x.d[i]>=rbase){
do {
x.d[i]=rbase; x.d[i+1]++;
} while(x.d[i]>=rbase);
} else if(x.d[i]<0){
do {
x.d[i]+=rbase; x.d[i+1];
} while(x.d[i]<0);
}
}
if(x.d[imax]<rbase) return bigtrim(x);
x.n++;
x.d[imax]=rbase; x.d[imax+1]=1;
while(x.d[imax]>=rbase){
x.d[imax]=rbase; x.d[imax+1]++;
}
return x;
}
static bignum bigcopy(limb zd[],bignum x){
bignum z; z.d=zd;
memcpy(z.d,x.d,sizeof(limb)*x.n);
z.n=x.n;
return z;
}
static bignum bigadd3(limb zd[],bignum x,bignum y){
bignum z; z.d=zd;
if(x.n<y.n){
for(int i=0;i<x.n;i++) z.d[i]=x.d[i]+y.d[i];
for(int i=x.n;i<y.n;i++) z.d[i]=y.d[i];
z.n=y.n;
} else {
for(int i=0;i<y.n;i++) z.d[i]=x.d[i]+y.d[i];
for(int i=y.n;i<x.n;i++) z.d[i]=x.d[i];
z.n=x.n;
}
return z;
}
#ifdef NOTUSED
static bignum bigsub3(limb zd[],bignum x,bignum y){
bignum z; z.d=zd;
for(int i=0;i<y.n;i++) z.d[i]=x.d[i]y.d[i];
for(int i=y.n;i<x.n;i++) z.d[i]=x.d[i];
z.n=x.n;
return z;
}
#endif
static bignum bigsub2(bignum x,bignum y){
for(int i=0;i<y.n;i++) x.d[i]=y.d[i];
return x;
}
static bignum bigadd2(bignum x,bignum y){
if(x.n<y.n){
if(x.n>0) {
x.d[0];
for(int i=0;i<x.n;i++) x.d[i]+=y.d[i]rcarry;
x.d[x.n]=y.d[x.n]+1;
for(int i=x.n+1;i<y.n;i++) x.d[i]=y.d[i];
x.n=y.n;
} else {
x=bigcopy(x.d,y);
}
} else {
x.d[0];
for(int i=0;i<y.n;i++) x.d[i]+=y.d[i]rcarry;
if(y.n==x.n) {
x.n++;
x.d[y.n]=1;
} else {
x.d[y.n]++;
}
}
return x;
}
static bignum bigdec1(bignum x){
int imax=x.n1;
x.d[0];
for(int i=0;i<imax;i++){
if(x.d[i]>=0) return x;
x.d[i]+=rbase; x.d[i+1];
}
return x;
}
static bignum biginc1(bignum x){
int imax=x.n1;
if(imax>=0) {
x.d[0]++;
for(int i=0;i<imax;i++){
if(x.d[i]<rbase) return x;
x.d[i]=rbase; x.d[i+1]++;
}
if(x.d[imax]<rbase) return x;
x.d[imax]=rbase;
}
x.n++;
x.d[imax+1]=1;
return x;
}
static bignum bigmul3w(limb zd[],bignum x,bignum y){
bignum z; z.d=zd;
int imax=x.n+y.n;
limb2 s[imax];
bzero(s,sizeof(limb2)*imax);
imax;
for(int i=0;i<x.n;i++){
for(int j=0;j<y.n;j++){
s[i+j]+=(limb2)x.d[i]*y.d[j];
}
if((i+1)%cmult==0){
for(int k=0;k<imax;k++){
if(s[k]<cmax) continue;
s[k]=cmax; s[k+1]+=cadd;
}
}
}
for(int k=0;k<imax;k++){
s[k+1]+=s[k]/rbase; z.d[k]=s[k]%rbase;
}
z.d[imax]=s[imax];
z.n=imax+1;
return bigtrim(z);
}
static bignum biglow(bignum x,int n){
if(x.n>n) x.n=n;
return x;
}
static bignum bighigh(bignum x,int n){
if(x.n<n) x.n=0;
else {
x.n=n;
x.d=&x.d[n];
}
return x;
}
static bignum bigmul3s(limb zd[],bignum x,bignum y){
if(x.n<TUNE1y.n<TUNE1) return bigmul3w(zd,x,y);
int M=x.n<y.n?y.n:x.n; int n=M/2;
bignum z;
bignum x0=biglow(x,n),x1=bighigh(x,n);
bignum y0=biglow(y,n),y1=bighigh(y,n);
limb z0d[M+1],z1d[M+3],z1yd[n+1],z2d[M+3];
bignum z0,z2,z1x,z1y,z1;
z0=bigmul3s(z0d,x0,y0);
z2=bigmul3s(z2d,x1,y1);
z1x=bigcarry(bigadd3(z1d,x0,x1));
z1y=bigcarry(bigadd3(z1yd,y0,y1));
z1=bigmul3s(z1x.d,z1x,z1y);
z1=bigcarry(bigsub2(bigsub2(z1,z0),z2));
z=bigcopy(zd,z0);
z=bigadd3(&z.d[n],(bignum){z.nn,&z.d[n]},z1);
z=bigadd2((bignum){z.nn,&z.d[n]},z2);
z.n=z.n+2*n; z.d=zd;
return bigcarry(z);
}
static bignum bigmul3(limb zd[],bignum x,bignum y){
if(x.n<TUNE1y.n<TUNE1) return bigmul3w(zd,x,y);
int M=x.n<y.n?y.n:x.n; int n=M/2;
if(M<TUNE2) return bigmul3s(zd,x,y);
bignum z;
bignum x0=biglow(x,n),x1=bighigh(x,n);
bignum y0=biglow(y,n),y1=bighigh(y,n);
limb z0d[M+1],z1d[M+3],z1yd[n+1],z2d[M+3];
bignum z0,z2,z1x,z1y,z1;
spawn(z0=) bigmul3(z0d,x0,y0);
spawn(z2=) bigmul3(z2d,x1,y1);
z1x=bigcarry(bigadd3(z1d,x0,x1));
z1y=bigcarry(bigadd3(z1yd,y0,y1));
z1=bigmul3(z1x.d,z1x,z1y);
sync;
z1=bigcarry(bigsub2(bigsub2(z1,z0),z2));
z=bigcopy(zd,z0);
z=bigadd3(&z.d[n],(bignum){z.nn,&z.d[n]},z1);
z=bigadd2((bignum){z.nn,&z.d[n]},z2);
z.n=z.n+2*n; z.d=zd;
return bigcarry(z);
}
static bignum t1,a,b;
static void fiboworks(int n){
if(n==0){
a.n=0; b.n=1; b.d[0]=1;
return;
}
fiboworks(n/2);
if(n%2==0){
// [a,b]=[a*(2*ba),b*(2*ba)(1)^k]
t1=bigcarry(bigsub2(bigadd3(t1.d,b,b),a));
a=bigmul3(a.d,a,t1);
b=bigmul3(b.d,b,t1);
if(n%4==0) b=bigdec1(b); else b=biginc1(b);
} else {
// [a,b]=[a*(2*a+b)+(1)^k,b*(2*a+b)]
t1=bigcarry(bigadd2(bigadd3(t1.d,a,a),b));
b=bigmul3(b.d,b,t1);
a=bigmul3(a.d,a,t1);
if(n%4==1) a=biginc1(a); else a=bigdec1(a);
}
return;
}
static void fibowork(int n){
if(n==0){
a.n=0; b.n=1; b.d[0]=1;
return;
}
if(n<TUNE3) fiboworks(n/2);
else fibowork(n/2);
if(n%2==0){
// [a,b]=[a*(2*ba),b*(2*ba)(1)^k]
t1=bigcarry(bigsub2(bigadd3(t1.d,b,b),a));
spawn(a=) bigmul3(a.d,a,t1);
b=bigmul3(b.d,b,t1);
if(n%4==0) b=bigdec1(b); else b=biginc1(b);
sync;
} else {
// [a,b]=[a*(2*a+b)+(1)^k,b*(2*a+b)]
t1=bigcarry(bigadd2(bigadd3(t1.d,a,a),b));
spawn(b=) bigmul3(b.d,b,t1);
a=bigmul3(a.d,a,t1);
if(n%4==1) a=biginc1(a); else a=bigdec1(a);
sync;
}
return;
}
static bignum fibo(int n, limb xd[]){
b.d=xd;
if(n<2){
b.n=1; b.d[0]=n;
return b;
}
limb ad[digits]; a.d=ad;
fibowork((n1)/2);
if(n%2==0){
// b=b*(2*a+b)
t1=bigcarry(bigadd2(bigadd3(t1.d,a,a),b));
b=bigmul3(b.d,b,t1);
} else {
// b=b*(2*ba)(1)^k
t1=bigcarry(bigsub2(bigadd3(t1.d,b,b),a));
b=bigmul3(b.d,b,t1);
if(n%4==1) b=bigdec1(b); else b=biginc1(b);
}
return b;
}
static int work(int n){
digits=n*log10(0.5*(1+sqrt(5.0)))/rexp+4;
limb t1d[digits]; t1.d=t1d;
limb xd[digits];
workers(nproc){
bignum x=fibo(n,xd);
bigprint(x);
}
return 0;
}
int main(){
int n=4784969;
setrlimit(RLIMIT_STACK,
&(const struct rlimit)
{RLIM_INFINITY,RLIM_INFINITY});
if(sizeof(limb)*2!=sizeof(limb2)){
fprintf(stderr,
"sizeof(limb)=%llu must be half of sizeof(limb2)=%llu!\n",
(llu)sizeof(limb),(llu)sizeof(limb2));
return 1;
}
limb limbmax=((limb2)1<<(8*sizeof(limb)1))1;
limb2 limb2max=((limb2)1<<(8*sizeof(limb)))1;
limb2max=(limb2max<<(8*sizeof(limb)))+limb2max;
rexp=log(limbmax)/log(10);
rbase=pow(10,rexp)+0.5; rcarry=rbase1;
cmult=(limb2max2*rbase)/rbase/rbase/2;
cadd=(limb2)rbase*cmult; cmax=cadd*rbase;
nproc=get_ncpu();
return work(n);
}
Code: Select all
/* parallel.h  Macros for parallel processing
Written Jan 7, 2019 by Eric Olson */
#ifndef _PARALLEL_H
#define _PARALLEL_H
#ifdef _OPENMP
# include <omp.h>
# define spawn(X) _Pragma("omp task default(shared)") X
# define sync _Pragma("omp taskwait")
# define get_ncpu(X) omp_get_num_procs(X)
# define workers(X) \
if(X) omp_set_num_threads(X); \
else omp_set_num_threads(1); \
_Pragma("omp parallel") \
_Pragma("omp single")
#else
# ifdef _CILKPLUS
# include <cilk/cilk.h>
# include <cilk/cilk_api.h>
# define spawn(X) X cilk_spawn
# define sync cilk_sync
# define get_ncpu(X) __cilkrts_get_nworkers(X)
# define workers(X) \
__cilkrts_end_cilk(); \
if(X){ \
char buf[32]; \
sprintf(buf,"%d",X); \
int r=__cilkrts_set_param("nworkers",buf); \
if(r!=__CILKRTS_SET_PARAM_SUCCESS){ \
fprintf(stderr,"Error: unable to set nworkers to %d!\n",X); \
exit(1); \
} \
__cilkrts_init(); \
}
# else
# define spawn(X) X
# define sync
# define get_ncpu(X) 1
# define workers(X)
# endif
#endif
#endif
Last edited by ejolson on Tue Jan 08, 2019 9:05 am, edited 1 time in total.
Re: Why Avoid BASIC on RPi?
Impressive!
I am getting better than 4 speed up with a quad core PC (hyper threading turned off).
Compared to the fibo.c version 4 ????????
I am getting better than 4 speed up with a quad core PC (hyper threading turned off).
Compared to the fibo.c version 4 ????????
Code: Select all
$ gcc O3 fibo.c o fibo_plain
$ gcc O3 fibo_gmp.c o fibo_gmp lgmp
$ gcc O3 fopenmp fibo_par.c o fibo_par
$
$ time ./fibo_plain >/dev/null
real 0m1.562s
user 0m1.560s
sys 0m0.004s
$ time ./fibo_gmp >/dev/null
real 0m0.195s
user 0m0.192s
sys 0m0.000s
$ time ./fibo_par >/dev/null
real 0m0.293s
user 0m1.084s
sys 0m0.000s
$
$ pc 1.562 / 0.293
5.33105802047782
Re: Why Avoid BASIC on RPi?
Thanks for testing the new code. I made one minor cosmetic change to the bigcopy routine and just updated the previous post. The modification doesn't change the speed but only makes calling conventions more regular. I had been hoping the new code would catch the GMP single thread performance, but that seems not to be the case. Maybe one needs a system with more and slower cores for that to happen. Have you tried changing the TUNE1, TUNE2 and TUNE3 parameters for better performance on your system?jahboater wrote: ↑Tue Jan 08, 2019 8:20 amImpressive!
I am getting better than 4 speed up with a quad core PC (hyper threading turned off).
Compared to the fibo.c version 4 ????????Code: Select all
$ gcc O3 fibo.c o fibo_plain $ gcc O3 fibo_gmp.c o fibo_gmp lgmp $ gcc O3 fopenmp fibo_par.c o fibo_par $ $ time ./fibo_plain >/dev/null real 0m1.562s user 0m1.560s sys 0m0.004s $ time ./fibo_gmp >/dev/null real 0m0.195s user 0m0.192s sys 0m0.000s $ time ./fibo_par >/dev/null real 0m0.293s user 0m1.084s sys 0m0.000s $ $ pc 1.562 / 0.293 5.33105802047782
Note to separate the algorithmic optimisations from the parallel speedup, you can compile the new code without the fopenmp flag to obtain a serial version of the code.
Last edited by ejolson on Tue Jan 08, 2019 9:24 am, edited 1 time in total.
Re: Why Avoid BASIC on RPi?
Awesome. No time to look into that just now but it builds and runs here:
For a speed up of about 3.6 over 8 (hyper)threads on 4 cores.
All in all 2.4 times faster than the single core c++:
One thing that immediately jumped out at me is this declaration in work();
t1d is an unused variable. But comment it out and it segfaults. Something cunning going on here!
How do we build that with cilk?
Code: Select all
$ gcc Wall Wextra O3 o parallel parallel.c lm
$ time ./parallel  tail c 32
4856539211500699706378405156269
real 0m0.919s
user 0m0.844s
sys 0m0.063s
$
$ gcc Wall Wextra O3 fopenmp o parallel parallel.c lm
$ time ./parallel  tail c 32
4856539211500699706378405156269
real 0m0.257s
user 0m1.281s
sys 0m0.094s
All in all 2.4 times faster than the single core c++:
Code: Select all
$ time ../c++/fibo_karatsuba  tail c 32
4856539211500699706378405156269
real 0m0.640s
user 0m0.609s
sys 0m0.000s
Code: Select all
limb t1d[digits]; t1.d=t1d;
How do we build that with cilk?
Memory in C++ is a leaky abstraction .