I tried recompiling some C code on RP2 that does a single-precision 2D inverse FFT with 4096 x 4096 elements, using the current Raspbian gcc without special flags, and also gcc-4.8 with them. I did 4 runs with each version, and the time in the FFT routine averaged 62 seconds in both cases. Is this expected, or should there be an improvement? There was one tiny difference; the older GCC code took 62 or 63 seconds on each run. The newer code varied from 61 to 64 seconds. The system was otherwise unloaded (except for 'top' running, showing the FFT code at 99.8% ... 100.3% cpu) in both cases.
Code: Select all
Compile flags:
(Case 1) gcc -ansi -pedantic -Wall -O4 (Raspbian gcc version 4.6.3)
(Case 2) gcc-4.8 -ansi -pedantic -Wall -O4 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard -funsafe-math-optimizations
If anyone's interested, the code is
http://www.bealecorner.org/best/gforge/Hflab095.zip and inside the interpreter my test was:
EDIT: using the built-in repeat function the variability went away, but still nearly the same speed.
Code: Select all
HL>repeat 10 cfill 4096; tic; ifft; toc; pop
62 seconds.
62 seconds.
61 seconds.
62 seconds.
62 seconds.
62 seconds.
62 seconds.
61 seconds.
62 seconds.
62 seconds.
HL-ARMv7>repeat 10 cfill 4096; tic; ifft; toc; pop
62 seconds.
62 seconds.
61 seconds.
61 seconds.
61 seconds.
61 seconds.
61 seconds.
61 seconds.
61 seconds.
61 seconds.
Note: with 4 threads each at 100% CPU, my power meter says 0.47 A @ 5.02V = 2.36 W, and after a few minutes cpu temperature = 53 C
Interestingly, if I run four instances of the "hl" program each doing a large FFT job, each one takes about 156 seconds to complete instead of 62 seconds, so with all four CPUs busy, each thread is running at 40% of the full speed it had when running alone. I guess the problem is the memory bus access, since each 4096x4096 complex array takes up 134 MB it won't fit in any kind of local cache, and the FFT process is highly non-local, accessing every element frequently.