User avatar
OneMadGypsy
Posts: 359
Joined: Wed Apr 28, 2021 1:57 am
Location: New Orleans, Louisiana
Contact: Website

Faster Sine and Cosine

Tue Sep 21, 2021 4:55 pm

I've been working on a graphics library. My current focus with it is gradient fills. My code currently supports: HORIZONTAL, VERTICAL and RADIAL fills using 2 or more colors with custom offsets for each individual color position. It also supports flipX|Y, originX|Y, scaleX|Y, and rotation. I have it working quite fast. Up until about 20 minutes ago the biggest bottleneck (from the drawing aspect) was rotation. It's actually still a bit of a bottleneck but much less of one due to thee code that I am about to share with you.

With my Pico overclocked to 250mhz and drawing a 60X60 gradient rect of 2 colors, scaled up 4 times, and rotated 45 degrees:

stock sine and cosine = 40ms
below sine and cosine = 28ms

Before I share the code I want to say that I am well aware that the below sine and cosine are not as accurate as the built in ones. However, for the purposes of gradient fills (and probably other stuff, too) it is quite sufficient. Also, both functions expect degrees instead of radians. This is because the degrees (once ranged) literally are the index in sine_table. If the functions accepted radians I would have to waste more clock cycles converting it back to degrees.

Code: Select all

const uint8_t sine_table[91] = {
    0,4,8,13,17,22,26,31,35,39,44,48,53,57,61,65,70,74,78,83,87,91,95,99,103,107,111,115,119,123,
    127,131,135,138,142,146,149,153,156,160,163,167,170,173,177,180,183,186,189,192,195,198,200,203,206,208,211,213,216,218,
    220,223,225,227,229,231,232,234,236,238,239,241,242,243,245,246,247,248,249,250,251,251,252,253,253,254,254,254,254,254,
    255
};

float isin(int16_t i) {
    i = (i%360+360)%360;
    if (i < 180) return sine_table[(i<90)?i:180-i]/255.0f;
    return -(sine_table[(i<270)?i-180:360-i]/255.0f);
}

float icos(int16_t i) {
    return isin(i+90);
}

SIDE:
I haven't looked at the actual numbers to determine how imprecise these functions are in comparison to their built-in counterparts, but I have run numerous visual comparison tests. The difference from visual results is almost nonexistent. Also, if I draw the exact same fill without any rotation it takes 2ms, so my 28ms time is almost entirely calculating rotation. Do you know a better way? If so, I'd love to learn it. This is the best I can do. I have tried a LOT of things.

If anybody is interested, below is how I concocted sine_table. I just ran the script, copied the printout, pasted it, deleted the 2 square brackets, and manually separated the lines into 30 elements each.

Code: Select all

print('const uint8_t sine_table[91] = {{\n\t{}\n}};'.format([math.floor(math.sin(i*(math.pi/180))*255) for i in range(91)]))
Last edited by OneMadGypsy on Tue Sep 21, 2021 6:14 pm, edited 2 times in total.
"Focus is a matter of deciding what things you're not going to do." ~ John Carmack

slimhazard
Posts: 61
Joined: Sat Apr 03, 2021 8:47 pm

Re: Faster Sine and Cosine

Tue Sep 21, 2021 5:57 pm

If you can afford the 91 bytes of memory for the lookup table, can you afford 360? If so you could pre-compute all 360 possible values, using i-180 or 180-i and so forth where necessary, and just do it all with one lookup, without the logic to find the range.

And to take it a step further, if you can afford 1440 bytes for a table of 360 floats, you could pre-compute all of the values divided by 255.0. Then it's really just one lookup without any further math, after getting the value in the range of 0 to 359. All you'd need to say in the source code is [0/255.0, 4/255.0, 8/255.0, ...] and so on. Run a clever awk one-liner over your existing array and you've got it.

The most expensive operations in your function are the divides, two integer divides for the mods and especially the floating point divide by 255.0, although your compiler may well have replaced that with a multiply by 1/255.0. There may be an ingenious trick to replace the mods with multiplies and some bit twiddling -- I seem to recall reading about such a thing, but might be mistaken. But even without a trick like that, you'll gain some speed if a lookup is all that's necessary after getting a number in the right range.

If you must have a table of integer values, then multiply by 1/255.0 instead of dividing. Very likely your compiler has already spotted that optimization, but you might as well make it explicit.

User avatar
OneMadGypsy
Posts: 359
Joined: Wed Apr 28, 2021 1:57 am
Location: New Orleans, Louisiana
Contact: Website

Re: Faster Sine and Cosine

Tue Sep 21, 2021 6:10 pm

Wow! Thank you! I am going to try all of the methods you suggested. I think I can afford whatever. I've stripped the firmware down significantly so, my pico is practically empty. I've been working on my own firmware that only does what I need it to for like 6 months. I stripped down micropython, too. My endgame is a video game device that allows users to easily create games/(apps?) with micropython. The user doesn't need SPI, UART, I2C, Pin, etc...


I actually came back here to fix a mistake I had made in my code. I goofed and didn't account for the negative side. I fixed my code.
"Focus is a matter of deciding what things you're not going to do." ~ John Carmack

slimhazard
Posts: 61
Joined: Sat Apr 03, 2021 8:47 pm

Re: Faster Sine and Cosine

Tue Sep 21, 2021 6:26 pm

I goofed and didn't account for the negative side.
I was wondering about that, that's the purpose in this line of adding 360 after the mod, and then taking the mod again, right?

Code: Select all

i = (i%360+360)%360;
If so, then how about:

Code: Select all

if (i < 0)
	i = -i;
i %= 360;
When you need to optimize math in hot code, get rid of divides whenever you can.

User avatar
OneMadGypsy
Posts: 359
Joined: Wed Apr 28, 2021 1:57 am
Location: New Orleans, Louisiana
Contact: Website

Re: Faster Sine and Cosine

Tue Sep 21, 2021 6:36 pm

my thought process behind i = (i%360+360)%360; was/is:

Code: Select all

force into range:       i%360
force positive:         above + 360
rerange :               above % 360
I know it's a whack-a-mole way of doing things, cause the code assumes that it needs to fix the i argument even if it doesn't.

Regarding your insight into division:

I appreciate this info. I am still sort of a noob at C programming and I don't know intricacies like what you shared. 100% of my C knowledge is "by observation" and "trial and error". I still get confused by pointers and references sometimes when dealing with arrays/objects. I do have an extensive programming background, just not in this language. However, I think i do a decent job of figuring things out. Everything I have made works and I keep revisiting them as I learn more to make them work better.
Last edited by OneMadGypsy on Tue Sep 21, 2021 6:46 pm, edited 1 time in total.
"Focus is a matter of deciding what things you're not going to do." ~ John Carmack

User avatar
jojopi
Posts: 3539
Joined: Tue Oct 11, 2011 8:38 pm

Re: Faster Sine and Cosine

Tue Sep 21, 2021 6:38 pm

OneMadGypsy wrote:
Tue Sep 21, 2021 4:55 pm
Also, both functions expect degrees instead of radians.
Since degrees are completely arbitrary, why not use 256ths of a circle instead?

Your range scaling can be &0xff instead of %360, and there is no need to make the argument positive first.

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5809
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge

Re: Faster Sine and Cosine

Tue Sep 21, 2021 6:39 pm

Get rid of the mod. Treat circle as having 512 (or 256 if that accurate enough), then you just need:

Code: Select all

i &= 511;

User avatar
OneMadGypsy
Posts: 359
Joined: Wed Apr 28, 2021 1:57 am
Location: New Orleans, Louisiana
Contact: Website

Re: Faster Sine and Cosine

Tue Sep 21, 2021 6:47 pm

@jojopi and dom

Noted. Thank You.


Edit:

The 255 happened because I don't know what the hell I'm doing and 255 was initially an uint8_t instead of a float. :D I think if any of you were inside of my head while I was "inventing" this you'd probably wonder if I understand math, at all! :D Meh, break it til you make it.
Last edited by OneMadGypsy on Tue Sep 21, 2021 8:03 pm, edited 2 times in total.
"Focus is a matter of deciding what things you're not going to do." ~ John Carmack

slimhazard
Posts: 61
Joined: Sat Apr 03, 2021 8:47 pm

Re: Faster Sine and Cosine

Tue Sep 21, 2021 6:58 pm

Note that I'm just going along with your decision that a lookup table for the 360 possible values of sin(degrees) is good enough. That can be a perfectly good optimization, if you're sure.

Numeric error becomes a problem when it accumulates -- if your code will compute y = isin(x), then z = isin(y), then w = isin(z) and so on (or generally results that depend on previous iterations of the function), then what may have not looked like a problem after the first iteration could conceivably start to look pretty bad further down the line.

I couldn't tell from your problem description if that's likely to happen. If so I suggest you try to push it out to large number of iterations and make sure you're happy with the results. If not you may have to live with the library trig functions.

User avatar
OneMadGypsy
Posts: 359
Joined: Wed Apr 28, 2021 1:57 am
Location: New Orleans, Louisiana
Contact: Website

Re: Faster Sine and Cosine

Tue Sep 21, 2021 7:11 pm

slimhazard wrote:
Tue Sep 21, 2021 6:58 pm
Note that I'm just going along with your decision that a lookup table for the 360 possible values of sin(degrees) is good enough. That can be a perfectly good optimization, if you're sure.

Numeric error becomes a problem when it accumulates -- if your code will compute y = isin(x), then z = isin(y), then w = isin(z) and so on (or generally results that depend on previous iterations of the function), then what may have not looked like a problem after the first iteration could conceivably start to look pretty bad further down the line.

I couldn't tell from your problem description if that's likely to happen. If so I suggest you try to push it out to large number of iterations and make sure you're happy with the results. If not you may have to live with the library trig functions.

Well, I can tell you exactly what I am doing, and then maybe you can tell me if I'm making a bad decision, if you like. Honestly, it would probably be a lot easier for you to read my code than for me to explain it. I'll keep it to "meat and potatoes" and omit the boiler plate.

creates a list of precomputed gradients. self is "fill" in the draw function (last codein this post). ox/oy are origins (ie. where to start drawing in gradient fill local space). DIST just spits out the sqrt of the Pythagorean theorem. The first 2 lines are just trying to determine what the maximum end point is. I have dimensions, origins and offsets to consider. I know it's junky, but it's right. I'll figure out a cleaner way later today.

Code: Select all

    int16_t sx = MIN(0, self->ox), sy = MIN(0, self->oy);
    int16_t ex = MAX(mp_obj_get_int(ofs_[e])-sx, MAX(self->w-sx, self->ox)), ey = MAX(mp_obj_get_int(ofs_[e])-sy, MAX(self->h-sy, self->oy)); //FIXME: too convoluted
    
    self->sl     = DIST(0, 0, ex, ey);
    self->swatch = m_malloc(sizeof(uint16_t)*self->sl);
    
    for (uint8_t i=0; i<e; ++i) {
        o1 = mp_obj_get_int(ofs_[i]); 
        o2 = mp_obj_get_int(ofs_[i+1]); 
        o  = ((i)?(o2-o1):o2); 
          
        for (uint16_t n=0; n<o; n++) {              //for every in difference
            c1 = (o2 > ofs)?mp_obj_get_int(pal_[i]):mp_obj_get_int(pal_[i+1]); //these 2 lines basically make a wrap decision in their color choice
            c2 = (o1 > ofs)?mp_obj_get_int(pal_[i]):mp_obj_get_int(pal_[i+1]);
            
            //parsed color channels
            r1 = R31(c1); r2 = R31(c2);
            g1 = G31(c1); g2 = G31(c2); 
            b1 = B31(c1); b2 = B31(c2);
            
            //multiplied channel
            r_ = (uint8_t)(n*((float)(r2-r1)/o)+r1);
            g_ = (uint8_t)(n*((float)(g2-g1)/o)+g1);
            b_ = (uint8_t)(n*((float)(b2-b1)/o)+b1);
            
            c = RGB565((uint32_t)((r_<<19)|(g_<<11)|(b_<<3)));
            
            self->swatch[ofs++] = c;
        }
    }
    
    
    while (ofs<self->sl) self->swatch[ofs++] = mp_obj_get_int(pal_[e]); //fill any remaining indexes with end color
    

my sin/cos implementation of distance. This is the only part that actually uses my sin/cos

Code: Select all

uint16_t distance_fast(int16_t deg, int16_t x, int16_t y, picomod_Fill_obj_t *fill) {
    int16_t nx=x-fill->ox, ny=y-fill->oy;
    if (deg) {
        float s = isin(deg);
        float c = icos(deg);
        nx = (int16_t)(nx * c - ny * s);
        ny = (int16_t)(nx * s + ny * c);
    }
    
    return MIN(((uint16_t)((fill->t&FILL_HORIZONTAL)?nx:((fill->t&FILL_VERTICAL)?ny:DIST(x, y, fill->ox, fill->oy)))), fill->sl-1); //HORIZONTAL, VERTICAL, RADIAL or max possible
}

body of the function that actually uses that info and draws. distance slow/fast are just my sin/cos vs original. I have them both so I can toggle which one I use and compare. sx/sy are scale

Code: Select all


    for (uint32_t p=0; p<fill->l; p++) {
        px = p%fill->w; py = p/fill->w;
        nx = ((fill->w-1)-(px<<1)) * fx + px;   //flip x if fx
        ny = ((fill->h-1)-(py<<1)) * fy + py;   //flip y if fy
        c  = fill->swatch[(s)?distance_slow(r, nx, ny, fill):distance_fast(r, nx, ny, fill)];
        
        if (fill->tc == c) //if transparent skip
            continue;
                
        clipped_solid_rect(self, sx, sy, px*sx+x, py*sy+y, c); //self, width, height, x, y, color ~ ignores anything outside of the screen, but also adjust if only partially on the screen
    }

All in all, the entire system is just figuring out distance and using it as an index for a precomputed list of colors. The results are actually incredible. I wish I could take an accurate picture of the screen to share it. Every time I try the resulting image looks nothing like my display.
"Focus is a matter of deciding what things you're not going to do." ~ John Carmack

stevend
Posts: 421
Joined: Fri Oct 11, 2013 12:28 pm

Re: Faster Sine and Cosine

Tue Sep 21, 2021 8:14 pm

You might like to research a bit - many algorithms have been developed for efficient drawing using only integer arithmetic. For example, a circle is symmetrical or mirrored about both X and Y axes, so you only need to compute one eighth of the radius.
Long time since I did any of this, but look at Bresenham's algorithm as an example for efficiently drawing straight lines at an angle, and also one for drawing circles

User avatar
OneMadGypsy
Posts: 359
Joined: Wed Apr 28, 2021 1:57 am
Location: New Orleans, Louisiana
Contact: Website

Re: Faster Sine and Cosine

Tue Sep 21, 2021 8:37 pm

Thanks for the links! I will definitely check them out.

EDIT: I looked at the line algorithm link, and I am already using that algorithm, but I didn't learn it. I just made it up. Everything I do is made up. I haven't gotten to circles yet. At least not officially.
-----


I changed this:

Code: Select all

for (uint32_t p=0; p<fill->l; p++) {
        px = p%fill->w; py = p/fill->w;
to this:

Code: Select all

    float w = 1.0f/fill->w;
    for (uint32_t p=0; p<fill->l; p++) {
        px = p%fill->w; py = p*w;
And it added 3ms to the draw time. I don't understand. 3ms is significant ... it's a 9th of the original draw time. Why is division heavily out-performing multiplication?
Last edited by OneMadGypsy on Tue Sep 21, 2021 8:51 pm, edited 1 time in total.
"Focus is a matter of deciding what things you're not going to do." ~ John Carmack

slimhazard
Posts: 61
Joined: Sat Apr 03, 2021 8:47 pm

Re: Faster Sine and Cosine

Tue Sep 21, 2021 8:50 pm

I'm afraid I'm not motivated to put that much time and effort into code review for a forum post. I get to do that kind of thing at work, this is free time.

Besides, what you've posted doesn't necessarily address the question I wanted to point out, namely whether the trig approximation might lead to accumulation of numeric error, although I get the impression that it doesn't so much. From what I can tell, you're using sine and cosine for the distance function, which in turn is used to index the swatch table, and that's the end of the math. But if those results enter into further floating point computations, then I may be wrong. You're still the best judge of whether the results are sufficient for your requirements.

The library math functions do what they do for numeric accuracy for a variety of reasons, one of the most prominent of which is to minimize error propagation in long sequences of computation. That's why I wanted to bring it up if you're going for a fast approximation. Don't let me talk you out of it, "good enough" is always the best standard.

And I agree with stevend that computer geometry and graphics are very well researched subjects from which you can benefit. I saw some ways to speed up the math in inner-loop code, but chances are that there are well-understood algorithms that beat the tar out of what we're micro-optimizing here.

slimhazard
Posts: 61
Joined: Sat Apr 03, 2021 8:47 pm

Re: Faster Sine and Cosine

Tue Sep 21, 2021 8:59 pm

I changed this:

Code: Select all

for (uint32_t p=0; p<fill->l; p++) {
        px = p%fill->w; py = p/fill->w;
to this:

Code: Select all

    float w = 1.0f/fill->w;
    for (uint32_t p=0; p<fill->l; p++) {
        px = p%fill->w; py = p*w;
And it added 3ms to the draw time.
Might be because the first version has both mod and divide by fill->w inside the loop, which can be usefully combined and the compiler noticed it, but no such lucky combination in the second version.

User avatar
OneMadGypsy
Posts: 359
Joined: Wed Apr 28, 2021 1:57 am
Location: New Orleans, Louisiana
Contact: Website

Re: Faster Sine and Cosine

Tue Sep 21, 2021 9:09 pm

"I'm afraid I'm not motivated to put that much time and effort into code review for a forum post. I get to do that kind of thing at work, this is free time."

Fair enough. I greatly appreciate all the help you have given.

@algorithms

Ya know, I just can't bring myself to learn them. It's like cheating, to me. My walls are covered in whiteboards with my own algorithms/ideas, and whereas I know I may never write algorithms that are as amazing as the ones you are referring to, I also know that it's entirely possible that I will. I live in my head. I'd rather make something that is pretty good and all of my own doing than something that is amazing which is primarily composed of none of my own doing. I mean, why stop at learning an algorithm? I could just go find some c implementation that definitely already exists and paste it. If we follow that train of thought to the end ... why do any of it? I could just hack some handheld knockoff and call it a day. I believe in me.

Don't get me wrong. I don't intend to figure out "the world" with nothing but my own thoughts. I will push myself as far as I can go, though, and when I hit a wall I go make a bunch of observations. Those observations may include looking at some code that works, which may also be the very algorithms you are referring to, and comparing it with what I am doing. In that action, I may "observe" myself right into understanding the very algorithms that I refuse to have handed to me.
"Focus is a matter of deciding what things you're not going to do." ~ John Carmack

pidd
Posts: 2606
Joined: Fri May 29, 2020 8:29 pm
Location: Wirral, UK
Contact: Website

Re: Faster Sine and Cosine

Tue Sep 21, 2021 9:32 pm

How about putting tables (gradient or trig tables) into ROMS (eeprom or whatever) and interrogating with i2c.

kilograham
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 958
Joined: Fri Apr 12, 2019 11:00 am
Location: austin tx

Re: Faster Sine and Cosine

Tue Sep 21, 2021 9:57 pm

a bit of an aside - and i'm certainly not suggesting use of divides in the above - but divide on Pico can be very very cheap.

We have h/w divider (one per core) which take 8 cycles and can happen concurrently with code execution (I think there is an example of perspective correct texture mapping in pico_playground).

the build in C / % operators are a bit expensive (better than software GCC divide) because of the need for a function call and no opportunity to interleave with code, but there are plenty of methods here https://github.com/raspberrypi/pico-sdk ... /divider.h for faster stuff..

e.g. you can do

Code: Select all

hw_divider_divmod_s32_start(a, b); // start signed divide a/b
// some other code that takes at least 8 cycles...

int32_t result = to_quotient_s32(hw_divider_result_nowait()); 
entire overhead is probably 6 cycles or so (one constant load, 2 SIO writes and 2 SIO reads)...

Return to “SDK”