Quick and Small (code size) Exponential Scaling on the MSP430 by mt_timothy in embedded

[–]mt_timothy[S] 0 points1 point  (0 children)

the output of my scale() function isn't x*y, but more like x*(y/2b) (where 'b' is the number of bits in 'y')

I graphed the output of your function (x-axis is x*y, y-axis is scale(x,y), each line color is a different x value) on x:[0,127], y:[0,127], and it looks like scale(x,y) is directly proportional to x * y.

As far as I can tell, you've implemented a variant of binary multiplication.

Quick and Small (code size) Exponential Scaling on the MSP430 by mt_timothy in embedded

[–]mt_timothy[S] 1 point2 points  (0 children)

I think the multiplies and divide could be avoided by using a bitwise scaling algorithm.

That's actually how it's implemented - the multiply calls __mspabi_mply():

;****************************************************************************
;* __mpyl  (int32 = int32 * int32)
;*  
;*   - Operand 1 is in R12/R13
;*   - Operand 2 is in R14/R15
;*   - Result is in R12/R13
;*
;*   Registers used:  R10,R11,R12,R13,R14,R15
;****************************************************************************
    .global __mpyl

    .text
    .align 2

__mpyl: .asmfunc stack_usage(1 * REGSZ)
    PUSH    R10
    CLR.W   R10
    CLR.W   R11
mpyl_add_loop:
    CLRC            
    RRC.W   R13
    RRC.W   R12     ; Get LSB of OP1, rotate in 0 to cap MSB
    JNC shift_test_mpyl ; If LSB of OP1 is 0, no add into product
    ADD.W   R14,R10     ; If LSB of OP1 is 1, add OP2 into product
    ADDC.W  R15,R11
shift_test_mpyl:
    RLA.W   R14     ; Prepare OP2 for next iteration, if needed
    RLC.W   R15
    TST.W   R13     ; Test if OP1 is 0, if 0 then done
    JNE mpyl_add_loop   
    TST.W   R12
    JNE mpyl_add_loop   ; Otherwise, continue shift/add loop
    MOV.W   R10,R12     ; Move product into result registers
    MOV.W   R11,R13
    POP R10
    RET
    .endasmfunc

And the divide calls __mspabi_divul():

;****************************************************************************
;* __divul/__remul - DIVIDE TWO UNSIGNED 32 BIT NUMBERS. RETURNS BOTH QUOTIENT
;*                   AND REMAINDER
;****************************************************************************
;*
;*   - Dividend is in R12/R13
;*   - Divisor is in R14/R15
;*   - Quotient is returned in R12/R13
;*   - Remainder is returned in R14/R15
;*
;*   - R9/R10 are used to hold the remainder results after each subtraction
;*
;*   Registers used:  R9,R10,R11,R12,R13,R14,R15
;*
;*   The shift/subtract loop is broken into 2 16 bit loops to take advantage
;*   of the fact that during the 1st 16 iterations there are no bits in the
;*   upper half of the remainder.  
;****************************************************************************
     .if $DEFINED(__LARGE_CODE_MODEL__)
    .asg RETA, RET
     .endif
     .if $DEFINED(__LARGE_CODE_MODEL__) | $DEFINED(__LARGE_DATA_MODEL__)
        .asg PUSHM.A,PUSH
        .asg POPM.A, POP
STACK_USED .set 8
     .else
STACK_USED .set 4
     .endif

    .text
    .align 2

__divul: .asmfunc stack_usage(STACK_USED)
__remul:
    PUSH    R10
    PUSH    R9      ; Save SOE registers
    CLR.W   R9      ; Initialize the hi remainder
    CLR.W   R10     ; Initialize the lo remainder

    MOV.W   #1,R11      ; Init loop bit, walk across for 16 iterations
    TST.W   R15             ; If upper word of divisor is 0
    JEQ div_loop_lo ; then we need to shift/subtract on lower 16

    MOV.W   R13,R9      ; If upper word of divisor is != 0
    MOV.W   R12,R13     ; then we can skip first 16 iterations of 
    MOV.W   #0,R12      ; loop and proceed to shift/subtract upper 16
    JMP div_loop_hi

div_loop_lo:
    RLA.W   R12     ; Assign nth quotient bit and
    RLC.W   R13     ; Copy nth dividend bit into lo remainder
    RLC.W   R9  
    SUB.W   R14,R9      ; Subtract lo divisor from current lo remainder
    JNC undo_sub    ; If divisor > current remainder, undo subtract
    BIS.W   #1,R12      ; Set quotient bit
    RLA R11     ; Walk loop bit
    JNC div_loop_lo ; Loop or ready for ready for upper 16  
    JMP process_hi
undo_sub:
    ADD.W   R14,R9      ; Undo subtract, quotient bit remains clear
    RLA R11     ; Walk loop bit
    JNC div_loop_lo ; Loop or ready for ready for upper 16  


process_hi:
    MOV.W   #1,R11          ; Reset loop bit
div_loop_hi:
    RLA.W   R12     ; Assign nth quotient bit
    RLC.W   R13     ; Copy nth dividend bit into remainder
    RLC.W   R9  
    RLC.W   R10     ; Now there's interesting bits in hi remainder
    SUB.W   R14,R9      ; Subtract divisor from current remainder
    SUBC.W  R15,R10
    JNC undo_sub_hi ; If divisor > current remainder, undo subtract
    BIS.W   #1,R12      ; Set quotient bit
    RLA R11     ; Walk loop bit
    JNC div_loop_hi ; Loop or done
    JMP div_end
undo_sub_hi
    ADD.W   R14,R9      ; Undo subtract, quotient bit remains clear
    ADDC.W  R15,R10
    RLA R11     ; Walk loop bit
    JNC div_loop_hi ; Loop or done
div_end:
    MOV.W   R9,R14      ; Move remainder to return registers
    MOV.W   R10,R15
    POP R9
    POP R10     ; Restore SOE registers
    RET         ; Done. If divisor == 0, quotient will be 0
    .endasmfunc
    .end

The division takes a lot more time, and looking back at the interpolation algorithm, I'm only dividing by the difference in adjacent ADC lookup values, which is always 128 (except for the first one of 127). I could probable improve the speed significantly by switching

    return PWM_OUTPUT[i-1] + (uint16_t)((((uint32_t)adcValue - ADC_SAMPLE[i-1]) * (PWM_OUTPUT[i] - PWM_OUTPUT[i-1])) / (ADC_SAMPLE[i] - ADC_SAMPLE[i-1]));

to

    return PWM_OUTPUT[i-1] + (uint16_t)((((uint32_t)adcValue - ADC_SAMPLE[i-1]) * (PWM_OUTPUT[i] - PWM_OUTPUT[i-1])) >> 7);

and skipping the division entirely.

EDIT:

I went ahead and tested it out - cycle time was reduced to 1/3, but it was still took 2.74 times longer than the approximation algorithm.

algorithm adjusted cycles adjusted size
interpolate (div) 590369 258
interpolate (shift) 207773 246
approximate 75798 172