Direct 2:3 (8:12) DCT upscaling with fast 12x12 IDCT

A fast 16x16 IDCT has been developed to provide a 'djpeg -scale 2/1' option for 1:2 (8:16) image upscaling.
This feature can also be used for direct internal color upsampling while normal (1:1) decoding of 2hx2v color subsampled JPEG images (also 2hx1v and 1hx2v with enhanced library).

A fast 12x12 IDCT has been developed to provide a 'djpeg -scale 3/2' option for 2:3 (8:12) image upscaling.
This is interesting because it is the first non-integer upsampling feature in the IJG context.

Computational efficiency of algorithms

The computational efficiency of the developed algorithms is estimated here by the number of multiplications per output pixel. We do not take into account the dequantization mults here, which would improve the upscaling results even further because only input values (less than output in case of upscaling) need to be dequantized.

Derivation of the fast 12x12 IDCT algorithm

Using the NxN DCT formula we can derive the 12x12 DCT as follows.
Note that we only need to calculate the upper 8 of 12 full matrix rows, since we have to apply them only to 8x8 input values (see later in transposed IDCT scheme).
The dotted vertical line in the middle marks the central (anti)symmetry axis, so we can further simplify computation by only calculating the left half and then mirror it alternately as is or with sign alternation to the right half.
Furthermore we omit the scalar scaling factor because it turns out that the final scaling is always by a factor of 1/8 for any output (I)DCT size.
col   0     1     2     3     4     5     6     7     8     9    10    11
index
                                       :
  /  C6    C6    C6    C6    C6    C6  : C6    C6    C6    C6    C6    C6  \
  |                                    :                                   |
  |  C1    C3    C5    C7    C9    C11 :-C11  -C9   -C7   -C5   -C3   -C1  |
  |                                    :                                   |
  |  C2    C6    C10  -C10  -C6   -C2  :-C2   -C6   -C10   C10   C6    C2  |
  |                                    :                                   |
  |  C3    C9   -C9   -C3   -C3   -C9  : C9   -C3    C3    C9   -C9   -C3  |
  |                                    :                                   |
  |  C4    0    -C4   -C4    0     C4  : C4    0    -C4   -C4    0     C4  |
  |                                    :                                   |
  |  C5   -C9   -C1   -C11   C3    C7  :-C7   -C3    C11   C1    C9   -C5  |
  |                                    :                                   |
  |  C6   -C6   -C6    C6    C6   -C6  :-C6    C6    C6   -C6   -C6    C6  |
  |                                    :                                   |
  |  C7   -C3   -C11   C1   -C9   -C5  : C5    C9    -C1   C11   C3   -C7  |
  |....................................:...................................|
  |                                    :                                   |

  where  Ck = cos(k*pi/24)
Now the IDCT is the transpose of the DCT, hence
col   0     1     2     3     4     5     6     7
index

  /  C6    C1    C2    C3    C4    C5    C6    C7
  |
  |  C6    C3    C6    C9    0    -C9   -C6   -C3
  |
  |  C6    C5    C10  -C9   -C4   -C1   -C6   -C11
  |
  |  C6    C7   -C10  -C3   -C4   -C11   C6    C1
  |
  |  C6    C9   -C6   -C3    0     C3    C6   -C9
  |
  |  C6    C11  -C2   -C9    C4    C7   -C6   -C5
  |----------------------------------------------
  |  C6   -C11  -C2    C9    C4   -C7   -C6    C5
  |
  |  C6   -C9   -C6   -C3    0    -C3    C6    C9
  |
  |  C6   -C7   -C10   C3   -C4    C11   C6   -C1
  |
  |  C6   -C5    C10   C9   -C4    C1   -C6    C11
  |
  |  C6   -C3    C6   -C9    0     C9   -C6    C3
  |
  \  C6   -C1    C2   -C3    C4   -C5    C6   -C7
With ck = sqrt(2) * Ck and C6 = 1/sqrt(2) we get
col   0     1     2     3     4     5     6     7
index

  /  1     c1    c2    c3    c4    c5    1     c7
  |
  |  1     c3    1     c9    0    -c9   -1    -c3
  |
  |  1     c5    c10  -c9   -c4   -c1   -1    -c11
  |
  |  1     c7   -c10  -c3   -c4   -c11   1     c1
  |
  |  1     c9   -1    -c3    0     c3    1    -c9
  |
  |  1     c11  -c2   -c9    c4    c7   -1    -c5
  |----------------------------------------------
  |  1    -c11  -c2    c9    c4   -c7   -1     c5
  |
  |  1    -c9   -1    -c3    0    -c3    1     c9
  |
  |  1    -c7   -c10   c3   -c4    c11   1    -c1
  |
  |  1    -c5    c10   c9   -c4    c1   -1     c11
  |
  |  1    -c3    1    -c9    0     c9   -1     c3
  |
  \  1    -c1    c2   -c3    C4   -c5    1    -c7

where

  c1  = 1.402114769
  c2  = 1.366025404
  c3  = 1.306562965
  c4  = 1.224744871
  c5  = 1.121971054
 [c6  = 1 not needed]
  c7  = 0.860918669
 [c8  not needed]
  c9  = 0.541196100
  c10 = 0.366025404
  c11 = 0.184591911
Even part optimization (columns 0, 2, 4, 6):
You see that we have only 3 multiplicators in this part: c4, c2, and c10. Furthermore we see from the numbers below that c10 = c2 - 1. Thus multiplication with c10 can be replaced by subtraction. This leaves us with just 2 mults in the even part calculation (by factors c4 and c2).

Odd part optimization (columns 1, 3, 5, 7):
Rows 1 and 4 form a 'rotation' expression (see fast 4x4 IDCT derivation) which can be spanned over the full odd columns block (1, 3, 5, 7) and thereby 'normalized' by substituting x1 - x7 and x3 - x5 with factors c3 and c9.
Column 3 has just 2 multiplicators (c3 and c9). The remaining elements can be reduced to 8 multiplications.
This gives us 3 (rotation) + 2 (column 3) + 8 = 13 mults in the odd part calculation.

Note that the rotation with c3 and c9 is again the same as in the even part of the 8x8 point LL&M IDCT algorithm.