Lockless Inc

Non Power of Two Fast Fourier Transform

The Fast Fourier Transform (FFT) is most optimized for vectors that are powers of two. However, sometimes you need the Fourier transform of some data of non power of two lengths. The next most common factor for arbitrary numbers is three. This article describes how FFTs of powers of three and six can be efficiently implemented.

The Fast Fourier Transform is a fast algorithm to calculate the discrete Fourier Transform:

Ff = ΣtN-1xt Exp[2iπtf/N]

Power of Two FFTs

For power of two lengths this sum can be efficiently evaluated by splitting the problem in half. The even and odd points can be handled by doing a transform of half-size.

i.e. Ff = (ΣtN/2-1x2t Exp[2iπtf/(N/2)])+Exp[2iπf/(N/2)tN/2-1x2t+1 Exp[2iπtf/(N/2)]

= Ff(even points)+Exp[2iπf/(N/2)] Ff(odd points)

The results from the two sub-transforms can be combined to produce the full-length transform in O(n) operations, multiplying by the "twiddle" or "phase" factors. The recursion takes O(lg(n)) levels, thus yielding a total computational complexity of O(n lg(n)), which is a great savings over the brute-force method, which takes O(n2) operations. (This is known as the Cooley-Tukey FFT algorithm.)

If the order of the input data is rearranged into bit-reversed order, then the transformation can be made in-place. This can be done with an algorithm in O(n) steps:


static void swap(double * restrict x, double * restrict y)
{
	double temp = *x;
	*x = *y;
	*y = temp;
}

static void bitreverse(double *data, int n)
{
	int i, j = 1;
	
	for (i = 1; i < n * 2; i += 2)
	{
		int m = n;
		
		if (j > i)
		{
			swap(&data[j-1], &data[i-1]);
			swap(&data[j], &data[i]);
		}
		
		while ((m >= 2) && (j > m))
		{
			j -= m;
			m >>= 1;
		}
		
		j += m;
	}
}

Also, the twiddle factors can be pre-computed:


#include <math.h>
#include <err.h>

/* Twiddle factors */
static double *tc;
static double *ts;

static void init_twiddle(int size)
{
	int i;
	
	tc = malloc(sizeof(double) * size);
	ts = malloc(sizeof(double) * size);
	
	if (!tc || !ts) errx(1, "Out of memory in init_twiddle\n");
	
	for (i = 0; i < size; i++)
	{
		tc[i] = cos(2*M_PI*i/size);
		ts[i] = sin(2*M_PI*i/size);
	}
}

static void fini_twiddle(void)
{
	free(tc);
	free(ts);
}

Finally, it is possible to use less complex multiplications in some cases. When the twiddle factors are 1,-1,+i,-i, the multiplications are "free", and turn into relabeling of the data. In addition, when the transformation size is eight, the twiddle factors are symmetric, and also can be done more efficiently. Using these speed-ups an algorithm for power of two sized transforms is:


static void fft4_aux(double *data)
{
	double tempr = data[2];
	double tempi = data[3];
	
	data[2] = data[0] - tempr;
	data[3] = data[1] - tempi;
	
	data[0] += tempr;
	data[1] += tempi;

	tempr = data[6];
	tempi = data[7];
	
	data[6] = data[4] - tempr;
	data[7] = data[5] - tempi;
	
	tempr += data[4];
	tempi += data[5];
	
	data[4] = data[0] - tempr;
	data[5] = data[1] - tempi;
	
	data[0] += tempr;
	data[1] += tempi;

	tempr = data[7];
	tempi = data[6];
	
	data[6] = data[2] + tempr;
	data[7] = data[3] - tempi;
	
	data[2] -= tempr;
	data[3] += tempi;
}

static void rec_fft2_aux(double *data, int rsize, int scal)
{
	int size = rsize / 2;
	int ii;
	
	if (rsize == 4)
	{
		fft4_aux(data);
		
		return;
	}

	rec_fft2_aux(data, size, scal * 2);
	rec_fft2_aux(data + rsize, size, scal * 2);
	
	if (rsize == 8)
	{
		double s2 = tc[scal];

		double tempr = data[8];
		double tempi = data[9];
		
		data[8] = data[0] - tempr;
		data[9] = data[1] - tempi;
	
		data[0] += tempr;
		data[1] += tempi;

		tempr = s2 * (data[10] - data[11]);
		tempi = s2 * (data[11] + data[10]);
		
		data[10] = data[2] - tempr;
		data[11] = data[3] - tempi;
	
		data[2] += tempr;
		data[3] += tempi;

		tempr = data[13];
		tempi = data[12];
		
		data[12] = data[4] + tempr;
		data[13] = data[5] - tempi;
	
		data[4] -= tempr;
		data[5] += tempi;

		tempr = s2 * (data[14] + data[15]);
		tempi = s2 * (data[14] - data[15]);
		
		data[14] = data[6] + tempr;
		data[15] = data[7] - tempi;
	
		data[6] -= tempr;
		data[7] += tempi;

		return;
	}
	
	for (ii = 0; ii < size; ii++)
	{
		int i = ii * 2;
		int j = ii * scal;
		
		double c = tc[j];
		double s = ts[j];
		
		double tempr = c * data[i + rsize] - s * data[i + rsize + 1];
		double tempi = c * data[i + rsize + 1] + s * data[i + rsize];
		
		data[i + rsize] = data[i] - tempr;
		data[i + rsize + 1] = data[i + 1] - tempi;
	
		data[i] += tempr;
		data[i + 1] += tempi;
	}
}

static void rec_fft2(double *data, int rsize)
{
	bitreverse(data, rsize);
	
	if (rsize < 4)
	{
		if (rsize == 1) return;
		if (rsize == 2)
		{
			double tempr = data[2];
			double tempi = data[3];
		
			data[2] = data[0] - tempr;
			data[3] = data[1] - tempi;
		
			data[0] += tempr;
			data[1] += tempi;
		
			return;
		}
	}

	rec_fft2_aux(data, rsize, 1);
}

The above uses the recursion explicit in the derivation above to make the algorithm more explicit. However, note that it may be slightly more efficient to turn the recursion into iteration, depending on the processor and compiler.

The above can be optimized further. It is more efficient to split the transform into four at each stage, than a split into two. This is because the twiddle factors for size four transforms are related in a simple way by powers of the fourth root of unity = i. Multiplications by i, (and by -1, 1, and -i) are trivial, thus making the algorithm more efficient. Note that we use a radix four sized reverse, instead of the bit reverse used above.


static void fft4_aux4(double *data)
{
	double sr0 = data[0] + data[4];
	double si0 = data[1] + data[5];

	double dr0 = data[0] - data[4];
	double di0 = data[1] - data[5];

	double sr1 = data[2] + data[6];
	double si1 = data[3] + data[7];

	double dr1 = data[2] - data[6];
	double di1 = data[3] - data[7];

	data[0] = sr0 + sr1;
	data[1] = si0 + si1;

	data[4] = sr0 - sr1;
	data[5] = si0 - si1;

	data[2] = dr0 - di1;
	data[3] = di0 + dr1;

	data[6] = dr0 + di1;
	data[7] = di0 - dr1;
}


static void rec_fft4_aux(double *data, int rsize, int scal)
{
	int size = rsize / 4;
	int rhsize = rsize / 2;
	int ii;

	if (rsize == 4)
	{
		fft4_aux4(data);
		
		return;
	}

	rec_fft4_aux(data, size, scal * 4);
	rec_fft4_aux(data + rhsize, size, scal * 4);
	rec_fft4_aux(data + rsize, size, scal * 4);
	rec_fft4_aux(data + rsize + rhsize , size, scal * 4);
	
	for (ii = 0; ii < size; ii++)
	{
		int i = ii * 2;
		int j1 = ii * 1 * scal;
		int j2 = ii * 2 * scal;
		int j3 = ii * 3 * scal;
		
		double c1 = tc[j1];
		double s1 = ts[j1];
		
		double tempr1 = c1 * data[i + rhsize] - s1 * data[i + rhsize + 1];
		double tempi1 = c1 * data[i + rhsize + 1] + s1 * data[i + rhsize];
		
		double c2 = tc[j2];
		double s2 = ts[j2];
		
		double tempr2 = c2 * data[i + rsize] - s2 * data[i + rsize + 1];
		double tempi2 = c2 * data[i + rsize + 1] + s2 * data[i + rsize];
		
		double c3 = tc[j3];
		double s3 = ts[j3];
		
		double tempr3 = c3 * data[i + rhsize + rsize] - s3 * data[i + rhsize + rsize + 1];
		double tempi3 = c3 * data[i + rhsize + rsize + 1] + s3 * data[i + rhsize + rsize];
		
		double sr0 = data[i] + tempr2;
		double si0 = data[i + 1] + tempi2;
		
		double dr0 = data[i] - tempr2;
		double di0 = data[i + 1] - tempi2;
		
		double sr1 = tempr1 + tempr3;
		double si1 = tempi1 + tempi3;
		
		double dr1 = tempr1 - tempr3;
		double di1 = tempi1 - tempi3;
		
		data[i] = sr0 + sr1;
		data[i + 1] = si0 + si1;
		
		data[i + rsize] = sr0 - sr1;
		data[i + rsize + 1] = si0 - si1;
		
		data[i + rhsize] = dr0 - di1;
		data[i + rhsize + 1] = di0 + dr1;
		
		data[i + rhsize + rsize] = dr0 + di1;
		data[i + rhsize + rsize + 1] = di0 - dr1;
	}
}

/* Radix reverse with Ruis algorithm */
static void radixreverse(double *data, int rsize, int m)
{
	int k, l, r, i, nm;
	
	int *j = malloc(sizeof(int) * rsize);
	if (!j) errx(1, "OOM in radixreverse()\n");
	
	j[0] = 0;
	
	nm = rsize/m;
	
	for (i = 1; i < m; i++)
	{
		j[i] = i * nm;
		swap(&data[i * 2], &data[j[i] * 2]);
		swap(&data[i * 2 + 1], &data[j[i] * 2 + 1]);
	}
	
	for (i = m, l = m, r=nm/m; r > 0; l *= m, r /= m)
	{
		for (k = 0; k < (m-1)*l; k++, i++)
		{
			j[i] = j[k] + r;
			if (i < j[i])
			{
				swap(&data[i * 2], &data[j[i] * 2]);
				swap(&data[i * 2 + 1], &data[j[i] * 2 + 1]);
			}
		}
	}
	
	free(j);
}

static void rec_fft4(double *data, int rsize)
{
	radixreverse(data, rsize, 4);
		
	if (rsize < 4)
	{
		if (rsize == 1) return;
		if (rsize == 2)
		{
			double tempr = data[2];
			double tempi = data[3];
		
			data[2] = data[0] - tempr;
			data[3] = data[1] - tempi;
		
			data[0] += tempr;
			data[1] += tempi;
		
			return;
		}
	}

	rec_fft4_aux(data, rsize, 1);
}

The above is a fair bit more efficient, taking about 4.25n lg(n) + O(n) steps for radix 4, compared to 5n lg(n) + O(n) for the radix 2 algorithm. However, it can be improved further. The trick is to use a split-radix algorithm. This uses a half-sized sub-transform for the even points, and two quarter-sized transforms for the odd points. This wins because the half-sized transform doesn't need any twiddle factors to be extended to the full-sized transform. The resulting algorithm looks something like:


static void rec_ffts_aux(double *data, int rsize, int scal)
{
	int size = rsize / 4;
	int rhsize = rsize / 2;
	int ii;
	
	if (rsize == 2)
	{
		double tempr = data[2];
		double tempi = data[3];
		
		data[2] = data[0] - tempr;
		data[3] = data[1] - tempi;
		
		data[0] += tempr;
		data[1] += tempi;
		return;
	}

	if (rsize == 4)
	{
		fft4_aux(data);
		
		return;
	}
	
	rec_ffts_aux(data, rhsize, scal * 2);
	rec_ffts_aux(data + rsize, size, scal * 4);
	rec_ffts_aux(data + rsize + rhsize, size, scal * 4);

	if (rsize == 8)
	{
		double s2 = tc[scal];
		
		double tempr1 = data[8];
		double tempi1 = data[9];
				
		double tempr3 = data[12];
		double tempi3 = data[13];
				
		double sr1 = tempr1 + tempr3;
		double si1 = tempi1 + tempi3;
		
		double dr1 = tempr1 - tempr3;
		double di1 = tempi1 - tempi3;
		
		data[8] = data[0] - sr1;
		data[9] = data[1] - si1;
		
		data[12] = data[4] + di1;
		data[13] = data[5] - dr1;
		
		data[0] += sr1;
		data[1] += si1;
		
		data[4] -= di1;
		data[5] += dr1;

		tempr1 = s2 * (data[10] - data[11]);
		tempi1 = s2 * (data[10] + data[11]);
		
		tempr3 = s2 * (data[14] + data[15]);
		tempi3 = s2 * (data[14] - data[15]);
		
		sr1 = tempr1 - tempr3;
		si1 = tempi1 + tempi3;
		
		dr1 = tempr1 + tempr3;
		di1 = tempi1 - tempi3;
		
		data[10] = data[2] - sr1;
		data[11] = data[3] - si1;
		
		data[14] = data[6] + di1;
		data[15] = data[7] - dr1;
		
		data[2] += sr1;
		data[3] += si1;
		
		data[6] -= di1;
		data[7] += dr1;

		return;
	}

	for (ii = 0; ii < size; ii++)
	{
		int i = ii * 2;
		int j1 = ii * scal;
		int j3 = ii * 3 * scal;
		
		double c1 = tc[j1];
		double s1 = ts[j1];
		
		double tempr1 = c1 * data[i + rsize] - s1 * data[i + rsize + 1];
		double tempi1 = c1 * data[i + rsize + 1] + s1 * data[i + rsize];
				
		double c3 = tc[j3];
		double s3 = ts[j3];
		
		double tempr3 = c3 * data[i + rhsize + rsize] - s3 * data[i + rhsize + rsize + 1];
		double tempi3 = c3 * data[i + rhsize + rsize + 1] + s3 * data[i + rhsize + rsize];
				
		double sr1 = tempr1 + tempr3;
		double si1 = tempi1 + tempi3;
		
		double dr1 = tempr1 - tempr3;
		double di1 = tempi1 - tempi3;
		
		data[i + rsize] = data[i] - sr1;
		data[i + rsize + 1] = data[i + 1] - si1;
		
		data[i + rhsize + rsize] = data[i + rhsize] + di1;
		data[i + rhsize + rsize + 1] = data[i + rhsize + 1] - dr1;
		
		data[i] += sr1;
		data[i + 1] += si1;
		
		data[i + rhsize] -= di1;
		data[i + rhsize + 1] += dr1;
	}
}

static void rec_ffts(double *data, int rsize)
{
	bitreverse(data, rsize);
	
	if (rsize < 4)
	{
		if (rsize == 1) return;
		if (rsize == 2)
		{
			double tempr = data[2];
			double tempi = data[3];
		
			data[2] = data[0] - tempr;
			data[3] = data[1] - tempi;
		
			data[0] += tempr;
			data[1] += tempi;
		
			return;
		}
	}

	rec_ffts_aux(data, rsize, 1);
}

The above takes 4n lg(n) + O(n) floating point operations to do the transform, which is a small yet significant improvement. Unfortunately, trying to improve this algorithm by splitting into more pieces, or trying something like a 1/2,1/4,1/8,1/8 split doesn't really help. However, the best known power of two algorithm is based on the above. It notes that a complex multiplication may be converted into:

(a+ib)(c+id)=a(1+ib/a)(c+id), or b(a/b+i)(c+id)

This doesn't save any operations as-is, but if the pre-factor is chosen correctly, it may cancel with other multiplications in the algorithm. This is known as the "Tangent" method, and was originally discovered by J. Van Buskirk for size 64 transformations. If this is done, then the speed is improved to 34/9n lg(n) + O(n) to do power of two FFTs.

Power of Three FFTs

So now lets look at power of three length transformations. Here, the obvious step is to divide the transformation into three sub-transformations, and recurse like the power of two case. This looks like:


static void rec_fft3_aux(double *data, int rsize, int scal)
{
	int ii;
	int size = rsize / 3;
	int rwsize = size * 2;
	
	if (rsize == 1) return;
	
	rec_fft3_aux(data, size, scal * 3);
	rec_fft3_aux(data + rwsize, size, scal * 3);
	rec_fft3_aux(data + rwsize*2, size, scal * 3);
	
	for (ii = 0; ii < size; ii++)
	{
		int i = ii * 2;
		int j1 = ii * scal;
		int j2 = ii * 2 * scal;
		
		double vr = -0.5;
		double vi = sqrt(3.0)*0.5;
		
		double c1 = tc[j1];
		double s1 = ts[j1];
		
		double xr1 = c1 * data[i + rwsize] - s1 * data[i + rwsize + 1];
		double xi1 = c1 * data[i + rwsize + 1] + s1 * data[i + rwsize];
		
		double tempr1_1 = xr1 * vr - xi1 * vi;
		double tempi1_1 = xr1 * vi + xi1 * vr;
		
		double tempr1_2 = xr1 * vr + xi1 * vi;
		double tempi1_2 = -xr1 * vi + xi1 * vr;
				
		double c2 = tc[j2];
		double s2 = ts[j2];
		
		double xr2 = c2 * data[i + rwsize*2] - s2 * data[i + rwsize*2 + 1];
		double xi2 = c2 * data[i + rwsize*2 + 1] + s2 * data[i + rwsize*2];
		
		double tempr2_1 = xr2 * vr - xi2 * vi;
		double tempi2_1 = xr2 * vi + xi2 * vr;
			
		double tempr2_2 = xr2 * vr + xi2 * vi;
		double tempi2_2 = -xr2 * vi + xi2 * vr;
				
		data[i + rwsize] = data[i] + tempr1_1 + tempr2_2;
		data[i + rwsize + 1] = data[i + 1] + tempi1_1 + tempi2_2;
		
		data[i + rwsize * 2] = data[i] + tempr1_2 + tempr2_1;
		data[i + rwsize * 2 + 1] = data[i + 1] + tempi1_2 + tempi2_1;
		
		data[i] += xr1 + xr2;
		data[i + 1] += xi1 + xi2;
	}
}

static void rec_fft3(double *data, int rsize)
{
	if (rsize == 1) return;
	
	radixreverse(data, rsize, 3);
	
	rec_fft3_aux(data, rsize, 1);
}

The problem with the above is that the twiddle factors are not related by a nice power of i. Thus extra full complex multiplications are required, slowing down the transformation. (This is the reason that non power of two transformations are slow, and happens in all other bases.) Some of the above can be optimized by rearranging the operations, and removing common sub-expressions. However, this doesn't alter the fact that the extra multiplications kill performance.

The new trick discussed here is a way to remove the cost of many of these extra complex multiplications. How it works is to note that we don't have to use numbers in the form a + ib, we can re-express them as c + kd, where k can be chosen to make the extra multiplications trivial. Here, we make the choice that k = 1/2(1+sqrt(3)i). This k has several nice properties:

k2=k-1
k3=-1
k4=-k
k5=-k+1

Addition of k-tuples is easy: (a+kb)+(c+kd) = (a+c)+k(b+d)

Multiplication is also nice: (a+kb)(c+kd) = (ac-bd)+k(ad+bc+bd)

The extra complex multiplications in the FFT turn into multiplications by k2 and k4, which in turn simplify:

k2(a+kb)=(k-1)(a+kb)=-a-kb+ka+(k-1)b=-(a+b)+ka
k4(a+kb)=-k(a+kb)=-ka-(k-1)b=b-k(a+b)

Turning these multiplications into an addition of the real and k-ary parts, together with negations which are able to be absorbed into other calculations in the FFT.

By re-arranging the calculation of the k-tuple multiply, we can perform it in the same number of operations as a normal complex multiply.

(a+kb)(c+kd) = (ac-bd)+k(ad+bc+bd) = (a+b)c-b(c+d) + k(b(c+d)+ad) = a(c+d)-(a+b)d+k((a+b)d+bc) = ac-bd + k((a+b)(c+d)-ac). Since c and d are known in the twiddle multiplication, so are their sum. Thus using any of the above identities will reduce the total number of floating point operations to six.

Using the above formulation, the FFT of power of three sizes turns into:


/* Complex to k */
static void itok(double *c)
{
	c[1] *= 1.0/sqrt(3.0);
	c[0] -= c[1];
	c[1] += c[1];
}

/* k to complex */
static void ktoi(double *c)
{
	c[0] += 0.5 * c[1];
	c[1] *= sqrt(3.0)*0.5;
}

static void datatok(double *data, int rsize)
{
	int i;
	
	for (i = 0; i < rsize * 2; i += 2)
	{
		itok(data + i);
	}
}

static void datatoi(double *data, int rsize)
{
	int i;
	
	for (i = 0; i < rsize * 2; i += 2)
	{
		ktoi(data + i);
	}
}

static void fast_rec_fft3_aux(double *data, int rsize, int scal)
{
	int ii;
	int size = rsize / 3;
	int rwsize = size * 2;
	
	if (rsize == 3)
	{
		double xr1 = data[2];
		double xi1 = data[3];
		
		double xa1 = xr1 + xi1;
				
		double xr2 = data[4];
		double xi2 = data[5];
		
		double xa2 = xr2 + xi2;
				
		data[2] = data[0] - xa1 + xi2;
		data[3] = data[1] + xr1 - xa2;
		
		data[4] = data[0] + xi1 - xa2;
		data[5] = data[1] - xa1 + xr2;
		
		data[0] += xr1 + xr2;
		data[1] += xi1 + xi2;
	
		return;
	}
		
	fast_rec_fft3_aux(data, size, scal * 3);
	fast_rec_fft3_aux(data + rwsize, size, scal * 3);
	fast_rec_fft3_aux(data + rwsize*2, size, scal * 3);
	
	for (ii = 0; ii < size; ii++)
	{
		int i = ii * 2;
		int j1 = ii * scal;
		int j2 = ii * 2 * scal;
		
		double x1 = data[i + rwsize] + data[i + rwsize + 1];
		double x1c = tkc[j1] * x1;
		double x1b = tkx[j1] * data[i + rwsize + 1];
		double x1d = tks[j1] * data[i + rwsize];
		double xr1 = x1c - x1b;
		double xi1 = x1b + x1d;
		double xa1 = x1c + x1d;
		
		double x2 = data[i + rwsize*2] + data[i + rwsize*2 + 1];
		double x2c = tkc[j2] * x2;
		double x2b = tkx[j2] * data[i + rwsize*2 + 1];
		double x2d = tks[j2] * data[i + rwsize*2];
		double xr2 = x2c - x2b;
		double xi2 = x2b + x2d;
		double xa2 = x2c + x2d;
				
		data[i + rwsize] = data[i] - xa1 + xi2;
		data[i + rwsize + 1] = data[i + 1] + xr1 - xa2;
		
		data[i + rwsize * 2] = data[i] + xi1 - xa2;
		data[i + rwsize * 2 + 1] = data[i + 1] - xa1 + xr2;
		
		data[i] += xr1 + xr2;
		data[i + 1] += xi1 + xi2;
	}
}

static void fast_rec_fft3(double *data, int rsize)
{
	if (rsize == 1) return;
	
	radixreverse(data, rsize, 3);
	
	datatok(data, rsize);
	
	fast_rec_fft3_aux(data, rsize, 1);
	
	datatoi(data, rsize);
}

Where the twiddle factors in k-tuple form are calculated by


static void init_twiddle(int size)
{
	int i;
	
	tkc = malloc(sizeof(double) * size);
	tks = malloc(sizeof(double) * size);
	tkx = malloc(sizeof(double) * size);
	
	if (!tkc || !tks || !tkx) errx(1, "Out of memory in init_twiddle\n");
	
	for (i = 0; i < size; i++)
	{
		double tc = cos(2*M_PI*i/size);
		double ts = sin(2*M_PI*i/size);
		
		/* Convert to k tuple */
		tks[i] = ts/sqrt(3.0);
		tkc[i] = tc - tks[i];
		tks[i] += tks[i];
		
		tkx[i] = tks[i] + tkc[i];
	}
}

static void fini_twiddle(void)
{
	free(tkc);
	free(tks);
	free(tkx);
}

What this has done is use 3n floating point operations to convert to k-tuple form, and 3n to convert back. If the size of the transformation is large enough, this is dominated by the improvement to the O(n lg(n)) term in the algorithm, giving an asymptotic operation count of 26/(3lg3)n lg(n) + O(n) ~ 5.47n lg(n) + O(n). This is still a fair bit slower than power of two sizes, but is much faster than the standard algorithm.

Power of Six FFTs

The above k-tuple trick also works for power of six sized transformations. (In fact k = the first sixth root of unity.) Doing this gains us the same type of speed increase that going from radix 2 to radix 4 gives.


static void fast_rec_fft6_aux(double *data, int rsize, int scal)
{
	int ii;
	int size = rsize / 6;
	int rwsize = size * 2;
	
	if (rsize == 6)
	{
		double xr1 = data[2];
		double xi1 = data[3];
		double x1 = xr1 + xi1;
		
		double xr2 = data[4];
		double xi2 = data[5];
		double x2 = xr2 + xi2;
		
		double xr4 = data[8];
		double xi4 = data[9];
		double x4 = xr4 + xi4;
		
		double xr5 = data[10];
		double xi5 = data[11];
		double x5 = xr5 + xi5;
		
		double t1 = data[6] + xr1 + xr5;
		double t2 = data[0] + xr2 + xr4;
		
		double t3 = data[7] + xi1 + xi5;
		double t4 = data[1] + xi2 + xi4;
		
		double t5 = data[6] - x5 + xi1;
		double t6 = data[0] - x2 + xi4;
		
		double t7 = data[7] - x1 + xr5;
		double t8 = data[1] - x4 + xr2;
		
		double t9 = data[6] - x1 + xi5;
		double t10 = data[0] - x4 + xi2;
		
		double t11 = data[7] - x5 + xr1;
		double t12 = data[1] - x2 + xr4;
		
		data[0] = t1 + t2;
		data[1] = t3 + t4;
		
		data[2] = t6 - t5;
		data[3] = t8 - t7;
		
		data[4] = t10 + t9;
		data[5] = t12 + t11;
		
		data[6] = t2 - t1;
		data[7] = t4 - t3;
		
		data[8] = t6 + t5;
		data[9] = t8 + t7;
		
		data[10] = t10 - t9;
		data[11] = t12 - t11;
	
		return;
	}
	
	fast_rec_fft6_aux(data, size, scal * 6);
	fast_rec_fft6_aux(data + rwsize, size, scal * 6);
	fast_rec_fft6_aux(data + rwsize*2, size, scal * 6);
	fast_rec_fft6_aux(data + rwsize*3, size, scal * 6);
	fast_rec_fft6_aux(data + rwsize*4, size, scal * 6);
	fast_rec_fft6_aux(data + rwsize*5, size, scal * 6);

	for (ii = 0; ii < size; ii++)
	{
		int i = ii * 2;
		int j1 = ii * scal;
		int j2 = ii * 2 * scal;
		int j3 = ii * 3 * scal;
		int j4 = ii * 4 * scal;
		int j5 = ii * 5 * scal;
		
		double x1 = data[i + rwsize] + data[i + rwsize + 1];
		double x1c = tkc[j1] * x1;
		double x1b = tkx[j1] * data[i + rwsize + 1];
		double x1d = tks[j1] * data[i + rwsize];
		double xr1 = x1c - x1b;
		double xi1 = x1b + x1d;
		double xa1 = x1c + x1d;
		
		double x2 = data[i + rwsize*2] + data[i + rwsize*2 + 1];
		double x2c = tkc[j2] * x2;
		double x2b = tkx[j2] * data[i + rwsize*2 + 1];
		double x2d = tks[j2] * data[i + rwsize*2];
		double xr2 = x2c - x2b;
		double xi2 = x2b + x2d;
		double xa2 = x2c + x2d;
		
		double x3 = data[i + rwsize*3] + data[i + rwsize*3 + 1];
		double x3c = tkc[j3] * x3;
		double x3b = tkx[j3] * data[i + rwsize*3 + 1];
		double x3d = tks[j3] * data[i + rwsize*3];
		double xr3 = x3c - x3b;
		double xi3 = x3b + x3d;
		
		double x4 = data[i + rwsize*4] + data[i + rwsize*4 + 1];
		double x4c = tkc[j4] * x4;
		double x4b = tkx[j4] * data[i + rwsize*4 + 1];
		double x4d = tks[j4] * data[i + rwsize*4];
		double xr4 = x4c - x4b;
		double xi4 = x4b + x4d;
		double xa4 = x4c + x4d;
		
		double x5 = data[i + rwsize*5] + data[i + rwsize*5 + 1];
		double x5c = tkc[j5] * x5;
		double x5b = tkx[j5] * data[i + rwsize*5 + 1];
		double x5d = tks[j5] * data[i + rwsize*5];
		double xr5 = x5c - x5b;
		double xi5 = x5b + x5d;
		double xa5 = x5c + x5d;	
			
		double t1 = xr3 + xr1 + xr5;
		double t2 = data[i] + xr2 + xr4;
		
		double t3 = xi3 + xi1 + xi5;
		double t4 = data[i + 1] + xi2 + xi4;
		
		double t5 = xr3 - xa5 + xi1;
		double t6 = data[i] - xa2 + xi4;
		
		double t7 = xi3 - xa1 + xr5;
		double t8 = data[i + 1] - xa4 + xr2;
		
		double t9 = xr3 - xa1 + xi5;
		double t10 = data[i] - xa4 + xi2;
		
		double t11 = xi3 - xa5 + xr1;
		double t12 = data[i + 1] - xa2 + xr4;
		
		data[i] = t1 + t2;
		data[i + 1] = t3 + t4;
		
		data[i + rwsize] = t6 - t5;
		data[i + rwsize + 1] = t8 - t7;
		
		data[i + rwsize * 2] = t10 + t9;
		data[i + rwsize * 2 + 1] = t12 + t11;
		
		data[i + rwsize * 3] = t2 - t1;
		data[i + rwsize * 3 + 1] = t3 - t4;
		
		data[i + rwsize * 4] = t6 + t5;
		data[i + rwsize * 4 + 1] = t8 + t7;
		
		data[i + rwsize * 5] = t10 - t9;
		data[i + rwsize * 5 + 1] = t12 - t11;
	}
}

static void fast_rec_fft6(double *data, int rsize)
{
	if (rsize == 1) return;
	
	radixreverse(data, rsize, 6);
	
	datatok(data, rsize);
	
	fast_rec_fft6_aux(data, rsize, 1);
	
	datatoi(data, rsize);
}

The above computes a FFT in 70/(6lg6) nlg(n) + O(n) ~ 4.51n lg(n) + O(n) floating point operations, nearly as fast as a radix four transformation, and actually faster than a radix two transformation. Thus any power of three in a transformation size should probably be combined with a power two to make a power of six. However, the above is not the end of the matter. We can use the same split-radix trick as was used to improve the radix four case. A 1/2,1/6,1/6,1/6 split seems to be the best here. Doing this yields an algorithm:


static void record_fft3(int *index, int rsize, int offset, int scal)
{
	int size = rsize / 3;
	
	if (rsize == 1)
	{
		*index = offset;
		return;
	}
	
	record_fft3(index, size, offset, scal * 3);
	record_fft3(index + size, size, offset + scal, scal * 3);
	record_fft3(index + size * 2, size, offset + scal * 2, scal * 3);
}

static void record_fft6(int *index, int rsize, int offset, int scal)
{
	int size = rsize / 6;
	
	if (rsize & 1)
	{
		record_fft3(index, rsize, offset, scal);
		return;
	}
	
	record_fft6(index, size * 3, offset, scal * 2);
	record_fft6(index + size * 3, size, offset + scal, scal * 6);
	record_fft6(index + size * 4, size, offset + scal * 3, scal * 6);
	record_fft6(index + size * 5, size, offset + scal * 5, scal * 6);
}

typedef struct fft6_data fft6_data;
struct fft6_data
{
	int *counts;
	int *swaps;
};

/* Initialize fft permutation */
static void get_fft_permute(int rsize, fft6_data *fd)
{
	int i, j, k;
	
	int start, next, count;
	
	int *temp;
	int *counts = malloc(sizeof(int) * rsize);
	int *swaps = malloc(sizeof(int) * rsize);

	int *index = malloc(sizeof(int) * rsize);
	if (!counts || !swaps || !index) errx(1, "Out of memory in get_fft_permute\n");
	
	record_fft6(index, rsize, 0, 1);
	
	j = 0;
	k = 0;
	for (start = 1; start < rsize;)
	{
		for (count = 0, i = start; index[i] != -1; i = next, count++)
		{
			swaps[j] = i;
			j++;
			next = index[i];
			
			/* Set used locations to a flag value so we can ignore them later */
			index[i] = -1;
		}
		
		counts[k] = count;
		k++;
	
		/* Find new start, ignore used values, and empty loops */
		while ((index[start] == -1) || (index[start] == start)) start++;
	}
	
	counts[k] = 0;
	
	/* Try to shrink memory usage */
	temp = realloc(counts, sizeof(int) * (k + 1));
	if (temp) counts = temp;
	
	fd->counts = counts;
	fd->swaps = swaps;
	
	free(index);
}

/* Invoke saved permutation */
static void invoke_fft_permute(double *data, fft6_data *fd)
{
	int i;
	
	int *counts;
	int *swaps = fd->swaps;
	
	int start, old, new;
	
	double br, bi;
	
	for (counts = fd->counts; *counts; counts++)
	{
		old = *swaps * 2;
		start = old;
		br = data[old];
		bi = data[old + 1];
	
		swaps++;
		
		for (i = 1; i < *counts; i++)
		{
			new = *swaps * 2;
			data[old] = data[new];
			data[old + 1] = data[new + 1];
			
			swaps++;
			old = new;
		}
		
		data[old] = br;
		data[old + 1] = bi;
	}
}

static void split_rec_fft6_aux(double *data, int rsize, int scal)
{
	int ii;
	int size = rsize / 6;
	int rwsize = size * 2;
	
	if (rsize & 1)
	{
		if (rsize == 1) return;
		fast_rec_fft3_aux(data, rsize, scal);
		return;
	}
	
	split_rec_fft6_aux(data, size * 3, scal * 2);
	split_rec_fft6_aux(data + rwsize*3, size, scal * 6);
	split_rec_fft6_aux(data + rwsize*4, size, scal * 6);
	split_rec_fft6_aux(data + rwsize*5, size, scal * 6);
	
	if (rsize == 6)
	{
		double xr1 = data[6];
		double xi1 = data[7];
		double xa1 = xr1 + xi1;
		
		double xr3 = data[8];
		double xi3 = data[9];
		
		double xr5 = data[10];
		double xi5 = data[11];
		double xa5 = xr5 + xi5;
		
		double t1 = xr3 + xr1 + xr5;
		double t2 = xi3 + xi1 + xi5;
		double t3 = xr3 - xa5 + xi1;
		double t4 = xr5 - xa1 + xi3;
		double t5 = xr3 - xa1 + xi5;
		double t6 = xi3 - xa5 + xr1;
				
		data[6] = data[0] - t1;
		data[7] = data[1] - t2;
		
		data[8] = data[2] + t3;
		data[9] = data[3] + t4;
		
		data[10] = data[4] - t5;
		data[11] = data[5] - t6;
		
		data[0] += t1;
		data[1] += t2;
		
		data[2] -= t3;
		data[3] -= t4;
		
		data[4] += t5;
		data[5] += t6;
	
		return;
	}

	for (ii = 0; ii < size; ii++)
	{
		int i = ii * 2;
		int j1 = ii * scal;
		int j3 = ii * 3 * scal;
		int j5 = ii * 5 * scal;
		
		double x1 = data[i + rwsize*3] + data[i + rwsize*3 + 1];
		double x1c = tkc[j1] * x1;
		double x1b = tkx[j1] * data[i + rwsize*3 + 1];
		double x1d = tks[j1] * data[i + rwsize*3];
		double xr1 = x1c - x1b;
		double xi1 = x1b + x1d;
		double xa1 = x1c + x1d;
				
		double x3 = data[i + rwsize*4] + data[i + rwsize*4 + 1];
		double x3c = tkc[j3] * x3;
		double x3b = tkx[j3] * data[i + rwsize*4 + 1];
		double x3d = tks[j3] * data[i + rwsize*4];
		double xr3 = x3c - x3b;
		double xi3 = x3b + x3d;
		
		double x5 = data[i + rwsize*5] + data[i + rwsize*5 + 1];
		double x5c = tkc[j5] * x5;
		double x5b = tkx[j5] * data[i + rwsize*5 + 1];
		double x5d = tks[j5] * data[i + rwsize*5];
		double xr5 = x5c - x5b;
		double xi5 = x5b + x5d;
		double xa5 = x5c + x5d;
		
		double t1 = xr3 + xr1 + xr5;
		double t2 = xi3 + xi1 + xi5;
		double t3 = xr3 - xa5 + xi1;
		double t4 = xr5 - xa1 + xi3;
		double t5 = xr3 - xa1 + xi5;
		double t6 = xi3 - xa5 + xr1;
				
		data[i + rwsize * 3] = data[i] - t1;
		data[i + rwsize * 3 + 1] = data[i + 1] - t2;
		
		data[i + rwsize * 4] = data[i + rwsize] + t3;
		data[i + rwsize * 4 + 1] = data[i + rwsize + 1] + t4;
		
		data[i + rwsize * 5] = data[i + rwsize * 2] - t5;
		data[i + rwsize * 5 + 1] = data[i + rwsize * 2 + 1] - t6;
		
		data[i] += t1;
		data[i + 1] += t2;
		
		data[i + rwsize] -= t3;
		data[i + rwsize + 1] -= t4;
		
		data[i + rwsize * 2] += t5;
		data[i + rwsize * 2 + 1] += t6;
	}
}


static void split_rec_fft6(double *data, int rsize, fft6_data *fd)
{
	if (rsize == 1) return;
	
	invoke_fft_permute(data, fd);
	
	datatok(data, rsize);
	
	split_rec_fft6_aux(data, rsize, 1);
	
	datatoi(data, rsize);
}

Unfortunately, the above requires more storage than the older algorithms. This is due to the fact that the permutation required is non-trivial. To reduce its overhead to O(n) from O(n lg(n)), O(n) extra storage is required. The advantage of the above is speed. It reduces the operation count to 44/(6*(1+0.5*lg3)) n lg(n) + O(n) ~ 4.09n lg(n) + O(n). This is only 2% slower than the standard split-radix algorithm. Such a difference is smaller than the overhead of cache effects. (Using power of two strides is particularly bad, and can affect the speed of implementations of the power of two FFT.)

The next step is to use something like the Tangent FFT algorithm to further increase speed. However, since the twiddle multiplication algorithm is different, and more twiddle factors are required, a simple conversion is not possible. Another possibility to increase speed on real machines is to convert the recursive algorithm into iteration. (But maintaining the depth-first scanning of the above algorithm is probably best, due to its cache-oblivious properties.)

Finally, note that some of the operations used above are not required if a convolution is wanted. Converting back from k-tuples to complex numbers isn't needed as the convolution's multiplication can be efficiently done in k-tuple form. In addition, convolutions don't care about the order of the frequency elements. Thus the permutation of the array can also be avoided. These two optimizations save O(n) operations.

Can the k-tuple trick be used for other radix FFTs? Unfortunately, the answer is no. In effect, the advantage of the k-tuple system is that it uses particular summation relations between the sixth roots of unity. Such relations don't exist for other bases. However, since three is the second most common factor (other than two), the above tricks can be used quite often in real-world situations.

Comments

sfuerst said...
I've updated the article to fix the O() terminology.
Lars Gregersen said...
It would be nice to learn how these methods comapre to FFTW:
http://www.fftw.org/

FFTW handles input vectors of arbitrary size (and as long as the number of elements is not a prime number it is really fast)
sfuerst said...
FFTW factorizes the length of its input, and uses a suite of algorithms customized for the problem at hand. For powers of two, I gather that it uses the Tangent FFT method. For powers of three and six, I'm fairly sure it stays in complex representation. However, it's highly likely that it uses symmetries of the third and sixth roots of unity to simplify the calculation somewhat from a naive implementation.

Unfortunately, the FFTW source code isn't exactly easy to read. (Much of the inner engine is machine-generated, and highly obfuscated by macros.)
DX said...
Can you kindly explain me in a little bit of detail the radix 3 algorithm :)

Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.