The Blog

DXT Compression Revisited

Pre(r)amble

I made a previous post on a new method for compressing images into the DXT format.  Various sources, including former NVIDIA employee Ignacio Castaño, agreed that the quality was not quite what it should be, and I began a more thorough investigation into why.  Some of my test images also had a few bad artifacts and I needed to iron this out anyway.

The changes I have made are significant and I felt it better to just make a new post rather than edit the previous.  The method of 2 linear regression stages followed by a fix-up, however, is still the same.

On a side note, the reason I have not updated in so long is because I just started a new job at tri-Ace where I help maintain their ASKA engine, showcased last year with this beautiful video.  The CEO and CTO Yoshiharu Gotanda will be speaking at GDC this year so take a seat at his presentation if you are attending GDC.

 

Back to the Start

Because my previous article already provides some background on DXT compression and more information can be found online, I will mention only a few facts that are fairly important.

  1. Texels are compressed into 4×4 blocks.
  2. Each block has 2 16-bit color keys.
  3. 2 more colors are made by interpolating between the 2 key colors at 33.33% and 66.67%, giving the block a total of 4 colors.  Instead of making a distinction between the key colors and interpolated colors, it is easier to just think of 4 colors in a line at points 0.0%, 33.33%, 66.67%, and 100.0%.  Furthermore, it is best to think of these as buckets ranging from 0 to 3.

Essentially, the quality of a DXT compressor depends on how well it picks the key colors.  Here, I will present my revised method of finding good key colors using 2 layers of linear regression.  As with any routine that expects to accomplish this with any level of decency, an understanding of exactly how to know if one set of colors is better than another is vital.

 

Checking Image Quality

When checking how similar an image is to another image, its peak signal-to-noise ratio (PSNR) is typically used.  It begins by getting the mean square error (MSE) of the image and using it for a divide, which in turn means that a higher PSNR is better, while a higher MSE is worse.  My comparisons do not need to know the actual error, but only enough to accurately determine that one set of colors is worse (when compared against the original) than another.  The concept is the same as comparing distances.  A common optimization is to omit the call to sqrt() used to get the actual distance, because the square-root operation does not change the fact that one distance is greater than the other.  Since we don’t need to know the actual distance to know that it is greater than another given distance, the square root is wasted time.  The same holds true for comparing a set of colors against another, and we seek to minimize the amount of work needed for the comparison to the data that follows the “anything beyond this point does not change the greater-than/equal-to/less-than comparison between the two operands” rule.

As far as PSNR is concerned, once the MSE has been calculated, the rest of the equation follows the above rule—a higher MSE always means a lower dB (the result of PSNR), therefor we can stop with just the MSE. For a computer, though, correctly guessing how close one image is to another as humans perceive them to be is impossible unless the images are identical.  Using the MSE (and sometimes RMSE, or root mean square error (while RMSE is common, I myself do not use it because the square root of numbers still falls within the above rule—the results of square-rooting 2 numbers does not change the result of a relativity comparison, so the square root is just overhead)) is a decent way to get close, but should not be taken as the end-all-be-all determination for how close images are to each other as perceived by humans.  The below image explains this rather clearly.

In this image, the left-most image is the original.  The center area is the difference, and the right-most image is the image we are comparing against the original.  Both right-side images are the same as the original except that they have only 16 colors each.  The bottom image was dithered.  According to the computer, humans will perceive the top image as a closer match to the original (the 2 MSE values, 4.61 and 1.72, are the “literal error” and the “perceived error”, respectively).  Do you feel the computer has guessed correctly?

The perceived value is more important in most cases.  The dithered image had an MSE of about 1.3 times more than the non-dithered image, but the perceived error is even greater, at ~1.4 times higher.  Ultimately, MSE can be used to get into a certain range of closeness, but human observation is required for best results. As for the algorithm I am about to present, it usually gets an MSE that is fairly close to those of ATI’s The Compressonator, which is the compressor I use as the basis for comparison for 3 reasons:

  1. It shows me the MSE of my images, so it is useful for benchmarking.
  2. It is my impression that it is the industry standard to beat.  See reason #3 as to why I can’t check for myself, but it apparently is better than even the open-source NVIDIA tool based off Squish, from Simon Brown.
  3. My apologies to  Ignacio Castaño, but I haven’t yet had the chance to test the open-sourced version of NVIDIA’s tools because I don’t have libpng12.dll and a few other DLL files, and I have not had time yet to look into the error shown to me when I load the project with Visual Studio 2005.  For those interested, Ignacio rightfully pointed out that my old comparisons were against a far inferior version of NVIDIA’s tools, and even my naive implementation of this method was able to beat it handily.  The open-sourced version, however, has improved greatly.  I can’t personally say by how much it has improved, but I can give an idea of how much mine has improved; The old routine in my main test image had a perceived MSE of almost 6, whereas The Compressonator had an MSE of 4.92.  My revisions have left me at 5.02, only 0.1 above ATI.  I am guessing the open-sourced NVIDIA tools are also somewhere in this range.

 

Linear Regression?

I have been getting this question a lot.  The Wikipedia article’s image explains what it is best.

It is used to find the best line through a given set of points.  It is similar to eigenvectors except in the way in which the result is expressed; Linear regression provides you with a single linear equation (or more precisely values that you can plug into a linear equation to get results).  Read on to see how I have applied this in 2 layers to get decent DXT-compressed images.

 

So What is the Algorithm?

As with my original algorithm, there are 3 passes.  The 1st pass is applied only once and just determines a fairly decent starting point for us.  I treat each R, G, and B separately, and sort each value in each channel from lows to highs.  Unlike in the graph above, that means any points along the X axis are guaranteed to be equal to or higher than the previous point, so we do not observe an up-and-down motion as in the graph, but only ups of various degrees.  I use linear regression to find the best line segment through these points.

	/**
	 * Takes an array of 16 texels in floating-point format and determines the two colors best suited to represent
	 *	the given colors via simple interpolation.  In other words, it determines the two key colors for a DXT
	 *	block.
	 *
	 * \param _bBlock The input block of 4-by-4 texels.
	 * \param _fMax Upon return, this holds the computed maximum color value.
	 * \param _fMin Upon return, this holds the computed minimum color value.
	 * \param _ui32Width Width of the block.
	 * \param _ui32Height Height of the block.
	 * \param _fAlphaHigh The highest alpha value in the block.
	 * \param _fAlphaLow The lowest alpha value in the block.
	 * \param _pfFormat The DXTn format to which the block is to be converted.
	 */
	LSVOID LSE_CALL CImage::GetKeyColors( const LSI_BLOCK _bBlock[4][4], LSI_BLOCK &_bMax, LSI_BLOCK &_bMin, LSUINT32 _ui32Width, LSUINT32 _ui32Height,
		LSFLOAT _fAlphaHigh, LSFLOAT _fAlphaLow, LSI_PIXEL_FORMAT _pfFormat ) {
		if ( !_ui32Width && !_ui32Height ) { return; }	// Not possible.
		static const LSFLOAT fErrorRes[4] = {
			31.0f,
			63.0f,
			31.0f,
			1.0f,
		};

		// For each channel.
		CSet<LSFLOAT> sColors;
		LSUINT32 ui32TotalColors[4] = { 0UL };
		for ( LSUINT32 I = 4UL; I--; ) {
			// We want only unique colors, and we want them sorted.
			sColors.ResetNoDealloc();
			//vColors.ResetNoDealloc();
			for ( LSUINT32 Y = _ui32Height; Y--; ) {
				for ( LSUINT32 X = _ui32Width; X--; ) {
					//LSFLOAT fValue = ::floorf( _bBlock[Y][X].fValues[I] * fErrorRes[I] + 0.5f ) / fErrorRes[I];
					//LSFLOAT fValue = ::floorf( _bBlock[Y][X].fValues[I] * fErrorRes[I] ) / fErrorRes[I];
					LSFLOAT fValue = _bBlock[Y][X].fValues[I];
					sColors.Insert( fValue );
				}
			}
			ui32TotalColors[I] = sColors.Length();

			// If there is only one color, try to match it exactly by putting it in the middle.
			//	The two interpolated middle colors are more precise than the edge colors.
			if ( sColors.Length() == 1UL ) {
				_bMax.fValues[I] = _bMin.fValues[I] = sColors.GetByIndex( 0UL );
				continue;
			}

			// If there are 2 colors, they are the high and low.
			if ( sColors.Length() == 2UL ) {
				_bMax.fValues[I] = sColors.GetByIndex( 1UL );
				_bMin.fValues[I] = sColors.GetByIndex( 0UL );
				continue;
			}

			// Find the average slope of the colors using linear regression.
			LSFLOAT fM = static_cast<LSFLOAT>(sColors.Length());
			LSFLOAT fSumX = 0.0f;
			LSFLOAT fSumXSq = 0.0f;
			LSFLOAT fSumXByY = 0.0f;
			LSFLOAT fSumY = 0.0f;
			for ( LSUINT32 J = 0UL; J < fM; ++J ) {
				LSFLOAT fThis = sColors.GetByIndex( J );
				fSumXByY += static_cast<LSFLOAT>(J) * fThis;
				fSumX += static_cast<LSFLOAT>(J);
				fSumY += fThis;
				fSumXSq += static_cast<LSFLOAT>(J) * static_cast<LSFLOAT>(J);
			}

			LSFLOAT fW1 = (fM * fSumXByY - fSumX * fSumY) / (fM * fSumXSq - fSumX * fSumX);
			LSFLOAT fW0 = (1.0f / fM) * fSumY - (fW1 / fM) * fSumX;
			// Plug it into the linear equation to get the 2 end points.
			LSFLOAT fLow = fW0;	// Simplified 0 term for X.
			LSFLOAT fHi = (fM - 1.0f) * fW1 + fW0;
			_bMin.fValues[I] = CStd::Clamp( fLow, fMin, fMax );
			_bMax.fValues[I] = CStd::Clamp( fHi, fMin, fMax );
		}

		// If the min and max are the same, there is no need for refinement.
		RefineKeyColors( _bBlock, _bMax, _bMin, _ui32Width, _ui32Height, 500UL, _fAlphaHigh, _fAlphaLow, _pfFormat, ui32TotalColors );

		for ( LSUINT32 I = 4UL; I--; ) {
			_bMax.fValues[I] = CStd::Clamp( _bMax.fValues[I], 0.0f, 1.0f );
			_bMin.fValues[I] = CStd::Clamp( _bMin.fValues[I], 0.0f, 1.0f );
		}
	}

You can see that I have commented out some code that rounds colors off to 16-bit values.  The first of these rounds up or down, while the second rounds only down.  Initially, the first routine provided better results, but later the 3rd routine (which is actually being used) provided chances for reliable optimizations otherwise unavailable.  The middle routine (which just truncates) has never provided any benefit. One change was that I clamped the values between the high and lows of the input data.  Originally I considered that the key colors could be higher and lower than the actual colors since the actual colors might be more accurately represented through the midpoint colors, since the midpoint colors are not restricted to  5 or 6 bits of precision.  But in practice, this never happened, and only served to caused the artifacts I mentioned in the preamble.  In reality, it is best to distribute each channel in a way that makes the key colors the max and min of each channel value except in specific cases which can be detected and handled manually.  So I clamped the values here. I have also left out some sugar-coating code.  If there is only one unique color, I apply an algorithm that checks if all channels have only one color.  If they do, I find the absolute best possible match for that color.  I also omitted some code for handling the case where there are only 2 unique colors.

 

Let’s Find Good Matches

The case in which there are 2 unique colors in a channel is too complex to explain here, so I will focus only on the case in which there is only 1 unique value in a channel, but this only applies if all channels have only one color, and those colors are all able to match the 0, 1/3, 2/3, or 1 buckets provided by the 2 key colors. Every channel must be in the same or opposable “buckets”.  That is, if R is best matched at 0% between the 2 colors, or bucket index 0, then the other colors must match well in buckets 0 or 3.  If R was best-matched at bucket index 1, then G and B must also be well-matched in buckets 1 or 2.  ryg decided to make a table of the best-possible key colors, but since I plan to support iOS devices and I expect that my online conversion is only a back-up for off-line conversions, I perform a brute-force test.  Since this is a one-time-only search for a given block, the performance is not impacted noticeably, and if it were I would cache the results into a smaller look-up table, since for any given image, if a single color dominates a whole block, it is likely that that specific color dominates other blocks as well.  Those blocks would benefit from the on-the-fly look-up table. Although I have  already implemented this in my own code, I think this is the type of code that is straight-forward enough to be both beneficial as a reader’s exercise yet long enough that I prefer to omit it for the sake of brevity.

The results of this pass aren’t really horrible, but there is obviously much to improve.

Original

LSEngine: 1 Pass.  MSE: 7.03

 

Let’s Refine Those Matches

The first pass starts us off in a fairly good direction, but lacks consideration of just how many texels share the same point among the 4 available via the key colors and interpolation.  This is still a per-channel discussion, so we are still acting only on reds, then on greens, then blues, but for each channel it might be worth our while to provide more accuracy to some values and less to others if some values occur more frequently.  We would call these “dominant values”. For an example, let’s pretend that in a block we have 7 values in a channel (we would have 16 values in every 4×4 block of course, but this is a contemplatable example). In regards to their actual values, 1 is at value 0.24, 5 are at value 0.4, and 1 is at value 1.0.  The 4 colors available to us are 0.24, 0.49, 0.75, and 1.0.  The 5 0.4 values are better served by the 0.49 sample than any other, but that leaves 5 values with a fairly large amount of error.  Since most of the block is dominated by this value, it makes sense to adjust the line so that one of the points gets closer to 0.49, even if it means adding some amount of error to one of the other values (and it will).

So how do you decide how much to tweak the line so that it gives just the right amount of attention to those 5 values, given that you are guaranteed to decrease your ability to match either the 0.24 or the 1.0 value in doing so?  My method is to use a second layer of linear regression, this time on the error metrics themselves.

In this diagram, the 4 possible values offered by the DXT compression for a single block are indicated in red and have a given amount of signed error.  The second bucket (remember that it is easier to refer to the 4 possible values as buckets) was a perfect match—any of the actual values that were closest to that bucket were also an exact match for that bucket.  But the other buckets were off by some amount, either up or down.  If we store the amount of signed error each bucket offers to the actual values in the 4×4 block, we can use linear regression through that data to determine by how much we should adjust our original end points.  The blue line shows the results of linear regression through this particular set of errors.

Once we have obtained the end-point values of the blue line, -0.194 and 0.355 here, we offset the original line by that amount and repeat.  This is an iterative process.  As the original line is adjusted, the original set of colors will of course have different amounts of errors for each bucket, but will also at times suddenly jump into a different bucket, giving the previous bucket no error and the new bucket some amount of error.  If we had infinite loops at our disposal, the routine would eventually converge on the most optimal solution possible for each channel, even though at times it would hop back and forth due to the above.  Since we don’t have infinite loops at our disposal, we have to reduce the chances of back-and-forth hopping and also decide how many loops is necessary for the quality we want to achieve.  My research shows that dividing the adjustments by 4 and using 500 loops are the respective sweet spots.

	/**
	 * Refines the given key colors iteratively, with each iteration bringing it closer to the most optimal solution.
	 *
	 * \param _bBlock The input block of 4-by-4 texels.
	 * \param _fMax Upon return, this holds the computed maximum color value.
	 * \param _fMin Upon return, this holds the computed minimum color value.
	 * \param _ui32Width Width of the block.
	 * \param _ui32Height Height of the block.
	 * \param _ui32Iters Iteration count for the 2nd phase.
	 * \param _fAlphaHigh The highest alpha value in the block.
	 * \param _fAlphaLow The lowest alpha value in the block.
	 * \param _pfFormat The DXTn format to which the block is to be converted.
	 * \param _ui32TotalColors The total unique colors in the channels.
	 */
	LSVOID LSE_CALL CImage::RefineKeyColors( const LSI_BLOCK _bBlock[4][4], LSI_BLOCK &_bMax, LSI_BLOCK &_bMin, LSUINT32 _ui32Width, LSUINT32 _ui32Height,
		LSUINT32 _ui32Iters, LSFLOAT /*_fAlphaHigh*/, LSFLOAT _fAlphaLow, LSI_PIXEL_FORMAT _pfFormat, LSUINT32 _ui32TotalColors[4] ) {
		LSBOOL bSkip[4] = {
			_ui32TotalColors[0] < 3UL,
			_ui32TotalColors[1] < 3UL,
			_ui32TotalColors[2] < 3UL,
			_ui32TotalColors[3] < 3UL,
		};
		static const LSFLOAT fErrorRes[4] = {
			31.0f,
			63.0f,
			31.0f,
			1.0f,
		};

		// Determine if this block will have an alpha component that would cause the color keys to be 3 colors instead of 4.
		LSUINT32 ui32Colors = (_pfFormat == LSI_PF_DXT1 && _fAlphaLow < 0.5f) ? 3UL : 4UL;
		LSBOOL bOmitLast = ui32Colors == 3UL;
		for ( ; _ui32Iters--; ) {
			// First create the 4 points that can be matched.
			LSI_BLOCK bMatchPoints[4];
			Get4Colors( _bMax, _bMin, bMatchPoints, bOmitLast );

			// For each channel.
			for ( LSUINT32 I = 3UL; I--; ) {
				if ( bSkip[I] ) { continue; }
				// Make the error "buckets".
				LSFLOAT fError[4] = { 0.0f };
				// The number of entries in each bucket.  All buckets must be filled so if any are 0
				//	we have to make a second pass to add values into them.
				LSUINT32 ui32Total[4] = { 0UL };
				LSUINT32 ui32TotalBucketsUsed = 0UL;

				// For each texel.
				for ( LSUINT32 Y = _ui32Height; Y--; ) {
					for ( LSUINT32 X = _ui32Width; X--; ) {
						// Find out to which point it is closest.
						LSUINT32 ui32Match = 0UL;
						LSFLOAT fMinError = GetError( I, bMatchPoints, _bBlock[Y][X], ui32Match, bOmitLast );

						// Add its error to the appropriate bucket.
						if ( fMinError < 0.0f ) {
							fMinError = -fMinError * fMinError;
						}
						else {
							fMinError *= fMinError;
						}
						fError[ui32Match] -= fMinError;
						if ( !ui32Total[ui32Match] ) {
							++ui32TotalBucketsUsed;
						}
						++ui32Total[ui32Match];
					}
				}
				// If only one bucket was used, there is no need to make any adjustments.
				if ( ui32TotalBucketsUsed < 2UL ) {
					continue;
				}

				// Normalize the error buckets.
				LSFLOAT fTotalError = 0.0f;
				for ( LSUINT32 A = ui32Colors; A--; ) {
					fError[A] /= _ui32Width * _ui32Height;
				}

				// Apply linear regression to the error table.
				LSFLOAT fM = static_cast<LSFLOAT>(ui32TotalBucketsUsed);
				LSFLOAT fSumX = 0.0f;
				LSFLOAT fSumXSq = 0.0f;
				LSFLOAT fSumXByY = 0.0f;
				LSFLOAT fSumY = 0.0f;
				for ( LSUINT32 J = 0UL; J < ui32Colors; ++J ) {
					if ( !ui32Total[J] ) { continue; }
					LSFLOAT fThis = fError[J];
					fSumXByY += static_cast<LSFLOAT>(J) * fThis;
					fSumX += static_cast<LSFLOAT>(J);
					fSumY += fThis;
					fSumXSq += static_cast<LSFLOAT>(J) * static_cast<LSFLOAT>(J);
				}

				LSFLOAT fW1 = (fM * fSumXByY - fSumX * fSumY) / (fM * fSumXSq - fSumX * fSumX);
				LSFLOAT fW0 = (1.0f / fM) * fSumY - (fW1 / fM) * fSumX;
				// Plug it into the linear equation to get the 2 end points.
				LSFLOAT fLow = fW0;	// Simplified 0 term for X.
				LSFLOAT fHi = (fM - 1.0f) * fW1 + fW0;
				fLow *= 0.25f;
				fHi *= 0.25f;
				if ( _bMin.fValues[I] + fLow == _bMin.fValues[I] && _bMax.fValues[I] + fHi == _bMax.fValues[I] ) {
					// The values have converged.
					bSkip[I] = true;
				}
				_bMin.fValues[I] += fLow;
				_bMax.fValues[I] += fHi;

				// This could cause the low and high to change order.
				fLow = _bMin.fValues[I];
				_bMin.fValues[I] = CStd::Min( _bMin.fValues[I], _bMax.fValues[I] );
				_bMax.fValues[I] = CStd::Max( _bMax.fValues[I], fLow );

			}
		}

A few new features were added to the routine, mostly for enhanced speed.  The routine can now early-out when either it converges or when a channel has 1 or 2 unique values.  As I mentioned, the cases where there are only 1 or 2 unique values can be handled in the first pass, although I did not present the code for doing so.  If you were not to implement the handlers for those cases, you should not early-out in the second pass on those conditions.

Another improvement was that the linear regression is no longer hard-coded to 4 points.  Some of my biggest artifacts were caused by some buckets being completely empty (no original values were closest to them) but still being used in the linear regression equation.  This is a somewhat difficult case to handle, but I got best results by simply skipping those buckets and using the actual amount of used buckets in my linear-regression code.

The results of this pass are a great improvement over doing just the first pass.

Original

LSEngine: 2 Passes. MSE: 5.52

 

Adjusting for Perceived Quality

Until now we have worked on each channel separately.  The perceived closeness of a given color to another color requires that all 3 channels be considered together, so another step is necessary in order to further decrease both the MSE and the actual quality of the image.  Humans do not perceive red, green, and blue equally.  Accounting for this is called “weighting”.

According to Adobe, the weights for each channel are 0.3, 0.59, and 0.11 for red, green, and blue respectively.  According to ATI’s The Compressonator, the weights are 0.3086, 0.6094, and 0.082.  While they are both big in the graphics industry, I am going to have to trust Adobe more than ATI on the grounds that they have spent many more millions of dollars and much more time specifically on “image quality” research, whereas ATI has spent more time focusing on chipsets and optimizations.  Unfortunately, since I am using ATI’s The Compressonator to get my MSE values, I am obviously subject to biased results.  There is another reason I am subjected to biased results as well, which I will discuss later.  Update: I was suggested to use the weights 0.212656, 0.715158, and 0.072186 in my previous post.  Doing so not only reduced my final image’s MSE (see below for details) but also leaves me with what I feel is a better-looking overall image.

The final pass considers all 3 channels together in an effort to gain a better perceptual result.  It also helps to account for any rounding errors that may have been introduced and also for the fact that our second pass was not allowed infinite looping.

		// Last step to refining is to adjust each channel by 1 and test if the result is a closer match.
		static const LSINT32 i32Gran[4] = {
			4,			// Red.
			4,			// Green.
			4,			// Blue.
			0,			// Alpha.
		};
		LSUINT16 ui16IntForm[2] = {
			ConvertBlockTexelTo16Bit( _bMin ),
			ConvertBlockTexelTo16Bit( _bMax )
		};

		LSFLOAT fError = GetError( _bBlock,
			Convert16BitToBlockTexel( ui16IntForm[1] ),
			Convert16BitToBlockTexel( ui16IntForm[0] ),
			_ui32Width, _ui32Height, bOmitLast, 100.0f );
		if ( fError != 0.0f ) {
			LSUINT16 ui16WinnerMin = ui16IntForm[0];
			LSUINT16 ui16WinnerMax = ui16IntForm[1];

			// Y is applied to the min, X to the max.
			// For all reds.
			for ( LSINT32 Ry = -i32Gran[0]; Ry <= i32Gran[0] && fError; ++Ry ) {
				// Create the new 16-bit value based off these adjustments.
				LSUINT16 ui16TempMin0 = ui16IntForm[0];
				if ( !Offset16BitColorChannel( 0UL, Ry, ui16TempMin0 ) ) { continue; }
				for ( LSINT32 Rx = Ry; Rx <= i32Gran[0] && fError; ++Rx ) {
					// Create the new 16-bit value based off these adjustments.
					LSUINT16 ui16TempMax0 = ui16IntForm[1];
					if ( !Offset16BitColorChannel( 0UL, Rx, ui16TempMax0 ) ) { continue; }

					// For all greens.
					for ( LSINT32 By = -i32Gran[1]; By <= i32Gran[1] && fError; ++By ) {
						// Create the new 16-bit value based off these adjustments.
						LSUINT16 ui16TempMin1 = ui16TempMin0;
						if ( !Offset16BitColorChannel( 1UL, By, ui16TempMin1 ) ) { continue; }
						for ( LSINT32 Bx = By; Bx <= i32Gran[1] && fError; ++Bx ) {
							// Create the new 16-bit value based off these adjustments.
							LSUINT16 ui16TempMax1 = ui16TempMax0;
							if ( !Offset16BitColorChannel( 1UL, Bx, ui16TempMax1 ) ) { continue; }

							// For all blues.
							for ( LSINT32 Gy = -i32Gran[2]; Gy <= i32Gran[2] && fError; ++Gy ) {
								// Create the new 16-bit value based off these adjustments.
								LSUINT16 ui16TempMin2 = ui16TempMin1;
								if ( !Offset16BitColorChannel( 2UL, Gy, ui16TempMin2 ) ) { continue; }
								for ( LSINT32 Gx = Gy; Gx <= i32Gran[2] && fError; ++Gx ) {
									// Create the new 16-bit value based off these adjustments.
									LSUINT16 ui16TempMax2 = ui16TempMax1;
									if ( !Offset16BitColorChannel( 2UL, Gx, ui16TempMax2 ) ) { continue; }

									if ( Ry == 0UL && Rx == 0UL &&
										Gy == 0UL && Gx == 0UL &&
										By == 0UL && Bx == 0UL ) { continue; }

									// Get the amount of error in this new configuration.
									LSFLOAT fThisError = GetError( _bBlock,
										Convert16BitToBlockTexel( ui16TempMax2 ),
										Convert16BitToBlockTexel( ui16TempMin2 ),
										_ui32Width, _ui32Height, bOmitLast, fError );

									// If the error has decreased, keep the changes.
									if ( fThisError < fError ) {
										fError = fThisError;
										ui16WinnerMin = ui16TempMin2;
										ui16WinnerMax = ui16TempMax2;
									}
								}
							}
						}
					}
				}
			}
			ui16IntForm[0] = ui16WinnerMin;
			ui16IntForm[1] = ui16WinnerMax;
		}

This highly nested loop adjusts both endpoints of each channel, stored in their 16-bit forms, one-by-one, searching for a better match.  The loop itself is not particularly interesting.  The work done by GetError() is much more interesting because it is a decider for the end quality of your results.

	/**
	 * Gets the amount of error of an entire block between 2 colors.  The orders of the highs and lows in the channels does
	 *	not impact the result.
	 *
	 * \param _bBlock The input block of 4-by-4 texels.
	 * \param _bMax The max color.
	 * \param _bMin The min color.
	 * \param _ui32Width Width of the block.
	 * \param _ui32Height Height of the block.
	 * \param _bOmitLast If true, 3 colors are returned instead of 4.
	 * \param _fCurError The current amount of error.  Knowing this helps the routine early-out.
	 * \return Returns the amount of error among the block given the 2 end colors.
	 */
	LSFLOAT LSE_CALL CImage::GetError( const LSI_BLOCK _bBlock[4][4], const LSI_BLOCK &_bMax, const LSI_BLOCK &_bMin,
		LSUINT32 _ui32Width, LSUINT32 _ui32Height, LSBOOL _bOmitLast, LSFLOAT _fCurError ) {
		// First create the 4 points that can be matched.
		LSI_BLOCK bMatchPoints[4][4];
		Get4Colors( _bMax, _bMin, bMatchPoints[0], _bOmitLast );
		CStd::MemCpy( bMatchPoints[1], bMatchPoints[0], sizeof( bMatchPoints[0] ) );
		CStd::MemCpy( bMatchPoints[2], bMatchPoints[0], sizeof( bMatchPoints[0] ) );
		CStd::MemCpy( bMatchPoints[3], bMatchPoints[0], sizeof( bMatchPoints[0] ) );

		LSUINT32 ui32Colors = _bOmitLast ? 3UL : 4UL;
		for ( LSUINT32 I = 4UL; I--; ) {
			bMatchPoints[1][I].fValues[0] = bMatchPoints[0][ui32Colors-1UL-I].fValues[0];	// Swap red.
			bMatchPoints[2][I].fValues[1] = bMatchPoints[0][ui32Colors-1UL-I].fValues[1];	// Swap blue.
			bMatchPoints[3][I].fValues[2] = bMatchPoints[0][ui32Colors-1UL-I].fValues[2];	// Swap green.
		}

		CVector4 vChannelErrors[4] = {
			CVector4( LSM_ZERO, LSM_ZERO, LSM_ZERO, LSM_ZERO ),
			CVector4( LSM_ZERO, LSM_ZERO, LSM_ZERO, LSM_ZERO ),
			CVector4( LSM_ZERO, LSM_ZERO, LSM_ZERO, LSM_ZERO ),
			CVector4( LSM_ZERO, LSM_ZERO, LSM_ZERO, LSM_ZERO ),
		};
		LSFLOAT fTotal[4] = { 0.0f };
		LSFLOAT fMaxError[4] = { 0.0f };
		LSFLOAT fBucketMinError = _fCurError;
		LSFLOAT fNormalizer = 1.0f / (_ui32Width * _ui32Height);

		// For each error bucket.
		for ( LSUINT32 B = 4UL; B--; ) {
			// For each texel.
			LSFLOAT fTempTotal = 0.0f;
			for ( LSUINT32 Y = _ui32Height; Y--; ) {
				for ( LSUINT32 X = _ui32Width; X--; ) {
					// Find out to which point it is closest.
					LSFLOAT fMinError = 10.0f;
					LSUINT32 ui32Slot = 0UL;
					CVector4 vThisError;
					for ( LSUINT32 A = ui32Colors; A--; ) {
						LSFLOAT fThisError = GetError( bMatchPoints[B][A], _bBlock[Y][X] );
						if ( fThisError < fMinError ) {
							fMinError = fThisError;
							ui32Slot = A;
							vThisError.x = bMatchPoints[B][A].s.fR - _bBlock[Y][X].s.fR;
							vThisError.y = bMatchPoints[B][A].s.fG - _bBlock[Y][X].s.fG;
							vThisError.z = bMatchPoints[B][A].s.fB - _bBlock[Y][X].s.fB;
							vThisError.w = bMatchPoints[B][A].s.fA - _bBlock[Y][X].s.fA;
							vThisError.x *= vThisError.x;
							vThisError.y *= vThisError.y;
							vThisError.z *= vThisError.z;
							vThisError.w *= vThisError.w;
							if ( fThisError == 0.0f ) { break; }
						}
					}

					// Add its error to the appropriate bucket.
					fMaxError[B] = CStd::Max( fMaxError[B], fMinError );
					vChannelErrors[B] += vThisError;

					LSFLOAT fA = vChannelErrors[B].x * fNormalizer;
					LSFLOAT fB = vChannelErrors[B].y * fNormalizer;
					LSFLOAT fC = vChannelErrors[B].z * fNormalizer;
#define LSI_ERROR( A, B, C )	(A) * 0.3f + (B) * 0.59f + (C) * 0.11f// + (A) + (B) + (C)
					fTempTotal = LSI_ERROR( fA, fB, fC );
					fTempTotal += fMaxError[B] * fMaxError[B] * fNormalizer;
					if ( fTempTotal >= fBucketMinError ) {
						// It can't win.  Another bucket was already closer.
						// Make sure this bucket can't win later.
						fTempTotal = 100.0f;
						vChannelErrors[B].x = 100.0f;
						X = Y = 0;
						break;
					}
				}
			}
			fBucketMinError = CStd::Min( fBucketMinError, fTempTotal );
			if ( fBucketMinError == 0.0f ) { return 0.0f; }
		}

		// Get the MSE of each bucket.
		for ( LSUINT32 B = 0UL; B < 4UL; ++B ) {
			vChannelErrors[B] *= fNormalizer;
			fTotal[B] = LSI_ERROR( vChannelErrors[B].x, vChannelErrors[B].y, vChannelErrors[B].z );
			fTotal[B] += fMaxError[B] * fMaxError[B] * fNormalizer;	// Add a penalty based on the maximum error.
		}

		// Find the minimum error.
		LSFLOAT fError = fTotal[0];
		for ( LSUINT32 B = 1UL; B < 4UL; ++B ) {
			fError = CStd::Min( fError, fTotal[B] );
		}
		return fError;
	}
	/**
	 * Gets the amount of error between 2 colors.
	 *
	 * \param _bColor0 Color 1.
	 * \param _bColor1 Color 2.
	 * \return Returns the amount of error between 2 colors.
	 */
	LSE_INLINE LSFLOAT LSE_CALL CImage::GetError( const LSI_BLOCK &_bColor0, const LSI_BLOCK &_bColor1 ) {
		/*return (_bColor0.s.fR - _bColor1.s.fR) * (_bColor0.s.fR - _bColor1.s.fR) * 0.3086f +
			(_bColor0.s.fG - _bColor1.s.fG) * (_bColor0.s.fG - _bColor1.s.fG) * 0.6094f +
			(_bColor0.s.fB - _bColor1.s.fB) * (_bColor0.s.fB - _bColor1.s.fB) * 0.082f;*/
		return (_bColor0.s.fR - _bColor1.s.fR) * (_bColor0.s.fR - _bColor1.s.fR) * 0.3f +
			(_bColor0.s.fG - _bColor1.s.fG) * (_bColor0.s.fG - _bColor1.s.fG) * 0.59f +
			(_bColor0.s.fB - _bColor1.s.fB) * (_bColor0.s.fB - _bColor1.s.fB) * 0.11f;
	}

Not as simple as you would expect.  The values in the channels are all low-to-high, which means we can’t just convert them to a single 4-bucket array and make our checks.  The channels were sorted independently so all of them are low-to-high, but it may well be that in the original set of colors the green may be decreasing as the red is increasing.  To account for this we make a total of 4 sets of 4-color buckets.  Index 0 is the raw set that was passed to us.  The rest are the same set but with one channel reversed.  There is no need to make sets with 2 channels reversed because that is exactly the same as instead reversing the remaining 1 channel and not the other 2.

The 4 MSE values are calculated and the lowest is considered the actual score for the 2 given key colors.  One of the things I had to fix in order to improve my previous results substantially was how the first GetError() function accumulated its results.  Previously I was simply adding the results from the bottom GetError() function.  This was an error because I was accumulating weighted errors.  The correct method is to use the bottom function to find the best bucket for each color, but then to accumulate the errors separately, applying weighting only at the end.

The rest of the logic you see here is for early-outing.  Error values only increase, so once any bucket passes the currently best error value, it can be discarded.  This is why the caller of GetError() is passing its current best error value.  This early-out saves a huge amount of time.

 

The Last Step

Since we have still been keeping channels individually sorted from low to high, the last step is to see if there is a need to reverse any channels.

		// Until now we have worked on each channel separately such that the mins and maxes are both on one side or the other for each
		//	channel.
		// In the cases in an image where one channel goes down as another channel goes up, this will leave us with nothing but poor matches
		//	when we try to figure out how much of each of the key colors to use for the respective block texel.
		// Here we try every combination of flipping each channel to find the one that is best.
		static const LSUINT32 ui32Shifts[4] = {
			11,			// Red.
			5,			// Green.
			0,			// Blue.
			0,			// Alpha.
		};
		static const LSUINT32 ui32Bits[4] = {
			5,			// Red.
			6,			// Green.
			5,			// Blue.
			0,			// Alpha.
		};
		// For each channel.
		for ( LSUINT32 I = 3UL; I--; ) {
			LSUINT32 ui32Mask = (1UL << ui32Bits[I]) - 1UL;
			fError = GetErrorStrict( _bBlock,
				Convert16BitToBlockTexel( ui16IntForm[1] ),
				Convert16BitToBlockTexel( ui16IntForm[0] ),
				_ui32Width, _ui32Height, bOmitLast );

			LSUINT32 ui32Min = (ui16IntForm[0] >> ui32Shifts[I]) & ui32Mask;
			LSUINT32 ui32Max = (ui16IntForm[1] >> ui32Shifts[I]) & ui32Mask;

			// See if the error decreases with the channel swapped.
			// Create the new 16-bit value based off these adjustments.
			LSUINT16 ui16TempMin = static_cast<LSUINT16>(ui16IntForm[0] & ~(ui32Mask << ui32Shifts[I]));
			LSUINT16 ui16TempMax = static_cast<LSUINT16>(ui16IntForm[1] & ~(ui32Mask << ui32Shifts[I]));

			ui16TempMin |= ui32Max << ui32Shifts[I];
			ui16TempMax |= ui32Min << ui32Shifts[I];

			if ( GetErrorStrict( _bBlock,
				Convert16BitToBlockTexel( ui16TempMin ),
				Convert16BitToBlockTexel( ui16TempMax ),
				_ui32Width, _ui32Height, bOmitLast ) < fError ) {
				// Swapped is better.
				ui16IntForm[0] = ui16TempMin;
				ui16IntForm[1] = ui16TempMax;
			}
		}

		_bMin = Convert16BitToBlockTexel( CStd::Min( ui16IntForm[0], ui16IntForm[1] ) );
		_bMax = Convert16BitToBlockTexel( CStd::Max( ui16IntForm[0], ui16IntForm[1] ) );
	}

The results are fairly impressive.

Original

LSEngine: 3 Passes. MSE: 5.02

The Compressonator. MSE: 4.92

The difference, according to the computer, is only 0.1 apart.  Visually, however, there are only a few noticeable differences between LSEngine’s and ATI’s results, and while most of them have no clear better-or-worse factor, a few are definitely in my favor.  The first one that will draw your attention is the green area above the woman’s hair.  This is the type of sets of colors that are notorious for giving DXT compressors trouble and there is no way to make both the hair and the green area look good, so the compressor has to choose.

ATI chose to maintain quality on the hair while my routine chose to keep the green more in-tact.  Unfortunately for ATI, the disruptions in their green are more noticeable than the higher quality of the hair.

The “I” in “FIGHT” is clearly in my favor as well.  In fact, so is every other letter in the word except the “H”.

LSEngine

ATI

One of the highlights of my previous implementation was its ability to maintain higher levels of contrast.  This has not changed in my revised implementation.

This glaring artifact introduced by The Compressonator ruins the highlight on the hair, and is visible without zooming in.  Below I have circled areas that were handled worse than by the other.  That is, more circles is bad.

Most of these red marks are a matter of contrast, but the left-most circle on the ATI side is simply an error.  These things won’t be visible when viewing at normal scale, but the total combined improvements, small as they may be, matter.  People will not know why they prefer one image over another—they just know they do.

Of note: After adding the top-most circle on the ATI side it became less obvious why I added that circle (the circle itself adds contrast to that area).  Checking the image above this one should make it more apparent.

All-in-all, my results have a higher amount of contrast and fewer blocky artifacts (see the lower-right circle in ATI).

Update: I was suggest to use the weights 0.212656, 0.715158, and 0.072186, and it turned out to provide not only a lower MSE (4.99) but also a better human-perceived result.  It picks more accurate colors in a lot of areas, increases contrast in some areas, and eliminates some artifacts.

LSEngine New Weights. MSE: 4.99

 

LSEngine: 3 Passes. MSE: 5.02

Notice the removed artifact under the “!” in the upper image and a slight increase in overall contrast.  Below I have circled the areas I feel were most improved by the new weights.

 

The GPU Factor

I mentioned another reason why—other than the different weights ATI uses—I am at a disadvantage when using ATI’s tools for checking the quality of my images.  Ignacio Castaño points out in his article that the method for decoding DXT data is not explicitly defined, and GPU vendors are allowed some creative control in this aspect.  NVIDIA in particular uses colors other than those at 0.00%, 33.3%, 66.7%, and 100%.  Both my own and ATI’s MSE went up when I sent them through an NVIDIA GPU.

But why does this put me at a disadvantage here?  Because The Compressonator is decoding its own data—the same data it generated—for the MSE comparison.  I get my results by sending the data through the GPU and taking a screenshot.  In other words, my encoding process is designed with assumptions on how the data will later be decoded, but if my assumptions are wrong my end result will be off as well.  The Compressonator encodes to DXT, then decodes that data for displaying on the screen and MSE calculations.  It can round the mid-point colors to the nearest values up or down, whereas a GPU is likely to simply truncate for the sake of speed.

This is demonstrated easily by sending the compressed ATI result through the GPU.  I even used an ATI GPU for fairness since I would assume The Compressonator would generate a result compliant with ATI’s own GPU decoding method.

Even ATI’s own GPU’s do not maintain the results of their DXT tool.

Here I replaced the black color with white to highlight the pixels that changed after being decoded by the GPU instead of by The Compressonator itself.  Sending it through an NVIDIA GPU only increased the error (1.1/0.65).  Throughout all of these tests, including those in my previous article, only my images were actually being sent through the GPU, and thus representative of what you would actually get in the game.  Some of the pixels are off by more than just 1, so it is likely that ATI has a similar decompression routine on their GPU’s to NVIDIA’s.  This means that as long as my routine assumes colors at 0/3, 1/3, 2/3, and 3/3, I can never get an MSE as low as The Compressonator’s unless I stop sending my image through the GPU and decode it myself, as The Compressonator does.

 

Future Work

I plan to make a command-line tool with this algorithm.  Since there are differences between how GPU’s decode DXT files, I would include a flag for targetting one GPU over another.  This would mean improved texture quality for PlayStation 3 compared to Xbox 360.

In-house, we often have textures in which each channel has a special proprietary meaning, such as assigning specular intensity to R, reflectivity to B, etc.  It would be useful if artists could assign their own weights to these channels to preserve quality in one channel more than in another.

For normal maps, the DXT5 method used by John Carmack is decent, but could be improved even further.  I will save that for another article.

About L. Spiro

L. Spiro is a professional actor, programmer, and artist, with a bit of dabbling in music. || [Senior Core Tech Engineer]/[Motion Capture] at Deep Silver Dambuster Studios on: * Homefront: The Revolution * UNANNOUNCED || [Senior Graphics Programmer]/[Motion Capture] at Square Enix on: * Luminous Studio engine * Final Fantasy XV || [R&D Programmer] at tri-Ace on: * Phantasy Star Nova * Star Ocean: Integrity and Faithlessness * Silent Scope: Bone Eater * Danball Senki W || [Programmer] on: * Leisure Suit Larry: Beach Volley * Ghost Recon 2 Online * HOT PXL * 187 Ride or Die * Ready Steady Cook * Tennis Elbow || L. Spiro is currently a GPU performance engineer at Apple Inc. || Hyper-realism (pencil & paper): https://www.deviantart.com/l-spiro/gallery/4844241/Realism || Music (live-played classical piano, remixes, and original compositions): https://soundcloud.com/l-spiro/

2 Awesome Comments So Far

Don't be a stranger, join the discussion by leaving your own comment
  1. mikiex
    March 17, 2012 at 6:51 PM #

    A good article, it would be interesting to see LSEngine vs Squish vs ATI over a large range of images to be complete.

    On trusting Adobe(not that it has anthing to do with this subject): but they dont even resample SRGB images in linear space. They didnt come up with 30% 59% 11%, these numbers have been around for years for luminance. Its possible ATI are accounting for error with their numbers (I don’t know, just guessing).

    I’d be interested in trying out your executable. Currently where I work we can select between different compressors per texture and preview the results. (That being said most people leave them the default).

    ATI is not our default, but I have been thinking about using ATI as the default.

    • L. Spiro
      March 17, 2012 at 7:51 PM #

      Thank you.

      Perhaps I will extend it with some more comparison shots after I finish a few tweaks etc.

      I don’t really know who to trust. New numbers may mean new and improved research. Plus there is blue shift—in low light as your rods take over you become more sensitive to blues. So perhaps there is not really just one set of numbers to use at all. Perhaps s more complex comparison would yield even better results, at the cost of speed.
      I wish an expert in the field would step forward and announce the official best way to determine luminance.

      I have no clue right now when the tool will be made and released, since I have hopped to another part of the engine momentarily, but it should be within a month or two.

      L. Spiro

Leave a Comment to L. Spiro

Remember to play nicely folks, nobody likes a troll.