By L. Spiro • August 10, 2011 • General • 4 Comments

Overview

Support for complex languages such as Chinese is crucial, and to avoid bugs and to provide the best support possible for exotic languages it is best to support them from the beginning, tempting as it may be to just use ASCII strings to get fast results so you can move on to the juicy bits. In this article I will clear up a few commonly misunderstood issues regarding UTF and provide sample source code that is easy to modify to fit into any project.

UTF

UTF is the format of choice for working with strings that can include characters from any language. UTF-8, UTF-16, and UTF-32 are the biggest players in this field. L. Spiro Engine natively runs on UTF-8, but has converters for UTF-16 and UTF-32 as well, which will be posted below.

Some people question whether or not UTF-8, being mostly one byte per character, can encode the entire Unicode set. It is capable of encoding the entire set and more, in fact.

UTF-32

UTF-32 is the easiest format to describe, so I begin here. The 32-bit character value is itself the actual Unicode character (code point). There is no transformation to perform. Here is my function for getting the next UTF-32 character out of a sequence of UTF-32 data:

	/**
	 * Get the next UTF-32 character from a UTF-32 string.
	 *
	 * \param _putf32Char Pointer to the next character to decode.  String must be in UTF-32 format.
	 * \param _ui32Len Length, in LSUTF32 units, of the string.  This is not the number of Unicode
	 *	characters, but actual the number of LSUTF32 characters in the buffer.
	 * \param _pui32Size If not NULL, this holds the returned size of the character in LSUTF32 units.
	 *	However, the UTF-32 coding scheme always uses 1 LSUTF32 character per Unicode character;
	 *	therefore, this value, if not NULL, will always be set to 1.  It is here only for compatibility
	 *	with the other UTF functions.
	 * \return Returns an LSUINT32 value representing the decoded Unicode character.  Also returns the
	 *	size of the character in LSUTF32 units.
	 */
	LSE_INLINE LSUINT32 LSE_CALL CStd::NextUtf32Char( const LSUTF32 * _putf32Char, LSUINT32 _ui32Len, LSUINT32 * _pui32Size ) {
		if ( _ui32Len == 0UL ) { return 0UL; }
		if ( _pui32Size ) { (*_pui32Size) = 1UL; }
		LSUINT32 ui32Ret = (*_putf32Char);
		if ( ui32Ret & 0xFFE00000 ) { return LSSUTF_INVALID; }
		return ui32Ret;
	}

LSUTF32 is simply a typedef for a 32-bit unsigned integer. LSE_CALL can be __cdecl, __stdcall, __fastcall, or blank. LSSUTF_INVALID is a macro defined as ~static_cast<LSUINT32>(0). The rest are pretty straightforward.

UTF-16

UTF-16 stores what it can using only 16 bits, but sometimes has to pair up with a second 16-bit character to read some Unicode characters. These are called surrogate pairs. The range of Unicode characters from 0xD800 to 0xDFFF are reserved and invalid. This allows a few bits that can be tagged onto the first 16-bit character to indicate that another 16-bit character follows and pairs with the first.

During encoding of UTF-16, if the character cannot be stored using only one 16-bit value, its value is decreased by 0×10000, and its upper 10 bits and lower 10 bits are each sent to one of the 2 16-bit surrogate-pair values. The first value gets the upper 10 bits and is masked with 0xD800. The second 16-bit value gets the lower 10 bits and is masked with 0xDC00.

For decoding, we want to search for a value whose upper bits are exactly 0xD800 and is followed by a 16-bit values whose upper bits are exactly 0xDC00. Then take the lower 10 bits from both values, move the 10 bits from the first 16-bit value up by 10 bits, combine it with the lower 10 bits, and add 0×10000 to the result of that. Examine the code if this seems tricky.

	/**
	 * Get the next UTF-16 character from a UTF-16 string.
	 *
	 * \param _putf16Char Pointer to the next character to decode.  String must be in UTF-16 format.
	 * \param _ui32Len Length, in LSUTF16 units, of the string.  This is not the number of Unicode
	 *	characters, but actual the number of LSUTF16 characters in the buffer.
	 * \param _pui32Size If not NULL, this holds the returned size of the character in LSUTF16 units.
	 * \return Returns an LSUINT32 value representing the decoded Unicode character.  Also returns the
	 *	size of the character in LSUTF16 units.  Returns LSSUTF_INVALID for invalid characters.
	 */
	LSUINT32 CStd::NextUtf16Char( const LSUTF16 * _putf16Char, LSUINT32 _ui32Len, LSUINT32 * _pui32Size ) {
		if ( _ui32Len == 0UL ) { return 0UL; }

		// Get the low bits (which may be all there are).
		LSUINT32 ui32Ret = (*_putf16Char);

		LSUINT32 ui32Top = ui32Ret & 0xFC00UL;
		// Check to see if this is a surrogate pair.
		if ( ui32Top == 0xD800UL ) {
			if ( _ui32Len < 2UL ) {
				// Not enough space to decode correctly.
				if ( _pui32Size ) { (*_pui32Size) = 1UL; }
				return LSSUTF_INVALID;
			}

			// Need to add the next character to it.
			// Remove the 0xD800.
			ui32Ret &= ~0xD800UL;
			ui32Ret <<= 10UL;

			// Get the second set of bits.
			LSUINT32 ui32Next = (*++_putf16Char);
			if ( (ui32Next & 0xFC00UL) != 0xDC00UL ) {
				// Invalid second character.
				// Standard defines this as an error.
				if ( _pui32Size ) { (*_pui32Size) = 1UL; }
				return LSSUTF_INVALID;
			}
			if ( _pui32Size ) { (*_pui32Size) = 2UL; }

			ui32Next &= ~0xDC00UL;

			// Add the second set of bits.
			ui32Ret |= ui32Next;

			return ui32Ret + 0x10000UL;
		}

		if ( _pui32Size ) { (*_pui32Size) = 1UL; }
		return ui32Ret;
	}

UTF-8

For characters below 0×80, one UTF-8 character is all that is needed, and the character is exactly the same as its ASCII counterpart. Otherwise multiple bytes must be combined to get the Unicode character.

Because NULL (0×00) is below 0×80, NULL is never the start of a sequence of characters. Furthermore, sequences of characters are guaranteed never to have NULL anywhere in them, therefore functions that copy strings (strcpy(), etc.) still work properly on all UTF-8 strings. strlen() can still be used, but returns the number of bytes in the string rather than actual characters. Regardless of this type of compatibility I suggest treating UTF-8 strings as their own types and making new functions for checking their lengths.

If the first byte has a 1 in its upper-most bit (val & 0×80) then it is a multi-byte character, with the number of high bits set being equal to the number of bytes in the character.

For example, 0x7F (01111111) is just one byte, so that is the Unicode character it represents.
0xCC (11001100) has its highest bit set, so the number of high bits set indicates how many bytes in the character. Here, that is 2, so the next byte is also part of the sequence.

Each following byte after the first begins with 10XXXXXX in its highest bits. The lower 6 bits are read from each byte and concatenated to form the final Unicode character. Earlier bytes in the sequence contain more-significant bits. The code explains it all.

	/**
	 * Get the next UTF-8 character from a UTF-8 string.
	 *
	 * \param _putf8Char Pointer to the next character to decode.  String must be in UTF-8 format.
	 * \param _ui32Len Length, in LSUTF8 units, of the string.  This is not the number of Unicode
	 *	characters, but actual the number of LSUTF8 characters in the buffer.
	 * \param _pui32Size If not NULL, this holds the returned size of the character in LSUTF8 units.
	 * \return Returns an LSUINT32 value representing the decoded Unicode character.  Also returns the
	 *	size of the character in LSUTF8 units.  Returns LSSUTF_INVALID for invalid characters.
	 */
	LSUINT32 CStd::NextUtf8Char( const LSUTF8 * _putf8Char, LSUINT32 _ui32Len, LSUINT32 * _pui32Size ) {
		if ( _ui32Len == 0UL ) { return 0UL; }

		// Get the low bits (which may be all there are).
		LSUINT32 ui32Ret = (*_putf8Char);

		// The first byte is a special case.
		if ( (ui32Ret & 0x80UL) == 0UL ) {
			// We are done.
			if ( _pui32Size ) { (*_pui32Size) = 1UL; }
			return ui32Ret;
		}

		// We are in a multi-byte sequence.  Get bits from the top, starting
		//	from the second bit.
		LSUINT32 I = 0x20;
		LSUINT32 ui32Len = 2UL;
		LSUINT32 ui32Mask = 0xC0UL;
		while ( ui32Ret & I ) {
			// Add this bit to the mask to be removed later.
			ui32Mask |= I;
			I >>= 1UL;
			++ui32Len;
			if ( I == 0UL ) {
				// Invalid sequence.
				if ( _pui32Size ) { (*_pui32Size) = 1UL; }
				return LSSUTF_INVALID;
			}
		}

		// Bounds checking.
		if ( ui32Len > _ui32Len ) {
			if ( _pui32Size ) { (*_pui32Size) = _ui32Len; }
			return LSSUTF_INVALID;
		}

		// We know the size now, so set it.
		// Even if we return an invalid character we want to return the correct number of
		//	bytes to skip.
		if ( _pui32Size ) { (*_pui32Size) = ui32Len; }

		// If the length is greater than 4, it is invalid.
		if ( ui32Len > 4UL ) {
			// Invalid sequence.
			return LSSUTF_INVALID;
		}

		// Mask out the leading bits.
		ui32Ret &= ~ui32Mask;

		// For every trailing bit, add it to the final value.
		for ( I = ui32Len - 1UL; I--; ) {
			LSUINT32 ui32This = (*++_putf8Char);
			// Validate the byte.
			if ( (ui32This & 0xC0UL) != 0x80UL ) {
				// Invalid.
				return LSSUTF_INVALID;
			}

			ui32Ret <<= 6UL;
			ui32Ret |= (ui32This & 0x3F);
		}

		// Finally done.
		return ui32Ret;
	}

USC-2 and Windows®

Microsoft®, believing for some reason that no more than 65,536 characters would ever be defined, relied entirely on the USC-2 encoding for all of Windows® internal workings prior to Windows® 2000. USC-2 encoding simply stores every Unicode character using 16 bits each. Since the Unicode standard was changed in July of 1996, USC-2 was implicitly deprecated and Microsoft® has had a headache since. There are 1,114,112 Unicode characters, so don’t bother even messing with this format. I will present no code for working with USC-2.

Microsoft® Visual Studio® defines wchar_t as 16-bit, again in error. It should be 32, as it is on most systems. Once again, never use wchar_t except for handling internal strings (error messages etc.) that must be defined using L”". Never save wchar_t strings to a file or send them over a network. Keep the fact that they change size in mind and convert them to something of a consistent size. As can be seen by my code above, this applies to all types, really, since int and long also change depending on the target platform or compiler. You should be defining custom types that clearly indicate their sizes and signedness.

Encoding

The functions above are already a strong foundation for a set of functions that can count how many characters in a UTF string, etc., but converting between formats would be quite helpful. I have already discussed how each format is encoded, so here I will only present the functions that are needed to encode them.

	/**
	 * Convert a raw 32-bit Unicode character to a UTF-32 character.  Returns the UTF-32 character as
	 *	an LSUINT32 value.  The returned length is the number LSUTF32 characters returned, which is
	 *	always 1.
	 *
	 * \param _ui32Raw The raw Unicode value to convert.
	 * \param _ui32Len The lengh, in LSUTF32 characters, of the converted value.  Always 1.
	 * \return Returns the converted character in LSUINT32 form along with the length, in units of
	 *	LSUTF32, of the returned value.  Because the mapping between UTF-32 and raw 32-bit Unicode values
	 *	is one-to-one, this value is always 1.
	 */
	LSE_INLINE LSUINT32 LSE_CALL CStd::RawUnicodeToUtf32Char( LSUINT32 _ui32Raw, LSUINT32 &_ui32Len ) {
		_ui32Len = 1UL;
		return _ui32Raw;
	}

	/**
	 * Convert a raw 32-bit Unicode character to a UTF-16 character.  Returns the UTF-16 character as
	 *	an LSUINT32 value.  The returned length is the number LSUTF16 characters returned.
	 *
	 * \param _ui32Raw The raw Unicode value to convert.
	 * \param _ui32Len The lengh, in LSUTF16 characters, of the converted value.
	 * \return Returns the converted character in LSUINT32 form along with the length, in units of
	 *	LSUTF16, of the returned value.
	 */
	LSUINT32 LSE_CALL CStd::RawUnicodeToUtf16Char( LSUINT32 _ui32Raw, LSUINT32 &_ui32Len ) {
		if ( (_ui32Raw & 0xFFFF0000) == 0UL ) {
			_ui32Len = 1UL;
			return _ui32Raw;
		}

		_ui32Len = 2UL;

		// Break into surrogate pairs.
		_ui32Raw -= 0x10000UL;
		LSUINT32 ui32Hi = (_ui32Raw >> 10UL) & 0x3FF;
		LSUINT32 ui32Low = _ui32Raw & 0x3FF;

		return (0xD800UL | ui32Hi) |
			((0xDC00UL | ui32Low) << 16UL);
	}

	/**
	 * Convert a raw 32-bit Unicode character to a UTF-8 character.  Returns the UTF-8 character as
	 *	an LSUINT32 value.  The returned length is the number LSUTF8 characters returned.
	 *
	 * \param _ui32Raw The raw Unicode value to convert.
	 * \param _ui32Len The lengh, in LSUTF8 characters, of the converted value.
	 * \return Returns the converted character in LSUINT32 form along with the length, in units of
	 *	LSUTF8, of the returned value.
	 */
	LSUINT32 LSE_CALL CStd::RawUnicodeToUtf8Char( LSUINT32 _ui32Raw, LSUINT32 &_ui32Len ) {
		// Handle the single-character case separately since it is a special case.
		if ( _ui32Raw < 0x80UL ) {
			_ui32Len = 1UL;
			return _ui32Raw;
		}

		// Upper bounds checking.
		if ( _ui32Raw > 0x10FFFFUL ) {
			// Invalid character.  What should we do?
			// Return a default character.
			_ui32Len = 1UL;
			return '?';
		}

		// Every other case uses bit markers.
		// Start from the lowest encoding and check upwards.
		LSUINT32 ui32High = 0x00000800UL;
		LSUINT32 ui32Mask = 0xC0;
		_ui32Len = 2UL;
		while ( _ui32Raw >= ui32High ) {
			ui32High <<= 5UL;
			ui32Mask = (ui32Mask >> 1) | 0x80UL;
			++_ui32Len;
		}

		// Encode the first byte.
		LSUINT32 ui32BottomMask = ~((ui32Mask >> 1UL) | 0xFFFFFF80UL);
		LSUINT32 ui32Ret = ui32Mask | ((_ui32Raw >> ((_ui32Len - 1UL) * 6UL)) & ui32BottomMask);
		// Now fill in the rest of the bits.
		LSUINT32 ui32Shift = 8UL;
		for ( LSUINT32 I = _ui32Len - 1UL; I--; ) {
			// Shift down, mask off 6 bits, and add the 10xxxxxx flag.
			LSUINT32 ui32This = ((_ui32Raw >> (I * 6UL)) & 0x3F) | 0x80;

			ui32Ret |= ui32This << ui32Shift;
			ui32Shift += 8UL;
		}

		return ui32Ret;
	}

Closing

UTF is often misunderstood, and many programmers are not sure how to support it properly. I hope this article clears up some common misunderstandings, especially regarding UTF-16 and USC-2, and regarding wchar_t and cross-platform support.

It is easy to overlook the use of UTF in your projects simply because you want to move on to the juicy things. These functions, which are cross-platform, optimized, tested, and easy to modify to fit into any project, can help you get UTF support quickly so that you can move on to the graphics code you want to write.

L. Spiro

About L. Spiro

L. Spiro is a professional actor, programmer, and artist, with a bit of dabbling in music. || [Senior Core Tech Engineer]/[Motion Capture] at Deep Silver Dambuster Studios on: * Homefront: The Revolution * UNANNOUNCED || [Senior Graphics Programmer]/[Motion Capture] at Square Enix on: * Luminous Studio engine * Final Fantasy XV || [R&D Programmer] at tri-Ace on: * Phantasy Star Nova * Star Ocean: Integrity and Faithlessness * Silent Scope: Bone Eater * Danball Senki W || [Programmer] on: * Leisure Suit Larry: Beach Volley * Ghost Recon 2 Online * HOT PXL * 187 Ride or Die * Ready Steady Cook * Tennis Elbow || L. Spiro is currently a GPU performance engineer at Apple Inc. || Hyper-realism (pencil & paper): https://www.deviantart.com/l-spiro/gallery/4844241/Realism || Music (live-played classical piano, remixes, and original compositions): https://soundcloud.com/l-spiro/

View all posts by L. Spiro →

4 Awesome Comments So Far

Don't be a stranger, join the discussion by leaving your own comment →

Adel→
June 10, 2012 at 11:52 PM #

Nice article. I work very often with the win32 api (Windows SDK) and over time I’ve adopted the custom of using TCHAR for all my string needs and compiling with UNICODE defined, so all my strings are actually wchar_t strings. However after reading this and two other articles about unicode strings, I’m beginning to doubt the wisdom of sticking to Windows’ UTF16 strings. The only drawback I can think of for just switching back to char* strings is having to convert to UTF16 every time I want to pass a string to Windows. What are your thoughts?

Reply
- L. Spiro→
  June 11, 2012 at 2:40 PM #
  
  I use UTF-8 strings for everything. Since Microsoft® does not recognize UTF-8 when accessing files and other things, I convert to USC-2 just for the call to the Windows® API (which is your main concern). While this may seem like unnecessary overhead, there are actually 3 points that make this the better choice.
  
  #1: Universally storing all of your strings with 2 bytes instead of 1 takes a lot more space. I prefer unnecessary overhead to unnecessary storage.
  #2: This conversion only takes place on file access and a few other areas which in themselves are slow operations. You shouldn’t be accessing files too often and when you are, the overhead of the conversion is nothing compared to the access of the file. It won’t make a dent on your load times and it should never happen during run-time, so the overhead of the conversion is really moot.
  #3: Other operating systems do recognize UTF-8 when opening files etc. If you ever plan to branch out (it seems very practical that you might look into iOS development soon) you will find yourself wishing you had just stuck with UTF-8 all along.
  
  Speaking of portability, I have my own LSWCHAR_T typedef, but its size does not change on Windows®. Instead, the meaning of my typedef is that it changes sizes on different platforms. It is basically always the size of L”" strings, which is 2 bytes on Windows® and 4 bytes on most other platforms.
  
  L. Spiro
  
  Reply
Adel→
June 12, 2012 at 7:00 AM #

It’s me again. In my applications I store GUI strings in a string table resource within the executable/dll. I keep a variable which stores the module handle of the binary which contains the string table resource, which is by default the executable itself.

[code]HMODULE hResModule = ;[/code]

When I need a string, I use a function (GetString()) which uses the ::LoadString() API to load the string from the resource binary. Different languages can then be added by placing resource-only dll files for the language in a lang directory, and changing languages in the application then means just loading a specific dll and storing its handle in hResModule for GetString() to work with.

The advantage of this over using regular text files containing the strings is that I also store dialog templates and menus in the resource binary. This is important if one wants to add support for a right-to-left language (arabic), because then even the layout of dialog box items needs to change, which is easier to do using a proper resource editor.

I guess what I’m trying to say is that switching over to UTF-8 may be a bit more pain than it’s worth for some Windows applications (or rather Windows-only applications), especially if the application’s use of strings is mostly for GUI stuff eventually drawn on screen using Windows routines.

I’m not sure what I said makes a lot of sense. I’ve actually not needed to do any porting to other languages yet so I don’t really have actual experience, plus I’m too tired to muster enough focus to revise what I just wrote :s
Any thoughts?

Reply
- L. Spiro→
  June 12, 2012 at 8:03 PM #
  
  If you are just doing Windows® API and mostly applications rather than games, that is the way to go. My method assumes you are making games or doing cross-platform coding.
  
  L. Spiro
  
  Reply

Remember to play nicely folks, nobody likes a troll.

The Blog

Multi-Language Support/UTF

About L. Spiro

4 Awesome Comments So Far

Leave a Comment

L. Inks

L. In

Categories