This lesson will discuss the language changes that have been introduced in DataFlex to assist with working with Unicode. A new native string datatype, the ‘WString,’ has been added. This is a UTF-16 string, which has been created to simplify communication with the Windows API, and other UTF-16 external APIs. ‘WString’ should not be used in other coding if it is not needed. Runtime functions work as ‘String,’ and ‘WString‘ needs to be converted to ‘String’ first, and then converted back afterwards. That would be inefficient. Simply working with ‘String’ is faster and more efficient.

Conversions can be made between ‘String’ and ‘WString,’ simply by using a ‘Move’ statement. The actual conversion between UTF-8 and UTF-16 is then done automatically by the runtime.

Demonstration

This example will call the Windows function ‘PathFileExists,’ using the external function defined in the packages. It calls it with a ‘String’ parameter.
Viewing the definition of the function, shows that the wide version is called, the UTF-16 version, and that the parameter is defined as a ‘WString.’ Prior to DataFlex 2021, it was a String. Changes do not need to be made to the code because ‘WString’ is now expected because upon passing the ‘String’ to the external function, the runtime automatically converts ‘String’ to ‘WString.’

Before DataFlex 2021, the length of the string (the number of characters in the string) was always the same as the size of the string (the number of bytes). As explained in lesson 6, this is no longer the case. They can now be different.

Characters can be different number of bytes. In Dataflex 2021, the ‘Length’ function still returns the number of characters of a string. ‘Length’ also works for ‘WString,’ but it should not be used anymore for getting the size of a string (the number of bytes). There is a new function for that, called ‘SizeOfString.’ It returns the number of bytes, which in UTF-8, is equal to the number of code units. A similar function exists for the ‘WString’ type, which is ‘SizeOfWString.’ This function will return the number of UTF-16 code units. So, the number of double bytes.

In this example, ‘SizeOfString’ returns the value 6, because the capital A is one byte, the accent a is two bytes and the euro sign is three bytes. In the last line, ‘SizeOfWString’ is performed on the same ‘String,’ but ‘SizeOfWString’ expects a ‘WString,’ so what happens is that first, in the runtime, ‘sLabel’ is converted to a ‘WString,’ and then ‘SizeOfWString’ is performed on that ‘WString.’ A value of 3, three code units, is returned. This is because in this instance, each character is one code unit. For a UTF-16 string, each code unit is two bytes, so the string size in memory is 6 bytes.

All other string functions work on the regular UTF-8 string, and they are character based. Such as the ‘Pos,’ ‘Left’ and ‘Mid’ functions, so they probably still work well in the code, unless the result is used as the number of bytes for memory operations or something else.

Demonstration

This demo application further demonstrates the differences between string size, string length and the use of string functions. The test string is “Hello my name is Stefan Müller,” and there is one character that is not part of the basic set of 128 characters. The length of the string is 31 characters, or 31 code points, both in UTF-8 and UTF-16, but the size of the string is different. In UTF-8, the function ‘SizeOfString’ returns 32 code units, but for ‘WString’ the function ‘SizeOfWstring’ returns 31. In other words, in UTF-16 all characters of this string fit in a single code unit. Note that each code unit is 2 bytes for ‘WStrings,’ so the size of this string is 62 bytes.
Adding a smiley character shows that in UTF-8 this is a character is 4 bytes, 4 code units, so the string size is increased by 4. In UTF-16 this character is also 4 bytes, 2 code units, so ‘SizeOfWString’ is now increased by 2.
In Unicode, there can be special situations, such as when using uppercasing. When the German word “Süßigkeit” is uppercased, using the uppercase function, the ringel-S becomes a double S, which means two characters instead of one. This changes the number of characters, and in this case, the size of the string in UTF-8 stays the same. In UTF-16 the size would increase.
The ‘Pos,’ ‘Mid,’ ‘Right’ and ‘Left’ functions still work as expect. The M, for example, is the 29th character. Here, the M is the 7th character because the smiley is one character.
These functions also work on ‘WString,’ but do note that in the runtime this takes a conversion to UTF-8 first before the function is actually executed, making the process a bit slower.

Migrating to DataFlex 2021

DataFlex courses

Information

Migrating to DataFlex 2021 Part 2

Lesson 8: Unicode language changes

Demonstration

Demonstration