Irfan's World

January 25, 2011

What is Unicode and non Unicode data formats?

Filed under: Data Warehouse — Tags: , — Irfan @ 12:33 am

In my earlier post, I have explained about data. In this post, I am going to explain you some basics about Unicode and non Unicode formats:

Unicode :
A Unicode character takes more bytes to store the data in the database. As we all know, many global industries wants to increase their business worldwide and grow at the same time, they would want to widen their business by providing services to the customers worldwide by supporting different languages like Chinese, Japanese, Korean and Arabic. Many websites these days are supporting international languages to do their business and to attract more and more customers and that makes life easier for both the parties.

To store the customer data into the database the database must support a mechanism to store the international characters, storing these characters is not easy, and many database vendors have to revised their strategies and come up with new mechanisms to support or to store these international characters in the database. Some of the big vendors like Oracle, Microsoft, IBM and other database vendors started providing the international character support so that the data can be stored and retrieved accordingly to avoid any hiccups while doing business with the international customers.

The difference in storing character data between Unicode and non-Unicode depends on whether non-Unicode data is stored by using double-byte character sets. All non-East Asian languages and the Thai language store non-Unicode characters in single bytes. Therefore, storing these languages as Unicode uses two times the space that is used specifying a non-Unicode code page. On the other hand, the non-Unicode code pages of many other Asian languages specify character storage in double-byte character sets (DBCS). Therefore, for these languages, there is almost no difference in storage between non-Unicode and Unicode.

Encoding Formats:
Some of the common encoding formats for Unicode are UCS-2, UTF-8, UTF-16, UTF-32 have been made available by database vendors to their customers. For SQL Server 7.0 and higher versions Microsoft uses the encoding format UCS-2 to store the UTF-8 data. Under this mechanism, all Unicode characters are stored by using 2 bytes.

Unicode data can be encoded in many different ways. UCS-2 and UTF-8 are two common ways to store bit patterns that represent Unicode characters. Microsoft Windows NT, SQL Server, Java, COM, and the SQL Server ODBC driver and OLEDB provider all internally represent Unicode data as UCS-2.

The options for using SQL Server 7.0 or SQL Server 2000 as a backend server for an application that sends and receives Unicode data that is encoded as UTF-8 include:

For example, if your business is using a website supporting ASP pages, then this is what happens:

If your application uses Active Server Pages (ASP) and you are using Internet Information Server (IIS) 5.0 and Microsoft Windows 2000, you can add "<% Session.Codepage=65001 %>" to your server-side ASP script. This instructs IIS to convert all dynamically generated strings (example: Response.Write) from UCS-2 to UTF-8 automatically before sending them to the client.

If you do not want to enable sessions, you can alternatively use the server-side directive "<%@ CodePage=65001 %>".

Any UTF-8 data sent from the client to the server via GET or POST is also converted to UCS-2 automatically. The Session.Codepage property is the recommended method to handle UTF-8 data within a web application. This Codepage setting is not available on IIS 4.0 and Windows NT 4.0.

Sorting and other operations :

The effect of Unicode data on performance is complicated by a variety of factors that include the following:

1. The difference between Unicode sorting rules and non-Unicode sorting rules
2. The difference between sorting double-byte and single-byte characters
3. Code page conversion between client and server

Performing operations like >, <, ORDER BY are resource intensive and will be difficult to get correct results if the codepage conversion between client and server is not available.

Sorting lots of Unicode data can be slower than non-Unicode data, because the data is stored in double bytes. On the other hand, sorting Asian characters in Unicode is faster than sorting Asian DBCS data in a specific code page, because DBCS data is actually a mixture of single-byte and double-byte widths, while Unicode characters are fixed-width.

Non-Unicode :

Non Unicode is exactly opposite to Unicode. Using non Unicode it is easy to store languages like ‘English’ but not other Asian languages that need more bits to store correctly otherwise truncation will occur.

Now, let’s see some of the advantages of not storing the data in Unicode format:

1. It takes less space to store the data in the database hence we will save lot of hard disk space.
2. Moving of database files from one server to other takes less time.
3. Backup and restore of the database makes huge impact and it is good for DBA’s that it takes less time

Non-Unicode vs. Unicode Data Types: Comparison Chart

The primary difference between unicode and non-Unicode data types is the ability of Unicode to easily handle the storage of foreign language characters which also requires more storage space.

Non-Unicode Unicode
(char, varchar, text) (nchar, nvarchar, ntext)
Stores data in fixed or variable length Same as non-Unicode
char: data is padded with blanks to fill the field size. For example, if a char(10) field contains 5 characters the system will pad it with 5 blanks nchar: same as char
varchar: stores actual value and does not pad with blanks nvarchar: same as varchar
requires 1 byte of storage requires 2 bytes of storage
char and varchar: can store up to 8000 characters nchar and nvarchar: can store up to 4000 characters
Best suited for US English: "One problem with data types that use 1 byte to encode each character is that the data type can only represent 256 different characters. This forces multiple encoding specifications (or code pages) for different alphabets such as European alphabets, which are relatively small. It is also impossible to handle systems such as the Japanese Kanji or Korean Hangul alphabets that have thousands of characters."1 Best suited for systems that need to support at least one foreign language: "The Unicode specification defines a single encoding scheme for most characters widely used in businesses around the world. All computers consistently translate the bit patterns in Unicode data into characters using the single Unicode specification. This ensures that the same bit pattern is always converted to the same character on all computers. Data can be freely transferred from one database or computer to another without concern that the receiving system will translate the bit patterns into characters incorrectly.

I think I should stop here, I will go on writing otherwise.


January 24, 2011

SQL Server Dates – Part 2 (Quarter)

In this post, I will be showing you how to generate Quarterly dates using getdate() database function.

Quarter Dates

The following code will generate First Day of the Previous Quarter, Last Day of the Previous Quarter, First Day of the Current Quarter, Last Day of the Current Quarter.

select ‘First Day of the Previous Quarter :’ , DATEADD(QQ,DATEDIFF(QQ,0,getdate())-1,0)
select ‘Last Day of the Previous Quarter :’ , DATEADD(QQ,DATEDIFF(QQ,0,getdate()),0)-1
select ‘First Day of the Current Quarter :’ , DATEADD(QQ,DATEDIFF(QQ,0,getdate()),0)
select ‘Last Day of the Current Quarter :’ , DATEADD(QQ,DATEDIFF(QQ,0,getdate())+1,0)-1
select ‘First Day of the Next Quarter :’ , DATEADD(QQ,DATEDIFF(QQ,0,getdate())+1,0)

Wait for my next post for more Quarterly dates. Click for Date Part 3

SQL Server Dates – Part 1 (Year)

To display today’s date in SQL Server 2008 select GETDATE().

Following are some of the dates that can be used in SQL Queries to fulfill the end user needs while retrieving data from the database or creating custom objects for delivering reports in BI tools.

select ‘First Day of the Previous Year :’ , DATEADD(YY,DATEDIFF(YY,0,getdate())-1,0)
select ‘Last Day of the Previous Year :’ , DATEADD(YY,DATEDIFF(YY,0,getdate()),0)-1
select ‘First Day of the Current Year :’ , DATEADD(YY,DATEDIFF(YY,0,getdate()),0)
select ‘Last Day of the Current Year :’ , DATEADD(YY,DATEDIFF(YY,0,getdate())+1,0)-1
select ‘First Day of the Next Year :’ , DATEADD(YY,DATEDIFF(YY,0,getdate())+1,0)

I will be continuing on how to generate Quarterly dates in my next post Quarter Dates

January 23, 2011

SQL Server Dates – Part 4 (Month)

Most of the times when we develop Business Intelligence reports for trending, like trend reports, cross tabs or charts, compare analysis ( current month vs. previous month, current rolling 6 months vs. previous rolling 6 months, current rolling 12 months vs. previous rolling 12 months) which are widely used in BI industry.

Following are some date examples of how to generate those dates using GETDATE() function from SQL Server 2008 database that can be beneficial for your report development. These date examples are also very useful for scheduling reports. (i.e. something we call dynamic dates as parameters for scheduling the reports)

Month Dates

select ‘First Day of the Previous Month :’ , DATEADD(MM,DATEDIFF(MM,0,getdate())-1,0)
select ‘Last Day of the Previous Month :’ , DATEADD(MM,DATEDIFF(MM,0,getdate()),0)-1
select ‘First Day of the Current Month :’ , DATEADD(MM,DATEDIFF(MM,0,getdate()),0)
select ‘Last Day of the Current Month :’ , DATEADD(MM,DATEDIFF(MM,0,getdate())+1,0)-1
select ‘First Day of the Next Month :’ , DATEADD(MM,DATEDIFF(MM,0,getdate())+1,0)

Rolling 12 Months

select ‘Current Rolling 12 Months / Begin Date :’ , DATEADD(MM,DATEDIFF(MM,0,getdate())-11,0)
select ‘Current Rolling 12 Months / End Date :’ , DATEADD(MM,DATEDIFF(MM,0,getdate())+1,0)-1
select ‘Previous Rolling 12 Months / Begin Date :’ , DATEADD(MM,DATEDIFF(MM,0,getdate())-23,0)
select ‘Previous Rolling 12 Months / End Date :’ , DATEADD(MM,DATEDIFF(MM,0,getdate())-11,0)-1

Wait for my next post on Weekly dates.

SQL Server Dates – Part 3 (Quarter ..)

Quarterly Dates continued … Following are mostly used Quarterly Dates in Data Warehousing and Business Intelligence reports.

Previous Year Quarters:

select ‘First Day of the Previous Year / First Quarter :’ , DATEADD(Q,0,DATEADD(YY,DATEDIFF(YY,0,getdate())-1,0))
select ‘First Day of the Previous Year / Second Quarter :’ , DATEADD(Q,1,DATEADD(YY,DATEDIFF(YY,0,getdate())-1,0))
select ‘First Day of the Previous Year / Third Quarter :’ , DATEADD(Q,2,DATEADD(YY,DATEDIFF(YY,0,getdate())-1,0))
select ‘First Day of the Previous Year / Fourth Quarter :’ , DATEADD(Q,3,DATEADD(YY,DATEDIFF(YY,0,getdate())-1,0))

Current Year Quarters:

select ‘First Day of the Current Year / First Quarter :’ , DATEADD(Q,0,DATEADD(YY,DATEDIFF(YY,0,getdate()),0))
select ‘First Day of the Current Year / Second Quarter :’ , DATEADD(Q,1,DATEADD(YY,DATEDIFF(YY,0,getdate()),0))
select ‘First Day of the Current Year / Third Quarter :’ , DATEADD(Q,2,DATEADD(YY,DATEDIFF(YY,0,getdate()),0))
select ‘First Day of the Current Year / Fourth Quarter :’ , DATEADD(Q,3,DATEADD(YY,DATEDIFF(YY,0,getdate()),0))
select ‘First Day of the Current Year / Third Quarter :’ , DATEADD(QQ,DATEDIFF(QQ,3,getdate())-1,0)
select ‘Last Day of the Current Year / Second Quarter :’ , DATEADD(QQ,DATEDIFF(QQ,3,getdate())-1,0)-1

Create a free website or blog at