Irfan's World

January 25, 2011

What is data ? How it is stored in Database ?

Filed under: Data Warehouse — Tags: — Irfan @ 12:29 am

Today, I am going to explain you a little about data. This four letter character is a huge topic that cannot be explained in a single post. I will try to keep it concise. In addition to that, we will see different types of data encoding schemes, their use in businesses. During this era where many companies all building their data warehouses by supporting Unicode or non Unicode formats, this topic has been highly discussed by DBA’s or Data Architect’s to support their projects and is top in their discussion list. In this post, I am leaning more towards Microsoft SQL Server and how it is used.?

Data :

A set of characters belonging to a language plus some special characters from a keyboard.

For example, A character set of English Language may contain letters from A..Z, numbers, punctuations, commas, ampersand, plus, minus, hyphen, backslash, tilde, ENTER, SHIFT and other special characters.

How Character data is stored:

In a computer, characters are represented by different patterns of bits being either ON or OFF.

There are 8 bits in a byte, and the 8 bits can be turned ON and OFF in 256 (28) different patterns. A program that uses 1 byte to store each character can therefore represent up to 256 different characters by assigning a character to each of the bit patterns.

There are 16 bits in 2 bytes, and 16 bits can be turned ON and OFF in 65,536 unique patterns. A program that uses 2 bytes to represent each character can represent up to 65,536 (216) characters.

Single-byte code pages (SBCS) are definitions of the characters mapped to each of the 256 bit patterns possible in a byte. Code pages define bit patterns for uppercase and lowercase characters, digits, symbols, and special characters such as !, @, #, or %. Each European language, such as German or Spanish, has its own single-byte code page. Although the bit patterns used to represent the Latin alphabet characters A through Z are the same for all the code pages, the bit patterns used to represent accented characters such as ‘é’ and ‘á’ vary from one code page to the next.

For example, If data is exchanged between computers running different code pages, all character data must be converted from the code page of the sending computer to the code page of the receiving computer. If the source data has extended characters that are not defined in the code page of the receiving computer, data is lost. When a database serves clients from many different countries/regions, it is difficult to pick a code page for the database that contains all the extended characters required by all the client computers.

The common encoding formats used to store the data are American Standard Code for Information Interchange (ASCII) and IBM’s Extended Binary Coded Decimal Interchange Code (EBCDIC). Mostly these codes were used to represent a common language like English.

The advancement of technological challenges arose to support International Languages for global audience, a committee came with a new type of encoding format i.e. Unicode. There are different types of Unicode formats available such as UTF-8, UTF-16, UTF-32 and similarly UCS-2 used by SQL Server.

In order to support multiple languages around the world, Microsoft came up with their own UCS-2 encoding scheme to store and compare the data multiple language data in the database and it gave rise to a concept called Collation that will unique identify database’s language, sort rules, code page etc..

In brief, I have taken some excerpts about collation from Microsoft website that is shown below:

Collation:

English meaning of Collation from dictionary.com : assembling in proper numerical or logical sequence

A collation encodes the rules governing the proper use of characters for either a language, such as Greek or Polish, or an alphabet.

A SQL Server collation defines how the database engine stores and operates on character and Unicode data.

The physical storage of character strings in Microsoft® SQL Server™ 2000 is controlled by collations. A collation specifies the bit patterns that represent each character and the rules by which characters are sorted and compared.

Collation Properties:

  • The sort order to use for Unicode data types (nchar, nvarchar, and ntext). A sort order defines the sequence in which characters are sorted (ascending or descending), and the way characters are evaluated in comparison operations. (<, >, =, <=, >=, < > ).
  • The sort order to use for non-Unicode character data types (char, varchar, and text).
  • The code page used to store non-Unicode character data.

SQL Server database collations can be specified at many levels :

  • When you install an instance of SQL Server database, you specify the default collation for that instance.
  • Each time you create a database, you can specify the default collation used for the database. If you do not specify a collation, the default collation for the database is the default collation for the instance.
  • Whenever you define a character column, you can specify its collation. If you do not specify a collation, the column is created with the default collation of the database.

SQL_SortRules[_Pref]_CPCodepage_<ComparisonStyle>

SortRules (Language Character Set) 

A string identifying the alphabet or language whose sorting rules are applied when dictionary sorting is specified. Examples are Latin1_General or Polish.

Pref

Specifies uppercase preference.

Codepage

Specifies a one- to four-digit number that identifies the code page used by the collation. CP1 specifies code page 1252, for all other code pages the complete code page number is specified. For example, CP1251 specifies code page 1251 and CP850 specifies code page 850.

CaseSensitivity

CI specifies case-insensitive, CS specifies case-sensitive.

AccentSensitivity

AI specifies accent-insensitive, AS specifies accent-sensitive.

BIN
Specifies the binary sort order to be used.

Codepages

Codepages are usually defined to support specific languages or groups of languages which share common writing systems. For example, codepage 1253 provides character codes required in the Greek writing system.

The order of the character codes in a codepage allows the system to provide the appropriate character code to an application like ASP.NET or Web Services.

When a new codepage is loaded, a set of different character codes are provided to the application. In Windows, codepages can be changed on-the-fly by the user, without changing the default system language in use. An application can determine which codepages a specific font supports and then can present language options to the user.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: