Sunday 22 July 2012

Units of SQL Server Data Storage – Extent and Data Pages

Sunday 22 July 2012
SQL Server Data Storage – Extent and Data Pages
For storing information in database we use Table. For that we create table by writing Create Table statement. But before we do that, let’s take a look at how SQL Server stores data.
In SQL SERVER data is stored in the form of Data Pages. Basic unit of storage to provide space for tables is Extent. Every Extent has number of Data Pages. When new records are inserted new data pages are allocated. Each Extent consists of eight data pages. New Extent will be allocated after all the eight pages are consumed. While Extent is the basic storage unit from database point of view, page is a unit of allocation within Extent.
Everything is stored on 8K Data Pages (technically 8060 bytes are available). 8 Data Pages (64K) combines into an extent.
This means 1 EXTENT = collection of 8 Data Pages.
The Extent is the basic unit of allocation for tables and indexes and all objects are saved in a table of some kind, all objects are stored in extents.

Fig. 1Extent and Data Page


SQL Server has two types of extents:

» Uniform:   In Uniform Extents, all eight pages are used by the same object or stores
                   data pages belong to the one object
.

» Mixed:       Mixed Extents are used by objects that are too small to take up eight
                   pages, so more than one object is stored in the extent or
stores data
                   pages belong to the different objects
.First 8 pages for the objects are stored in the mixed extents, after that only uniform extents are used. All space allocation is done based on extents - 64K blocks regardless of the type (mixed or uniform).
Fig.-2: Extent
A new table or index is generally allocated pages from mixed extents. When the table or index grows to the point that it has eight pages, it then switches to use uniform extents for subsequent allocations. If you create an index on an existing table that has enough rows to generate eight pages in the index, all allocations to the index are in uniform extents.
When a table or an index needs additional storage space, another extent is allocated to that object. A new extent will generally not be allocated for a table or index until all pages on that extent have been used. This process of allocating extents rather than individual pages to objects serves two useful purposes. First, the time-consuming process of allocation takes place in one batch rather than forcing each allocation to occur whenever a new page is needed. Second, it forces the pages allocated to an object to be at least somewhat contiguous. If pages were allocated directly, on an as needed basis, then pages belonging to a single object would not be next to each other in the data file. Page 1 might belong to table 1, page 2 might belong to index 3, page 3 might belong to table 5, and so on. This is called fragmentation. Fragmentation can have a significant negative impact on performance. When pages for a single object are contiguous, though, reads and writes can occur much more quickly.
Pages are contained in Extent. Every extent will have around eight data pages. But all the eight data pages are not created at once; it’s created depending on data demand. So when a pages becomes full it creates a new page, this process is called “Page Split”..
At the most fundamental level, everything in SQL Server is stored on an 8KB page. The page is the one common denominator for all objects in SQL Server. Many types of pages exist, but every page has some factors in common. Pages are always 8KB in size and always have a header, leaving about 8,060 bytes of usable space on every page
There are a few special data pages types SQL Server is using to track extents allocation. Those pages are basically bitmaps – every bit handles one extent. So one page can cover 64,000 extents or almost 4 GB of data. Let’s take a quick look at them:
SQL Server has eight primary types of pages:
v  Data pages:
Data pages hold the actual database records. The data page is 8,192 bytes, but only 8,060 of those bytes are available for data storage because a header at the beginning of each data page contains information about the page itself. Rows are not allowed to span more than one page, but if you have variable-length columns that exceed this limit, you can move them to a page in the ROW_OVERFLOW_DATA allocation unit.
v  Index pages:
Index pages store the index keys and levels making up the entire index tree. Unlike data pages, you have no limit for the total number of entries you can make on an index page.
v  Text/Image pages:
Text and image pages hold the actual data associated with text, ntext, and image data types. When a text field is saved, the record will contain a 16-byte pointer to a linked list of text pages that hold the actual text data. Only the 16-byte pointer inside the record is counted against the 8,060-byte record-size limit.
v  Global Allocation Map pages:
The Global Allocation Map (GAM) page type keeps track of which extents in a data file are allocated and which are still available.
v  Index Allocation Map pages:
Index Allocation Map (IAM) pages keep track of what an extent is being used for—specifically, to which table or index the extent has been allocated.
v  Page Free Space pages:
This is not an empty page; rather, it is a special type of page that keeps track of free space on all the other pages in the database. Each Page Free Space page can keep track of the amount of free space of up to 8,088 pages or about 64Mb of data.
 v   Bulk Changed Map pages:         This page contains information about other pages that have been modified by bulk operations (such as BULK INSERT) since the last BACKUP LOG statement.

 v   Differential Changed Map pages:
      This page contains information about other pages that have changes since the 
      last BACKUP DATABASE statement
v  Shared Global Allocation Map (SGAM):  
It tracks if extents are mixed extent and have at least one data page available for use.
Different sections in Page:

Page has three important sections:-
Ø  Page Header
Ø  Actual Data i.e. Data Row
Ø  Row pointers or Row Offset

Fig.-3 - Data Page

Let’s dive into one level deeper and take a look at the structure of the data page.




Fig.- 4- Data Page In Details

First 2 bytes contain header information. Next 2 bytes store the length of the fixed width data following by the data. Next are 2 bytes for the number of columns. Next, null bitmask (1 byte per 8 nullable columns). It follows by 2 bytes store number of variable width columns, variable width  column offset array (2 bytes per variable column) and variable width data. And finally there is the optional 14 bytes pointer to the version store. This one is used for optimistic isolation levels (snapshot, read committed snapshot), MARS, etc.

First of all, fixed width data always uses space even when null. Variable width data uses 2 extra bytes for offset storage for every value. Null values of variable width data are not stored although there are still 2 bytes in the offset array unless null values are last in the row.
So where does it lead us? Let’s think about the table which accepts some transactional data.Create table dbo.AppData
(
LastTranDate          Datetime       Not Null,
Amount                     float             not null,
IsApproved               int                   not null,
IsPending                  int                   not null,
LMDT                          Datetime       not null constraint DEF_ AppData_LMDT
                                                            default (getDate())

)

5 fields we have in this table require 8 + 8 + 4 + 4 + 8 = 32 bytes. Now let’s think for a minute. Do we really need to store LastTranDate date/time with precision up to 3 milliseconds? Would 1 minute be OK? Same about LMDT column. Can we use 1 second precision? What about Amount? Can we use smallmoney or maybe decimal(9,3)?
Let’s modify the table a little bit:Create table dbo.AppData
(
LastTranDate          smalldatetime  Not Null,
Amount                    
decimal(9,3)                      not null,
IsApproved               BIT                               not null,
IsPending                  BIT                               not null,
LMDT                          Datetime2(0)             not null constraint DEF_ AppData_LMDT
                                                                        default (getDate())

)

Now it requires: 4 + 5 + 1 + 0 (8 bit fields shares 1 byte of storage space) + 6 = 16 bytes. We ended up with 16 bytes of saving. Not much. On other hand, this is about 16K per 1000 rows. Or about 16Mb per 1M rows. And what if system collects 1M rows per day? It would be ~5.8Gb per year. What if you have 50M rows per day..?
Finally it’s not only about the storage size. It greatly affects performance of the system because of the extra IO operations and other things also.

So always we have use correct data types
.

Sunday 15 July 2012

Comparison between DBMS and RDBMS

Sunday 15 July 2012
Difference between DBMS and RDBMS:
In general if we categorically consider two types of Database, one is DBMS and another is RDBMS. In general term DBMS stands for Database Management System to control the storage of data and RDMBS stand for Relational Database Management System. This is the most common form of DBMS.
In 1969-1970s, Edgar Frank Codd introduced the theory of relational database. The only way to view the data is as a set of tables. Because there can be relationships between the tables, people often assume that is what the word "relational" means. Not so. Codd was a mathematician and the word "relational" is a mathematical term from the science of set theory. It means, actually "based on tables".
DBMS provides a systematic and a organized way of storing, managing and retrieving information from a logically related information collection. RDBMS also provides what DBMS provides with extended feature called Integrity Relationship or Relational Integrity or Termed as Referential Integrity.
We can conceptually visualize as:
RDBMS       =        DBMS                    +        Referential Integrity


Fig-1: Relationship
In the example above figure every Developer have a project which showing or indicating a referential integrity between ‘Developer’ and ‘Project’. If we omit this referential integrity in DBMS and File’s, it will allow you, but in RDBMS, it will restrict you from saving data if we have defined the referential integrity between these two entities. By establishing “Foreign Keys” we maintain this relationship in any RDBMS.

Database has to be persistent, meaning that the information stored in a database has to continue to exist even after the application(s) that saved and manipulated the information have ceased to run. A database also has to provide some uniform methods that are not dependent on a specific application for accessing the information that is stored inside the database.
This is a pretty liberal definition of a database. Lotus Notes calls its message stores "databases", and by this definition they qualify. There are a number of new database technologies that include object-oriented databases and associative databases, and they seem to qualify as databases under this definition too.
If we consider Text or flat binary files don't qualify as databases under this definition, since only the application that created one of these files knows enough about the file's contents to make use of the information stored within the file. They meet the persistence part of the DBMS definition, but not the independent access part of the definition.
Other "standards" like the Berkeley DB format supported by Perl, Python, and related languages do more or less qualify as a DBMS. While it isn't what most people think of when they think about DBMS setups, it does meet both the persistence and uniform access conditions for a DBMS.
An RDBMS is a Relational Data Base Management System. This adds the additional condition that the system supports a tabular structure for the data, with enforced relationships between the tables. This excludes the databases that I've listed so far since they either don't support a tabular structure at all, or don't enforce relationships between tables.
Microsoft's Jet database engine qualifies as an RDBMS under this definition, even though it seems like the majority of its users ignore the "relational" side of the engine by failing to declare foreign keys. Individual FoxPro files do not qualify because they don't have any built-in method for declaring or supporting relationships, even though nearly every FoxPro system I've ever seen expects or relies on these relationships.
Most DBAs think of an RDBMS as a client/server system. The database engine runs on a server, and client applications connect and request data from the server. Microsoft SQL Server, Oracle, DB2, and most of the other "industrial grade" databases in use today use this mental model.
Summarizing the topics as point to point:
DBMS:
A DBMS is a storage area that persist the data in files. To perform the database operations, the file should be in use.
Relationship can be established between 2 files.
There are limitations to store records in a single database file depending upon the database manager used.
DBMS allows the relations to be established between 2 files.
Data is stored in flat files with metadata.
DBMS does not support client / server architecture.
It does not support distributed architecture. It can share the data server to the client.
DBMS does not follow normalization. Only single user can access the data.
DBMS does not impose integrity constraints.
ACID properties of database must be implemented by the user or the developer
RDBMS:
RDBMS stores the data in tabular form.
It has additional condition for supporting tabular structure or data that enforces relationships among tables.
RDBMS supports client/server architecture.
It supports distributed architecture. It can share the data server to the client.
RDBMS follows normalization.
RDBMS allows simultaneous access of users to data tables.
RDBMS imposes integrity constraints.
ACID properties of the database are defined in the integrity constraints.
N.B:
“ACID” is a set of rule which are laid down to ensure that Database Transaction is reliable.
IT stands for:
A        ->       Atomicity
C       ->       Consistency
I         ->       Isolation
D       ->       Durability
We will focus on this topic in another future blog.

Sunday 8 July 2012

Why we needed Database:

Sunday 08 July 2012
Why we needed Database?
We are using MS SQL Server as Database for our application. So what is the actual need to use Database? That is the actual essence of this article.
Many people have a good habit to write daily dairy of every in and outs of the day. So dairy is an entity of storage to store or manage something or some information of our daily life.
If we think in broad perspective, Database provides a systematic and organized way of storing, managing and retrieving from collection of logically related information.
Secondly the information has to be persistent, that means even after the application is closed the information should be persisted.
Finally it should provide an independent way of accessing data and should not be dependent on the application to access the information.
Now explaining the third aspect with an example.
Figure of a text file containing some information as:

The first column of the information is Name, Second Address and third the Mobile number. This is simple text file which is designed by a programmer for a specific application.
It works fine in the boundary of the application. Now some years down the line a third party application has to be  integrated with this file, so in order the third party application integrates properly it has the following options:-
v  Use interface of the original application
v  Understand the complete detail of how the text file is organized, example the first column is Name, then Address and finally Mobile number. After analyzing write a code which can read the file, parse it etc.
That’s what the main difference between a simple file and database. Database has independent way of accessing information while simple files do not. File meets the storing, managing and retrieving part of a database but not the independent way of accessing data.