Hướng dẫn học Microsoft SQL Server 2008 part 73 pot

Nielsen p05.tex V4 - 07/10/2009 4:25pm Page 683Data Connectivity IN THIS PART Chapter 30 Bulk Operations Chapter 31 Executing Distributing Queries Chapter 32 Programming with ADO.NET Cha

Trang 1

Nielsen c29.tex V4 - 07/23/2009 4:59pm Page 682

Part IV Developing with SQL Server

Summary

Writing dynamic SQL is definitely taking your game to a higher level Code that creates code Cool

Remember to always be aware of whether or not the parameters can be used to enable SQL injection

A few key points from this chapter:

■ Build up the@SQLStrvariable from the inside out — start with the dynamic list and then append theSELECTprolog

■ If theWHEREclause is dynamic, chances are good theFROMclause will be also

■ When writing dynamic SQL, add aPRINTstatement to output the@SQLStrvariable during development It makes debugging much easier

■ Usesp_executesql

■ If there’s a different way to make the code flexible, such as the parse and join method mentioned in the best practice, do that instead of dynamic SQL

■ All the standard database integrity features (e.g., foreign keys) help defend against SQL injection

■ Always think like a hacker Where can an SQL injection string be used to alter the intention of the code?

■ Dynamic SQL is not necessarily ad-hoc SQL Never permit ad-hoc SQL to your database

This concludes a ten-chapter discussion on T-SQL development that began with ‘‘what is a batch’’ and

progressed to the point of code-generating batches If you’re up to writing code-generating code in

T-SQL, you’re doing well as a SQL Server database developer I congratulate you

Trang 2

Nielsen p05.tex V4 - 07/10/2009 4:25pm Page 683

Data Connectivity

IN THIS PART Chapter 30

Bulk Operations

Chapter 31

Executing Distributing Queries

Chapter 32

Programming with ADO.NET

Chapter 33

Sync Framework

Chapter 34

LINQ

Chapter 35

Asynchronous Messaging with Service Broker

Chapter 36

Replicating Data

Chapter 37

Performing ETL with Integration Services

Chapter 38

Access as a Front End to SQL Server

As much as I’d like to think that Management Studio is the ultimate

UI and there’s no need for any other interface to SQL Server,

the truth is that SQL Server needs to connect to nearly any possible

data conduit

Other than Chapter 5, ‘‘Client Connectivity,’’ all the code so far has

occurred inside SQL Server Part V focuses on myriad ways that data can be

brought into and synchronized with SQL Server

Some of the connectivity technologies are well known and familiar

technologies like the simple but mighty bulk insert, distributed queries and

linked servers, ADO.NET, replication, and Microsoft Access

Other connectivity technologies are newer Integration services replaced DTA

with SQL Server 2005 Service Broker was also introduced with SQL Server

2005 LINQ and Synch are new with SQL Server 2008

If SQL Server is the box, then this part busts out of the box and pumps

data in and out of SQL Server

Trang 3

Nielsen p05.tex V4 - 07/10/2009 4:25pm Page 684

Trang 4

Bulk Operations

IN THIS CHAPTER

Mass inserts from comma-delimited files T-SQL ETL processing Bulk insert

BCP Performing bulk operations

Often, DBAs need to load copious amounts of data quickly — whether

it’s a nightly data load or a conversion from comma-delimited text files

When a few hundred megabytes of data need to get into SQL Server in a

limited time frame, a bulk operation is the way to get the heavy lifting done

XML’s popularity may be growing, but its file sizes seem to be growing even

faster XML’s data tags add significant bloat to a data file, sometimes quadrupling

the file size or more For very large files, IT organizations are sticking with CSV

(also known as comma-delimited) files For these old standby files, the best way

to insert that data is a bulk operation

In SQL Server, bulk operations pump data directly to the data file according to

the following models:

■ Simple recovery model: No problem with recovery; the transaction log

is used for current transactions only

■ Bulk-logged recovery model: No problem with recovery; the bulk

operation transaction bypasses the log, but then the entire bulk

opera-tion’s data is still written to the log One complication with bulk-logged

recovery is that if bulk operations are undertaken, point-in-time

recov-ery is not possible for the time period covered by the transaction log

To regain point-in-time recovery, the log has to be backed up As extent

allocations are logged for bulk operations, a log backup after bulk

operations will contain all the pages from extents that have been added,

which results in a large transaction log backup

■ Full recovery model: In full recovery model, bulk operations are not

performed; the engine does full logging of inserts To restart the

transac-tion log recoverability process, following the bulk operatransac-tion, perform a

complete backup, and restart the transaction logs

Trang 5

Part V Data Connectivity

For more details on recovery models and how to set them, see Chapter 41, ‘‘Recov-ery Planning.’’ Details on the transaction log are covered in Chapter 66, ‘‘Managing Transactions, Locking, and Blocking.’’

Technically, the SELECT INTO syntax is also a bulk-logged operation, and it too bypasses the transaction

log SELECT INTO creates a table from the results of a SELECT statement; it is discussed in Chapter 15,

‘‘Modifying Data.’’

Bulk insert operations are normally one step of an ETL (extract-transform-load) nightly process While

developing these ETL processes in T-SQL is perfectly acceptable, Integration Services is a strong

alterna-tive, and it includes bulk operations For more details about developing Integration Services solutions,

see Chapter 37, ‘‘Performing ETL with Integration Services.’’

Bulk insert is extremely fast, and I’ve had good success using it in production environ-ments My one word of caution is that the data must be clean Variations in data type, irregular columns, and missing columns will cause trouble.

Bulk operations can be performed with a command prompt using BCP (a command-prompt utility

to copy data to and from SQL Server), within T-SQL using theBULK INSERTcommand, or using

Integration Services

Bulk Insert

TheBULK INSERTcommand can be used within any T-SQL script or stored procedure to import data

into SQL Server The parameters of the command specify the table receiving the data, the location of the

source file, and the options

To test theBULK INSERTcommand, use theAddress.csvfile that’s part of the build script to

load theAdventureWorkssample database It’s probably already on your hard drive or it can

be downloaded from MSDN The 4MB file has 19,614 rows of address data — that’s small by

ETL norms

The following batch bulk inserts from theAddress.csvfile in theAdventureWorksdirectory into

theAWAddresstable:

Use Tempdb;

CREATE TABLE AWAddressStaging (

ID INT, Address VARCHAR(500),

Region VARCHAR(500), PostalCode VARCHAR(500), GUID VARCHAR(500), Updated DATETIME );

Trang 6

BULK INSERT AWAddressStaging

FROM ‘C:\Program Files\Microsoft SQL Server\90\Tools\Samples\

AdventureWorks OLTP\Address.csv’

WITH (FIRSTROW = 1,ROWTERMINATOR =’\n’);

On my Dell notebook, theBULK INSERTcompletes in less than a half-second

The first thing to understand aboutBULK INSERTis that every column from the source table is

sim-ply inserted directly into the destination table using a one-to-one mapping The first column from the

source file is dumped into the first column of the destination table Each column lines up If there are

too many columns in the destination table, then it will fail However, if there are not enough columns in

the destination table, thenBULK INSERTwill work, as the extra data is placed into the bit bucket and

simply discarded

The BULK INSERT command won’t accept a string concatenation or variable in the FROM

parameter, so if you’re assembling the string of the file location and the filename, then you need to assemble a dynamic SQL statement to execute the BULK INSERT Building and executing

dynamic SQL is covered in Chapter 29, ‘‘Dynamic SQL and Code Generation,’’ which contains many

examples.

Best Practice

Because BULK INSERT is dependent on the column position of both the source file and the destination

table, it is best practice to use a view as an abstraction layer between the BULK INSERT command and

the table If the structure of either the source file or the destination table is altered, then modifying the view

can keep the BULK INSERT running without having to change the other object’s structure

Another best practice is to BULK INSERT the data into a staging table, check the data, and then perform

the rest of the transformations as you merge the data into the permanent tables As long as you don’t mind

copying the data twice, this works well

Bulk Insert Options

In practice, I’ve always needed to use some options when usingBULK INSERT:

■ Field Terminator specifies the character used to delimit or separate columns in the source

file The default, of course, is a comma, but I’ve also seen the pipe character (|) used in

production

■ Row Terminator specifies the character that ends a row in the source file ‘\n’ means end of row

and is the typical setting However, files from mainframes or other systems sometimes don’t use a

clean end of line In these cases, use a hex editor to view the actual end of line characters and specify

the row terminator in hex For example, a hex value of ‘0A’ is coded as follows:

ROWTERMINATOR = ‘0x0A’

Trang 7

Part V Data Connectivity

■ FirstRow is useful when specifying whether the incoming file has column headers If the file does have column headers, then use this option to indicate that the first row of data is actually the second row of the file

■ TabLock places an exclusive lock on the entire table and saves SQL Server the trouble of having to lock the table’s data pages being created This option can dramatically improve performance, but at the cost of blocking data readers during the bulk insert If the bulk insert

is part of an ETL into a staging table, then there’s no problem, but if it’s a bulk insert into

a production system with potential users selecting data, then this might not be a good idea

Multiple bulk-import streams can potentially block each other To prevent this, SQL Server provides a special internal lock, called a bulk-update (BU) lock To get a BU lock, you need to specify theTABLOCKoption with each bulk-import stream without blocking other bulk-import streams

■ Rows per Batch tells SQL Server to insert n number of rows in a single batch, rather than

the entire file Tweaking the batch size can improve performance I’ve found that beginning with 100 and then experimenting to find the best size for the particular set of data works best

This helps performance because the logging is done less often Too many rows, however, often exceed memory cache and may create waits In my experience, 2,000 rows is often the best number

■ Max Errors specifics how many rows can fail before the bulk insert fails Depending on the business requirement for the data, you may need to set this to zero

■ The Errorfile option points to a file that will collect any rows not accepted by the bulk insert operation This is a great idea and should be used with everyBULK INSERTcommand in production

Other options, which I’ve never needed in production, includeCheck_Constraints,CodePage,

DataFileType,Fire_Triggers,KeepIdentity,KeepNulls,Kilobytes_per_batch, and

Order (The best practice of bulk inserting into a staging table and then performing the ETL merge into

the permanent tables makes these commands less useful.)

Best Practice

BULK INSERT handles columns in the order they appear in the source comma-delimited file, and the

columns must be in the same order in the receiving SQL table Bulk inserting into a view provides a data

abstraction layer so that any changes in column order won’t break the BULK INSERT code

When developing aBULK INSERTstatement, it’s generally useful to open the source file using Excel

and examine the data Excel often reformats data, so it’s best not to save files in Excel Sorting the data

by the columns can help find data formatting anomalies

Trang 8

BCP

BCP, short for bulk copy program (or bulk copy Porsche — a reference among DBAs to its speed), is a

command-line variation of bulk operations BCP differs fromBULK INSERTin that it is command-line

executed and can import or export data It uses many of the same options asBULK INSERT The basic

syntax is as follows:

BCP destination table direction datafile options

For the destination, use the server name along with the complete three-part name (server and

database.schema.object) For a complete listing of the syntax, just type BCP at the command prompt

Because this is an external program, it needs authorization to connect to SQL Server You have two

options: Use the-Ppassword option and hard-code your password into the batch file script, or omit

the-P, in which case it will prompt for a password Neither is a very good option You can also use

integrated security, which is usually considered the best practice

For straightforward ETL operations, I prefer using T-SQL and BULK INSERT For complex

ETL loads, Integration Services is great To be frank, I have little use for automating ETL

processes using DOS batch scripts and BCP, although Powershell may make a believer of me yet.

Summary

This chapter explained a specific T-SQL command:BULK INSERT Bulk operations provide the

additional horsepower needed to import massive amounts of data by ignoring the transaction log and

pumping the data directly to the table The downside is that it complicates the recovery plan The best

way to perform a bulk operation is with theBULK INSERTT-SQL command

The next chapter continues the theme of moving data with a personal favorite of mine, distributed

queries — when one SQL Server instance executes a query against a second SQL Server instance

Trang 9

Trang 10

Executing Distributed

Queries

IN THIS CHAPTER

Understanding distributed queries

Making the connection with remote data sources T-SQL distributed queries Pass-through queries Two-phase commits and distributed transactions

Data is seldom in one place In today’s distributed world, most new

projects enhance, or at least connect to, existing data That’s not a

problem; SQL Server can read and write data to most other data

sources Heterogeneous joins can even merge SQL Server data with an Excel

spreadsheet

SQL Server offers several methods for accessing data external to the

cur-rent database From simply referencing another local database to executing

pass-through queries that engage an Oracle server, SQL Server can handle it

Distributed Query Concepts

Linking to an external data source is nothing more than configuring the name of

the linked server, along with the necessary location and login information, so that

SQL Server can access data on the linked server

Linking is a one-way configuration, as illustrated in Figure 31-1 If Server A

links to Server B, then it means that Server A knows how to access and log into

Server B As far as Server B is concerned, Server A is just another user

If linking a server is a new concept to you, then it could easily be confused with

registering a server in Management Studio As illustrated in Figure 31-1,

Man-agement Studio is only communicating with the servers as a client application

Linking the servers enables SQL Server instance A to communicate directly with

SQL Server instance B

Links can be established in Management Studio or with T-SQL code (which

could, of course, be created by configuring the link in Management Studio and

then generating a script.) The latter has the advantage of repeatability in case a

Định dạng
Số trang	10
Dung lượng	1,83 MB