Nielsen p05.tex V4 - 07/10/2009 4:25pm Page 683Data Connectivity IN THIS PART Chapter 30 Bulk Operations Chapter 31 Executing Distributing Queries Chapter 32 Programming with ADO.NET Cha
Trang 1Nielsen c29.tex V4 - 07/23/2009 4:59pm Page 682
Part IV Developing with SQL Server
Summary
Writing dynamic SQL is definitely taking your game to a higher level Code that creates code Cool
Remember to always be aware of whether or not the parameters can be used to enable SQL injection
A few key points from this chapter:
■ Build up the@SQLStrvariable from the inside out — start with the dynamic list and then append theSELECTprolog
■ If theWHEREclause is dynamic, chances are good theFROMclause will be also
■ When writing dynamic SQL, add aPRINTstatement to output the@SQLStrvariable during development It makes debugging much easier
■ Usesp_executesql
■ If there’s a different way to make the code flexible, such as the parse and join method mentioned in the best practice, do that instead of dynamic SQL
■ All the standard database integrity features (e.g., foreign keys) help defend against SQL injection
■ Always think like a hacker Where can an SQL injection string be used to alter the intention of the code?
■ Dynamic SQL is not necessarily ad-hoc SQL Never permit ad-hoc SQL to your database
This concludes a ten-chapter discussion on T-SQL development that began with ‘‘what is a batch’’ and
progressed to the point of code-generating batches If you’re up to writing code-generating code in
T-SQL, you’re doing well as a SQL Server database developer I congratulate you
Trang 2Nielsen p05.tex V4 - 07/10/2009 4:25pm Page 683
Data Connectivity
IN THIS PART Chapter 30
Bulk Operations
Chapter 31
Executing Distributing Queries
Chapter 32
Programming with ADO.NET
Chapter 33
Sync Framework
Chapter 34
LINQ
Chapter 35
Asynchronous Messaging with Service Broker
Chapter 36
Replicating Data
Chapter 37
Performing ETL with Integration Services
Chapter 38
Access as a Front End to SQL Server
As much as I’d like to think that Management Studio is the ultimate
UI and there’s no need for any other interface to SQL Server,
the truth is that SQL Server needs to connect to nearly any possible
data conduit
Other than Chapter 5, ‘‘Client Connectivity,’’ all the code so far has
occurred inside SQL Server Part V focuses on myriad ways that data can be
brought into and synchronized with SQL Server
Some of the connectivity technologies are well known and familiar
technologies like the simple but mighty bulk insert, distributed queries and
linked servers, ADO.NET, replication, and Microsoft Access
Other connectivity technologies are newer Integration services replaced DTA
with SQL Server 2005 Service Broker was also introduced with SQL Server
2005 LINQ and Synch are new with SQL Server 2008
If SQL Server is the box, then this part busts out of the box and pumps
data in and out of SQL Server
Trang 3Nielsen p05.tex V4 - 07/10/2009 4:25pm Page 684
Trang 4Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 685
Bulk Operations
IN THIS CHAPTER
Mass inserts from comma-delimited files T-SQL ETL processing Bulk insert
BCP Performing bulk operations
Often, DBAs need to load copious amounts of data quickly — whether
it’s a nightly data load or a conversion from comma-delimited text files
When a few hundred megabytes of data need to get into SQL Server in a
limited time frame, a bulk operation is the way to get the heavy lifting done
XML’s popularity may be growing, but its file sizes seem to be growing even
faster XML’s data tags add significant bloat to a data file, sometimes quadrupling
the file size or more For very large files, IT organizations are sticking with CSV
(also known as comma-delimited) files For these old standby files, the best way
to insert that data is a bulk operation
In SQL Server, bulk operations pump data directly to the data file according to
the following models:
■ Simple recovery model: No problem with recovery; the transaction log
is used for current transactions only
■ Bulk-logged recovery model: No problem with recovery; the bulk
operation transaction bypasses the log, but then the entire bulk
opera-tion’s data is still written to the log One complication with bulk-logged
recovery is that if bulk operations are undertaken, point-in-time
recov-ery is not possible for the time period covered by the transaction log
To regain point-in-time recovery, the log has to be backed up As extent
allocations are logged for bulk operations, a log backup after bulk
operations will contain all the pages from extents that have been added,
which results in a large transaction log backup
■ Full recovery model: In full recovery model, bulk operations are not
performed; the engine does full logging of inserts To restart the
transac-tion log recoverability process, following the bulk operatransac-tion, perform a
complete backup, and restart the transaction logs
Trang 5Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 686
Part V Data Connectivity
For more details on recovery models and how to set them, see Chapter 41, ‘‘Recov-ery Planning.’’ Details on the transaction log are covered in Chapter 66, ‘‘Managing Transactions, Locking, and Blocking.’’
Technically, the SELECT INTO syntax is also a bulk-logged operation, and it too bypasses the transaction
log SELECT INTO creates a table from the results of a SELECT statement; it is discussed in Chapter 15,
‘‘Modifying Data.’’
Bulk insert operations are normally one step of an ETL (extract-transform-load) nightly process While
developing these ETL processes in T-SQL is perfectly acceptable, Integration Services is a strong
alterna-tive, and it includes bulk operations For more details about developing Integration Services solutions,
see Chapter 37, ‘‘Performing ETL with Integration Services.’’
Bulk insert is extremely fast, and I’ve had good success using it in production environ-ments My one word of caution is that the data must be clean Variations in data type, irregular columns, and missing columns will cause trouble.
Bulk operations can be performed with a command prompt using BCP (a command-prompt utility
to copy data to and from SQL Server), within T-SQL using theBULK INSERTcommand, or using
Integration Services
Bulk Insert
TheBULK INSERTcommand can be used within any T-SQL script or stored procedure to import data
into SQL Server The parameters of the command specify the table receiving the data, the location of the
source file, and the options
To test theBULK INSERTcommand, use theAddress.csvfile that’s part of the build script to
load theAdventureWorkssample database It’s probably already on your hard drive or it can
be downloaded from MSDN The 4MB file has 19,614 rows of address data — that’s small by
ETL norms
The following batch bulk inserts from theAddress.csvfile in theAdventureWorksdirectory into
theAWAddresstable:
Use Tempdb;
CREATE TABLE AWAddressStaging (
ID INT, Address VARCHAR(500),
Region VARCHAR(500), PostalCode VARCHAR(500), GUID VARCHAR(500), Updated DATETIME );
Trang 6Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 687
BULK INSERT AWAddressStaging
FROM ‘C:\Program Files\Microsoft SQL Server\90\Tools\Samples\
AdventureWorks OLTP\Address.csv’
WITH (FIRSTROW = 1,ROWTERMINATOR =’\n’);
On my Dell notebook, theBULK INSERTcompletes in less than a half-second
The first thing to understand aboutBULK INSERTis that every column from the source table is
sim-ply inserted directly into the destination table using a one-to-one mapping The first column from the
source file is dumped into the first column of the destination table Each column lines up If there are
too many columns in the destination table, then it will fail However, if there are not enough columns in
the destination table, thenBULK INSERTwill work, as the extra data is placed into the bit bucket and
simply discarded
The BULK INSERT command won’t accept a string concatenation or variable in the FROM
parameter, so if you’re assembling the string of the file location and the filename, then you need to assemble a dynamic SQL statement to execute the BULK INSERT Building and executing
dynamic SQL is covered in Chapter 29, ‘‘Dynamic SQL and Code Generation,’’ which contains many
examples.
Best Practice
Because BULK INSERT is dependent on the column position of both the source file and the destination
table, it is best practice to use a view as an abstraction layer between the BULK INSERT command and
the table If the structure of either the source file or the destination table is altered, then modifying the view
can keep the BULK INSERT running without having to change the other object’s structure
Another best practice is to BULK INSERT the data into a staging table, check the data, and then perform
the rest of the transformations as you merge the data into the permanent tables As long as you don’t mind
copying the data twice, this works well
Bulk Insert Options
In practice, I’ve always needed to use some options when usingBULK INSERT:
■ Field Terminator specifies the character used to delimit or separate columns in the source
file The default, of course, is a comma, but I’ve also seen the pipe character (|) used in
production
■ Row Terminator specifies the character that ends a row in the source file ‘\n’ means end of row
and is the typical setting However, files from mainframes or other systems sometimes don’t use a
clean end of line In these cases, use a hex editor to view the actual end of line characters and specify
the row terminator in hex For example, a hex value of ‘0A’ is coded as follows:
ROWTERMINATOR = ‘0x0A’
Trang 7Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 688
Part V Data Connectivity
■ FirstRow is useful when specifying whether the incoming file has column headers If the file does have column headers, then use this option to indicate that the first row of data is actually the second row of the file
■ TabLock places an exclusive lock on the entire table and saves SQL Server the trouble of having to lock the table’s data pages being created This option can dramatically improve performance, but at the cost of blocking data readers during the bulk insert If the bulk insert
is part of an ETL into a staging table, then there’s no problem, but if it’s a bulk insert into
a production system with potential users selecting data, then this might not be a good idea
Multiple bulk-import streams can potentially block each other To prevent this, SQL Server provides a special internal lock, called a bulk-update (BU) lock To get a BU lock, you need to specify theTABLOCKoption with each bulk-import stream without blocking other bulk-import streams
■ Rows per Batch tells SQL Server to insert n number of rows in a single batch, rather than
the entire file Tweaking the batch size can improve performance I’ve found that beginning with 100 and then experimenting to find the best size for the particular set of data works best
This helps performance because the logging is done less often Too many rows, however, often exceed memory cache and may create waits In my experience, 2,000 rows is often the best number
■ Max Errors specifics how many rows can fail before the bulk insert fails Depending on the business requirement for the data, you may need to set this to zero
■ The Errorfile option points to a file that will collect any rows not accepted by the bulk insert operation This is a great idea and should be used with everyBULK INSERTcommand in production
Other options, which I’ve never needed in production, includeCheck_Constraints,CodePage,
DataFileType,Fire_Triggers,KeepIdentity,KeepNulls,Kilobytes_per_batch, and
Order (The best practice of bulk inserting into a staging table and then performing the ETL merge into
the permanent tables makes these commands less useful.)
Best Practice
BULK INSERT handles columns in the order they appear in the source comma-delimited file, and the
columns must be in the same order in the receiving SQL table Bulk inserting into a view provides a data
abstraction layer so that any changes in column order won’t break the BULK INSERT code
When developing aBULK INSERTstatement, it’s generally useful to open the source file using Excel
and examine the data Excel often reformats data, so it’s best not to save files in Excel Sorting the data
by the columns can help find data formatting anomalies
Trang 8Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 689
BCP
BCP, short for bulk copy program (or bulk copy Porsche — a reference among DBAs to its speed), is a
command-line variation of bulk operations BCP differs fromBULK INSERTin that it is command-line
executed and can import or export data It uses many of the same options asBULK INSERT The basic
syntax is as follows:
BCP destination table direction datafile options
For the destination, use the server name along with the complete three-part name (server and
database.schema.object) For a complete listing of the syntax, just type BCP at the command prompt
Because this is an external program, it needs authorization to connect to SQL Server You have two
options: Use the-Ppassword option and hard-code your password into the batch file script, or omit
the-P, in which case it will prompt for a password Neither is a very good option You can also use
integrated security, which is usually considered the best practice
For straightforward ETL operations, I prefer using T-SQL and BULK INSERT For complex
ETL loads, Integration Services is great To be frank, I have little use for automating ETL
processes using DOS batch scripts and BCP, although Powershell may make a believer of me yet.
Summary
This chapter explained a specific T-SQL command:BULK INSERT Bulk operations provide the
additional horsepower needed to import massive amounts of data by ignoring the transaction log and
pumping the data directly to the table The downside is that it complicates the recovery plan The best
way to perform a bulk operation is with theBULK INSERTT-SQL command
The next chapter continues the theme of moving data with a personal favorite of mine, distributed
queries — when one SQL Server instance executes a query against a second SQL Server instance
Trang 9Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 690
Trang 10Nielsen c31.tex V4 - 07/21/2009 2:03pm Page 691
Executing Distributed
Queries
IN THIS CHAPTER
Understanding distributed queries
Making the connection with remote data sources T-SQL distributed queries Pass-through queries Two-phase commits and distributed transactions
Data is seldom in one place In today’s distributed world, most new
projects enhance, or at least connect to, existing data That’s not a
problem; SQL Server can read and write data to most other data
sources Heterogeneous joins can even merge SQL Server data with an Excel
spreadsheet
SQL Server offers several methods for accessing data external to the
cur-rent database From simply referencing another local database to executing
pass-through queries that engage an Oracle server, SQL Server can handle it
Distributed Query Concepts
Linking to an external data source is nothing more than configuring the name of
the linked server, along with the necessary location and login information, so that
SQL Server can access data on the linked server
Linking is a one-way configuration, as illustrated in Figure 31-1 If Server A
links to Server B, then it means that Server A knows how to access and log into
Server B As far as Server B is concerned, Server A is just another user
If linking a server is a new concept to you, then it could easily be confused with
registering a server in Management Studio As illustrated in Figure 31-1,
Man-agement Studio is only communicating with the servers as a client application
Linking the servers enables SQL Server instance A to communicate directly with
SQL Server instance B
Links can be established in Management Studio or with T-SQL code (which
could, of course, be created by configuring the link in Management Studio and
then generating a script.) The latter has the advantage of repeatability in case a