CHAPTER 50 SQL Server Full-Text Search Here are some examples using FREETEXTandFREETEXTTABLE: Use AdventureWorks;SELECT * from Person.Contact where Freetext*,’Barack Obama’ Corrected!. S
Trang 1CHAPTER 50 SQL Server Full-Text Search
Here are some examples using FREETEXTandFREETEXTTABLE:
Use AdventureWorks;SELECT * from Person.Contact where Freetext(*,’Barack Obama’)
Corrected! HPC
Use AdventureWorks;
SELECT * FROM Sales.Individual as s
JOIN (SELECT [key], rank FROM FREETEXTTABLE(Person.Contact, *, ‘jon’,100)) AS k
ON k.[key]=s.Contactid order by rank desc
Notice that the FREETEXTTABLEexample does the functional equivalent of a CONTAINSTABLE
query because the search is wrapped in double quotation marks
Stop Lists
Stop lists are used when you want to hide words in searches or to prevent from being
indexed those words that would otherwise bloat your full-text index and might cause
perfor-mance problems Stop lists (also known as noise word lists or stop word lists) are a legacy
component from decades ago when disk prices were very expensive Back then, using stop
lists could save considerable disk space However, with disk prices being relatively cheap, the
use of stop lists is no longer as critical as it once was You can create your own stop word list
by expanding your database in SSMS and then right-clicking on the Full-Text Stoplists
node and selecting New Full-Text Stoplist You have an option of creating your own stop list,
basing it on a system stop list, creating an empty one, or creating one based on another stop
list in a different database Each catalog can have its own stop list, which is a frequently
demanded feature because some search consumers want to be able to prevent some words
from being indexed in one table but want those words indexed in a different table After you
create a stop word list, you can maintain it by right-clicking on it in the Full-Text
Stoplistsnode and selecting Properties Figure 50.5 illustrates this option
The options are to add a stop word, delete a stop word, delete all stop words, and clear the
stop list After selecting the option you want, you can enter a stop word and the language
in which you want that stop word to be applied
Keep in mind that the stop lists are applied at query time (while searching) and index
time (while indexing) Changes made to a stop list are reflected real-time in searches but
applied only to newly indexed words The stop words remain in the catalog until you
rebuild the catalog It is a best practice to rebuild your catalog as soon as you have made
changes to your stop word list To rebuild your full-text catalog, right-click on the catalog
in SSMS and select Rebuild
Full-Text Search Maintenance
After you create full-text catalogs and indexes that you can query, you have to maintain
them The catalogs and indexes maintain themselves, but you need to focus on backing
up and restoring them as well as tuning your search solution for optimal performance In
SQL Server 2008, the full-text catalogs get fragmented from time to time, especially if you
are using the Automatic (Track Changes Automatically) setting You can check the level of
fragmentation by using the following command:
SELECT table_id, status FROM sys.fulltext_index_fragments WHERE status=4 OR
status=6;
Trang 2FIGURE 50.5 Maintaining a full-text stop list
If you notice that your tables are highly fragemented you will optimize your full-text
indexes Here is the command you would use to do this:
ALTER FULLTEXT CATALOG AdventureWorks2008 REORGANIZE;
Full-Text Search Performance
SQL Server FTS performance is most sensitive to the number of rows in the result set and
number of search terms in the query You should limit your result set to a practical
number; most searchers are conditioned to look only at the first page of results for what
they are looking for, and if they don’t see what they need there, they refine the search
and search again A good practical limit for the number of rows to return is 200 You
should try, if at all possible, to use simple queries because they perform better than more
complex ones As a rule, you should useCONTAINSrather thanFREETEXTbecause it offers
better performance, and you should useCONTAINSTABLErather thanFREETEXTTABLEfor the
same reason
Several factors are involved in delivering an optimal Full-Text Search solution Consider
the following:
Avoid indexing binary content Convert it to text, if possible Most IFilters do not
perform as well as the text IFilter
Use integer columns on the base table that comprises your unique index
Trang 3CHAPTER 50 SQL Server Full-Text Search
Partition large tables into smaller tables There seems to be a sweet spot around 50
million rows, but your results may vary Ensure that for large tables, each table has
its own catalog Place this catalog on a RAID 10 array, preferably on its own
controller
SQL Full-Text Search benefits from multiple processors, preferably four or more A
sweet spot exists on eight-way machines or better You will find 64-bit hardware also
offers substantial performance benefits over 32-bit
Dedicate at least 512MB to 1GB of RAM to MSFTESQLby setting the maximum server
memory to 1GB less than the installed memory Set resource usage to run at 5to
give a performance boost to the indexing process (that is, sp_fulltext_service
‘resource_usage’,5), set ft crawl bandwidth (max)andft notify bandwidth
(max)to0, and set max full-text crawl rangeto the number of CPUs on your
sys-tem Use sp_configureto make these changes
Full-Text Search Troubleshooting
The first question you should ask yourself when you have a problem with SQL Full-Text
Search is this: “Is the problem with searching or with indexing?” To help you make this
determination, Microsoft has included three DMVs in SQL Server 2008:
sys.dm_fts_index_keywords
sys.dm_fts_index_keywords_by_document
sys.dm_fts_parser
The first two DMVs displays the contents of your full-text index The first DMV returns
the following columns:
Keyword—Each keyword in varbinary form
Display_term—The keyword as indexed; all the accents are removed from the word
Column_ID—The column ID where the word exists
Document_Count—The number of times the word exists in that column
The second DMV breaks down the keywords by document Like the first DMV, it contains
theKeyword,Display_term, and Column_IDcolumns, but in addition it contains the
following two columns:
Document_ID—The row in which the keyword occurs
Occurrence_count—The number of times the word occurs in the cell (a cell is also
known as a tuple; it is a row-column combination—for example, the contents of the
third column in the fifth row)
The first DMV, sys.dm_fts_index_keywords, is used primarily to determine candidate
noise wordsit can be used to diagnose indexing problems The second DMV,
sys.dm_fts_index_keywords_by_document, is used to determine what is stored in your
index for a particular cell
Trang 4Here are some examples of their usage:
select * From sys.dm_fts_index_keywords(DB_ID(),Object_iD(‘MyTable’))
select * From sys.dm_fts_index_keywords_by_document(DB_ID(),Object_iD(‘MyTable’))
These two DMVs are used to determine what occurs at index time The third DMV,
sys.dm_fts_parser, is used primarily to determine what happens at search time—in other
words, how SQL Server Full-Text Search interprets your search phrase Here is an example
of its usage
select * from sys.dm_fts_parser(@queryString, @LCID, @StopListID, @AccentSenstive)
@QueryString is your search word or phrase, @LCID is the LoCale ID for your language
(determinable by querying sys.fulltext_languages), @StopListID is your stoplist
file (determinable by querying sys.fulltext_stoplists), @AccentSensitive allows you
to set accent sensitivity (0 not sensitive, 1 sensitive to accents) Here is an
example of how this works:
select * from sys.dm_fts_parser(‘café’, 1033, 0, 1)
select * from sys.dm_fts_parser(‘café’, 1033, 0, 0)
In the second example, you will notice that the Display_termis cafe and not café These
queries return the following columns:
Keyword—This is a varbinary representation of your keyword
Group_id—The query parser builds a parse tree of the search phrase If you have any
Boolean searches, it assigns different group IDs to each part of the search term For
example in the search phrase’”Hillary Clinton” OR “Barack Obama”’, Hillary and
Clinton belong to Group ID1and Barack and Obama belong to Group ID2
Phrase_id—Some words are indexed in multiple forms; for example, data-base is
indexed as data, base, and database In this case, data and base have the same phrase
ID, and database has another phrase ID
Occurence_count—This is how frequently the word apprears in the search string
Special_term—This column refers to any delimiters that the parser finds in the
search phrase Possible values are Exact Match,End of Sentence,End of
Paragraph, and End of Chapter
Display_term—This is how the term would be stored in the index
Expansion_type—This is the type of expansion, whether it is a thesaurus expansion
(4), an inflectional expansion (2), or not expanded (0) For example, the following
query shows the stemmed variants of the word run.
select * from sys.dm_fts_parser(‘FORMSOF( INFLECTIONAL, run)’, 1033, 0, 0)
Source_Term—This is the source term as it appears in your query
When troubleshooting indexing problems, you should consult the full-text error log,
which can be found in C:\Program Files\Microsoft SQL
Trang 5CHAPTER 50 SQL Server Full-Text Search
Server\MSSQL10.MSSQLSERVER\MSSQL\LOGand starts with the prefix SQLFTfollowed by the
database ID (padded with leading zeros), the catalog ID (query sys.fulltext_catalogsfor
this value), and then the extension .log You may find many versions of the log each
with a numerical extension, such as SQLFT0001800005.LOG.4; this is the fourth version of
this log These full-text indexing logs can be read by any text editor
You might find entries in this log that indicate documents were retried or documents
failed indexing in addition to error messages returned from the iFilters
Summary
SQL Server 2008 Full-Text Search offers extremely fast and powerful querying of textual
content stored in tables In SQL Server 2008, the full-text index creation statements are
highly symmetrical, with the table index creation statements making SQL Server FTS
much more intuitive to use than previous versions of SQL Server FTS Also new is the
tremendous increase in indexing and querying speeds These features make SQL Server
Full-Text Search a very attractive component of SQL Server 2008
Trang 6SQL Server 2008 Analysis Services
What’s New in SSAS
Understanding SSAS and OLAP
Understanding the SSAS Environment Wizards
An Analytics Design Methodology
An OLAP Requirements Example: CompSales International
SQL Server 2008 Analysis Services (SSAS) continues to
expand with numerous data warehousing, data mining, and
online analytical processing (OLAP)–rich tools and
tech-nologies Microsoft continues to attack the data
warehous-ing/business intelligence (BI) market by pouring millions
and millions of dollars into this area Microsoft knows that
the world is hungry for analytics and is betting the farm on
it As a part of its internal project named “Madison,”
Microsoft has been acquiring other complementary BI
tech-nologies to accelerate its plans (such as acquiring the MPP
data warehousing appliance company DATAllegro and
rolling it under its BI offering) Other more traditional (and
much more expensive) OLAP and BI platforms such as
Cognos, Hyperion, Business Objects, and Micro Strategies
are being challenged, if not completely replaced, by this
new version of SSAS
A chief data architect from a prominent Silicon Valley
company said recently, “I can now build [using SSAS]
sound, extremely usable, highly scalable, OLAP cubes
myself, faster and smarter than the entire data warehouse
team could do only a few years ago.” This is what Microsoft
has been trying to bring to the forefront for years—“BI for
the masses.”
What’s New in SSAS
SQL Server 2005 was the big jump into completely
rede-ploying Analysis Services—from the architecture, to the
development environment, to the multidimensional
languages supported, and even to the wizard-driven
deploy-ments SQL Server 2008 R2 raises this core work up a few
Trang 7CHAPTER 51 SQL Server 2008 Analysis Services
more notches with enhancements at almost every part of SSAS and with the addition of
major scaleout capabilities Following are some of the top new features and enhancements:
Microsoft has improved and streamlined the Cube Designer
Several subtle enhancements have been made around the Dimension and
Aggregation Designers
You can now create attribute relationships with the new Attribute Relationship
Designer
You can use subspace computations to optimize performance for your
Multidimensional Expressions (MDX) queries
Multidimensional OLAP (MOLAP) enables write-back capabilities that support
high-performance “what if” scenarios
A shared read-only Analysis Services database between several Analysis Services
servers enables you to “scale out” easily and efficiently
You are able to use localized analytical data in native languages, including
transla-tion capabilities and automatic currency conversions
A highly compressed and optimized data cache is maintained automatically
Backup performance is optimized
SQL Server PowerPivot for Excel is a new feature
The master data hub in SQL Server 2008 R2 helps manage your master data services
more efficiently
And, last, but not least,
SQL Server 2008 R2 Parallel Data Warehouse is a highly scalable data warehouse
appliance-based massively parallel processing (MPP) solution that knows no bounds
Understanding SSAS and OLAP
Because OLAP is at the heart of SSAS, you need to understand what it is and how it solves
the requirements of decision makers in a business As you might already know, data
ware-housing requirements typically include all the capability needed to report on a business’s
transactional history, such as sales history This transactional history is often organized
into subject areas and tiers of aggregated information that can support some online
query-ing and usually much more batch reportquery-ing Data warehouses and data marts typically
extract data from online transaction processing (OLTP) systems and serve data up to these
business users and reporting systems In general, these are all called decision support
systems (DSS), or BI systems, and the latency of this data is determined by the business
requirements it must support Typically, this latency is daily or weekly, depending on the
business needs, but more and more, we are seeing more real-time (or near-real-time)
reporting requirements
Trang 8All Product
Product Type
All Geo
Country
All Time
Month
Sales Units 450 333 1203
Years
Product
Region
Customer
TIME
OLAP Cube
PRODUCT PRODUCT
Jan01 Feb01 Mar01 Apr01
996
(France)
(2010)
(Feb 01)
(IBM Laptop
Model 451D)
FIGURE 51.1 Multidimensional representation of business facts
OLAP falls squarely into the realm of BI The purpose of OLAP is to provide for a mostly
online reporting environment that can support various end user reporting requirements
Typically, OLAP representations are of OLAP cubes A cube is a multidimensional
represen-tation of basic business facts that can be accessed easily and quickly to provide you with
the specific information you need to make a critical decision It is useful to note that a
cube can be composed of from 1 to N dimensions However, remember that the business
facts represented in a cube must exist for all the dimensions being defined for the fact In
other words, all dimensional values (that is, intersections) have to be present for a fact
value to be stored in the cube
Figure 51.1 illustrates the Sales_Unitshistorical business fact, which is the intersection of
time, product, and geography dimensional data For a particular point in time (February
2010), for a particular product (IBM laptop model 451D), and in a particular country
(France), the sales units were 996 units With an OLAP cube, you can easily see how many
of these laptop computers were sold in France in February 2010
Basically, cubes enable you to look at business facts via well-defined and organized
dimen-sions (time, product, and geography dimendimen-sions, in this example) Note that each of these
dimensions is further organized into hierarchical representations that correspond to the
way data is looked at from the business point of view This provides for the capability to
drill down into the next level from a higher, broader level (like drilling down into a
specific country’s data within a geographic region, such as France’s data within the
European geographic region)
Trang 9CHAPTER 51 SQL Server 2008 Analysis Services
SSAS directly supports this and other data warehousing capabilities In addition, SSAS
allows a designer to implement OLAP cubes using a variety of physical storage techniques
that are directly tied to data aggregation requirements and other performance
considera-tions You can easily access any OLAP cube built with SSAS via the Pivot Table Service, you
can write custom client applications by using MDX with OLE DB for OLAP or ActiveX
Data Objects Multidimensional (ADO MD), and you can use a number of third-party “OLE
DB for OLAP” compliant tools
Microsoft utilizes something called the Unified Dimensional Model (UDM) to
conceptual-ize all multidimensional representations in SSAS It is also worth noting that many of the
leading OLAP and statistical analysis software vendors have joined the Microsoft Data
Warehousing Alliance and are building front-end analysis and presentation tools for SSAS
The data mining capabilities that are part of SSAS provide a new avenue for organized data
discovery This includes using SQL Server DMX
This chapter takes you through the major components of SSAS, discusses a
mini-method-ology for OLAP cube design, and leads you through creating and managing robust OLAP
cube that can easily be used to meet a company’s BI needs
Understanding the SSAS Environment Wizards
Welcome to the “land of wizards.” This implementation of SSAS, as with older versions of
SSAS, is heavily wizard oriented SSAS has a Cube Wizard, a Dimension Wizard, a Partition
Wizard, a Storage Design Wizard, a Usage Analysis Wizard, a Usage-Based Optimization
Wizard, an Aggregation Wizard, a Calculated Cells Wizard, a Mining Model Wizard, and a
few other wizards All of them are useful, and many of their capabilities are also available
through editors and designers Using a wizard is helpful for those who need to have a
little structure in the definition process and who want to rely on defaults for much of
what they need The wizards are also plug-and-play oriented and have been made
avail-able in all SQL Server and NET development environments In other words, you can
access these wizards from wherever you need to, when you need to All the wizard-based
capabilities can also be coded in MDX, DMX, and ASSL
Figure 51.2 shows how SSAS fits into the overall scheme of the SQL Server 2008
environ-ment SSAS has become completely integrated into the SQL Server platform Utilizing many
different mechanisms, such as SSIS and direct data source access capabilities, a vast amount
of data can be funneled into the SSAS environment Most of the cubes you build will likely
be read-only because they will be for BI However, a write-enabled capability (WriteBack) is
available in SSAS for situations that meet certain data updatability requirements
As you can also see in Figure 51.2, the basic components in SSAS are all focused on building
and managing data cubes SSAS consists of the analysis server, processing services,
integra-tion services, and a number of data providers SSAS has both server-based and
client-/local-based SSAS capabilities This essentially provides a complete platform for OLAP
You create cubes by preprocessing aggregations (that is, precalculated summary data) that
reflect the desired levels within dimensions and support the type of querying that will be
done These aggregations provide the mechanism for rapid and uniform response times to
Trang 10Packages
SSIS
OLAP
Cube
OLAP
Models Mining
Models
Local Cube Engine msmdlocal.exe
IIS
COM Data Pump
XMLA (SOAP over TCP/IP)
XMLA (SOAP over HTTP)
XMLA (SOAP over TCP/IP)
OLE DB for OLAP ADO MD
Win32/64 Applications
COM-Based Applications
.NET Applications
Any Application for OLAP or DM
OLTP Databases
Multi-Dimensional
Data Warehouse
OLTP Databases
Measures
Dimensions
Hierarchies
Partitions
Perspectives
Unified
Dimensional
Model
(UDM)
Proactive Cache
(MOLAP cache)
SSAS
Processing
Engine
FIGURE 51.2 SSAS as part of the overall SQL Server 2008 environment
queries You create them before the user uses the cube All queries utilize either these
aggre-gations, the cube’s source data, a copy of this data in a client cube, data in cache, or a
combination of these sources A single Analysis Server can manage many cubes You can
have multiple SSAS instances on a single machine
By orienting around UDM, SSAS allows for the definition of a cube that contains data
measures and dimensions Each cube dimension can contain a hierarchy of levels to
specify the natural categorical breakdown that users need to drill down into for more
details Look back at Figure 51.1, and you can see a product hierarchy, time hierarchy, and
geography hierarchy representation
The data values within a cube are represented by measures (the facts) Each measure of
data might utilize different aggregation options, depending on the type of data Unit data
might require the SUM(summarization) function, Date of Receipt data might require the
MAXfunction, and so on Members of a dimension are the actual level values, such as the
particular product number, the particular month, and the particular country Microsoft
has solved most of the limitations within SSAS SSAS addresses up to 2,147,483,647 of
most anything within its environment (for example, dimensions in a database, attributes
in a dimension, databases in an instance, levels in a hierarchy, cubes in a database,
measures in a cube) In reality, you will probably not have more than a handful of
dimen-sions Remember that dimensions are the paths to the interesting facts Dimension
members should be textual and are used as criteria for queries and as row and column
headers in query results