Building management systems

And in fact, most repositories go far beyond these basic functions to enable you to do the following: ??Support the concept of components: Although all management systems must somehow s

Trang 1

Defining Data, Information, and Content

A CM Domain White Paper

By Bob Boiko

This white paper is produced from the Content Management Domain which features the full text

of the book "Content Management Bible," by Bob Boiko Owners of the book may access the CM Domain at www.metatorial.com

This paper contains the content of Chapter 1 of "Content Management Bible." It concerns the relationship between the terms in the title of the paper

Trang 2

Building Management Systems

A CM Domain White Paper

By Bob Boiko

This white paper is produced from the Content Management Domain which features the full text

of the book "Content Management Bible," by Bob Boiko Owners of the book may access the CM Domain at www.metatorial.com

Trang 3

Table of Contents

Table of Contents 2 What's in a Management System? 3 Building a Repository _3 Essential and recommended repository functions 4 The content model _6 Storing Content 7 Relational database repositories 8 Relational database basics 8 Storing component classes and instances 9 Fully parsing structured text _11 Partially parsing structured text 12 Not parsing structured text 13 Breaking the spell of rows and columns _14 Storing access structures _16 Hierarchies in a relational database 16 Indexes in relational databases 18 Cross-references in relational databases 19 Sequences in relational databases _21 Storing the content model 21 XML-based repositories 23 Object databases vs XML 24 Storing component classes and instances _24 Storing access structures _27 Hierarchies in XML 27 Indexes in XML _29 Cross-references in XML _29 Sequences in XML 30 Storing the content model 31 File-based repositories _33 Implementing Localization Strategies _34 Doing Management Physical Design _36

A repository-wide DTD _36 Link checking 36 Media checking _37

Trang 4

Search and replace _38 Management integrations 39 Summary 40

The management system within a content management system holds and organizes all the

content that you've collected In addition to storing content, the management system can provide

a full cataloging and administration system for your content and related data

In this white paper I discuss the variety of databases and functions that you may encounter or need to create to store and administer your content

What's in a Management System?

Many CMS companies describe their entire product as a management system I take a different

tack For me, although it's of course true that a content management system is a management system, it's more instructive to focus the term management on the specific parts of the CMS that

deal with the content that's in the system and differentiate them from the other parts of the CMS that enable you to get content in (collection) and get it out (publication)

The management system within a CMS has these parts:

??A repository: All the content and control files for the system are stored here The repository

houses databases and files that hold content The repository can also store configuration and administrative files and databases that specify how the CMS runs and produces publications

??A repository interface: This enables the collection, publishing, workflow and administrative

system to access and work with the repository The interface provides functions for input, access, and output of components as well as other files and data that you store in the

??An administrative module: This module enables you to configure the CMS

In this white paper I focus most on the repository itself to give you a central place from which to understand management

Building a Repository

The repository is the heart of the management system and of the CMS as a whole Into the

repository flow all the raw materials on which any publication is built Within the repository,

components are stored and can be continually fortified to increase the quality of their metadata or content Out of the repository flow the components and other parts that a page of a publication needs (as shown in Figure 1)

Trang 5

Figure 1: A high-level view of a CMS repository that shows its different parts and the content

storage options that you have

As a first approximation, you can think of the repository as a database As does a database, a repository enables you to store and retrieve information The repository, however, is much more For one thing, the repository can house many databases It can house files as well It has an interface to other systems that goes beyond what a standalone database usually does If you stand back from the repository and look at it as single unit, however, most of what you may know about databases helps you understand the functions of the repository In fact, most repositories have a database at their core The database, however, is wrapped in so much custom code and a user interface that end users aren't likely to ever see the database

You can, and often should, add a component to the repository before it's fully authored,

converted, edited, and has had metadata added to it After it's in the repository, these processes can be brought under the control of your workflow module

Essential and recommended repository functions

At the most basic level, a repository must provide the same functions as any database, as

follows:

??It must hold your content Whether you employ a vast distributed network of databases or a

simple file structure on a computer under someone's desk, the central function of a

management system is to contain your content in one "place." In addition, the system must have some way of segmenting content into individually locatable units (such as files or

database records)

Trang 6

??It must enable you to input content Whether you have tools for loading multiple

component at a time (bulk processing), automatic inputs via syndication, or one-by-one

entries via Web-based forms, the management system must give you some way to get

content in

??It must enable you to locate content Whether it employs sophisticated natural language

searches or a simple index, you must be able to find content in the system

??It must enable you to output content Whether it supports advanced transformations or

only the simplest tab-delimited format, the management system must enable you to retrieve a copy of content that you've found in a format you can use

??It must enable you to remove content Whether it can archive automatically or whether you

must delete old content by hand, without the capability to remove content, a management system is inadequate

Although a repository that performs the preceding minimum functions would be sufficient to build

a CMS on, it would be far from ideal And in fact, most repositories go far beyond these basic functions to enable you to do the following:

??Support the concept of components: Although all management systems must somehow

segment information, a good system facilitates inputting, naming, cataloging, locating, and extracting content based on its type (or, in my language, its component class)

??Track your content: The management system ought to provide statistics and reports on

your components that enable you to assess the status of individuals or groups of

components

??Support the notion of workflow: Although not part of the repository, the workflow module

must be tightly integrated with it As one example among many, events that occur within the repository, such as adding new components or deleting them, should be capable of triggering workflow processes

??Support element and full-text search: You're likely to know one of two things about

components that you want to find in the repository: the value of some piece of metadata that they contain or some piece of text that you remember that they contain In the first case, you

want what's called an element search (In relational databases, this is usually called a fielded search.) To do an element search, what you want most is a list of the elements and a place

where you can type or select the value that you want To find components by author, for

example, you want to see an Author box into which you can type a name For a bonus, the system can help you type only valid possibilities The Author box, for example, can be a list from which you simply choose an author rather than typing her name In the second case, where you remember some piece of text that the component contains, a full-text search is what you want Here, what you want is to type a word or phrase in a box and have the

system find components that contain that word or phrase in any element For spice, the

repository can enable you to combine full text and fielded search or to type Boolean

operators such as AND, OR, and NOT to make more precise searches of either type

??Support bulk processes: Managing components one at a time is far too slow for many

situations A good repository enables you to specify an operation and then do it over and over

to all the components it applies to Suppose, for example, that your lead metator is out of town and you want to extend the expiration date on any components that "turn off" while

she's out You could do an element search for all components with an expiration date

between today and the day that she returns Then you could open each of these components and change its Expire Date element to sometime next week

??Support all field types: Any repository enables you to type metadata as text, but the one

that you want can do much more The best kind of repository supports all the types of fields that I describe in white paper20, "Working with Metadata," in the section "Metadata fields." In

Trang 7

any repository, for example, you can type the name of an author into each component's

Author element Spelling errors and variations on the same name (Christopher Scott vs C Scott), however, eventually cause problems It would be better if you had one place where you could type all author names once Then, whenever an author needs to be specified, you can choose the name rather than type it The best would be a system that can be linked to the main sources of metadata in your organization People log into an organization's network, for example, based on a user ID and password This information - as well as the

organizational groups to which they belong - is stored in a registry Wouldn't you most like to work with a system that could connect to this registry and find all the people that are in the Authors group? Then, to have access to all authors' names (not to mention any other

information that the registry stores), you just need to make sure that the Authors group is correctly maintained by your organization's system administrators Similarly, if your repository holds master copies of metadata lists, you want it to be openly accessible to your

organization's other systems

??Support organization standards: Your repository should access and work within whatever

user security and other network standards that you employ If you aren't running a TCP/IP network protocol, for example, the CMS's Web-based forms and administrative tools can't work on your local area network

The content model

Database developers create data models (or database schema) These models establish how each table in the database is constructed and how it relates to the other tables in the database XML developers create DTDs (or XML Schema) DTDs establish how each element in the XML file is constructed and how it relates to the other elements in the file CMS developers create content models that serve the same function - they establish how each component is constructed and how it relates to the other components in the system

In particular, the content model specifies the following:

??The name of each component class

??The allowed elements of each component class

??The element type and allowed values for each element

??The access structures in which each component class and instance participate

The content model puts bones and sinew on the content domain Although the content domain is

a simple statement, the content model is a fully detailed framework On the other hand, all the components that you detail in the model ought to be specifically in support of the domain If you can't determine quickly how a particular component serves the domain, you should reconsider the necessity of the component or the validity of the domain statement

If your CMS is built on a relational database, your content model gives rise to a database

schema If your CMS is built on XML files or an XML database, your content model gives rise to a DTD The content model, however, isn't simply reducible to either of these models Suppose, for example, that you establish that you want an Author element that's an open list This fact can't be coded in either a database schema or a DTD Rather, it must be established in the authoring environment that you use Still, the majority of the content model can be coded either explicitly or implicitly in the database or XML schema that you develop The rest of the content model

becomes part of the access structures in your repository and the rules that you institute in your collection system

Trang 8

Storing Content

Most content management systems store components in databases Some store metadata in databases and keep the component content in files Although almost all content management systems use some sort of database, the exact database they employ and how the components are stored in the database varies widely The two major classes of databases that a CMS may use to store content components are shown in Figure 2

Figure 2: A CMS may store components in a relational database or an XML database

To date, content management systems have stored content in the following general ways:

??In relational databases, which are the computer industry's standard place to store large

amounts of information

??In an object (or XML) database, which stores information as XML

Sometimes the component body elements are stored in files In these cases, management

elements are generally stored apart from the body elements in a relational or XML database

As I write, many CMS companies are experimenting with new technologies that seek to make the best of both the database world and the world of files In addition, database product companies themselves are breaking the established boundaries by creating hybrid object -relational

databases that overlay XML Schema onto the basic relational database infrastructure

Regardless of the type of storage system that you use, it must be capable of storing components, relationships between components, and the content model, as follows:

??Storing component instances: The primary function of a CMS repository is to store the

content components that you intend to manage Suppose, for example, that you want to

manage a type of information called an HR benefit that includes a name and some text If

your system has 50 HR benefits, there must be 50 separately stored entities, each following the HRBenefits class structure, which can be retrieved one at a time or in groups

??Storing component classes: To store component instances, the repository needs some

way of representing component classes Somewhere in your storage system, for example, there must be a template for an HRBenefit component After you create a new HRBenefit component, the system uses this template to decide what the new HRBenefit includes

??Storing relationships between components: The repository must have some way of

representing and storing the access structures that you create Any indexes that you decide that you need, for example, must be capable of being represented somewhere in the

repository and must be capable of linking to the components that are indexed

??Storing the content model: Your repository system must somehow account for all the rules

in your content model Most are covered by storing the components and their relationships, but some aren't If certain component elements are required (meaning that, in every

Trang 9

component instance in which that element is present, that element must not be blank), for example, that fact must be somehow stored so that it can be upheld Similarly, if the content

of a particular element must be a date, or can't be longer than 100 characters, these facts must also be stored somewhere so that you can enforce these rules

Relational database repositories

The relational database was invented as a way to store large amounts of related information

efficiently At this task, it's excelled The vast majority of computer systems that work with more than a small amount of information have relational databases behind them Today, there are a handful of database product companies (Oracle, Microsoft, IBM, and the like) who supply

database systems to most of the programmers around the world Programmers use these

commercial database systems to quicken their own time-to-market and increase their capability to integrate with the databases currently in use by their customers

The majority of CMS product companies also base their repositories on these commercial

database products In fact, many require that you buy your database directly from the

manufacturer (This fact, by the way, puts a convenient-for-them and inconvenient-for-you firewall between the CMS product support staff and that of the database company.) Buying a database (or, more accurately, a license) from a commercial company is no big problem; database vendors are happy to sell to you directly What's much more of an issue is whether the CMS requires that you administer the database separately You may give preference to CMS products that have integrated database administration into their own user interface and don't require you to

administer the databases separately

Relational database basics

To help readers with less background in data storage, I provide some database basics before going into the more technical aspects of representing content in a relational database

Whatever you store in a relational database must fit into the database's predefined structures, as follows:

??Databases have tables: Tables contain all the database's content Loosely, one table

represents one type of significant entity You may create a table, for example, to hold your HRBenefit components The structure of that table represents the structure of the component

it stores Tables can be related to each other (This is where the relations in relational

databases are.) Rather than typing in the name of each author, for example, your HRBenefits table may be linked (via a unique ID) to a separate author table that has the name and e-mail address of each author

??Tables have rows (also called records): Loosely, each row represents one instance of its

table's entity Each HRBenefit component, for example, can occupy one row of the

HRBenefits table

??Rows have columns (also called fields): Strictly, each field contains a particular piece of

uniquely named information that can be individually accessed An HRBenefit component, for example, may have an element called Benefit Name In a relational database, that element may be stored in a field called Benefit Name Using the database's access functions, you can extract individual Benefit Name elements from the component (or row) that contains them

??Columns have data types: As you create the column, you assign it one of a limited number

of types The Benefit Name column, for example, would likely be of the type "text" (generally with a maximum length of 255 characters) Other relevant column data types include integer, date, binary large object, or BLOB (for large chunks of binary data such as images), and

large text or memo (for text that's longer than 255 characters)

Trang 10

As you see, even given these exacting constraints, there are many ways to represent content in a relational database I don't present the following examples to give you a guide to building a CMS database (You'd need much more than I provide.) In addition, if you purchase a CMS product, you work with a database that the product company's already designed What I intend is to give you insight into how the needs of a CMS mesh with the constraints of a relational database so that you can understand and evaluate the databases that you encounter

Storing component classes and instances

The simplest way to represent components in a relational database is one component class per table, one component instance per row, and one element per column An example of an

HRBenefits component class in Microsoft Access is shown in Figure 3

Figure 3: A simple table representing the HRBenefits component class

Note

Even if you know nothing about databases, you can likely see that this is very well structured

Everything is named and organized into a nice table It's not hard to imagine how database programs could help you manage, validate, and access content stored in tables In fact, database programs are quite mature and can handle tremendous amounts of data in tables They offer advanced access and connect easily to other programs It's no wonder that relational databases are the dominant players in component storage

The component class is called HRBenefits There are three HRBenefit component instances, one

in each row of the table As shown in the figure, HRBenefit components have six elements, one per column Interestingly, you'd likely ever type only two of the elements - Name and Text The ID element can be filled in automatically by the database, which has a unique ID feature

Even this most simple representation of a component in a relational database isn't so simple There are really four tables involved in storing component information The Type, Author, and

EmployeeType columns contain references to other tables (lookup tables in database parlance)

Behind the scenes, what's actually stored in the column isn't the words shown but rather the

unique ID of a row in some other table that contains the words From a more CMS-focused

vocabulary set, you can say that Type, Author, and EmployeeType are closed list elements The lists are stored in other tables and can be made available at author time to ensure that you enter correct values in the fields for these elements There may, for example, be three drop-down lists

on the form that you use to create HRBenefit components In the first is a list of Types, in the second a list of Authors, and in the third a list of Employee Types The words in the list are filled

in from the values in three database tables

I continue to complicate the example to show some of the other issues that come into play

whenever you store components in a relational database Suppose that there's an image that goes along with each benefit component (an image of a happy employee, perhaps) To represent the image you have the following two choices:

??You can actually store the image in the database

??You can store a reference to the file that contains the image

The second technique is the usual choice because, historically, databases have been lousy at storing binary large objects (BLOBs) They became bloated and lost performance This is often

Trang 11

no longer true, but the perception remains More important, images (and other media) stored

within a database aren't very accessible to the people who must work on them Anyone can go to

a directory and use her favorite tool to open, edit, and then resave an image, but you need a

special interface to extract and restore the same image in a database field This advantage is rendered moot in many of the more advanced CMS products that create extensive revision

histories To have your changes logged, you must extract and restore your files by using some sort of interface anyway The same interface can easily store and retrieve BLOBs from the

database All in all, although referencing files instead of storing them in the repository is still the most popular way to include media in a component, there's often little real advantage to it

An HR Benefits table with images is shown in Figure 4

Figure 4: The HR component table with an image reference added

Notice the new ImagePath column where you can enter the directory and name of the picture to

be included with this component The image path shown here is a relative path That is, it doesn't start from a drive letter or URL domain name (such as http://www.xyz.com) Rather, the path assumes that the file resides in an images/hr directory and that some program adds the right drive or domain name or the rest of the path later This ensures that, even if the computer that houses these files changes, the ImagePath values can stay the same In real life, you probably wouldn't even type a relative path Most likely, you upload a file from your hard drive to the

system, which then decides what the appropriate directory is for the file

The elements of the component are stored in the columns of the database This system works very well if you require a small number of management elements and a few larger elements of body text It works less well if you have a large number of management elements and a large number of body text elements It works very poorly if a component can have a variable number of management or body elements Suppose, for example that the text of an HRBenefit component looks something like the following example:

<TEXT>

<TITLE>We now do Teeth!</TITLE>

<PULLQUOTE><B>My gums and molars never felt so good.</B>

<IMAGE>laurasmile.jpg</IMAGE>

</PULLQUOTE>

<H1>a paragraph of text here</H1>

Put your mouth where our money is-<B>use the plan!</B>

</CONCLUSION>

</TEXT>

Trang 12

Rather than a paragraph or two of untagged text, you have a complex composition that includes its own images, metadata, and body elements How should this be represented in a relational database? Here are the choices:

??Full parsing: You can create a column for each element in the text chunk

??Partial parsing: You can store the entire text chunk in a single column but pull out some of

the elements into separate columns so that you can access them separately

??No parsing: You can store the entire text chunk in a single column and not worry about its

internal tagging

Of these, the last is the most commonly done

Fully parsing structured text

Certainly, if you wanted to make the elements of your component maximally accessible, you'd create a column for each element You want to "explode" the elements of the text chunk into a set

of database columns that you can then access individually Why? Well suppose, for example, that you wanted to get to just the pull-quotes and images to create a gallery of smiling employees It would be nice to have each of these in its own column so that you could easily find them and work with them

To explode the structured text, you parse it and store it Parsing is the process of finding and selecting elements, and storing is the process of finding the right database row and column and

putting the element's text within it

Well, given the sample text in the preceding section, this would yield seven extra columns in the

HR Benefits database table (Title, Pullquote, Image, H1, H2, Conclusion, and B) That doesn't sound excessive, but then again, it's far from the whole story Table 32-1 shows how the text is divided - you can see that this approach doesn't work

Table 32-1 An Impossible Repository

Pullquote My gums and molars never felt so good

B My gums and molars never felt so good

Trang 13

H1 a paragraph of text here

Conclusion Put your mouth where our money is

First, breaking each element into a column ruins the element nesting that's critical to the text How would you know, for example, that the B column must go inside the Conclusion column for the text to make sense? Second, columns are repeated Both the B column and the H1 column occur twice This isn't allowed in a database, in which each column must be uniquely named Finally (I could cite more problems, but I'm sure that you get the idea.), and most important, this is the text for just one HRBenefit component Is it reasonable to expect that the others have exactly these columns? I think not Others should follow the same general form but not the exact order and number of elements The number and names of the elements (and thus the columns) can vary from component to component, and that's not allowed in a database

Strange as it may seem, these problems aren't insurmountable In fact, I know of at least one CMS company that's working to completely "explode" rule-based but variable text (XML, that is) in

a relational database But it's not easy As you can see from the example that I give, the basic rules of a relational database are at odds with the needs of the text The rigid regularity of rows and columns is too far removed from the subtler regularity of well-formed text to enable the two to overlap easily

Partially parsing structured text

Given the difficulty exploding structured text into rows and columns fully, most systems don't try Luckily, there are more modest approaches to storing structured text that can often suffice The most common modest approach is to parse the text block, looking for relevant elements and

storing them in their own columns

Consider again, the text chunk to be included in an HRBenefit component, as follows:

<TEXT>

</PULLQUOTE>

Put your mouth where our money is-<B>use the plan!</B>

</CONCLUSION>

</TEXT>

Trang 14

How many of the elements here do you really need to access separately? Certainly not all of them It's hard to think of a reason, for example, why you'd need to get to all <b> elements For the purpose of illustration, say that you really need only the pull-quotes separately In this case, you can simply add a Pullquote column to your HR Benefits table, and you're ready (see Figure 5)

Figure 5: The HR Benefits table with a Pullquote column added

It doesn't do to remove the pull-quote from the text Its position and nesting in the text chunk may

be critical Rather, you must make a copy of the pull-quote and put it in the Pullquote column Notice that you don't need to put the pull-quote tag in the column - only the text The column name itself serves as the tag for the pull-quote element Similarly, if you locate the entire text chunk (delimited by <TEXT></TEXT>), you need copy only what's inside to the Text column of the database

This type of solution enables you to have your text chunk and your metadata, too It requires however, that you do the following:

??Program or do manual work: You must create either custom programming code or a

manual process for inputting elements to database columns

??Synchronize columns: You must make sure that you keep the elements in the text and

database columns in synch If someone edits the pull-quote in the text, you must recognize the event and make sure that you update the same text that's duplicated in the Pullquote column

??Synchronize constraints: You must ensure that the constraints on the element and

database column match It doesn't do for the database column to be limited to 255 characters

if the pull-quote element can be as long as the author wants it to be

Given the extra work involved in pulling metadata from structured text, most people keep the

number of elements they treat this way to a bare minimum

Not parsing structured text

Exploding structured text completely into database columns is often prohibitively complex A controlled explosion of only certain elements into columns is more reasonable but still presents problems What most people end up doing then is just storing the entire chunk of structured text

in one database column and ignoring what's inside the text chunk

This isn't as bad a solution as it may seem at first glance For one thing, you don't always need to access elements within a text chunk In many situations, it's fine to simply wrap an unparsed text chunk with a few metadata columns and call it done

For another thing, not all text is structured The previous solutions demand that the text chunk you're dealing with be tagged well enough that you can locate elements within them reliably Most text chunks that you encounter aren't so well structured In fact, unless someone's put the effort into delivering XML code, it's likely that you don't even have the starting place for constructing any sort of automatic explosion process Thus, for systems that end up storing text in HTML or any other less-than-easily parsed markup language, storing entire text chunks may be the only possible option

Trang 15

But even if you have well-structured text, it may work out well to save it in a single database field Just because the database can't get inside the block of text and deliver a single element doesn't mean that you can't do so by other methods Suppose, for example, that you store an entire

HRBenefit text chunk in a single column called Text Using relational database methodologies, it's not possible to get directly to the pull-quote elements that may be within the Text column You can certainly get the entire text chunk and parse it yourself, however, to find the pull-quote

element! In other words, getting to the pull-quote elements could be a two-step process: The

database gives you the text chunk, and you parse it by using nondatabase tools to get to the quote element

pull-Obviously, this is less convenient than having the database just give you what you want It also makes it hard to do a fielded search against elements (where you say, for example, "Give me all the components with this text in their pull-quote element") Still, storing all your structured text in one field may still get you the elements that you desire In general recognition of this approach, at least one CMS product offers XML processing tools that you can use against any XML code

that's stored in its relational database Moreover, some of the database product companies

themselves have developed XML overlays that enable you to package the two-stage process into

a single query

Breaking the spell of rows and columns

I've discussed the simplest approach to storing components in relational databases: one table per component class, one row per component, and one element per column Although this approach

is commonly used, it's not the only one possible The one-table approach has the following

advantages:

??It's easy to understand You don't need to hunt around in the database to find your

components You simply look for the table that has the same name as the components that you're seeking

??It has high performance Databases go fastest if they're simple In fact, even the

straightforward process of referencing other tables instead of retyping author names and the

like (called normalizing) can slow a database down For this reason, databases that need to

be used in high-volume situations (high-transaction Web sites, for example) are often normalized" first Their table relationships are broken, and single tables are produced from a set of related tables

"de-The disadvantages of the single-table approach are as follows:

??It's inflexible To add a new component class to the system, you must create a new table

To add an element to a component, you must add a column to a table Although this may seem simple, in many database systems, it requires a fair amount of effort Relational

databases aren't really designed to have their tables constantly modified, created, and

destroyed

??It has a hard time dealing with irregular information For example, either a table has a

particular column or it doesn't In a CMS, a component instance sometimes may or may not have a particular column

??It has a hard time dealing with extra information The one table approach enables you to

specify the component name (in the table name), the component element name (in the

column names), and the basic data type of the component element (in the data types of the columns) For other types of information about the component or its elements (whether a column's an open list or a reference, for example), there's no obvious place to store it

An approach more subtle than the "one table/one component class" model can help overcome some of these disadvantages Consider the two tables shown in Figure 6

Trang 16

Figure 6: Two tables that represent any component class

In the background, there's a Components table It simply lists all the component classes in the system and gives them a unique ID (In real life, a table such as this would be much more

complex and most likely consist of an entire family of tables.) In the foreground, there's an

Elements table This table stores all the elements of all the components in the system

Each element has the following:

??A unique ID: The ID is one piece of information about the element that uniquely identifies it,

never changes, and can be used to quickly locate the element in any search that you may do

??A component class to which it's tied: The illustration shows that a drop-down feature's

linked to this column so that you don't need to type the name of the component and element with which it's associated In this example, each element is associated with only one

component class In real life, an element may be associated with any number of component classes, requiring a more complicated set of tables

??The element's name: The name is the phrase that people use to recognize the element if

they see it on entry forms and reports

??The element's type: The types shown here correspond loosely to the metadata filed types

The CMS can use these types to decide what to do with an element If the type is Path, for example, the CMS can verify that any typed text has the format of a directory and file name This way of representing components has some very nice features First, to add a new

component class, rather than needing to create a whole new table, you need only add a row to the Components table Similarly, to add an element to a component class, you need only add a row to the Elements table and fill in its values Second, you can easily extend what the database

"knows" about a component or element The ElementType column, for example, has additional information about what the system expects an element to contain The ElementType extends the simpler idea of data type In addition to being able to say what data type an element must have (date, text, BLOB, and so on.), you can use the element type to create additional rules

Notice that these tables represent component classes only There's no place to actually store the component instances For this, you need an additional table, as shown in Figure 7

Figure 7: The Component Instances table stores only element values

Trang 17

This table stores element values and associates them to a particular component instance and to

a general component class as follows:

??Element values: The Value column has the specific content of one element of one

component

??Classes: The ComponentID column identifies the class of the component, and the ElementID

specifies the various elements of the chosen component class

??Instances: All rows in this table with the same InstanceID are part of the same component

instance All the preceding rows, for example, are part of component instance number 1 Again, in real life this single table may actually be a family of tables

In summary, this more abstract way of representing components uses tables to define component classes and tables to represent actual components It's far more flexible than the simpler "one table/one component class" system Of course, you don't get something for nothing Clearly, the more abstract system is harder to understand (and program) In addition, the extra structure and relationships in the abstract system could slow the CMS down in high-transaction environments Still, as your CMS needs become more complex, a more subtle approach to representing

components becomes necessary

Overall, a more abstract database facilitates authoring, while a more concrete database facilitates delivery Because of this, many CMS designers choose to keep one database structure for

authoring and then transform it to a simpler structure as they move the content from the authoring platform (the local LAN, say) to the delivery environment (an Internet server outside the firewall, for example)

Storing access structures

In the following sections, I discuss how you may store each kind of access structure in a relational database

Hierarchies in a relational database

Relational databases aren't great at representing outlines They just don't fit conveniently into rows and columns Instead, a more abstract approach is needed to put an outline in a table As

an example of how it's done, consider the following Table of Contents for an intranet:

Trang 18

unmanageable How do you add or update a line? What if the name of a listed component

changes? How does that change get into the outline that you typed into the field or cells?

Unfortunately, a more sophisticated approach is needed To be truly useful, the approach must accomplish the following tasks:

??Represent nesting: It must represent the nesting in the outline

??Reference Ids: It must reference components by ID (and not require you to type the

component name itself if you refer to it) so that, if the name changes, you don't need to retype anything

??Be complete: There must be enough information in the outline to enable the system to

format it later as a set of links to the component pages (assuming that one of your outputs is

a Web publication)

As one solution among many, consider the table shown in Figure 8

Figure 8: A simple hierarchy database table

The following list takes the table apart, column by column, to see how it meets the tasks that I set out for it:

??The TOCID column simply gives each line of the outline a unique ID

??The Text column specifies the text that should appear on each line of the outline Notice at

this point that there are two kinds of lines in the outline: lines that name the folders of the outline and lines that name particular components You can tell the two types apart because the rows with no ComponentID correspond to lines of the outline that are folders Folder rows have the name of the folder in the Text column, while component rows have the name of the component in the Text field Because you're entering the component IDs, you don't actually

Trang 19

need to put in component names I include them in the preceding table only to make the table easier to read

??The ParentID and ChildNumber columns establish the nesting of the outline Notice, for

example, that the ParentID of the HR Benefits folder is 1 This is the TOCID of the Our

Intranet folder This means that the HR Benefits folder is under the Our Intranet folder The ChildNumber of the HR Benefits folder is 1 That means that it's the first child under Our

Intranet The Events folder, with a ParentID of 1 and a ChildNumber of 2, is the second child under Our Intranet The News folder is child number 3 under the Our Intranet folder

??The ComponentID column has a component ID in it if the row has a component, and not a

folder, listed in it The CMS uses this distinction to decide which outline rows to make

expandable folders and which to make links to components

If this seems a bit complex, you've gotten the general idea You must play some tricks to get an outline into a table After it's tricked, however, the database performs as expected and can store any outline you want effectively One of the nice things about the preceding structure is that it can store as many outlines as you want All you need to do is create a new folder that has no parent Then any other folder or component that lists the new folder as its parent is in a new outline This feature comes in handy, as you may need more than one outline in your CMS

Indexes in relational databases

An index connects phrases to content (or to other phrases) In books, index entries point to

pages In a CMS, index entries (also called keywords) can point to pages, components, elements,

or text positions

Indexes that point to components are fairly straightforward to represent in a relational database A very simple but quite adequate index table is shown in Figure 9

Figure 9: A simple Index database table

The first column lists the index term, while the second column lists the IDs of the components to which the index term applies In real life, you may make some modifications to this simple format

to increase its quality First, you may want to make it a two-level index, like what you see in the back of a book To do so, you'd need to somehow represent an outline in the table A set of fields such as the ones that I used previously to create an outline would do, or you could do something simpler given that it's only a two-level outline and that it's alphabetical As most database

programmers would tell you immediately, you'd probably not want to list all the component IDs in one column Rather, you'd put them in a separate "bridge" table that has only one ID per column The hard part of indexes, of course, isn't creating the database tables to support them but putting the effort into indexing your content

If your system is going to produce primarily Web pages, you may be tempted to follow the most well-worn path of associating index keywords with Web pages Most indexes on the Web today use the <META> tag to create a keyword facility that their own site can use - as well as any Web crawling search engine - as follows:

Trang 20

This is a fine approach, and in many cases, it's the only approach to indexing a Web site It

needn't preclude creating an index in your CMS that points to components, however, and not pages As you build a page, it's easy enough to populate the values of the <META> tag from the index terms that apply to any of the components that you've put on the page That way, as

components are added or deleted from the page, their keywords take care of themselves

Cross-references in relational databases

Cross-references have referents and references The referent is the thing linked to The reference

is the thing doing the pointing As an example, consider the following HTML cross-reference:

<A HREF="target.htm">Click me</A>

The referent is target.htm You go there if you follow the link The reference is the entire line It's the thing in HTML that says, "There's a link here."

Cross-references can present a bit of a dilemma in a relational database Although the referent can usually be a component, the reference can be an entire component, an element within a

component (an image, say), or even a single word within an element In database lingo, a reference can apply to a row, a column within a row, or to a word within a column Applying a cross-reference to a component is relatively easy In Figure 10, you can see a table that does the trick

cross-Figure 10: A table to cross-reference components to other components

The ReferenceID column has the ID of the component from which you're linking The ReferentID has the name of the component that you're linking to The LinkText column has the text that

becomes the link The idea here is that, if you publish the component onto some sort of page, your software can look in this table to see whether there are any links If there are, you can

render them in the appropriate way Having your cross references organized this way gives you a tremendous advantage in keeping them under control If you delete component 23, for example, this table tells you that you'd better fix the cross-reference to it in component 33 In addition, you can use this table to tell you which components are linked to most often and other very nice-to-know facts about your cross-referencing system

Notice that this approach is neutral with respect to the kind of publication that you want to create You can, for example, use the same information to create an HTML link, as follows:

<A HREF="23.htm">More about benefits</A>

Or you can create a print link, as follows:

More about benefits can be found in the section titled "Benefits and You."

So much for cross-references where the reference is an entire component How about ones

where you need to link from an element within a component? It becomes a bit stickier here, and you have some choices based on the assumptions that you can make, as follows:

??Extra metadata: If the element is always going to have a link, you can add an extra column

to the database that holds the link referent (and link text if necessary) In other words, the

Trang 21

cross-reference can become another element of the component that's always tied to a

particular other element

??Element references: You can add a column to the table in Figure 10 that contains an

element name or ID As each component element is rendered, your software can look in the table and see whether that element has a link This is a bit of overkill, however, because the vast majority of elements don't have links Still, it would work

??Link as structured text: You can treat the element link as a text link, as I describe next

The most common type of cross-reference is between a phrase and a component Too bad that it's so unnatural for a relational database to manage such links The problem lies in the fact that databases have no built-in tools to look into columns and deal with what's inside of them As long

as you obey the data type of the column, you can put as many broken and malformed links inside them as you want

So how do you manage cross-references where the referent is a word or phrase? First, rather than typing any sort of link in the text of the column, you may instead put in just a link ID Rather than a link that looks as follows:

<A HREF="23.htm">More about benefits</A>

Put in something like the following example:

Now there's no information in the text of the column that can change or go bad Next, use a table similar to the one shown in Figure 11

Figure 11: A table for managing links that are embedded within database columns

This table looks a lot like the one I used to link component to component That fact comes in handy in just a moment First, I need to discuss how this table works If you publish component

33, your CMS finds the text:

After it does, the software can look up LinkID 55 to retrieve the information to make a link If

component 23 (the referent of the link) is deleted, no problem; just as before, your software can look in this table and tell you to go back to component 33 and change link 55

So you have ways to deal with links from any level of a component But rather than three different ways to handle three different kinds of links, it would be much better to have one way that covers all three situations, as shown in Figure 12

Figure 12: A table to deal with every level of linking

I've added LinkType and Element columns to the table, and now it can cover any situation, as follows:

Trang 22

??A text-level link: It has a link type of Text and lists the name of the element that contains the

Sequences in relational databases

Component sequences specify a next and a previous component of the one that you happen to

be positioned on They're the easiest of the structures to store in a relational database First, the component outline that you create is a built-in sequence The components that are next and

previous in the outline are likely to be of use to you If that's the only sequence that you need, you have no work to do to store a sequence

Many sequences can be generated on the fly without any storage at all A sequence by date, for example, can simply query the repository for components and sort them by a date column

If you need other sequences, you can create a three-column table In column 1 is a sequence ID,

in column 2 is a component ID, and in column 3 is an order (1, 2, 3, and so on.) To construct a sequence, you find all the rows that have the sequence ID that you want to use and order them

by the number that's in the order column

Storing the content model

As I mention in the section "The Content Model," earlier in this white paper the content model is all the rules of formation for your content components The rules fall into the following categories:

??The name of each component class

??The allowed elements of each component class

??The element type and allowed values for each element

??The access structures in which each component class and instance participates

I've covered the storage of much of the content model in discussing how content and its

relationships are stored in a relational database However, I summarize and discuss the issues in

a bit more depth I cover both the simple and abstract relational database models

In the simple "one table/one component class" scheme that I discuss earlier in this white paper you create a single table for each component class that you intend to manage In the more

abstract scheme that I discuss earlier in this white paper you create tables that define component classes and elements and other tables where the element values are stored

The major difference between these two approaches, and the reason that I contrast them in some detail, is that, in the simple scheme, the content model can't be stored explicitly Rather, it's implicit in the names of the database parts that you create In the abstract scheme, most of the content model can be stored explicitly At first, this may seem like a small distinction, but I don't believe it is Your content model is the heart of your CMS If it's buried in the base structure of your repository and not available for review and frequent tweaking, your system isn't flexible

enough to flow with the changes that you're going to want to make

In the simple scheme, the name of each component class is stored implicitly in the name of a table The HRBenefits class, for example, is named in the name of its table The allowed

elements for each class are stored implicitly in the column names in the simple scheme The fact that the HR Benefits table has five columns means, for example, that the HR component is

allowed five elements

Trang 23

In the abstract scheme, classes are named explicitly in the Components table There's an actual column where you type in the name of the component class To rename a class, you need only change the value in one row and column The elements allowed in each component are also

named explicitly in the Elements table There's a particular row and column intersection where you type the name of the element

Unlike the simple scheme, the abstract scheme gives you the capability to view and modify your component class structures easily

Element field types can't be stored explicitly in the simple database scheme They can, however,

be represented more or less A closed list, for example, can be created for the author element

by linking it to an Author table Only people listed in the Author table can be chosen Similarly, if the user can add a new row to the Author table as well as select one from the existing list, you can create an open list It's important to notice however, that the list isn't open or closed because

of the database structure; it's open or closed based on the user interface that you put around the database structure

In the abstract scheme, there's a place to enter the element filed type explicitly Each element has a specific column for just this purpose (the ElementType column), as shown in Figure 13

Figure 13: The two tables of the abstract component system

The types shown here extend the element field types The Pattern field type, for example, also states what kind of pattern to expect (ImagePath expects a "path" and Text expects an "XML")

As opposed to the simple scheme in which a list was only open or closed based on the user interface that you applied to it, here the list is explicitly set to open or closed Your system can now read the fact that Author is an open list and provide the appropriate user interface

Allowed element values may or may not be explicit in the simple scheme On the one hand, I can explicitly set the data type of a column to "date." Other allowed values can't be made explicit There's no place, for example, to type explicitly the rule that a pattern element called ImagePath must have a valid path and file name in it This rule is implicit in the validation program that you may create that checks the contents of this column In general, the best that you can do to

represent allowed values and element types in the simple scheme is to match them to the closest data type

In the abstract scheme, of course, all allowed values can be made explicit The ImagePath

element, for example, is set specifically to look for a path pattern You still need to code a

validation that enforces this pattern, but at least you can code it once and then automatically apply it to any element of the type Pattern:Path In general, you can use an abstract database schema to represent any sort of element field type that you want and include enough extra

information that your validation and form-building software can figure out how to handle element fields of that type

Trang 24

XML-based repositories

Object databases were invented as a convenient way to store or serialize the data that

programmers needed to handle in their object-oriented programming Programmers, who

traditionally used a relational database to store their data, got tired of trying to fit hierarchical data into rows and columns, so they invented a hierarchical storage system

Object-oriented data mirrors the structure of the programming objects that process it As a very simple example, suppose that you're writing a program to deal with a University curriculum You may create an object called Course that does all the processing for particular courses, an object called Department that does the department-level processing, and a final object called School that contains all the functions that are needed for the school as a whole These three objects have the following relationships:

??The way that you process a course depends on which department that class is in All

courses in the English department, for example, may require the taker to be an English

major Thus you want to access the Department object whenever doing the Course-Signup function

??The way that you process a department may depend on which school it's in All course

changes in the English department, for example, may need to be approved by the dean of the Humanities school Thus you need to access the School object in performing the

Department-ChangeCurriculum function

In plain English, you can represent this relationship as follows:

School (has approver's name)

Department (has course requirement)

Course (has course data)

Somewhere, you must store the approver's name, course requirements, and course data, as well

as a lot of other data for large varieties of courses, departments, and schools A programmer

could (and still most often does) create relational database tables to keep track of all this data There's no natural fit, however, between this hierarchical data and the rows and columns of a

database A much more straightforward way to store the data for this object model may be

something like the following example:

To store data into this structure, you simply "walk" down the object hierarchy storing object

names and their data To reload a set of data, you read the data hierarchy, create object

instances as you come across their names and load them with the data that's listed Of course, in real life there's a bit more to it than the simple explanation that I've given, but I hope that you get the point The preceding XML structure fits the structure of an object-oriented programmer's world like a hand in a glove A relational database fits an object world like a square peg in a round hole

Trang 25

So object databases were invented to provide the programmer with a more straightforward (and faster) way to store and retrieve data for their object hierarchies I wouldn't have bothered with such a detailed discussion but for the fact that content components are a lot like programming objects Just as do objects, components come in a sort of hierarchy And component hierarchies are just as inconvenient to store in relational databases as are object hierarchies Thus content programmers have turned to object databases for many of the same reasons that object

programmers have

The hierarchical way that object databases store information is a compelling reason to consider them for a CMS More important is the fact that the syntax that they use to create the information hierarchies is XML For a CMS that uses XML as its basic content format, an object database is a natural choice for a repository

Object databases vs XML

You don't need an object database to program in XML In fact, few XML programmers know

much about object databases They work with XML files An XML file carries all the same

structure and syntax of an object database On the other hand, there are a number of reasons you may choose (as some CMS product companies have) to use an object database:

??Multiple files: You may need to work with many XML files and would like them all united by a

single hierarchy and searchable by the cross-file capabilities that an object database

supplies

??Delivery: You may need a more sophisticated delivery environment than a file system can

offer Many object databases, for example, come with the Web caching, load balancing, and replication functionality you need to run a high-throughput site

??Development environment: You may prefer the programming and administration

environment that an object database provides Many object databases, for example, come with their own equivalents or extensions of XML standards such as XPath and XSLT that you may prefer to use rather than the less developed standards

??Performance: Object databases offer some performance gains over XML files Many provide

indexes, for example, that enable you to find commonly searched-for metadata more quickly than you could in an XML file

Just about all of what I cover in the sections that follow applies equally well to an object database

or XML files In either case, the main event isn't the container (the file or database) but the XML it contains

Storing component classes and instances

The simplest way to store components in XML is inside a single element that bears the name of the component class:

Trang 26

</HRBENEFIT>

The component is an XML element, and all the component's elements are XML elements You can get a lot more out of the component's XML code by making some of the elements into

attributes, as shown in the following example:

<HRBENEFIT ID="HR1" Type="Standard" AuthorID="A21"

I chose to make the ID, Type, Author, and EmployeeType attributes for particular reasons based

on the kinds of element fields that they are, as follows:

??The ID element is a unique identifier It's automatically generated by the CMS (and

possibly never seen by a user) By making it an attribute, you can specify (in a DTD or

schema) that it's an ID and must be unique across all HRBenefit components One small change is needed to create the ID attribute In XML, IDs must begin with a character, so I changed the value from 1 to HR1

??The Type and EmployeeType component elements are closed lists The user chooses a

value from a constrained set of choices By making them attributes, you can define them as closed lists (in a DTD or schema) and specify the valid choices Unfortunately, there's no equally easy way to create open lists in an XML DTD

??The Author component element is more complex Assume that there are Author

components elsewhere in the system that have author names, job titles, e-mail addresses, and the like In this component, rather than duplicate information that resides in an Author component, it's much wiser to simply refer to the correct author component In other words, the Author component element is an open list whose values are references to another

component By making it an attribute, you can specify that it's a reference (called an IDREF in DTD parlance) and enforce that it always points to an existing Author component Because it's an IDREF attribute, I changed its name to AuthorID and changed its value from the

author's name to the ID of his component

To store all the HRBenefit components, you simply wrap them in a higher element:

Trang 27

If you get the feeling that XML is easy to create, you're right and wrong It's very easy to type blocks of text like the preceding ones; it's very hard to create a large, interconnected system of elements that's controlled as rigorously as you need In other words, following the syntax rules of XML is easy (the concept of well-formed XML) Creating and then following the complex set of construction rules specified by a DTD or schema is hard (the concept of valid XML)

Nevertheless, you can see that XML is quite capable of representing components

If you're storing your components in a relational database, as I discuss earlier in this white paper you have some hard thinking to do about how to store structured text in rows and columns In an XML repository, the issue is much simpler Suppose that you have the following structure in the Text element of the HRBenefit component:

<TEXT>

</PULLQUOTE>

<H1>a paragraph of text here

</H1>

<CONCLUSION>Put your mouth where our money is-<B>use the

plan!</B></CONCLUSION>

</TEXT>

To store this structure in an XML file or database, you simply include it within the larger

component XML and you're done There's no need for special techniques for retrieving the

structured text or elevating any part of it to special metadata containers The elements that are within the <TEXT> tags are as accessible to the users of the repository as are any other elements

of the component In fact, they're just more elements of the component Furthermore, it doesn't matter if some components have a lot of different structures in their <TEXT> elements and others very little Whatever is there is stored and accessible right down to the lowest <B> tag

Trang 28

Of course, you don't get this sort of advantage for nothing What I conveniently skipped over in the preceding description is that a lot of work may need to go into the text to get it to the point where you can just paste it into the <TEXT> element

In particular, you need the following:

??The text must be well-formed XML Although this isn't strictly true (you can put in any text

you want if you use an XML feature called a CDATA section), it's true enough if you want to access and use the text It may be relatively easy to get the text into XML (from HTML, for example) or quite difficult (from some old word-processing format for example)

??The text must be valid XML Although this isn't strictly true (you don't need to validate your

XML), it's true enough if you want to get any of the advantages from having rules around your XML that are listed in a DTD or schema

In a relational database, it doesn't matter what kind of text is in the rows and columns, the main problem is recognizing the structure in structured text In object databases and XML files, it

matters a lot what kind of text you store in each element Although relational databases have a hard time with structured text, XML files and databases have a hard time dealing with text that's not structured

Storing access structures

Most of the access structures are fairly easy to represent and work with in an XML structure

Trang 29

??Nesting: The XML represents the nesting in the outline by having folder elements embedded

in other folder elements

??ID references: The XML references components by ID so that, if the name changes, you

don't need to retype anything The ID is enough information for the system to later go fetch the items names and create a set of links to the component pages

Trang 30

Notice that this outline assumes that there are elements somewhere in the XML code that contain the IDs referred to In content management parlance, you'd say that this outline refers to a set of

HR, Site, IStory, and OStory components

Indexes in XML

Representing an index in XML is fairly straightforward and follows the same logic as it does in a relational database The index of HR components that I describe earlier in this white paper for example, can be represented as follows in XML:

The following is happening in the preceding example:

??The <INDEX> element sets the index apart from other structures in the XML repository

??The <TERM> element encloses a single index entry It has a name attribute that has the term that's being indexed and an ID attribute that uniquely identifies the term

??The <COMPONENTID> element marks a single component that's indexed by the term Only the ID of the component is given The name of the component (or its page number if you're producing a print publication) is retrieved by using the ID as the index is published

By the way, I could have used a shortened form or XML where all the component IDs are packed into a single attribute, but it would have been harder to read and explain

Cross-references in XML

In XML, cross-references all follow the structure of a text link that I discuss in the section "Cross

References in Relational Databases, " earlier in this white paper The simplest form of a

cross-reference may look as follows in an XML structure:

<LINK>For more information, see <REFERENTID>HR3</REFERENTID></LINK> The reference is enclosed in the <LINK> element, and the referent ID is enclosed in the

<REFERENTID> element To add more control to the link, you can add some extra attributes, as

in the following example:

Trang 31

<LINK Type="formore" Autotext = "yes"

Position="inline"><REFERENTID>HR3</REFERENTID></LINK>

In this case, the following extra attributes are in the link:

??The Type attribute names the cross-referencing strategy that this link is part of In this case,

I specified that the link is part of the "formore" strategy Based on the value of this attribute, I could trigger different publication and management functions for the cross-reference

??The Autotext attribute specifies whether the reference is created by using a standard text

string or the string provided in the <LINK> element In this case, I specified that the link

should use standard text Thus I could make the reference text, "For more information,

see ," into a standard text block that the CMS publishing templates automatically access and insert If I have multiple publication formats, I could create different standard text blocks that are appropriate to each format (Web, print, e-mail, and so on)

??The Position attribute specifies how the link ought to be positioned in the publications you

produce In this case, I specified "inline" which I intend to mean, "Position the link right at the place the <LINK> tag is in the text." I may have other options for positioning the link at the top

or bottom of the element or component in which it's embedded

There's nothing special about the attributes and values I chose to include in the <LINK> element

I invented these particular structures to illustrate the point that you can add as much structure to the representation of your links as you want to achieve the kind of control you want over them As I've said before, however, creating structure is easy, finding the staff time to learn and enter all the structure you create is the hard part

Because XML is always accessible down to the smallest element, it's easy to link to and from any level of content As I mention in the section "Cross References in Relational Databases," earlier

in this white paper a relational database is troublesome to link to from any chunk smaller than a whole component Because all links are text links in XML, the link can be embedded at any level

of a component In addition, it's difficult to link to any chunk smaller than a full component In XML, any level of element can have an ID associated with it and so be the target of a cross

reference Consider, for example, the following link:

<LINK>For more information, see <REFERENTID>HR3</REFERENTID></LINK> What chunk of content does the ID HR3 refer to? It may be a component, a component element,

or even a low-level <P> tag within a component element Except for any extra differences in

presentation that you may want to have, it makes no difference to the XML what level of structure you refer to

Trang 32

You can use the Type attribute of the <SEQUENCE> element to trigger any particular

management or publishing functions you may have that are particular to this sort of sequence

"Topics" type sequences, for example, may have a particular icon that's used in the sequence links

Storing the content model

As is the case with relational databases, much of the content model is in the XML code that

stores the components Generally, XML behaves more like the abstract relational database

model, in which the names and allowed values of the component classes and elements are

stored in one place and the values are stored elsewhere In the relational database world,

programmers had to invent a way to separate the structure of their components from the

components themselves In the XML world, structure is always separate from data XML uses Document Template Definitions (DTDs) or XML Schemas to define the structure that the data must follow

Here is the segment of XML that I used earlier to represent an HR component:

For simplicity, I made all component elements into XML elements In real life, you'd make some

of your component elements XML attributes, as in the following example:

<IMAGEPATH>images/hr/joesmile.jpg</IMAGEPATH>

<TEXT>Our great 401K plan </TEXT>

</HRBENEFIT>

With one exception, these two versions of the HRBenefit component are informationally

equivalent In the second example, rather than typing in the name of an author, I use an author ID that points to an author structure that's somewhere else Although the two versions may be

informationally equivalent, the second is much easier to manage By making some of the

component elements XML attributes, you can use some of the features of an XML DTD to control them Here's a DTD that you may use to specify the allowed structure of HR components of the second variety:

<!ELEMENT HRBENEFITS (HRBENEFIT)+>

<!ELEMENT HRBENEFIT (NAME, IMAGEPATH, TEXT)>

<!ATTLIST HRBENEFIT

ID ID #REQUIRED

Trang 33

Type (Standard | Extended) #REQUIRED

AuthorId IDREF #REQUIRED

EmployeeType (FT | PT | ALL) #REQUIRED

>

<!ELEMENT NAME (#PCDATA)>

<!ELEMENT IMAGEPATH (#PCDATA)>

<!ELEMENT TEXT (#PCDATA)>

Without going into too much detail about how to construct a DTD, I can point out some significant ways in which this DTD specifies component and element structure, as follows:

??Nesting: The first two lines of the DTD establish the way XML tags may nest within each

other The first line states that an <HRBENEFITS> element may have within it one or more

<HRBENEFIT> elements That's what the phrase (HRBENIFIT)+ means The second line establishes that the <HRBENEFIT> element must contain a <NAME>, <IMAGEPATH>, and

<TEXT> element in that order That's what the phrase (NAME, IMAGEPATH, TEXT) means DTDs enable you to specify and enforce existence, number, and order of all allowed

elements

??Component classes: The DTD has no specific syntax for classes In fact, as I've said

before, any XML element can be a component class In this case, the <HRBENEFITS>

element defines the class If I wanted to say more than simply that the class exists, I could add attributes or other child elements to the <HRBENEFITS> element to add parameters to the class as a whole I may use this feature, for example, to specify how often components of this class are to be reviewed

??Component instances: The <HRBENEFITS> element defines parameters for the class as a

whole The <HRBENEFIT> element defines how component instances must be structured In this case, line 2 of the DTD specifies that each HRBenefit component has a <NAME>, < IMAGEPATH>, and <TEXT> child element Lines 3 through 8 specify that each HRBenefit component has an ID, Type, AuthorID, and EmployeeType attribute ATTLIST means

attribute list The list begins on the line with the word ATTLIST and ends with a closing angle bracket (>)

??XML attributes as component elements: The lines of the DTD that define attributes tell you

a lot about what kinds of component elements they are Line 4 states that the ID component element is a required ID That means that it must be supplied; it must begin with a letter; and

it must be unique throughout the repository Line 5 states that the Type component element

is a required closed list with allowed values of "Standard" or "Extended." Line 6 states that the AuthorType component element is a required ID reference That means that it must

contain a valid ID from some other element in the repository (some author component in this case) Finally, line 7 states that the EmployeeType component element is a closed list with allowed values of "FT, " "PT," or "ALL."

??XML elements as component elements: The final three lines of the DTD define the

structure of the Name, ImagePath, and Text component elements In this case, they're all defined in the same way Each is allowed to be just text with no extra markup inside it If I had wanted to go further in the definition, I could have defined attributes and child elements for any of or all these to further define them Given that the <TEXT> element may contain

additional markup, for example, I could have defined its allowed child elements and attributes

to whatever level I wanted

Not all element types can be represented in a DTD Because you type the allowed values of a list right into the definition of a list attribute, for example, it's not possible (without some trickery) to create an open list by using a DTD To get around the limitations of DTDs, you can create an

Trang 34

abstraction similar to the one in this white paperfor relational databases Rather than coding the list right into the DTD, you can store it as data in an XML file Even given the limited example that

I provide, I hope that you can still get a feeling for how the content model may be represented in a DTD

relational database or XML structure In these cases, the files are referred to rather than included

in the database or XML In addition, you may choose to store some of your components as

separate XML files, as explained in Figure 14

Figure 14: Files can be stored and referenced in a database In addition, components can be

stored as separate XML files

Rather than putting an image or other media file into a database, you can simply refer to its file name in the database and keep the file in a directory where it can be accessed easily if needed

Note

If you store media in files, make sure that the directory is protected You don't want people to access these files directly The CMS must be between the users and the files to stay synchronized and in control of them

Many of today's systems keep a set of individual XML files inside the repository to store individual components If you use a system such as this, make sure that you have some way of overcoming the following potential problems:

??Structure inconsistency: Each XML file type is its own world It's validated by its own DTD

with its own structure and rules If you have ten different kinds of XML files in your repository, you need ten different DTDs to validate them You need some overarching process or

Trang 35

technology to ensure that the structures don't conflict with each other and can be managed

as a whole

??Integrity: Within a single XML file, it's easy to ensure that your links are valid By using ID

and IDREF attribute types, any XML parser tells you whether you have a nonunique ID or a reference to an ID that doesn't exist You don't get this advantage across XML files Instead, you need extra software to resolve and validate any references that cross file boundaries

??Performance: There may be times when you need to open a large number of individual files

to find what you're looking for In these cases, your system may be extremely slow It gets proportionally slower the more files you add And, although there's no file size limitation to an XML file, there may be file size limitations in the tools you're using to handle the XML, or the performance of these tools may start to degrade if your XML file gets extremely large

??Access structures: Access structures span components To create access structures in a

system that has multiple files, you must create yet more files that reference the individual files It's not impossible to do so; it's just a lot of extra work that you wouldn't need to do if you use a single XML file or object database

If you choose to store multiple XM L files, make sure that you understand the cost and are

prepared to take responsibility for knitting them into a single, unified repository

Implementing Localization Strategies

Your ability to localize content is gauged most by the amount of human effort that you can muster

to do the localization The more people you have, the more you can localize To make the most of the localization resources you can muster, you need a streamlined and efficient localization

system in your CMS The most efficient system presents only the exact information that needs to

be localized and doesn't require the localizer to understand the general structure of your CMS repository

In designing your repository to support localization, you have the following two basic options for how to store localized content:

??By component: You can choose to make multiple complete copies of each component you

intend to localize

??By element: You can choose to make multiple copies of only the elements of the

components you intend to localize

Suppose, for example, that that you have a component that needs to be localized for three

localities You could make three copies of the component, one per locality Each variant would somehow be keyed to the same base component If you determine that only the title and body elements need to be localized, however, you could create one component that has three versions

of the title and three versions of the body element

The element method has the advantage of less duplication If you make complete copies of the component, you must duplicate all the information that's not localized in all the variants of the component You have an added burden of keeping all the versions of the component

synchronized One significant disadvantage to the element model is the ease of data sharing Depending on how your repository is created, it may be hard to share ownership of individual

elements of a component without giving up ownership of the whole component

Clearly, there are ways to mitigate the disadvantages of both localization methods So it's not the case that you should necessarily prefer one method to the other In addition, if you're using a commercial CMS, the localization functionality it offers may decide for you how what level of

localization you can support Still, if you have a choice, you may design for an element

localization model to minimize duplication and maximize flexibility

Trang 36

I describe here both localization models using the XML form of the HRBenefit components that I introduce in the section "Storing Component Classes and Instances," earlier in this white paper First, to do component-level localization, you need only add the following two new attributes on the HRBenefit component:

<HRBENEFIT Locale="IntEnglish" FamilyID="HRF1" ID= "U100"

Type="Standard" AuthorID="A21" EmployeeType="FT">

As an alternative, you could simply pack more information into the original ID attribute You could, for example, create IDs of the form HR1_IntEnglish In this way, you can save on attributes; the more elaborate method, however, is more flexible

To modify the HRBenefit component for element-level localization, you may use a structure like that of the following example:

Trang 37

system renders this component, it can decide which locale to use and find the appropriate

elements based on the Locale attribute

<VARIANT> elements also have ID and Rev attributes These two attributes can form the basis of

a localization management system Because each variant has its own ID, it can be located and updated individually You can find, display, and update local elements individually The Rev

attribute can tell you if a change is made to one variant that needs to be propagated to the others Notice, for example, that the IntEnglish variant of the <TITLE> element is at Rev 3 while the

NAEnglish variant is only at Rev 2 This can trigger a workflow step to review the localization of the NAEnglish variant and update it with the IntEnglish variant

In real life, your localization structures and system may be quite a bit more complex You should

be able to see from these examples, however, the increased flexibility that the element method gives you over the component method The way that you store localized content can lead

naturally to particular localization processes

Doing Management Physical Design

In the logical design process, you decide on the kinds of content that you need to manage In the physical design process, you figure out the software and hardware that you need to effectively accomplish your logical design In the sections that follow, I outline some of the physical design issues that you may need to tackle as you build your own or enhance a commercial CMS I

present what I consider to be the ultimate or ideal functionality that a CMS should offer You may get nowhere near this ideal in your system, but you should at least keep the ideal in mind and move toward it as you design and then implement your CMS

A repository-wide DTD

Given that you can represent a very complex and complete content model in a DTD (as

discussed earlier in this white paper, it's surprising to me that I haven't come across a CMS

product company with an XML repository that uses DTDs (or schema) to control the overall

structure of its repository What I've seen is that the repository as a whole must be well-formed XML but is never validated as a whole Many support the use of any number of DTDs to validate content before it's input to the repository This results in a repository that has multiple, possibly conflicting, DTDs in charge of different parts of the repository I can only guess that these

companies believe that the task of creating a structure for the entire repository is prohibitively complex for most customers Be this as it may, until you can model and control the entire

repository, you can never control the entire system If you develop your own repository and you choose to use XML, make sure that you find a way to have a repository-wide DTD or schema

Note

Of course, if you're using a relational database repository, the same comments apply You still want to have an overall repository structure In the relational database world, however, this has always been a natural thing to do

Link checking

Make sure that you build robust link checking into your system if it's not part of a system that you buy A robust linking system has the following sorts of functions:

??Authoring: The system should have the capability to author links at the component, element,

or text level (as I describe in the section "Cross-References in Relational Databases," earlier

in this white paper

??Structure: The system ought to enable you to pack additional information into a link In

addition to the reference and referent ID, the system ought to enable you to specify link IDs,

Trang 38

link types, and any other parameters that you may want to manage and render the link to your specifications

??Management: The system ought to enable you to find and repair broken links automatically

A broken link can have an internal or external referent For internal referents, the system

ought to alert you if you delete a component that's used as a referent somewhere else The ideal user interface for this would be some kind of dialog box that lists the titles and IDs of the components that link to the one that you're trying to delete From this dialog box, you ought to

be able to globally change the referent to be a different component or open referring

components individually to decide what to do with each link A much more subtle piece of functionality would be to have your system detect if you change the title (or other significant elements) of a component and ask you whether you want to review links Often significant content changes to a component invalidate some of the links you've forged to it For external links, the system must check the link periodically For links to Web URLs, the system can try

to access the URL and see whether it still exists Of course, this doesn't tell you whether the URL is still appropriate As an enhancement, the system can try to determine whether the target page has changed since the last access by storing an image of the page linked to and comparing it with the current one If the current page is different from the last one, the system can trigger a workflow to have someone verify that the page is still valid If this process can't

be automated, the system should be capable of compiling lists periodically and initiating

workflow to tell someone that it's time to check the validity of the links manually

??Rendering: If the system has only one way of rendering links (as HTML <A> tags, for

example), you may need to augment it so that it can produce the other sorts of links that you may need You may, for example, need to write your own extension to create print-format links ("for more information, see " and so on)

Note

My comments on cross-references apply also to all the access structures that you may store in your repository You may get a lot of Web-format TOC functionality out of your CMS, for example, but need to write your own print-format TOC functions

Media checking

Make sure that, in some way, your system has the capability to check the validity of media

references that appear within text In many systems, the media references (such as <IMG> tags

in HTML) that are embedded in the middle of a block of text go unnoticed and can be broken

without any clue The best way to track these internal media references is to track them as

separate components Then you can create a status element that you can query and report on to know the status of each media item in your system As a bonus, the media components can have extra elements for whatever variations you may need in different publications If print publications need one image caption while Web publications need another, for example, your media

component can have elements to store the caption variants of caption

The following two media management functions are vital to create if they aren't included in any system that you buy:

??The first "must-have" function validates media references This function should be run

periodically against all media references to files that aren't controlled by the CMS If the

system includes references to sound files that are on a Web server somewhere, for example, the validator should check periodically that these files are still there

??The second "must-have" function deploys files This function finds and deploys the files

that are mentioned in components (By deploy, I mean copying the file to its destination

location for the publication, most typically a "live" Web server.) This function is especially necessary if you don't store media (and other files) as separate components but rather as simple references within text blocks (for example, if you store image references with an

Trang 39

<IMG> tag in an HTML text element) In this case, it's likely that your system can't find and deploy the file just because it's in some <IMG> tag You must add the code to ensure that files that are referenced this way can be found and deployed automatically

??Even files that are stored in separate components may need extra code to find their way

to the correct deployment destination It's entirely possible that your deployment needs for files goes beyond the stock functionality of your system You may, for example, need

to deploy media to servers that have the storage capacity to hold large files based on the size of the files You may deploy them by type to servers that have the special software needed to deliver different media types You may have a combination of the two methods and include other deployment methods as well

As complex as your deployment system is, if it's logical (and it may take considerable effort to make it so), it can be reduced to a mechanical process

Search and replace

Given the obvious and large need for a trustworthy and comprehensive search-and-replace

capacity in a CMS, it's surprising that there isn't better support for it in commercial products A thorough search and replace function in a CMS should do the following:

??Work in a familiar way: Just about every editing program has a search-and-replace function

The most basic ones just enable you to enter a phrase to find and a phrase to replace it with Most enable you to replace phrases one at a time or in bulk The better ones enable you to match case, find whole words or fragments of words, search forward or backward in the file, and undo The best give you all this and also enable you to restrict the search to any part of the structure of the information you're searching A good CMS system deserves the best kind

of search-and-replace functionality

??Search everywhere: Think of all the places that text is stored in a CMS Aside from the

repository, which can have an enormous variety of nooks and crannies in which a phrase may be found, there's the entire configuration system of the CMS with text in many flat files, XML files, system registries, database tables, and who knows where else? Then there's the whole publishing system Phrases may certainly need to be found and replaced in templates and other publication-related files

??If the CMS did no more than state somewhere all the locations where text you may want

to change is stored, it would help you a lot But that's only half the battle (And as hard as

it is, it's still the easier half.) The other half is creating a system that can open, search and update each text storage file correctly The CMS must have enough knowledge of each type of text storage file that it can manipulate it without messing it up Now, obviously, there are going to be configuration files that aren't worth interfacing with But these

unparsable files should at least be listed so that you know where else to check if you haven't found the text you're seeking Any files that can be parsed and updated (and that should be most of them) ought to be accessible to the search-and-replace function

??Follow structure: The search-and-replace function ought to show you the structure of at

least your repository (if not some of the other text storage files) so that you can choose where

to search If you're using an XML system, your structure is a hierarchy from which you may choose branches to search Or you may want to search by element or attribute, regardless of where in the hierarchy it is If you're using a relational database, your structure is tables and columns from which to narrow your search Or you may want to search for a phrase,

regardless of what column it's in

??You can imagine that, the more complex your structure becomes, the harder it is to

represent By layering your user interface, however, you should be able to make a lot of the repository structure accessible Design the structure selector to zoom in and out on the structure The highest-level view (only first-level XML elements or whole relational

Trang 40

database tables) ought to be useful to anyone and should be shown by default More advanced users should be able to drill down as far into the structure as they desire The layering also makes the selector a great learning tool Users can explore the structure of the repository in a comprehensive but nonthreatening way

??Support regular expressions: Regular expressions are a searching syntax that enable you

to specify just about any text-matching rule you can imagine The familiar wildcard character (*) is just one example of the powerful text-matching tools that are contained in regular

expressions There's full support for Boolean comparisons (AND, OR, and NOT) in regular expressions as well as a number of "stand-in" characters that work in ways similar to the * wildcard Adding regular expressions to a search-and-replace function is like adding steroids

to an athlete They give search-and-replace a much finer-grained discrimination than you can achieve in a regular text box

??Be programmable: A search-and-replace function this good shouldn't be left only to end

users It ought to be programmable If your CMS search-and-replace function is designed as

a programmable object, there's no reason why you can't leverage the CMS

search-and-replace function in any automation programs you write for the CMS

??Follow security: Of course, the search-and-replace function ought to be completely

personalizable That is, for each user, the search should be confined to the text to which she's been granted access And, of course, the system should be able to use the access permissions that are stored and managed in the operating system's user administration

??XML as an integration medium: XML has become the standard method to interface

between any two systems that need to share data The provider system outputs XML, and the receiver system inputs XML Both systems use the same DTD or schema to ensure that there's no mismatch If you have an XML repository, you can provide a transformation

program that maps your DTD to the shared one The transformation is subject to problems of ambiguity, no correspondence, and no existence (as I outline in white paper30, "Processing Content," in the section "Principles of mapping") but can often be programmed without too much trouble by standard XML tools If you have a relational database repository, the same applies You must resolve the same transformation problems; but the rest of the mapping should be fairly easy to accomplish All in all, XML solves the mechanical problems of

transferring data but still leaves you to figure out how to fit together two structures that

conflict At least if you do your part of the work, XML provides the tools to implement your solution

??Standard database interfaces: Your CMS may need to include an easy-to-use interface for

probing and building an interface to standard databases Using standards such as ODBC, you can get tools that enable you to connect to, issue queries against, and view data in a surprisingly large percentage of the standard database products The best of these has a nontechnical interface for connecting to simple databases that can be used by administrators plus the capability for the connection to be extended by programmers using a standard

language

??Launching: Make sure that you have some way to launch other applications from within your

system You want to launch all your standard authoring tools as well as any CMS services

Định dạng
Số trang	757
Dung lượng	13,1 MB