175 Pattern identification is an important issue in public health, and current methods are not designed to deal with identifying complex geographical patterns of illness and disease.. Th
Trang 1Pattern Identification in Public Health Data Sets: The Potential Offered by Graph Theory
Peter A Bath, Cheryl Craigs, Ravi Maheswaran, John Raymond,
and Peter Willett
CONTENTS
8.1 Introduction 159
8.1.1 Background 160
8.1.2 Computational Chemistry and Graph Theory 161
8.2 Methods 162
8.2.1 Program 162
8.2.2 Data 162
8.2.2.1 Geographical Area 162
8.2.2.2 Deprivation 163
8.2.2.3 Standardized Long-Term Limiting Illness for People Aged Less Than 75 164
8.2.2.4 Adjacency Information 165
8.2.3 Storage of Information 165
8.2.4 Queries 166
8.2.4.1 Query Patterns 166
8.2.4.2 Query Data File 167
8.3 Results 169
8.4 Discussion 172
Acknowledgments 175
References 175
Pattern identification is an important issue in public health, and current methods are not designed to deal with identifying complex geographical patterns of illness and disease Graph theory has been used successfully within the field of chemoinformatics to identify complex user-defined patterns,
Trang 2or substructures, within molecules in databases of two-dimensional (2D) and three-dimensional (3D) chemical structures In this paper we describe a study
in which one graph theoretical method, the maximum common substructure (MCS) algorithm, which has been successful in identifying such patterns, has been adapted for use in identifying geographical patterns in public health data We describe how the RASCAL (RApid Similarity CALculator) program (Raymond and Willett, 2002; Raymond et al., 2002a,b), which uses the MCS method, was utilized for identifying user-specified geographical patterns
of socioeconomic deprivation and long-term limiting illness The paper illus-trates the use of this method, presents the results from searches in a large database of public health data, and then discusses the potential of graph theory for use in searching for geographical-based information
8.1.1 Background
The need to identify patterns of illness and disease is not uncommon in public health, for example the identification of disease clusters and tendencies toward clustering, such as outbreaks of communicable disease (e.g., tuber-culosis), and higher than expected prevalence=incidence of diseases (e.g., childhood leukemia) The basic building blocks or units for such patterns may be individuals or geographical units, but the key factor is the association between units in terms of time, space, or other complex links However, searching for patterns of disease using geographical-based data can help not only to identify disease clusters in a geographical area but also can be helpful
in seeking to identify potential causes of such outbreaks, which may be geographical features themselves or be characteristics of a geographical area Cluster detection, particularly the identification of geographical disease clusters, has been the subject of intensive research within public health and geographical information sciences (Openshaw et al., 1988; Knox, 1989; Besag and Newell, 1991; Alexander and Cuzick, 1992; Kulldorff, 1999) Within the domain of public health and spatial epidemiology, Besag and Newell (1991) classified tests for disease clustering into two groups The first comprises general or nonspecific tests that examine the tendency for diseases to clus-ter The second group comprises specific tests that assess clustering around predefined points, e.g., nuclear installations, or assess the locational struc-ture of clusters Among the better-known cluster detection methods are Openshaw’s Geographical Analysis Machine (Openshaw et al., 1988), Kulldorff’s spatial scan statistic (Kulldorff, 1999), Knox’s test (Knox, 1989), and Besag and Newell’s method (Besag and Newell, 1991) Issues related to clustering and cluster detection are discussed in detail in recent compre-hensive publications in the subject area (Lawson et al., 1999; Elliott et al., 2000) The methods described, however, are all concerned with statistical probability and estimation of effect size They were not designed to handle complex pattern searching queries, and there are currently no satisfactory methods available for this purpose
In the domain of geographical information science, the ability of current software systems to recognize the relationship between neighboring areas is
Trang 3determined by whether the software has the property of topology, and in particular the branch of topology called pointset topology Pointset topology
is concerned with the concepts of sets of points, their neighborhood, and nearness (Worboys, 1995) It is this concept that allows for the analysis of contiguous areas Many current GIS, such as ArcView 3.2 (2002), do not have this property and so cannot deal with contiguous problems such as identifying complex geographical patterns involving neighboring areas More sophisticated software such as ArcInfo7, however, has topological properties and in theory can identify complex patterns of adjacent neigh-bors (ArcInfo 8.2, 2002) However, three major difficulties are associated with this type of searching The first problem is that any complex geographical pattern search must be programmed into the software separately, which is time-consuming and requires a high level of programming expertise The other two problems are that the resulting programs are computationally very intensive and generate very large result files
In this paper, we describe early work in developing and using techniques that are successfully used in computational chemistry for identifying geo-graphical patterns in public health data
8.1.2 Computational Chemistry and Graph Theory
In the field of computational chemistry, sophisticated techniques have been developed for the efficient storage and retrieval of various types of chemical information Highly specified, sophisticated, and flexible searches can be carried out within large databases of molecular structures using techniques derived from graph theory, a branch of mathematics Graph-theoretical methods of storing 2D and 3D chemical structures have been developed within the Chemoinformatics Research Group in the Department of Infor-mation Studies at the University of Sheffield (Willett, 1995, 1999)
Graph theory is used to describe a set of objects, or nodes, and the relationships, or edges, between the nodes In computational chemistry, nodes are used to represent the atoms in chemical structures The edges represent the bonds in 2D chemical structure representations and inter-atomic distances in 3D chemical structure representations of the molecule The resulting graph is called a connection table and contains a list of all the (non-hydrogen) atoms within the structure and their relationships to each other, in terms of bonds (2D) or distances (3D) (Willett, 1995, 1999) Thus, information about molecules can be stored on databases and retrieved using algorithms developed to identify identical structures (called isomorphism) There are three types of isomorphism used to compare pairs of graphs:
. Graph isomorphism, used to check whether two graphs are identical
. Subgraph isomorphism, used to check whether one graph is com-pletely contained within another graph
. Maximum common subgraph isomorphism, used to identify the larg-est subgraph common to a pair of graphs
Trang 4Algorithms using these types of isomorphism have been developed and used successfully within chemistry to represent and search large files of 2D and 3D structures The principle of representing information in terms of nodes and edges is not, however, exclusive to computational chemistry and has been used in other areas If one considers the map of the London Under-ground as an example of a geographical map, it can be regarded as a graph, with the nodes of the graph representing the stations, and edges representing connecting stations; for example, Russell Square and Covent Garden are on the same underground line, the Piccadilly line Most other geographical maps or spatially distributed data could be represented in this way
The aim of the study was to assess the ability of the graph-theoretical methods, used in computational chemistry, to identify a series of increasingly complex patterns of geographical areas that are of interest in public health We were particularly interested in identifying areas of deprivation and areas of deprivation that have poor health We briefly describe the MCS algorithm and the structure of the data files that were developed for searching the geograph-ical data After presenting the results of the searches, we discuss the utility of the method for identifying geographical patterns for public health
8.2.1 Program
The RASCAL program, which is an example of a maximum common subgraph isomorphism method, has been used previously within chemoinfomatics, was modified to enable the program to be used with geographically based public health data, so that the nodes were geographical area and the edges were the association between these areas Just as the chemical structures can have information associated with them, such as atomic type, geographical areas can also have information associated with them, such as deprivation, census variables, and mortality and morbidity information The modified program had previously been validated using a test data set (Bath et al., 2002a) The modified RASCAL program can identify all geographical patterns within the area of interest that match a predefined geographical pattern, in terms of variable criteria and area adjacency The program requires two distinct pieces of information about each geographical area: variable infor-mation that will be used in the selection criteria and inforinfor-mation about which areas are neighboring
8.2.2 Data
8.2.2.1 Geographical Area
The geographical area used in the study was the area previously covered by the Trent Region Health Authority, which includes South Yorkshire, Derby-shire, LeicesterDerby-shire, NottinghamDerby-shire, LincolnDerby-shire, and South Humberside
Trang 5(Figure 8.1) The areas of interest were the 10,665 enumeration districts (EDs) that make up Trent region EDs are the lowest level of census geography in England and Wales representing on average 200 households in 1991 Information on two census-derived variables was used in the study: deprivation and standardized long-term limiting illness ratio for people aged under 75 years (SLTLI<75)
8.2.2.2 Deprivation
The Townsend Material Deprivation Index (Townsend et al., 1988) was calculated for each ED within the Trent region and this index was used to assign each ED with a deprivation quintile variable The Townsend Material Deprivation Index is a composite score made up of the summation of four standardized variables taken from the 1991 Census small area statistics (SAS) The census variables are: unemployment, overcrowding, lack of owner occupied accommodation, and lack of car ownership This index was chosen because previous studies have suggested that it is a reasonable measure for explaining material disadvantage (Morris and Carstairs, 1991)
A high positive score indicates relatively high levels of deprivation within
an area whereas a high negative score indicates relatively high levels of affluence within an area
The Townsend Material Deprivation Index was calculated for each ED within Trent, standardized to Trent In total, 195 EDs could not be allocated
Barnsley Humber
Sheffield
Lincolnshire North
Nottinghamshire North
Derbyshire
South Derbyshire
Leicester Nottingham
Rotherham Doncaster
FIGURE 8.1
Map of Trent region showing the enumeration districts for the 1991 census (From 1991 Census: Digitised Boundary Data (England and Wales).)
Trang 6a deprivation score because of missing values in one or more of the census variables, generally low counts and suppression thresholds built into the census tables (Dale and Marsh, 1993) These EDs were given a deprivation quintile value of 99 The remaining 10,470 EDs were equally assigned a deprivation quintile on the basis of their Townsend score A quintile value
of 5 indicated those EDs within the top 20% most deprived areas, and a quintile value of 1 indicated those EDs within the top 20% most affluent, relative to Trent
Figure 8.2 shows the map of Trent region shaded into quintiles on the basis of the Townsend deprivation score Because of their relatively small size and large number individual EDs are difficult to distinguish for the whole of Trent To show individual EDs more clearly, an area within the south=center of Sheffield has been selected
The maps of Sheffield center show that the more deprived areas are pre-dominantly to the northeast of the map, within the wards of Castle, Manor, Park, Sharrow, and Netherthorpe, which surround the south of the city center
8.2.2.3 Standardized Long-Term Limiting Illness for People Aged
Less Than 75
Long-term limiting illness was also taken from the 1991 Census SAS The indirect standardization method was used, standardizing each ED by age and sex to Trent region for all persons aged less than 75 years The ED-based population estimates used in the standardization were taken from the Estimating with Confidence Project, which adjusted for the underenumera-tion that occurred in the 1991 Census (Simpson et al., 1995) A value of
100 signifies that the observed number of persons with limiting long-term illness under 75 years is equivalent to the number of persons expected, taking into account the age-specific rates of Trent region overall The
Standardized to trent region
1 (2094) (2094) (2094) (195)
2 4 Missing values
FIGURE 8.2
Maps showing the Townsend deprivation quintile for each ED within the Trent region and an inner-city area of Sheffield (striped areas signify missing data) (From 1991 Census: Digitised Boundary Data (England and Wales); 1991 Census: Small Area Statistics (England and Wales).)
Trang 7resu lting SLTLI < 75 val ues were then assigne d to q uintiles with the 20% lowest values ass igned a quintile value of 1 and the highes t 20% ass igned a value of 5 The SLTLI < 75 for 194 EDs could not be calcul ated because of conf identiality issu es in the Census SAS tables (Da le and Marsh, 1993) These EDs were given a val ue of 99
Figure 8.3 shows the SLTLI < 75 quin tiles for Trent region and for the selected area with in Shef field The hig her SLTLI < 75 sco res can again be seen pre dominan tly within the north east of the map, sur rounding the city center to the south
8.2.2 4 Adja cency Informati on
As wel l as each ED havin g a depriva tion quintile and an SLTLI < 75 val ue, each ED also has informati on about its neighbo ring EDs The EDs were eac h assign ed a numb er bet ween 1 and 10 ,665 For each ED a list of neighbo ring
ED numbers was reco rded
8.2.3 Storag e of Inform ation
All the informati on relati ng to each ED was stored on one space-s eparated text file The file contain ed three parts Part 1 hel d, on one line, the total number of EDs, the max imum number of neighbo ring EDs, and the numbe r
of variables Part 2 held, for each ED, one line containing the ED number, ED name, the deprivation quintile, and the SLTLI<75 value Part 3 held, for each ED, one line containing their ED number and the ED number for each neighboring (or adjacent) ED
Table 8.1 shows an extract from the data file, showin g part 1 and parts
2 and 3 for the ED 38PMFF03
Standardized to Trent
(2094) (2095) (194)
2 − 66.13 &<84.14
3 − 84.14 &<103.3
4 − 103.3 &<131
5 − 131+
Missing values
FIGURE 8.3
Maps showing SLTLI <75 quintiles for the EDs in the Trent region and an inner-city area of Sheffield (striped areas signify missing data) (From 1991 Census: Digitised Boundary Data (England and Wales); 1991 Census: Small Area Statistics (England and Wales).)
Trang 8Part 1 in Table 8.1 shows there were 10,665 EDs within the data file, a maximum of 22 neighboring EDs to any one central ED and two variables Part 2 shows that the ED 38PMFF03 was numbered 10,000 and had a deprivation quintile of 4 and an SLTLI<75 quintile of 4 Part 3 shows the numbers of the six neighboring EDs Because the maximum number of neighboring EDs was 22, the modified RASCAL program expected 22 numbers to follow each ED number in part 3 The ED 38PMFF03 had only six neighboring EDs, so 16 zeroes are included to ensure that the ED had the
22 expected values
8.2.4 Queries
8.2.4.1 Query Patterns
Figure 8.4 sho ws the quer y pattern s that were used to identify geogr aphi cal patterns within the Trent region These queries were developed to provide a range of pattern sizes and arrangement of deprived EDs of potential interest within the query pattern
Query 1 is a fairly simple pattern looking for a central ED adjacent to three EDs, all with a deprivation quintile within the top 20% most deprived Query 2 has a central ED adjacent to four EDs, all with deprivation quintiles within the top 20% most deprived and with the top 20% highest levels of SLTLI<75 Query 3 is looking for a pattern of EDs forming a chain of five, all with deprivation quintiles within the top 20% most deprived and with SLTLI<75 within the top 20% highest scores Thus, although queries 2 and
3 both contain the same number of EDs, i.e., five, they represent very different shapes of patterns For example, Query 2 could represent a tight cluster of deprived EDs and deprivation and poor health concentrated in a given area, whereas Query 3 could represent a chain of deprived EDs alongside, or bordering, a geographical feature, such as a road or river Differentiating between clusters of deprivation and chains of deprivation
in relation to geographical features in this way could be of value in under-standing the local impact of deprivation and health for planning health-care and social-care services
Query 4 is similar to Query 3 but seeks to identify chains of nine EDs Query 5 is looking for a more complicated pattern of nine EDs all with deprivation quintiles within the top 20% most deprived and with the top 20% highest levels of SLTLI<75 Thus, similar to queries 2 and 3, both the queries 4 and 5 had the same number of nodes, i.e., nine, but represented different shapes of patterns that could be linked with geographical features
TABLE 8.1
Extract from the ED Information Data File
10,000 38PMFF03 4 4 (part2)
10,000 9,998 9,999 10,001 10,002 10,003 10,004 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (part3)
Trang 98.2.4.2 Query Data File
The data files for each of the queries were set up in a similar way to that of the ED data file but with two extra parts Part 1 held, on one line, the total number of query nodes, the maximum number of neighboring query nodes,
Criteria: AII EDs within the top 20% deprived
Pattern
Query
node 2
Query
node 1
Query
node 4
Query node 3
Criteria: AII EDs within the top 20% deprived and SLTLI<75 within top 20% highest scores
Criteria: AII EDs within the top 20% deprived and SLTLI<75 within top 20% highest scores Criteria: AII EDs within the top 20% deprived and
SLTLI<75 within top 20% highest scores
Criteria: AII EDs within the top 20% deprived and
SLTLI<75 within top 20% highest scores
Pattern
Query node 2
Query node 1 Query
node 5
Query node 5
Query
node 5
Query node 6
Query node 7 Query
node 7
Query node 8
Query node 8
Query node 9
Query
node 9
Query node 4
Query node 3
Query 4
Pattern
Query node 2
Query node 1
Query node 1
Query node 5
Query node 4
Query node 4
Query node 3
Query node 3
Query 3
Pattern
Query 5
Pattern
Query node 1
Query
node 6
Query node 2
Query node 2
Query node 3
Query node 4
FIGURE 8.4
Diagrams showing query patterns and selection criteria.
Trang 10and the number of variables Part 2 held, for each query node, one line containing the query node number, query node name, and deprivation quintile Part 3 held, for each query node, one line containing the query node number and the query node number for each neighboring query node Parts 4 and 5 allowed queries to be set up with ranges rather than absolute numbers Part 4 held, for each query node, one line containing their query code number and a tolerance value percentage for the deprivation quintile Part 5 held, for each query node one line containing their query code number and a tolerance direction for the deprivation tolerance value, which allowed tolerance values to be set around the deprivation quintile value, or set the tolerance value one way only, i.e., greater than or less than The query data file for Query 1 is displayed in Table 8.2
Part 1 of Table 8.2 states that there were four query nodes, a maximum of three connections, and one variable Part 2 states that the four query nodes are called Q1, Q2, Q3, and Q4, with the query node numbers 1, 2, 3, and 4, respectively All the query nodes have a deprivation quintile 5 Part 3 shows the connections within the pattern It states that query node 1 is connected
to query nodes 2–4, while query nodes 2–4 are only connected to query node 1 Part 4 states that all the query node deprivation values have a tolerance of 1% In Part 5, all the EDs have a tolerance direction of 0 indicat-ing that the tolerance is either side of the deprivation quintile, that is the deprivation quintile for each query node can be between 4.95 and 5.05 The query data files for query numbers 2–5 follow a similar pattern to the data file for Query 1
The modified RASCAL program was used to run each of these queries against the Trent ED data file
TABLE 8.2 Data File for Query 1