Sample Code Listing 11.1 is a simple command-line benchmark that illustrates the basic ments of interacting with MySQL in Java, and as a bonus gives you some idea ofperformance of the My
Trang 1and IllegalAccessException The exceptions must be handled or declared by themethod that loads the driver A simple way to deal with all of them can be
"/" + db + "?username=" + user + "&password=" + password);
The argument to getConnection() is a JDBC URL (Uniform Resource Locator)string The concept of a URL is used not only with JDBC, but in many otherclient-server protocols When used with JDBC, the format of the URL is
host is the name or IP address of the database server The optional port
argu-ment defines the TCP/IP port to connect to When this is omitted, the default is
3306 There is no need to specify it explicitly unless MySQL is running on a standard port, but it does not hurt if you do The db argument specifies thename of the initial database The URL can optionally specify additional argu-ments after the ? delimiter Each argument setting is separated by the & delim-iter For example:
non-jdbc:mysql://localhost/products?user=test&password=test
The above URL defines a JDBC connection to a MySQL server running on host on the default TCP/IP port to the database products with the username set
local-to test and the password set local-to test.
The most common URL arguments for MySQL Connector/J are
■■ user: the username for authentication
J a v a C l i e n t B a s i c s
190
Trang 2■■ password: the password to use for authentication.
■■ autoReconnect:if set to true, reconnect if the connection dies By
default, it is set to false
■■ maxReconnects:if autoReconnect is enabled, specifies the maximumnumber of reconnection attempts
■■ initialTimeout:if autoReconnect is enabled, specifies the time to waitbetween reconnection attempts
The URL arguments for MySQL Connector/J are fully documented in theREADME file in its distribution archive The file also discusses special features,known issues, and tips for this driver All users of MySQL Connector/J shouldcheck out this file
Note that getConnection() throws SQLException, which you will need to deal
with If you do not want to type java.sql all the time in front of java.sql names, you can put import java.sql.*; at the top of your code We assume in the
class-future examples that you have done this
To be able to send a query, you first must instantiate a Statement object:
Statement st = con.createStatement();
Once a statement is created, you can use it to run two types of queries: the onesthat return a result set, and the ones that do not To run a query that returns aresult set, use executeQuery(), and for the query that does not return a resultset, use executeUpdate()
executeQuery() returns an object of type ResultSet You can iterate through theresult set using the method next() and then access individual fields of the cur-rent row with the getXXX() method—for example, getString(), getFloat(),getInt() When the result set has been fully traversed, next() returns FALSE.Another way to execute a query is to use the PreparedStatement object, which
is a subclass of Statement It can be obtained in the following way:
PreparedStatement st = con.prepareStatement(query);
In the query passed to prepareStatement(), you can use placeholders (?) inplace of column values The placeholder should be set to the actual valuethrough a call to the appropriate setXXX() method of the PreparedStatementobject Note that the placeholder indexes start with 1, not with 0 After the placeholder values have been set, you can call executeQuery() or execute-Update(), just as in the case of Statement The advantage of using preparedstatements is that the strings will be properly quoted and escaped by the driver
AP I Over view 191
Trang 3If the information, such as the number of columns in the result set or theirnames, is not known in advance, you can obtain it by retrieving the ResultSet-MetaData object associated with the ResultSet To retrieve the object, use thegetMetaData() method of the ResultSet You can then use getColumnCount(),getColumnName(), and other methods to access a wide variety of information Note that all of the database access methods we have discussed can throwSQLException
Sample Code
Listing 11.1 is a simple command-line benchmark that illustrates the basic ments of interacting with MySQL in Java, and as a bonus gives you some idea ofperformance of the MySQL server in combination with the Java client We cre-ate a table with one integer column, which is also going to be the primary key,plus the specified number of string columns of type CHAR(10) We populate itwith dummy data Then we perform the specified number of SELECT queriesthat will select one record based on the random value of the primary key Toreduce input/output, we select only one column from the entire record, which
ele-is physically located right in the middle of it
You can download the sample code from the book’s Web site The source is inBenchmark.java The class files are also provided: Benchmark.class andMySQLClient.class To compile, execute
Trang 4Sample Code 193
{
/*
We encapsulate Connection, Statement, PreparedStatement, and
ResultSet under one hood ResultSet and PreparedStatement are
made public as we will want to give the caller a lot of
flexibility in operating on it.
/* Constructor Accepts the JDBC URL string */
public MySQLClient(String url)
We have two sister methods - safeReadQuery(), and
safeWriteQuery() They both execute a query and handle errors by
aborting with a message The former executes a read query that
produces a result set, and stores the result set in the rs class
member The latter executes a write
query, which produces no result set.
Trang 5of the setXXX() methods of PreparedStatement, a call to
safeRunPrepared() In our convention, the prefix "safe" in the method name indicates that we catch the exception ourselves and deal with it The caller does not need to worry about
handling any exceptions.
public void safeRunPrepared()
Listing 11.1 Source code of Benchmark.java (continues)
Trang 6/* Get number of columns in the result set from the last query */
public int getNumCols() throws SQLException
{
return rs.getMetaData().getColumnCount();
}
/*
Get the name of the column in the result set at the given
sequential order index
Hardcoded values for the connectivity arguments In a real-life
application those would be read from a configuration file or from
the user.
*/
private static String user = "root", password = "", host = "localhost",
db = "test";
/* Convenience emergency exit method */
public static void die(String msg)
Trang 7Create a table named t1 with the first column as an integer and
a primary key, and the rest of the columns strings of type
CHAR(10) The number of additional columns is determined by the numCols argument Populate the table with numRows rows of
generated data When finished, run SHOW TABLE STATUS LIKE 't1' and print the results.
c.safeWriteQuery("DROP TABLE IF EXISTS t1");
/* Initializations to prepare for constructing the queries */ String query = "CREATE TABLE t1(id INT NOT NULL PRIMARY KEY"; String endInsert = "", startInsert = "INSERT INTO t1 VALUES(?"; int i;
Trang 8/* Start the timer */
long start = System.currentTimeMillis();
/* Prepare the query before the insert loop */
c.safePrepareQuery(startInsert + endInsert);
/* Set the constant string values */
for (i = 0; i < numCols; i++)
/* Stop the timer */
long runTime = System.currentTimeMillis() - start;
/* Compute and print out performance data */
System.out.println(String.valueOf(numRows) +
" rows inserted one at a time in " + String.valueOf(runTime/1000) + "." + String.valueOf(runTime%1000) + " s, " + (runTime > 0 ? String.valueOf((num- Rows*1000)
/runTime) +
" rows per second" : "") + "\n");
/*
Now we examine the table with SHOW TABLE STATUS This
Listing 11.1 Source code of Benchmark.java (continues)
Trang 9J a v a C l i e n t B a s i c s
198
serves several purposes It allows us to check if the rows
we have inserted are really there We can see how many bytes the table and each row are taking And we can provide an example of how to read the results of a query.
int numStatusCols = c.getNumCols();
String line = "TABLE STATUS:\n";
Iterate through the result set.
MySQLClient.safeReadQuery() stores the result set in the
rs member There is actually only one result row But for the sake of example, we do a loop anyway.
Trang 10Sample Code 199
}
}
/*
Run numQueries randomized selects of the type SELECT sN FROM t1
WHERE id = M, with N being the number of columns divided in half
to hit the middle column, and M a random number between 0 and
numRows-1 Time the process and compute performance data.
/* Initialize the common query prefix */
String queryStart = "SELECT s" + String.valueOf(numCols/2) +
" FROM t1 WHERE id = ";
/* Instantiate the random number generator object */
Random r = new Random();
/* Start the timer*/
long start = System.currentTimeMillis();
/* Now run generated queries in a loop randomizing the key
/* Stop the timer */
long runTime = System.currentTimeMillis() - start;
/* Compute and print performance data */
System.out.println(String.valueOf(numQueries) +
" selects in " + String.valueOf(runTime/1000) + "." + String.valueOf(runTime%1000) + " s,
Listing 11.1 Source code of Benchmark.java (continues)
Trang 11J a v a C l i e n t B a s i c s
200
" + (runTime > 0 ? String.valueOf((numQueries*1000)/runTime) +
" queries per second" : "") +
/* Construct the JDBC URL */
String url = "jdbc:mysql://" + host + "/" + db + "?user=" + user + "&password=" + password;
/* Parse the command-line arguments */
int numRows = 0, numCols = 0, numQueries = 0;
Trang 12Listing 11.1 Source code of Benchmark.java (continued)
A run of the sample application with 1000 rows, 10 columns, and 1000 selects(Java Benchmark 1000 10 1000) with Sun JDK 1.3.1 produces the following out-put on my desktop (a Pentium III 500 with 256MB of RAM, running Linux2.4.19):
1000 rows inserted one at a time in 1.696 s, 589 rows per second
Trang 14In this chapter, we discuss how to write MySQL client code efficiently We
have examined some related topics in Chapters 6 and 7; in this chapter, welook at caching techniques, replication-aware code, ways to improve write-dominant applications, methods of reducing network I/O, and query optimiza-tion In the query optimization section, we learn how the MySQL optimizerworks
Query Caching
A frequent cause of inefficiency in database applications is that the sameSELECT query is being run multiple times, and each time it produces the sameresult because the tables have not changed One easy way to address this prob-lem is to use server query caching The server will cache the results of allSELECT queries The next time the same query is run, if the tables involved inthe query have not been modified, the result is read from the cache and notfrom the tables When at least one of the tables involved in the query is modi-fied, the cached result is invalidated Thus, performance is improved, and theoptimization is transparent to the user This feature is available in the 4.0MySQL branch starting in version 4.0.3
To enable the query cache, you need to set query_cache_size to a non-zerovalue ( set-variable query_cache_size=16MB, for example) This does mean,however, that you would have to use a relatively new feature along with quite abit of new code that is in the 4.0 branch The code in the query cache has proven
Writing the Client for Optimal
Performance
C H A P T E R
12
203
Trang 15itself reasonably stable, though, so you may consider this an option as long asyou test your application thoroughly to make sure it does not hit any unknownbugs (something you should do anyway, even with a stable release of MySQL).
By the time you read this book, it is very likely that the 4.0 branch will havereached stable status
However, having to depend on the server query cache is not the best option forimproving performance itself, although it may save you some developmenttime, especially if you have already written the application The problem is thatyou still have to contact the server and have it do the work that it really doesnot have to do, such as reading network packets, parsing the query, checking if
it is stored in the query cache, checking if the tables have been modified, andsending the stored result back to the client A better alternative is to performcaching on the client inside your application logic There are several cachingtechniques you can use in your code:
■■ When retrieving query results, ask yourself if you will need them again inanother routine If you will, save the result for later access An elegant way
to do it in C is to not call mysql_free_result() on the pointer returned frommysql_store_result() immediately, but to save it in a global variable forlater access; then when it is needed again, call mysql_data_seek() torewind to the first row, and iterate through it another time A similarmethod will work in PHP In Perl, you can use $dbh->fetchrow_arrayref(),although it is not as elegant as with C or PHP because it will make anunneeded copy of the result
■■ If you are generating an HTML page dynamically, you can have a static pageinstead that is always regenerated when tables that it depends on are
updated The advantage of this approach is that even if the database servergoes down, you will still have something to show, and it will be current, too!
■■ For non-Web clients, you can identify logical blocks of information thatdepend on a set of tables, and cache/regenerate those blocks only if thetable has been updated To check the timestamp on a table, you can useSHOW TABLE STATUS LIKE ‘tbl_name’ if the updating agent is not capable
of notifying clients that the update has happened otherwise
If the application performs intensive frequent reads and relatively infrequentwrites, proper query caching can significantly improve performance and helpgreatly reduce hardware costs
Trang 16master, while non-time-critical selects can go to slaves This method allows you
to scale your application with a cluster of inexpensive servers to the kind of formance that might be difficult to achieve even with an expensive mainframe However, to take advantage of this setup in MySQL 3.23, the client has to beaware of the set of replication servers The connect function must be able toestablish two connections: one to the master and the other to one of the slaves,based on some evenly distributed algorithm (random pick, hash on process id,etc.) All queries will go through three query functions: safe_query_write(),safe_query_read(), and safe_query_critical_read() You call safe_query_write()for any queries that modify the data, safe_query_read() for any query that canread data that is slightly behind (a slave may be a few queries behind the mas-ter), and safe_critical_read() for reads that need the most current data
per-MySQL AB plans to add a proxy in 4.0 or 4.1 that will look at the query and ligently decide if it should go to the master or to the slave This allows old code
intel-to take advantage of the replicated setup For more information on replication,see Chapter 16
Improving Write-Dominant Applications
One of the challenges in scaling a MySQL application is that while it is relativelyeasy to scale read performance through caching and replication, those tech-niques will not improve performance when writes are dominant Let’s discuss acouple of ways to speed up write-intensive applications:
■■ Combine several inserts into one For example, instead of INSERT INTOtbl VALUES (1); INSERT INTO tbl VALUES (2); INSERT INTO tbl VALUES(3), you can do INSERT INTO tbl VALUES (1),(2),(3);
■■ If the number of records you insert at once is significant (more than 100 orso), it is advisable to save the data in a file (e.g., tab-delimited columns)and then use LOAD DATA INFILE (if the file can be placed directly on theserver) or LOAD LOCAL DATA INFILE otherwise to load the data LOADDATA INFILE is the fastest possible way to get a large chunk of data into atable, and is worth the overhead of creating a temporary file
■■ Do all you can to reduce the length of a record by choosing efficient datatypes for each field If you are using sizeable BLOBs (over 1K), considerwhether it would be possible to store them separately on a regular file sys-tem and store references to them instead in the database MySQL does notperform very well with BLOBs; it does not have a separate BLOB space forstorage, and the network and storage code in the server is written with theassumption that the inserted record is short and that making one extracopy of a field does not hurt
Improving Write-Dominant Applications 205
Trang 17■■ Eliminate unnecessary keys, and reduce the size of the ones you are going
to keep if possible (e.g., by indexing only a prefix of a string instead of thewhole string)
■■ If possible, write your application with the assumption that the data can belogically divided into groups For example, if you are logging some infor-mation based on an IP address, you can hash on the last 4 bits of theaddress and create 16 different groups Then you can put each group ofrecords on its own server The statistical and other queries that need toexamine the entire data set will poll multiple servers This will add somedevelopment overhead for your application because you will need tomerge the data coming back On the other hand, it will give you the advan-tage of being able to run the query in parallel on several servers
■■ For applications that collect randomly arriving data one piece at a time(e.g., if you are logging IP packets, ISP dial-up information, or cell-phonecalls), you can greatly improve performance by pooling the data intobatches and then using LOAD DATA INFILE This can be done by havingthe clients connect to a simple proxy server instead of MySQL directly,which will read the data from the client, add it to the queue, and periodi-cally flush the queue, sending it to the MySQL server in a batch
Reducing Network I/O
Unnecessary network I/O overhead is obviously detrimental to application formance Not only does it tie up network resources and cause delays on theclient, it also increases the amount of time locks are being held on the server Inaddition, it ties up the server CPU since it pipes the data to the client it shouldnot have to pipe Two basic concepts are behind achieving efficient networkI/O:
per-■■ Read only what you need (in other words, if you read it, use it; if you donot use it, do not read it)
■■ Do not retrieve data into the client just to process it Let the server do thework as much as possible
Excessive network I/O is often caused by the client requesting too many rows
or columns of data back from a query Let’s look at an example where the client
is reading from a table with 100 columns, each an average of 10 bytes in length.Once it receives the data, it accesses only 5 of those 100 columns Yet the clientperforms SELECT * FROM tbl instead of SELECT c1,c2,c3,c4,c5 FROM tbl Or,
on the other hand, the client executes SELECT * FROM tbl WHERE c1 > 'start'AND c2 < 'end' and reads the first record of the result, discarding the rest of therecords A more efficient approach would be to use SELECT * FROM tbl
W r i t i n g t h e C l i e n t f o r O p t i m a l P e r f o r m a n c e
206
Trang 18WHERE c1 > 'start' AND c2 < 'end' LIMIT 1, which will give the client the record
it needs
Other times, the client actually uses what it reads, but it is still not efficientbecause the server is capable of doing the work to get the end result the client
is actually after Let’s consider a simple example in PHP (see Listing 12.1)
Understanding the Optimizer 207
($res=mysql_query("SELECT c1,c2,c3,c4,c5 FROM tbl WHERE $constr")) ||
die("Error running query");
Listing 12.1 Inefficient code can lead to higher network I/O.
The code in Listing 12.1 can be improved by moving the addition to the server,
as shown in Listing 12.2 We now read only two columns from the server instead
of five, which reduces network traffic
$res=mysql_query("SELECT c1,(c2+c3+c4+c5) as total FROM tbl WHERE
$constr")) || die("Error running query");
Listing 12.2 Example of reducing network I/O by letting the server do the work.
Understanding the Optimizer
To master the art of writing efficient queries, you must learn how the MySQLoptimizer thinks In a way, you must become one with it, just as experiencedcyclists become one with their bike or experienced musicians become one withtheir instrument Getting there takes a lot of time, practice, research, and morepractice In the next few pages, we provide an introduction that we hope willget you started on the right track
Trang 19The MySQL optimizer was written mostly by one person: Monty Widenius It is
an expression of Monty’s programming genius, and as such is quite efficient andcapable At the same time, however, it inherited some of Monty’s quirks andprejudices, and at times exhibits rather odd behavior that some programmersmay find annoying Despite these quirks, the optimizer gets the job done verywell if you take the time to understand how it works
The goal of any optimizer is to improve application performance by examiningthe minimum possible amount of data necessary to return accurate results forany query The most common way to accomplish this is to perform key reads asopposed to table scans When a key read is being performed, we want to use theone that will require the least amount of I/O to retrieve the data we need The MySQL optimizer can use only one key per table instance in a join (even ifyou are selecting from just one table, the optimizer still considers this a join,albeit a simplified one) It distinguishes among four methods of key lookup:const, ref, range, and fulltext A const lookup involves reading one recordbased on a known value of a unique key It is the most efficient lookup method.ref involves two situations: a lookup on a key when only some prefix parts areknown but not the entire value of the key, and lookups on a non-unique key In
a const lookup, we always read only one record, while a ref lookup may need toread additional ones A range lookup is used when we know the maximum andminimum possible values of a key (the range of values) or a set of such ranges.The efficiency of a ref lookup greatly depends on key distribution full-textlookups are used when you have a full-text key on a column and are using theMATCH AGAINST() operator
The optimizer is capable of examining different possibilities for keys and willchoose the method that requires, in its judgment, the least number of records
In some cases, it can make a mistake in estimating how many rows a certainkey would require it to examine, and for that reason it can choose the wrong
key If that happens, you can tell it to use a different key by including the USE KEY(key_name) syntax in the query
If the key interval estimated number of records exceeds 30 percent of the table,
a full scan will be preferred for MyISAM tables The reason for this decision isthat it takes longer to read one key than one record from the data file This isnot the case for InnoDB tables, however, because a full scan is really just a tra-versal of the primary key (if no primary key or unique key is specified when thetable is created, a surrogate key will be created) It would definitely not befaster than partially traversing itself, and is not likely to be faster than travers-ing a portion of a secondary key
MyISAM and InnoDB tables use the B-tree index HEAP (in-memory) tables usehash index Therefore, HEAP tables are very efficient at looking up records
W r i t i n g t h e C l i e n t f o r O p t i m a l P e r f o r m a n c e
208
Trang 20based on the value of the entire key, especially if the key is unique However,they are not able to read the next key in sequence, and this makes it impossiblefor them to use the range optimization or to be able to retrieve the recordsbased on a key prefix.
The optimizer is capable of figuring out from the WHERE clause that it can use
a key if the columns comprising it are compared directly with a constantexpression or with the value from another table in a join For example, if aquery has WHERE a = 4+9, the optimizer will be able to use a key on a, butwhen the same condition is written as WHERE a–4 = 9, the key would not beused It is, therefore, important to have a habit of writing comparison expres-sions with a simple left side The optimizer looks at =, >, >=, <=, LIKE, AND,
OR, IN, and MATCH AGAINST operators in the WHERE clause when decidingwhich keys could possibly be used to resolve the query
In some cases, MySQL will “cheat” by noticing something special in the WHEREclause For examine, MyISAM and HEAP tables keep track of the total number
of records Therefore, SELECT count(*) FROM tbl can be answered by simplylooking up the record count in the table descriptor structure A query likeSELECT * FROM tbl WHERE a = 4 AND a = 5 can be answered immediatelywithout ever looking at the table data: a cannot be 4 and 5 at the same time;therefore, the result is always an empty set Suppose we have a key on a in aMyISAM or InnoDB table During initialization, the optimizer will look up thehighest and the lowest values of the key If it notices that the key valuerequested is out of range, it will immediately return an empty set
Remembering that only one key can be used per table instance is important Acommon mistake that can significantly reflect on performance is repeatedlyexecuting something like SELECT * FROM tbl WHERE key1 = 'val1' OR key2 ='val1', assuming that key1 and key2 are keyed columns in the table tbl In thissituation, MySQL will not be able to use either one of the keys It is much better
to split the query into two in this case: SELECT * FROM tbl WHERE key1 ='val1', followed by SELECT * FROM tbl WHERE key2 = 'val2'
The optimizer does a decent job with joins, but it does require you to think fully when writing them You have to have the proper keys in place No tempo-rary keys will be created to optimize a join—only the existing ones will be used.MySQL does not perform a hash-join It simply examines each table to be joined
care-to see how many records it would have care-to retrieve from each, sequences them
in smaller-first order, and then begins to read records according to that order.The first record is read for the first table, then the first for the second, up untilthe last record All records are read from the last table that matches the condi-tions Then the second record is read in the next-to-last table, and all the match-ing records are read from the last table again
Understanding the Optimizer 209
Trang 21Each time the optimizer finishes the iteration through a table, it backtracks tothe previous one in the join order, moves on to the next record, and repeats theiterative process These steps are repeated until we have iterated through thefirst table Thus, we end up examining the product of the number of candidaterows for each table.
As you can see, this process will be very fast if a key can always be used in alltables to look up a record and there are not very many possibilities to examine
in each table However, it could be a performance disaster if all the tables arelarge and have to be fully scanned
To see what the optimizer is doing, you can look at the output of EXPLAIN; youcan also run a query on the server (when it has no other activity) and use theoutput from SHOW STATUS to track the differences in the optimizer statistic
variables Let’s examine a few examples that will illustrate how this can be
done to understand how efficient a certain query might be For more in-depthinformation on EXPLAIN and SHOW STATUS, please refer to Chapter 15,where we document the output of both in detail
To facilitate our optimizer strategy research, I have written a special line utility called query-wizard Both the source and the executables are avail-able on this book's Web site (www.wiley.com/compbooks/pachev) Whenperforming your own studies, you could either use query-wizard as is withoutmodifications, extend or customize it to fit your needs, or perhaps use it as aninspiration to write your own tool The essence of the research is to runEXPLAIN on your query to have the optimizer tell you what it is going to do;then run the query and see how the optimizer statistics have changed to seewhat it has actually done
command-Here is a listing of the program’s options (obtained by running /
query-wizard -? ):
Usage: query-wizard [options] file
Options:
-x EXPLAIN
-s report Handler_ variable changes
-r run the query N times and time it, requires an argument
-o output file name
-q maximum query length, default 65535
W r i t i n g t h e C l i e n t f o r O p t i m a l P e r f o r m a n c e
210