For example, thefollowing command changes the settings related to the buffer cache: # echo "5 33 80" > /proc/sys/vm/buffermem 15.4.2.5 Solaris On Solaris systems, you can view the values
Trang 1memory that they never actually use, and they may run successfully if this setting is enabled.
Changing parameter values is accomplished by modifying the values in these values For example, thefollowing command changes the settings related to the buffer cache:
# echo "5 33 80" > /proc/sys/vm/buffermem
15.4.2.5 Solaris
On Solaris systems, you can view the values of system parameters via the kstat command.For example,the following command displays system parameters related to paging behavior, including their default values
on a system with 1 GB of physical memory:
# kstat -m unix -n system_pages | grep 'free '
cachefree 1966 Units are pages.
lotsfree 1966
desfree 983
minfree 491
Figure 15-4 illustrates the meanings and interrelationships of these memory levels
Figure 15-4 Solaris paging and swapping memory lLevels
As the figure indicates, setting cachefree to a value greater than lotsfree provides a way of favoring
processes' memory over thebuffer cache (by default, no distinction is made between them because lotsfree
is equal to cachefree) In order to do so, you should decrease lotsfree to some point between its current level and desfree (rather than increasing cachefree).
Solaris 9 has changed its virtual memory manager and has eliminated the cachefree
variable
15.4.2.6 Tru64
Tru64 memory management is controlled by parameters in thesysconfig vm subsystem These are the most
useful parameters:
vm_aggressive_swap: Enable/disable aggressive swapping out of idle processes (0 by default).
Enabling this can provide some memory management improvements on heavily loaded systems, but it
Trang 2is not a substitute for reducing excess consumption.
There are several parameters that control the conditions under which the memory manager stealspages from active processes and/or swaps out idle processes in an effort to maintain sufficient freememory They are listed in Figure 15-5 along with their interrelationships and effects
Figure 15-5 Tru64 paging and swapping memory levels
The default for vm_page_free_min is 20 pages The value of vm_page_free_target varies with the
memory size; for a system with 1 GB of physical memory, it defaults to 512 pages The reserved value
is always 10 pages
The other variables are computed from these values vm_page_free_swap (and the equivalent
vm_page_free_optimal) is set to the point halfway between the minimum and the target, and
vm_page_free_hardswap is set to about 16 times the target value.
Several parameters relate to the size of the buffer cache vm_minpercent specifies the percentage of
memory initially used for the buffer cache (the default is 10%).The buffer cache size will increase ifmemory is available The parameter ubc_maxpercent specifies the maximum amount of memory that itmay use (the default is 100%) When memory is short and the size of the cache corresponds to
ubc_borrowpercent or larger, pages will be returned to the general pool until the cache drops below
this level (and process memory page stealing does not occur) The default for the borrow level is 20%
of physical memory
On file servers, it will often make sense to increase one or both of the minimum and borrow
percentages (to favor the cache over local processes in memory allocation) On a database server,though, you will probably want to reduce these sizes
15.4.3 Managing Paging Space
Specially designated areas of disk are used forpaging On most Unix systems, distinct, dedicated disk
partitions—called swap partitions—are used to hold pages written out from memory In some recent Unix implementations, paging can also go to special page files stored in a regular Unix filesystem.[26]
[26] Despite their names, both swap partitions and page files can be used for paging and for swapping(on systems supporting virtual memory)
Trang 3Many discussions of setting up paging space advise using multiple paging areas, spreadacross different physical disk drives Paging I/O performance will generally improve thecloser you come to this ideal
However, regular disk I/O also benefits from careful disk placement It is not always possible toseparate both paging space and important filesystems Before you decide which to do, you mustdetermine which kind of I/O you want to favor and then provide the improvements appropriate forthat kind
In my experience, paging I/O is best avoided rather than optimized, and other kinds of disk I/Odeserve far more attention than paging space placement
15.4.3.1 How much paging space?
There are as many answers to this question as there are people to ask The correct answer is, of course, "Itdepends." What it depends on is the type of jobs your system typically executes A single-user workstationmight find a paging area of one to two times the size of physical memory adequate if all the system is usedfor is editing and small compilations On the other hand, real production environments running programswith very large memory requirements might need two or even three times the amount of physical memory.Keep in mind that some processes will be killed if all available paging space is ever exhausted (and newprocesses will not be able to start)
One factor that can have a large effect on paging space requirements is the way that the operating systemassigns paging space to virtual memory pages implicitly created when programs allocate large amounts ofmemory (which may not all be needed in any individual run) Many recent systems don't allocate pagingspace for such pages until each page is actually accessed; this practice tends to minimize per-processmemory requirements and stretch a given amount of physical memory as far as possible However, othersystems assign paging space to the entire block of memory as soon as it is allocated Obviously, under thelatter scheme, the system will need more page file space than under the former
Other factors that will tend to increase your page file space needs include:
Jobs requiring large amounts of memory, especially if the system must run more than one at a time
Jobs with virtual address spaces significantly larger than the amount of physical memory
Programs that are themselves very large (i.e., have large executables) This often implies the itemabove, but not vice versa
A very, very large number of simultaneously running jobs, even if each individual job is fairly small
15.4.3.2 Listing paging areas
Most systems provide commands to determine the locations of paging areas and how much of the totalspace is currently in use:
Trang 4List paging areas Show current usage
Here is some output from a Solaris system:
swapfile dev swaplo blocks free
Page Space Phys Volume Volume Group Size %Used Active Auto
hd6 hdisk0 rootvg 200MB 76 yes yes
paging00 hdisk3 uservg 128MB 34 yes yes
The output lists the paging space name, the physical disk it resides on, the volume group it is part of, itssize, how much of it is currently in use, whether it is currently active, and whether it is activated
automatically at boot time This system has two paging spaces totaling about 328 MB; total system swapspace is currently about 60% full
Here is some output from an HP-UX system:
-The first three lines of the output provide details about the system swap configuration -The first line (dev)
shows that 34 MB is currently in use within the paging area at /dev/vg00/lvol2 (its total size is 192 MB) The
next line indicates that another 98 MB has been reserved within this paging area but is not yet in use
The third line of the display is present when pseudo-swap has been enabled on the system This is
accomplished by setting the swapmem_on kernel variable to 1 (in fact, this is the default) Pseudo-swap
allows applications to reserve more swap space than physically exists on the system It is important to
Trang 5emphasize that pseudo-swap does not itself take up any memory, up to a limit of seven eighths of physicalmemory Line 3 indicates that there is 164 MB of memory overcommitment capacity remaining for
applications to use (32 MB is in use)
The final line (total) is a summary line In this case, it indicates that there is 257 MB of total swap space onthis system 164 MB of it is currently either reserved or allocated: the 34 MB allocated from the paging areaplus 98 MB reserved in the paging area plus 32 MB of the pseudo-swap capacity
15.4.3.3 Activating paging areas
Normally, paging areas are activated automatically at boot time On many systems, swap partitions are
listed in the filesystem configuration file, usually /etc/fstab The format of the filesystem configuration file is
discussed in detail in Section 10.2, although some example entries will be given here:
/dev/ad0s2b none swap sw 0 0 FreeBSD
/dev/vg01/swap swap pri=0 0 0 HP-UX
/dev/hda1 swap swap defaults 0 0 Linux
This entry says that the first partition on disk 1 is a swap partition This basic form is used for all swappartitions
Solaris systems similarly place swap areas into /etc/vfstab:
swapon -a > /dev/console 2>&1
when adding a new partition Solaris provides the swapadd tool to perform the same function during boots.
Under AIX, paging areas are listed in the file /etc/swapspaces :
hd6:
dev = /dev/hd6
paging00:
dev = /dev/paging00
Each stanza lists the name of the paging space and its associated special file (the stanza name and the
filename in /dev are always the same) All paging logical volumes listed in /etc/swapspaces are activated at
boot time by a swapon -a command in /etc/rc Paging logical volumes can also be activated when they are
created or by manually executing the swapon -a command
15.4.3.4 Creating new paging areas
Trang 6As we've noted, paging requires dedicated disk space, which is used to store paged-out data Making a newswap partition on an existing disk without free space is a painful process, involving these steps:
Performing a full backup of all filesystems currently on the device and verifying that the tapes arereadable
Restructuring the physical disk organization (partition sizes and layout), if necessary
Creating new filesystems on the disk At this point, you are treating the old disk as if it were a brandnew one
Restoring files to the new filesystems
Activating the new swapping area and adding it to the appropriate configuration files
Most of these steps are covered in detail in other chapters A better approach is the subject of the nextsubsection
15.4.3.5 Filesystem paging
Many modern Unix operating systems offer a great deal more flexibility by supporting filesystem paging
—paging to designated files within normal filesystems Page files can be created or deleted as needs change,albeit at a modest increase in paging operating system overhead
Under Solaris, the mkfile command creates new page files For example, the following command will create
the file /chem/page_1 as a 50 MB file:
# mkfile 50m /chem/page_1
# swap -a /chem/page_1 0 102400
size of the file is interpreted as bytes unless a k (KB) or m (MB) suffix is appended to it The regular swap
command is then used to designate an existing file as a page file by substituting its pathname for the specialfilename
On HP-UX systems, filesystem paging is initiated by designating a directory as the swap device to the
swapon command In this mode, it has the following basic syntax:
swapon [-m min] [-l limit] [-r reserve] dir
min is the minimum number of filesystem blocks to be used for paging (the block size is as defined when the filesystem was created: 4096 or 8192), limit is the maximum number of filesystem blocks to be used for paging space, and reserve is the amount of space reserved for files beyond that currently in use which may never be used for paging space For example, the following command initiates paging to the /chem
filesystem, limiting the size of the page file to 5000 blocks and reserving 10000 blocks for future filesystemexpansion:
# swapon -l 5000 -r 10000 /chem
You can also create a new logical volume as an additional paging space under HP-UX For example, the
following commands create and activate a 125 MB swap logical volume named swap2:
Trang 7# lvcreate -l 125 -n swap2 -C y -r n /dev/vg01
# swapon /dev/vg01/swap2
The logical volume uses a contiguous allocation policy and has bad block relocation disabled (-C and -r,respectively) Note that no filesystem is built on the logical volume
On Linux systems, a page file may be created with commands like these:
# dd if=/dev/zero of=/swap1 bs=1024 count=8192 Create 8MB file.
# mkswap /swap1 8192 Make file a swap device.
# sync; sync
# swapon /swap1 Activate page file.
On FreeBSD systems, a page file is created as follows:
# dd if=/dev/zero of=/swap1 bs=1024 count=8192 Create 8MB file.
# vnconfig -e vnc0 /swap1 swap Create pseudo disk /dev/vn0c
and enable swapping.
The vnconfig command configures the paging area and activates it
Under AIX, paging space is organized as special paging logical volumes Like normal logical volumes, pagingspaces may be increased in size as desired as long as there are unallocated logical partitions in their volumegroup
You can use the mkps command to create a new paging space or the chps command to enlarge an existing
one For example, the following command creates a 200 MB paging space in the volume group chemvg:
# mkps -a -n -s 50 chemvg
The paging space will be assigned a name like pagingnn where nn is a number: paging01, for example The
-a option says to activate the paging space automatically on system boots (its name is entered into
/etc/swapspaces) The -n option says to activate the paging space immediately after it is created The -s
option specifies the paging space's size, in logical partitions (whose default size is 4 MB) The volume groupname appears as the final item on the command line
The size of an existing paging space may be increased with the chps command Here the -s option specifiesthe number of additional logical partitions to be added:
# chps -s 10 paging01
This command adds 40 MB to the size of paging space paging01.
FreeBSD does not support filesystem paging, although you can use a logical volumefor swapping in either environment The latter makes it much easier to add anadditional paging space without adding a new disk
15.4.3.6 Linux and HP-UX paging space priorities
HP-UX and Linux allow you to specify a preferred usage order for multiple paging spaces via a prioritysystem The -p option to swapon may be used to assign a priority number to a swap partition or other
Trang 8paging area when it is activated Priority numbers run from 0 to 10 under HP-UX, with lower numbered areasbeing used first; the default value is 1.
On Linux systems, priorities go from 0 to 32767, with higher numbered areas being used first, and theydefault to 0 It is usually preferable to give dedicated swap partitions a higher usage priority than filesystempaging areas
15.4.3.7 Removing paging areas
Paging spaces may be removed if they are no longer needed, unless they're on the root disk To remove aswap partition or filesystem page file in a BSD-style implementation—FreeBSD, Linux, HP-UX, and
Tru64—remove the corresponding line from the appropriate system configuration file Once the system isrebooted, the swap partition will be deactivated (rebooting is necessary to ensure that there are no activereferences to the partition or page file) Page files may then be removed normally with rm
Under Solaris, the -d option to the swap command deactivates a swap area Here are some examples:
# swap -d /dev/dsk/c1d1s1 0
# swap -d /chem/page_1 0
Once the swap -d command is executed, no new paging will be done to that area, and the kernel will
attempt to free areas in it that are still in use, if possible However, the file will not actually be removed until
no processes are using it
Under AIX, paging spaces may be removed with rmps once they are deactivated:
# chps -a n paging01
# rmps paging01
The chps command removes paging01 from the list to be activated at boot time (in /etc/swapspaces).The
rmps command actually removes the paging space
Administrative Virtues: Persistence
Monitoring system activity levels and tuning system performance both rely on the same system
administrative virtue:persistence These tasks naturally must be performed over an extended
period of time, and they are also inherently cyclical (or even recursive) You'll need persistence
most at two points:
When you are just getting started and don't have any idea what is wrong with the system
and what to try to improve the situation
After the euphoria from your early successes has worn off and you have to spend more
time to achieve smaller improvements
System performance tuning—and system performance itself—both follow the 80/20 rule: getting
the last 20% done takes 80% of the time (System administration itself often follows another
variation of the rule: 20% of the people do 80% of the work.) Keep in mind the law of
diminishing returns, and don't waste any time trying to eke out that last 5% or 10%
Trang 9I l@ve RuBoard
Trang 10I l@ve RuBoard
15.5 Disk I/O Performance Issues
Disk I/O is the third major performance bottleneck that can affect a system or individual job This section willlook first at the tools for monitoring disk I/O and then consider some of the factors that can affect disk I/Operformance
15.5.1 Monitoring Disk I/O Performance
Unfortunately, Unix tools for monitoring disk I/O data are few and rather poor BSD-like systems provide the
iostat command (all but Linux have some version of it) Here is an example of its output from a FreeBSDsystem experiencing moderate usage on one of its two disks:
The command parameter specifies the interval between reports (and we've omitted the first, summary one,
as usual) The columns headed by disk names are the most useful for our present purposes They showcurrent disk usage as the number of transfers/sec (tps) and MB/sec
System V-based systems offer the sar command, and it can be used to monitor disk I/O Its syntax in thismode is:
$ sar -d interval [count]
interval is the number of seconds between reports, and count is the total number of reports to produce (the
default is one) In general, sar's options specify what data to include in its report sar is available for AIX,HP-UX, Linux, and Solaris However, it requires that process accounting be set up before it will return anydata
This report shows the current disk usage on a Linux system:
Trang 1115.5.2 Getting the Most From the Disk Subsystem
Disk performance is something that more effectively results from installation-time planning and configurationthan from after-the-fact tuning Different techniques are most effective for optimizing different kinds of I/O.This means that you'll need to understand the I/O performed by the applications/typical workload on thesystem
There are two sorts of disk I/O:
Sequential access
Data from disk is read in disk block order, one block after another After the initial seek (head
movement) to the starting point, the speed of this sort of I/O is limited by disk transfer rates
Random access
Data is read in no particular order This means that the disk head will have to move frequently toreach the proper data In this case, seek time is an important factor in overall I/O performance, andyou will want to minimize it to the extent possible
Three major factors affect disk I/O performance in general:
Disk hardware
Data distribution across the system's disks
Data placement on the physical disk
15.5.2.1 Disk hardware
In general, the best advice is to choose the best hardware you can afford when disk I/O performance is animportant consideration Remember that the best SCSI disks are many times faster than the fastest EIDEones, and also many times more expensive
These are some other points to keep in mind:
When evaluating the performance of individual disks, consider factors such as its local cache in addition
to quoted peak transfer rates
Trang 12Be aware that actual disk throughput will seldom if ever achieve the advertised peak transfer rates.Consider the latter merely as relative numbers useful in comparing different disks.
Musameci and Loukides suggest using the following formula to estimate actual disk speeds: per-track * RPM * 512)/60,000,000 This yields an estimate of the disk's internal transfer rate in MB.However, even this rate will only be achievable via sequential access (and rarely even then)
(sectors-When random access performance is important, you can estimate the number of I/O operations persecond as 1000/(average-seek-time + 30000/rpm)
Don't neglect to consider the disk controller speed and other characteristics when choosing hardware.Fast disks won't perform as well on a mediocre controller
Don't overload disk controllers Placing disks on multiple disk controllers is one way to improve I/Othroughput rates In configuring a system, be sure to compare the maximum transfer rate for eachdisk adapter with the sum of the maximum transfer rates for all the disks it will control; obviously,placing too large a load on a disk controller will do nothing but degrade performance A more
conservative view states that you should limit total maximum disk transfer rates to 85%-90% of thetop controller speed
Similarly, don't overload system busses For example, a 32-bit/33MHz PCI bus has a peak transfer rate
of 132 MB/sec, less than what an Ultra3 SCSI controller is capable of
15.5.2.2 Distributing the data among the available disks
The next issue to consider after a system's hardware configuration is planning data distribution among theavailable disks: in other words, what files will go on which disk The basic principle to take into account insuch planning is to distribute the anticipated disk I/O across controllers and disks as evenly as possible (in
an attempt to prevent any one resource from becoming a performance bottleneck) In its simplest form, thismeans spreading the files with the highest activity across two or more disks
Here are some example scenarios that illustrate this principle:
If you expect most of a system's I/O to come from user processes, distributing the files they are likely
to use across multiple disks usually works better than putting everything on a single disk
A system intended to support multiple processes with large I/O requirements will benefit from placingthe data for different programs or jobs on different disks (and ideally on separate controllers) Thisminimizes the extent to which the jobs interfere with one another
For a system running a large transaction-oriented database, ideally you will want to place each of thefollowing item pairs on different disks:
Tables and their indexes
Database data and transaction logs
Large, heavily used tables accessed simultaneously
Given the constraints of an actual system, you may have to decide which of these separations is the
Trang 13most important.
Of course, placing heavily accessed files on network rather than local drives is almost always a guarantee ofpoor performance Finally, it is also almost always a good idea to use a separate disk for the operatingsystem filesystem(s) (provided you can afford to do so) to isolate the effects of the operating system's ownI/O operations from user processes
15.5.2.3 Data placement on disk
The final disk I/O performance factor that we will consider is the physical placement of files on disk Thefollowing general considerations apply to the relationship between file access patterns, physical disk location,and disk I/O performance:
Sequential access of large files (i.e., reading or writing, starting at the beginning and moving steadilytoward the end) is most efficient when the files are contiguous: made up of a single, continuous chunk
of space on disk Again, it may be necessary to rebuild a filesystem to create a large amount of
contiguous disk space.[27] Sequential access performance is highest at the outer edge of the disk (i.e.,beginning at 0) because the platter is the widest at that point (head movement is minimized)
[27] Unfortunately, some disks are too smart for their own good Disks are free to do all kinds ofremapping to improve their concept of disk organization and to mask bad blocks Thus, there is
no guarantee that what look like sequential blocks to the operating system are actually
sequential on the disk
Disk I/O to large sequential files also benefits from software disk striping, provided an appropriatestripe size is selected (see Section 10.3) Ideally, each read should result in one I/O operation (or less)
to the striped disk
Placing large, randomly accessed files in the center portions of disk drives (rather than out at theedges) will yield the best performance Random data access is dominated by seek times—the timetaken to move the disk heads to the correct location—and seek times are minimized when the data is
in the middle of the disk and increases at the inner and outer edges AIX allows you to specify thepreferred on-disk location when you create a logical volume (see Section 10.3) With other Unix
versions, you accomplish this by defining physical disk partitions appropriately
Disk striping is also effective for processes performing a large number of I/O operations
Filesystemfragmentation degrades I/O performance Fragmentation results when the free space within
a filesystem is made of many small chunks of space (rather than fewer large ones of the same
aggregate size) This means that files themselves become fragmented (noncontiguous), and accesstimes to reach them become correspondingly longer If you observe degrading I/O performance on avery full filesystem, fragmentation may be the cause
Filesystem fragmentation tends to increase over time Eventually, it may be necessary or desirable touse a defragmenting utility If none is available, you will need to rebuild the filesystem to reducefragmentation; the procedure for doing so is discussed in Section 10.3
15.5.3 Tuning Disk I/O Performance
Some systems offer a few hooks for tuning disk I/O performance We'll look at the most useful of them in
Trang 14this subsection.
15.5.3.1 Sequential read-ahead
Some operating systems attempt to determine when a process is accessing data files in a sequential
manner When it decides that this is the access pattern being used, it attempts to aid the process by
performing read-ahead operations: reading more pages from the file than the process has actually
requested For example, it might begin by retrieving two pages instead of one As long as sequential access
of the file continues, the operating system might double the number of pages read with each operationbefore settling at some maximum value
The advantage of this heuristic is that data has often already been read in from disk at the time the processasks for it, and so much of the process's I/O wait time is eliminated because no physical disk operation needtake place
Maximum number of pages to read ahead You will want to increase this parameter for striped
filesystems Good values to try are 8-16 times the number of component drives
Both parameters must be a power of 2
15.5.3.1.2 Linux
Linux provides some kernel parameters related to read-ahead behavior They may be accessed via these
files in /proc/sys/vm:
page-cluster
Determines the number of pages read in by a single read operation The actual number is computed
as 2 raised to this power The default setting is 4, resulting in a page cluster size of 16 Large
sequential I/O operations may benefit from increasing this value
min-readahead and max-readahead
Specify the minimum and maximum pages used for read-ahead They default to 3 and 31,
respectively
Finally, the Linux Logical Volume Manager allows you to specify the read-ahead size when you create alogical volume with lvcreate, via its -r option For example, this command specifies a read-ahead size of 8sectors and also creates a contiguous logical volume:
# lvcreate -L 800M -n bio_lv -r 8 -C y vg1
Trang 15The valid range for -r is 2 to 120.
15.5.3.2 Disk I/O pacing
AIX also provides a facility designed to prevent general system interactive performance from being adverselyaffected by large I/O operations By default, write requests are serviced by the operating system in the order
in which they are made (queued) A very large I/O operation can generate many pending I/O requests, andusers needing disk access can be forced to wait for them to complete This occurs most frequently when anapplication computes a large amount of new data to be written to disk (rather than processing a data set byreading it in and then writing it back out)
You can experience this effect by copying a large file—32MB or more—in the background and then running
an ls command on any random directory you have not accessed recently on the same physical disk You'llnotice an appreciable wait time before the ls output appears
Disk I/O pacing is designed to prevent large I/O operations from degrading interactive performance It isdisabled by default Consider enabling it only under circumstances like those described
This feature may be activated by changing the values of the minpout and maxpout system parameters using
the chdev command When these parameters are nonzero, if a process tries to write to a file for which there
are already maxpout or more pending write operations, the process is suspended until the number of
pending requests falls below minpout.
maxpout must be one more than a multiple of 4: 5, 9, 13, and so on (i.e., of the form 4x+1) minpout must
be a multiple of 4 and at least 4 less than maxpout The AIX documentation suggests starting with values of
33 and 16, respectively, and observing the effects The following command will set them to these values:
# chdev -l sys0 -a maxpout=33 -a minpout=16
If interactive performance is still not as rapid as you want it to be, try decreasing these parameters; on theother hand, if the performance of the job doing the large write operation suffers more than you want it to,increase them Note that their values do persist across boot because they are stored in the ODM
I l@ve RuBoard
Trang 16I l@ve RuBoard
15.6 Monitoring and Managing Disk Space Usage
This section looks at the tools available to monitor and track disk space usage It then goes on to discussways of approaching a perennial administrative challenge: getting users to reduce their disk use
15.6.1 Where Did It All Go?
The df -k command produces a report that describes all the filesystems, their total capacities, and theamount of free space available on each one (reporting sizes in KB) Here is the output from a Linux system:
File system Kbytes used avail capacity Mounted on
/dev/sd0a 7608 6369 478 93% /
/dev/sd0g 49155 45224 0 102% /corp
This output reports the status of two filesystems: /dev/sd0a, the root disk, and /dev/sd0g, the disk mounted
at corp (containing all files and subdirectories underneath /corp) Each line of the report shows the
filesystem's name, the total number of kilobytes on the disk, the number of kilobytes in use, the number ofkilobytes available, and the percentage of the filesystem's storage that is in use It is evident that both
filesystems are heavily used In fact, the /corp filesystem appears to be overfull.
As we've noted earlier, the operating system generally holds back some amount of space in each filesystem,allocatable only by the superuser (usually 10%, although Linux uses 5% by default) A filesystem mayappear to use over 100% of the available space when it has tapped into this reserve
The du -k command reports the amount of disk space used by all files and subdirectories underneath one ormore specified directories, listed on a per-subdirectory basis (amounts are given in KB)
A typical du report looks like this:
$ du -k -s /home/chavez
34823 /home/chavez
In many cases, this may be all the information you care about
Trang 17To generate a list of the system's directories in order of size, execute the command:
$ du -k / | sort -rn
This command starts at the root filesystem, lists the storage required for each directory, and pipes its output
to sort With the -rn options (reverse sort order, sort by numeric first field), sort orders these directoriesaccording to the amount of storage they occupy, placing the largest first
If the directory specified as its parameter is large or has a large number of subdirectories, du can take quite
a while to execute It is thus a prime candidate for automation via scripts and after-hours execution via
quot reports the number of kilobytes used by each user in the specified filesystem It is run as root (to
access the disk special files) Here's a typical example:
15.6.2 Handling Disk Shortage Problems
The commands and scripts we've just looked at will let you know when you have a disk space shortage andwhere the available space went, but you'll still have to solve the problem and free up the needed spacesomehow There is a large range of approaches to solving disk space problems, including the following:
Buy another disk This is the ideal solution, but it's not always practical
Mount a remote disk that has some free space on it This solution assumes that such a disk is
available, that mounting it on your system presents no security problems, and that adding additionaldata to it won't cause problems on its home system
Eliminate unnecessary files For example, in a pinch, you can remove the preformatted versions of themanual pages provided that the source files are also available on your system
Compress large, infrequently accessed files
Trang 18Convince or cajole users into deleting unneeded files and backing up and then deleting old files theyare no longer using If you are successful, a great deal of free disk space usually results At the sametime, you should check the system for log files that can be reduced in size (discussed later in thissection).
When gentle pressure on users doesn't work, sometimes peer pressure will The system administrator
on one system I worked on used to mail a list of the top five "disk hogs"—essentially the output of the
quot command—whenever disk space was short I recommend this approach only if you have both athick skin and a good-natured user community
Some sites automatically archive and then delete user files that haven't been accessed in a certainperiod of time (often two or three months) If a user wants a file back, he can send a message to thesystem administration staff, who will restore it This approach is the most brutal and should only betaken when absolutely necessary It is fairly common in university environments, but rarely usedelsewhere It's also easy to circumvent by touching all your files every month, and performing systembackups may also reset access times on inactive files
These, then, are some of the alternatives.[29] In most cases, though, when you can't add any disks to thesystem, the most effective way to solve a disk space problem is to convince users to reduce their storagerequirements by deleting old, useless, and seldom (if ever) used files (after backing them up first) Junk filesabound on all systems For example, many text editors create checkpoint and backup files as protectionagainst a user error or a system failure If these accumulate, they can consume a lot of disk space In
addition, users often keep many versions of files around (noticed most often in the case of program sourcefiles), frequently not even remembering what the differences are between them
[29] There is another way to limit users' disk usage on some systems: disk quotas (discussed later inthis section) However, quotas won't help you once the disks are already too full
The system scratch directory /tmp also needs to be cleared out periodically (as well as any other directories
serving a similar function) If your system doesn't get rebooted very often, you'll need to do this by hand
You should also keep an eye on the various system spooling directories under /usr/spool or /var/spool
because files can often become stagnant there
Unix itself has a number of accounting and logging files that, if left unattended, will grow without bound Asadministrator, you are responsible for extracting the relevant data from these files periodically and thentruncating them We'll look at dealing with these sources of wasted space in the following sections
Under some circumstances, a filesystem's performance can begin to degrade when afilesystem is more than 80%-90% full Therefore, it is a good idea to take anycorrective action before your filesystems reach this level, rather than waiting untilthey are completely full
15.6.2.1 Using find to locate or remove wasted space
The find command may be used to locate potential candidates for archival and deletion (or just deletion) inthe event of a disk space shortage For example, the following command prints all files with names beginning
with BAK or ending with a tilde, the formats for backup files from two popular text editors:
$ find / -name ".BAK.*" -o -name "*~" -print
As we've seen, find can also delete files automatically For example, the following command deletes alleditor backup files over one week old:
Trang 19# find / /bio /corp -atime +7 \( -name ".BAK.*" \
-o -name "*~" \) -type f -xdev -exec rm -f {} \;
When using find for automatic deletion, it pays to be cautious That is why the previous command includesthe -type and -xdev options and lists each filesystem separately With the cron facility, you can use find
to produce a list of files subject to deletion nightly (or to delete them automatically)
Another tactic is to search the filesystem for duplicate files This will require writing a script, but you'll beamazed at how many you'll find
15.6.2.2 Limiting the growth of log files
The system administrator is responsible for reaping any data needed from logfiles and keeping them to areasonable size The major offenders include these files:
The various system log files in /usr/adm or /var/adm, which may include sulog, messages, and other files set up via /etc/syslog.conf.
Accounting files in /usr/adm or /var/adm, especially wtmp and acct (BSD) or pacct (System V) Also,
under System V, the space consumed by the cumulative summary files and ASCII reports in
/var/adm/acct/sum and /var/adm/acct/fiscal are worth monitoring.
Subsystem log files: many Unix facilities, such as cron, the mail system, and the printing system, keeptheir own log files
Under AIX, the files smit.log and smit.script in users' home directories are appended to every time
someone runs SMIT They become large very quickly You should watch the ones in your own and
root's home directories (if you su to root, the files still go into your own home directory) Alternatively,
you could run the smit command with the -l and -s options (which specify the log and script
filenames respectively) and set both filenames to /dev/null Defining an alias is the easy way to do so:
alias smit="smit -l /dev/null -s /dev/null" bash/ksh
alias smit "smit -l /dev/null -s /dev/null" csh/tcsh
There are several approaches to controlling the growth of system log files The easiest is to truncate them byhand when they become large This is advisable only for ASCII (text) log files To reduce a file to zero length,use a command such as:
# cat /dev/null > /var/adm/sulog
Copying from the null device into the file is preferable to deleting the file, because in some cases the
subsystem won't recreate the log file if it doesn't exist It's also preferable to rm followed by touch becausethe file ownerships and permissions remain correct and also because it releases the disk space immediately
To retain a small part of the current logging information, use tail, as in this example:
Trang 20AIX provides the skulker script (stored in /usr/sbin) to perform some of these filesystem cleanup
operations, including the following:
Clearing the queueing system spooling areas of old, junk files
Clearing /tmp and /var/tmp of all files over one day old.
Deleting old news files (over 45 days old)
Deleting a variety of editor backup files, core dump files, and random executables (named a.out) You
may want to add to the list of file types
The system comes set up to run skulker every day at 3 A.M via cron, but the crontab entry is commentedout If you want to run skulker, you'll need to remove the comment character from the skulker line in
root's crontab file.
15.6.3 Controlling Disk Usage with Disk Quotas
Disk space shortages are a perennial problem on all computers For systems where direct control over howmuch disk space each user uses is essential, disk quotas may provide a solution
The disk quota system allows an administrator to limit the amount of filesystem storage that any user canconsume If quotas are enabled, the operating system will maintain separate quotas for each user's diskspace and inode consumption (equivalent to the total number of files he owns) on each filesystem
There are two distinct kinds of quota: a hard limit and a soft limit A user is never allowed to exceed his hard
limit, under any circumstances When a user reaches his hard limit, he'll get a message that he has
exceeded his quota, and the operating system will refuse to allocate any more storage A user may exceedthe soft limit for a limited period of time; in such cases, he gets a warning message, and the operatingsystem grants the request for additional storage If his disk usage still exceeds this soft limit at the nextlogin, the message will be repeated He'll continue to receive warnings at each successive login until either:
He reduces his disk usage to below the soft limit, or
He's been warned a fixed number of times (or for a specified period of time, depending on the
implementation) At this point, the operating system will refuse to allocate any more storage until theuser deletes enough files that his disk usage again falls below his soft limit
The disk quota system has been designed to let users have large temporary files, provided that in the longterm, they obey a much stricter limit For example, consider a user with a hard limit of 15,000 blocks and a
soft limit of 10,000 blocks If this user's storage ever exceeds 15,000 blocks, the operating system will
refuse to allocate any more storage immediately; he will need to free some storage before he can save anymore files If this user's storage exceeds 10,000 blocks, he'll get a warning but requests for more disk spacewill still be honored However, if this user does not reduce his storage below 10,000 blocks, the operatingsystem will eventually refuse to allocate any additional storage until it does fall below 10,000 blocks
If you decide to implement a quota system, you must determine which filesystems need quotas In mostsituations, the filesystems containing user home directories are appropriate candidates for quotas
Filesystems that are reserved for public files (for example, the root filesystem) probably shouldn't use quotas The /tmp filesystem doesn't usually have quotas because it's designed to provide temporary scratch
Trang 21Many operating systems require quotas to be enabled in the kernel, and manykernels do not include them by default Check your kernel configuration beforeattempting to use quotas
15.6.3.1 Preparing filesystems for quotas
After deciding which filesystems will have quotas, you'll need to edit the filesystem entries in the filesystem
configuration file (usually /etc/fstab) to indicate that quotas are in use by editing the options field, as in
/dev/dsk/c0t3d0s0 /1 ufs 2 yes rw,logging,quota
See Section 10.2 for full details on the filesystem configuration file on the various systems
On AIX systems, add a line like the following to the filesystem's stanza in /etc/filesystems:
quota = userquota,groupquota
Include the userquota keyword for standard disk quotas and the groupquota keyword for group-based disk
quotas (described in the final part of this section)
Next, make sure that there is a file named quotas in the top-level directory of each filesystem for which you
want to establish quotas If the file does not exist, create it with the touch command:[31]
[31] This is not always required by recent quota system implementations, but it won't hurt either
# cd /chem
# touch quotas
# chmod 600 quotas
The file must be writable by root and no one else.
15.6.3.2 Setting users' quota limits
Trang 22Use the edquota command to establish filesystem quotas for individual users This command can be invoked
to edit the quotas for a single user:
# edquota username(s)
When you execute this command, edquota creates a temporary file containing the hard and soft limits oneach filesystem for each user After creating the file, edquota invokes an editor so you can modify it (bydefault, vi; you can use the environment variable EDITOR to specify your favorite editor) Each line in this
file describes one filesystem The format varies somewhat; here is an example:
/chem: blocks in use: 13420, limits (soft=20000, hard=30000)
inodes in use: 824, limits (soft=0, hard=0)
This entry specifies quotas for the /chem filesystem; by editing it, you can add hard and soft limits for this
user's total disk space and inode space (total number of files) Setting a quota to 0 disables that quota Theexample specifies a soft quota of 20,000 disk blocks, a hard quota of 30,000 disk blocks, and no quotas oninodes Note that the entry in the temporary file does not indicate anything about the user(s) to which thesequotas apply; quotas apply to the user specified when you execute the edquota command When you listmore than one user on the command line, you will edit a file for each one of them in turn
After you save the temporary quota file and exit the editor (using whatever commands are appropriate forthe editor you are using), edquota modifies the quotas files themselves These files cannot be edited
directly
The -p option to edquota lets you copy quota settings between users For example, the following command
applies chavez's quota settings to users wang and harvey:
# edquota -p chavez wang harvey
15.6.3.3 Setting the soft limit expiration period
edquota's -t option is used to specify the system-wide time limit for soft quotas Executing edquota -t
also starts an editor session something like this one:
Time units may be: days, hours, minutes, or seconds
Grace period before enforcing soft limits for groups:
/chem: block grace period: 3 days, file grace period: 0 days
A value of zero days indicates the default value is in effect (usually seven days) You can specify the time
period in other units by changing days to one of the other listed keywords Some implementations allow you
to specify the grace period in months as well, but then one would have to start to wonder what the point ofusing disk quotas was in the first place
15.6.3.4 Enabling quota checking
# quotaon filesystem
# quotaon -a
The first command enables the quota system for the specified filesystem The latter enables quotas on all
Trang 23filesystems listed with quotas in the filesystem configuration file For example, the following command
enables quotas for the /chem filesystem:
# quotaon /chem
Similarly, the command quotaoff disables quotas It can be used with the -a option to disable all quotas,
or with a list of filesystem names
15.6.3.5 Quota consistency checking
The quotacheck command checks the consistency of the quotas file for the filesystem specified as its
argument It verifies that the quota files are consistent with current actual disk usage This command should
be executed after you install or modify the quota system If used with the option -a, quotacheck checks allfilesystems designated as using quotas in the filesystem configuration file
them to one of the system boot scripts on AIX systems The other Unix versions run them automatically, viathese boot scripts:
FreeBSD /etc/rc (if check_quotas="yes" in /etc/rc.conf)
HP-UX /sbin/init.d/localmount
Linux /etc/init.d/quota(SuSE 7: if START_QUOTA="yes" in /etc/rc.config)
Solaris /etc/init.d/MOUNTFS and ufs_quota
Tru64 /sbin/init.d/quota if QUOTA_CONFIG="yes" in /etc/rc.config
15.6.3.6 Disk quota reports
The repquota command reports the current quotas for one or more specified filesystem(s) Here is an
example of the reports generated by repquota:
# repquota -v /chem
*** Report for user quotas on /chem (/dev/sd1d)
Block limits File limits
User used soft hard grace used soft hard grace chavez 13420 20000 25000 824 0 0
chen +- 2436 2000 3000 2days 8 0 0
The plus sign in the entry for user chen indicates that he has exceeded his disk quota.
Users can use the quota command to determine where their current disk usage falls with respect to theirdisk quotas
15.6.3.7 Group-based quotas (AIX, FreeBSD, Tru64 and Linux)
AIX, FreeBSD, Tru64, and Linux extend standard disk quotas to Unixgroups as well as individual users.Specifying the -g option to edquota causes names on the command line to be interpreted as group namesrather than as usernames Similarly, edquota -t -g allows you to specify the soft limit timeout period forgroup quotas
Trang 24By default, the quotaon, quotaoff, quotacheck, and repquota commands operate on both user andgroup quotas You can specify the -u and -g options to limit their scope to only user quotas or only groupquotas, respectively Users must use the following form of the quota command to determine the currentstatus of group quotas:
$ quota -g chem
For example, this command will report the disk quota status for group chem Users may query the disk
quota status only for groups of which they are a member
I l@ve RuBoard
Trang 25I l@ve RuBoard
15.7 Network Performance
This section concludes our look at performance monitoring and tuning on Unix systems It contains a briefintroduction to network performance, a very large topic whose full treatment is beyond the scope of thisbook Consult the work by Musameci and Loukides for further information
15.7.1 Basic Network Performance Monitoring
network statistics You can limit the display to a single network protocol via the -p option, as in this examplefrom an HP-UX system:
$ netstat -s -p tcp Output shortened.
tcp:
178182 packets sent
111822 data packets (35681757 bytes)
30 data packets (3836 bytes) retransmitted
66363 ack-only packets (4332 delayed)
337753 packets received
89709 acks (for 35680557 bytes)
349 duplicate acks
0 acks for unsent data
284726 packets (287618947 bytes) received in-sequence
0 completely duplicate packets (0 bytes)
3 packets with some dup, data (832 bytes duped)
11 out of order packets (544 bytes)
5 packets received after close
11 out of order packets (544 bytes)
The output gives statistics since the last boot.[32]
[32] Or most recent counter reset, if supported
Network operations are proceeding nicely on this system The highlighted lines are among those that wouldindicate transmission problems if the values in them rose to appreciable percentages of the total networktraffic
More detailed network performance data can be determined via the various network monitoring tools weconsidered in Section 8.6
15.7.2 General TCP/IP Network Performance Principles
Good network performance depends on a combination of several components working properly and
efficiently Performance problems can arise in many places and take many forms These are among the mostcommon:
Trang 26Network interface problems, including insufficient speed and high error rates due to failing or
misconfigured hardware This sort of problem shows up as poor performance and/or many errors on aparticular host
Network adapters, hubs, switches, and network devices in general seldom fail all at once, but ratherproduce increasing error rates and/or degrading performance over time These metrics should bemonitored regularly to spot problems before they become severe Degradation can also occur due toaging drop cables
Hardware device setup errors, including half/full duplex mismatches, cause high error and collisionrates and result in hideous performance
Overloaded servers can also produce poor network response Servers can have several kinds of
shortfalls: too much traffic for its interface to handle, too little memory for the network workload (or anincorrect configuration), and insufficient disk I/O bandwidth The server's performance will need to beinvestigated to determine which of these are relevant (and hence where the most attention to theproblem should be paid)
Insufficient network bandwidth for the workload You can recognize such situations by the presence ofslow response and/or significant timeouts on systems throughout the local network, which is notalleviated by the addition of another server system The best solution to such problems is to use high-performance switches If this is not possible, another, much less desirable, solution is to divide thenetwork into multiple subnets that separate systems requiring distinct network resources from oneanother
All of these problem types are best addressed via by correcting or replacing hardware and/or reallocatingresources rather than configuration-level tuning
15.7.2.1 Two TCP parameters
TCP operations are controlled by a very large number of parameters Most of them should not be modified bynonexperts In this subsection, we'll consider two that are most likely to produce significant improvementswith little risk
The maximum segment size (MSS) determines the largest "packet" size that the TCP protocol willtransmit across the network (The actual size will be 40 bytes larger due to the IP and TCP headers.)Larger segments result in fewer transmissions to transfer a given amount of data and usually providecorrespondingly better performance on Ethernet networks.[33] For Ethernet networks, the maximumallowed size, 1460 bytes (1500 minus 40), is usually appropriate.[34]
[33] Note that this will often not be the case for slow network links, especially for applications thatare very sensitive to network transmission latencies
[34] When is it inappropriate? When the headers are larger than the minimum and using a sizethis large causes packet fragmentation and its resultant overhead For example, a value of 1200-
1300 is more appropriate when, say, the PPP over Ethernet protocol is used, as would be thecase on a web server accessed by cable modem users
Socket buffer sizes When an application sends data across the network via the TCP protocol, it is firstplaced in a buffer From there, the protocol will divide it as needed and create segments for
Trang 27transmission Once the buffer is full, the application generally must wait for the entire buffer to betransmitted and acknowledged before it is allowed to queue additional data.
On faster networks, a larger buffer size can improve application performance The tradeoff here is thateach buffer consumes memory, so the system must have sufficient available memory resources toaccommodate all of the buffers for (at least) the usual network load For example, using read and writesocket buffers of 32 KB for each of 500 network connections would require approximately 32 MB ofmemory on the network server (32 x 2 x 500) This would not be a problem on a dedicated networkserver but might be an issue on busy, general-purpose systems
On current systems with reasonable memory sizes and no other applications with significant memoryrequirements, socket buffer sizes of 48 to 64 KB are usually reasonable
Table 15-7 lists the relevant parameters for each of our Unix versions, along with the commands that may
be used to modify them
Table 15-7 Important TCP parameters
KB]
MSS [default in bytes]
[16]tcp_recvspace [16] tcp_mssdflt [512]
FreeBSD sysctl param=value(also /etc/sysctl.conf)
net.inet.tcp.sendspace[32]net.inet.tcp.recvspace[64]
net.inet.tcp.mssdflt[512]
HP-UX ndd -set /dev/tcp param value(also
/etc/rc.config.d/nddconf)
tcp_recv_hiwater_def[32]tcp_xmit_hiwater_def[32]
min, default, max)
rmem_max [64]wmem_max[64]tcp_rmem
Tru64 sysconfig -r inet param=value(also
/etc/sysconfigtab)
tcp_sendspace[60]tcp_recvspace [60] tcp_mssdflt [536]
The remaining sections will consider performance issues associated with two important network subsystems:DNS and NFS
15.7.3 DNS Performance
DNSperformance is another item that is easiest to affect at the planning stage The key issues with DNS are:
Sufficient server capacity to service all of the clients
Trang 28Balancing the load among the available servers
At the moment, the latter is best accomplished by specifying different name server orderings within the
/etc/resolv.conf files on groups of client systems It is also helpful to provide at least one DNS server on
each side of slow links
Careful placement of forwarders can also be beneficial At larger sites, a two-tiered forwarding hierarchy mayhelp to channel external queries through specific hosts and reduce the load on other internal servers
Finally, use separate servers for handling internal and external DNS queries Not only will there be
performance benefits for internal users, it is also the best security practice
DNS itself can also provide a very crude sort of load balancing via the use of multiple A records in a zone file,
as in this example:
docsrv IN A 192.168.10.1
IN A 192.168.10.2
IN A 192.168.10.3
These records define three servers with the hostname docsrv Successive queries for this name will receive
each IP address in turn.[35]
[35] Actually, each query will receive each IP address as the first entry in the list that is returned Mostclients pay attention only to the top entry
This technique is most effective when the operations that are requested from the servers are all essentiallyequivalent, and so a simple round robin distribution of them is appropriate It will be less successful whenrequests can vary greatly in size or resource requirements In such cases, manual assigning servers to the
various clients will work better You can do so my editing the nameserver entries in /etc/resolv.conf.
15.7.4 NFS Performance
The Network File System is a very important Unix network service, so we'll complete our discussion of
performance by considering some of its performance issues
Monitoring NFS-specific network traffic and performance is done via the nfsstat command For example,the following command lists NFS client statistics:
Trang 29If either of these values is appreciable, there is probably an NFS bottleneck somewhere If badxids is within
a factor of, say, 6-7 of timeouts, the responsiveness the remote NFS server is the source of the client's
performance problems On the other hand, if there are many more timeouts than badxids, then general
network congestion is to blame
The nfsstat command's -s option is used to obtain NFS server statistics:
Server nfs V2: (54231 out of 59077 calls)
null getattr setattr root lookup readlink read
Server nfs V3: (4846 out of 59077 calls)
null getattr setattr lookup access readlink read
15.7.4.1 NFS Version 3 performance improvements
Many Unix systems are now providingNFS Version 3 instead of or in addition to Version 2 NFS Version 3 hasmany benefits in several areas; reliability, security, performance are among them The following are themost important improvements provided by NFS Version 3:
Trang 30TCP versus UDP: Traditionally, NFS uses the UDP transport protocol NFS Version 3 uses TCP as itsdefault transport protocol.[36] Doing so provides NFS operations with both flow control and packet-level retransmission By contrast, when using UDP, any network failure requires that the entire
operation be repeated Thus, using TCP often results in smaller performance hits when there areproblems
[36] Some NFS Version 2 implementations can also optionally use TCP instead of UDP
Two-phase writes: Previously, NFS write operations were performed synchronously, meaning that aclient had to wait for each write operation to be completed before starting another one Under NFSVersion 3, write operations are performed in two parts:
The client queues a write request, which the server acknowledges immediately Additional writeoperations can be queued once the acknowledgement is received
The client commits the write operation (possibly after some intermediate modifications), and theserver commits it to disk (or requests its retransmission if the data is no longer available (e.g., ifthere was an intervening system crash)
The maximum data block size is increased (the previous limit was 8 KB) The actual maximum value isdetermined by transport protocol; for TCP, it is 32 KB In addition to reducing the number of packets, alarger block size can result in fewer disks seeks and faster sequential file access The effect is
especially noticeable with high-speed networks
15.7.4.2 NFS performance principles
The following points are important to keep in mind with respect to NFS server performance, especially in theplanning stages:
Mounting NFS filesystems in the background (i.e., with the bg option) will speed up boots
Use an appropriate number of NFS daemon processes The rule of thumb is 2 per expected
simultaneous client process In contrast, if there are idle NFS daemons on a server, you can reduce thenumber and release their (albeit small) memory resources
Very busy NFS servers will benefit from a multiprocessor computer CPU resources are almost never anissue for NFS, but the context switches generated by very large numbers of clients can be significant
Don't neglect the usual system memory and disk I/O performance considerations, including the size ofthe buffer cache, filesystem fragmentation, and data distribution across disks
NFS searches remote directories sequentially, entry by entry, so avoid remote directories with largenumbers of file
Remember that not every task is appropriate for remote files For example, compiling a program suchthat the object files are written to a remote filesystem will run very slowly indeed In general, sourcefiles may be remote, but object files and executables should be created on the local system In
general, for best network performance, avoid writing large amounts of data to remote files (althoughyou may to sacrifice disk and network I/O performance in order to use the CPU resources of a fastremote system)
Trang 31Resources for You
After all of this discussion of system resources, it's worth spending a little time considering onesfor yourself Resources for system administrators come in many varieties: books and magazines,web sites and news groups, conferences and professional organizations, and humor and fun (allwork and no play won't do anything positive for your performance)
Here are some of my favorites:
An excellent Unix internals book: UNIX Internals: The New Frontier by Uresh Vahalia
(Prentice-Hall)
Sys Admin magazine, http://www.sysadminmag.com
Useful web sites: http://www.ugu.com, http://www.lwn.net, http://www.slashdot.com (thelast for news and rumors)
LISA: an annual conference for system administrators run by Usenix and Sage (see
http://www.usenix.org/events)
UNIX Hater's Handbook, ed Simson Garfinkel, Daniel Weise, and Steve Strassmann (IDG
Books) This is still the funniest book I've read in a long time You can expect to waste a fewhours at work if you start reading it there because you won't be able to put it down
I l@ve RuBoard
Trang 32I l@ve RuBoard
Chapter 16 Configuring and Building Kernels
As we've noted many times before, the kernel is the heart of the Unix operating system It is the core
program, always running while the operating system is up, providing and overseeing the system
environment Thekernel is responsible for all aspects of system functioning, including:
Process creation, termination and scheduling
Virtual memory management (including paging)
Device I/O (via interfaces with device drivers: modules that perform the actual low-level
communication with physical devices such as disk controllers, serial ports, and network adapters)
Interprocess communication (both local and network)
Enforcing access control and other security mechanisms
Traditionally, the Unix kernel is a single, monolithic program On more recent systems, however, the trend
has been toward modularized kernels: small core executable programs to which additional, separate object
or executable files—modules—can be loaded and/or unloaded as needed Modules provide a convenient way
to provide support for a new device type or add specific new functionality to an existing kernel
In many instances, the standard kernel program provided with the operating system works perfectly well forthe system's needs There are a few circumstances, however, where it is necessary to create a customkernel (or perform equivalent customization activities) to meet the special needs of a particular system orenvironment Some of the most common are:
To add capabilities to the kernel (e.g., support for disk quotas or a new filesystem type)
To add support for new devices
To remove unwanted capabilities/features from the kernel to reduce its size and resource consumption(mostly memory) and thereby presumably improve system performance
To change the values of hardwired kernel parameters that cannot be modified dynamically
How often you have to build a new kernel depends greatly on which system you are administering On someolder systems (mid-1990s versions of SCO Unix come to mind), you had to build a new kernel any time youadded even the smallest, most insignificant new device or capability to the system On most current
systems, such as FreeBSD and Tru64, you build a kernel only when you want to significantly alter the systemconfiguration And on a few systems, like Solaris and especially AIX, you may never have to do so
In this chapter, we'll look at the process of building a customized kernel, and we'll also examine
administering kernel modules There are many reasons you might want to alter the standard kernel:
addressing performance issues, supporting a device and subsystem, removing features the system doesn'tuse (in an effort to make the kernel smaller), adjusting the operating system's behavior and resource limits,and so on We won't be able to go into every possible change you might make on each of the systems weare considering Instead, we'll look at the general process you go through to make a kernel, including how to
Trang 33install it and boot from it and how to back out your changes should they prove unsatisfactory.
NOTE
Custom kernel building and reconfiguration is not for the faint-hearted, the careless orthe ignorant Know what you're doing, and why, to avoid inadvertently making yoursystem unusable
In general, building a custom kernel consists of these steps:
Installing the kernel source code package (if necessary)
Applying any patches, adding new device driver code, and/or making any other source code changesyou may require
Saving the current kernel and its associated configuration files
Modifying the current system configuration as needed
Building a new kernel executable image
Building any associated kernel modules (if applicable)
Installing and testing the new kernel
Table 16-1 lists the kernel locations and kernel build directories for the operating systems we are
considering
Table 16-1 Standard kernel image and build directory locations
Solaris /kernel/unix (or genunix[2]) none
[1] This component is architecture-specific; i386 is the generic subdirectory for Intel-based PCs Ifyou're running on a more recent CPU type, building a kernel for that specific processor may improvethe operating system's performance
[2] The gen forms are the generic, hardware-independent versions of the kernel.
We'll begin with the kernel build process on FreeBSD and Tru64 systems (which are very similar) and thenconsider each of the other environments in turn In each case, we will also consider other mechanisms for
Trang 34configuring the kernel and/or kernel modules that are available.
Trang 35I l@ve RuBoard
16.1 FreeBSD and Tru64
Tru64 and FreeBSD use an almost identical process for building a customized kernel They rely on a
configuration file for specifying which capabilities to include within the kernel and setting the values of
various system parameters The configuration file is located in /usr/sys/conf on Tru64 systems and in /usr/src/sys/arch/conf under FreeBSD, where arch is an architecture-specific subdirectory (we'll use i386 as
an example)
Configuration filenames are conventionally all uppercase, and the directory typically contains several
different configuration files The one used to build the current kernel is usually indicated in the /etc/motd file For example, the GENERIC file was used to build the kernel on this FreeBSD system:
FreeBSD 4.3-RELEASE (GENERIC) #0: Sat Apr 21 10:54:49 GMT 2001
Default Tru64 configuration files are often named GENERIC or sometimes ALPHA.
On FreeBSD systems, you will first need to install the kernel sources if you have not already done so:
FreeBSD
# cd /
# mkdir -p /usr/src/sys If not already present.
# mount /cdrom
# cat /cdrom/src/ssys.[ad]* | tar xzvf
-To add a device to a Tru64 system, you must boot the generic kernel, /genvmunix, to force the system to
recognize and create configuration information for the new device:
# emacs NEWKERN [/tmp/NEWDEVS]
The GENERIC configuration file is the standard, hardware-independent version provided with the operating
system If you have already customized the kernel, you would start with the corresponding configuration file
Trang 36While editing the new configuration file, add (or activate) lines for new devices or features, disable or
comment out lines for services you don't want to include, and specify the values for any applicable kernelparameters In general, it's unlikely that you'll need to modify the contents of hardware device-related
entries The one exception is the ident entry, which assigns a name to the configuration You should change
it so its value corresponds to the name you have selected:
ident NEWKERN
You may also occasionally remove unneeded subsystems by commenting out the corresponding option'sentry, as in this example, which disables disk quotas:
#options QUOTA Tru64
On Tru64 systems, you will need to merge in any new device lines from the file created by the sizer
command (placed into /tmp), indicated by the optional second parameter to the Tru64 emacs commandabove One way to locate these device lines is to diff that file against your current kernel configuration file
or the GENERIC file.
The FreeBSD configuration file contains a large number of settings, most of them corresponding to hardwaredevices and their characteristics In addition, there are several entries specifying the values of various kernelparameters that might need to be altered in some circumstances For example:
[3] Many kernel parameters can also be modified via the sysctl command and its initialization file (see
Section 15.4)
You can examine the LINT or NOTES configuration file for documentation on most available parameters.
The next step in the kernel build process is to run the command that creates a custom build area for the newconfiguration:
Trang 37doconfig and config create the NEWKERN subdirectory, where the new kernel is actually built Once the
make commands complete, the new kernel may be installed in the root directory and tested
If there are problems building the new kernel, you can boot the saved version with these commands:
disk1s1a:> unload
disk1s1a:> load kernel.save
disk1s1a:> boot
>>> boot -fi vmunix.save
16.1.1 Changing FreeBSD Kernel Parameters
FreeBSD also allows manykernel parameters to be changed dynamically The sysctl command can be used
to list all kernel parameters along with their current values:
You can use this command form to modify a parameter value:
16.1.2 FreeBSD Kernel Modules
FreeBSD also provides support for kernel modules; you can compile them via the corresponding
Trang 38subdirectories in /usr/src/sys/modules The kldstat -v command displays a list of currently-loaded kernelmodules Virtually all are used for supporting devices or filesystem types You can load and unload kernelmodules manually with the kldload and kldunload commands.
The file /boot/loader.conf specifies modules that should be loaded at boot time:
userconfig_script_load="YES" Line created by sysinstall.
usb_load="YES" Load USB modules.
ums_load="YES"
umass_load="YES"
Of course, you need to create the required modules before they can be autoloaded
16.1.3 Installing the FreeBSD Boot Loader
Generally, the FreeBSDboot loader is installed by default in the Master Boot Record (MBR) of the systemdisk However, should you ever need to, you can install it manually with this command:
# boot0cfg -B /dev/ad0
The -B option says to leave the partition table unaltered
You can also use this command's -m option to prevent certain partitions from appearing in the boot menu.This option takes a hexadecimal integer as its argument The value is interpreted as a bit mask that includes(bit is on) or excludes (bit is off) each partition from the menu (provided that it is a BSD partition in the firstplace) The ones bit in the mask corresponds to the first partition, and so on
For example, the following command enables only partition 3 to be listed in the menu:
# boot0cfg -B -m 0x4 /dev/ad0
subpartition within a physical disk partition, as in this example, which installs the boot program into the firstsubpartition in the first partition:
# disklabel -B /dev/ad0s1
16.1.4 Tru64 Dynamic Kernel Configuration
Tru64 also supports two sorts of kernel reconfigurationwithout needing to build a new kernel: subsystemloading and unloading and kernel parameter modifications
A very few subsystems may be dynamically loaded and unloaded into the Tru64 kernel You can list allconfigured subsystems using the sysconfig command:
# sysconfig -s
cm: loaded and configured
hs: loaded and configured
ksm: loaded and configured
Subsystems can be loaded or unloaded The -m option displays whether each one is dynamic (loadable and
Trang 39unloadable with a running kernel) or static:
# sysconfig -m | grep dynamic
hwautoconfig: dynamic
envmon: dynamic
lat: dynamic
On this system, only three subsystems are dynamic For these modules, you can use the sysconfig -c and
-u options to load and unload them, respectively
Static and dynamic subsystems can also have settable kernel parameters associated with them You canview the list of available parameters with a command like this one:
# sysconfig -Q lsm Parameters for the Logical Storage Manager
lsm:
Module_Name - type=STRING op=Q min_len=3 max_len=30
lsm_rootdev_is_volume - type=INT op=CQ min_val=0 max_val=2
Enable_LSM_Stats - type=INT op=CRQ min_val=0 max_val=1
The display lists the parameter name, its data type, allowed operations, and valid range of values The
operations are specified via a series of code letters: Q means can be queried, C means the change occurs after reboot, R means the change occurs on a running system.
In our example, the first parameter (the name of the module) can be queried but not modified; the secondparameter (whether the root filesystem is a logical volume) can be modified, but the new value won't takeeffect until the system reboots; and the third parameter (whether subsystem statistics are recorded) takeseffect as soon as it is changed
You use the -q option to display the current value of a parameter and the -r option to change its value:
The /etc/sysconfigtab file can be used to set kernel parameters at boot time (see Section 15.4).
If you prefer a graphical interface, the dxkerneltuner utility can also be used to view and modify thevalues of kernel parameters The sys_attrs manual page provides descriptions of kernel parameters andtheir meanings
I l@ve RuBoard
Trang 40I l@ve RuBoard
16.2 HP-UX
SAM is still the easiest way to build a new kernel under HP-UX However, you can build one manually if youprefer:[4]
[4] This command is also useful for simply listing the modified variables in the current kernel
# cd /stand Move to kernel directory.
# mv vmunix vmunix.save Save current kernel.
# cd build Move to build subdirectory.
# /usr/lbin/sysadm/system_prep -v -s system Extract system file.
# kmtune -s var
=value
-S /stand/build/system Modify kernel parameters.
.
# mk_kernel -s /system -o /vmunix_new Build new kernel.
# kmupdate /stand/build/vmunix_new Schedule kernel install.
# mv /stand/system /stand/system.prev Save old system file.
# mv /stand/build/system /stand/system Install new system file.
The system_prep script creates a new system configuration file by extracting the information from the
running kernel The kmtune command(s) specify the values of kernel variables for the new kernel
The mk_kernel script calls the config command and initiates the make process automatically Once thekernel is built, you use the kmupdate command to schedule its installation at the next reboot You can thenreboot to activate it
If there is a problem with the new kernel, you can boot the saved kernel with a command like the following:
ISL> hpux /stand/vmunix.save
To determine what kernel object files are available, use the following command to list the contents of the
/stand directory:
ISL> hpux ll /stand
The system file contains information about system devices and settings for various kernel parameters Here
are some examples of the latter:
maxfiles_lim 1024 Maximum open files per process.
maxusers 250 Number of users/processes to assume when sizing kernel
data structures.
nproc 512
You can also use SAM to configure these parameters and then rebuild the kernel Figure 16-1 illustratesusing SAM to modify a kernel parameter (in this case, the length of the time slice: the maximum period forwhich a process can execute before being interrupted by the scheduler)