Essential System Administration, 3rd Edition phần 9 ppsx

For example, thefollowing command changes the settings related to the buffer cache: # echo "5 33 80" > /proc/sys/vm/buffermem 15.4.2.5 Solaris On Solaris systems, you can view the values

Trang 1

memory that they never actually use, and they may run successfully if this setting is enabled.

Changing parameter values is accomplished by modifying the values in these values For example, thefollowing command changes the settings related to the buffer cache:

# echo "5 33 80" > /proc/sys/vm/buffermem

15.4.2.5 Solaris

On Solaris systems, you can view the values of system parameters via the kstat command.For example,the following command displays system parameters related to paging behavior, including their default values

on a system with 1 GB of physical memory:

# kstat -m unix -n system_pages | grep 'free '

cachefree 1966 Units are pages.

lotsfree 1966

desfree 983

minfree 491

Figure 15-4 illustrates the meanings and interrelationships of these memory levels

Figure 15-4 Solaris paging and swapping memory lLevels

As the figure indicates, setting cachefree to a value greater than lotsfree provides a way of favoring

processes' memory over thebuffer cache (by default, no distinction is made between them because lotsfree

is equal to cachefree) In order to do so, you should decrease lotsfree to some point between its current level and desfree (rather than increasing cachefree).

Solaris 9 has changed its virtual memory manager and has eliminated the cachefree

variable

15.4.2.6 Tru64

Tru64 memory management is controlled by parameters in thesysconfig vm subsystem These are the most

useful parameters:

vm_aggressive_swap: Enable/disable aggressive swapping out of idle processes (0 by default).

Enabling this can provide some memory management improvements on heavily loaded systems, but it

Trang 2

is not a substitute for reducing excess consumption.

There are several parameters that control the conditions under which the memory manager stealspages from active processes and/or swaps out idle processes in an effort to maintain sufficient freememory They are listed in Figure 15-5 along with their interrelationships and effects

Figure 15-5 Tru64 paging and swapping memory levels

The default for vm_page_free_min is 20 pages The value of vm_page_free_target varies with the

memory size; for a system with 1 GB of physical memory, it defaults to 512 pages The reserved value

is always 10 pages

The other variables are computed from these values vm_page_free_swap (and the equivalent

vm_page_free_optimal) is set to the point halfway between the minimum and the target, and

vm_page_free_hardswap is set to about 16 times the target value.

Several parameters relate to the size of the buffer cache vm_minpercent specifies the percentage of

memory initially used for the buffer cache (the default is 10%).The buffer cache size will increase ifmemory is available The parameter ubc_maxpercent specifies the maximum amount of memory that itmay use (the default is 100%) When memory is short and the size of the cache corresponds to

ubc_borrowpercent or larger, pages will be returned to the general pool until the cache drops below

this level (and process memory page stealing does not occur) The default for the borrow level is 20%

of physical memory

On file servers, it will often make sense to increase one or both of the minimum and borrow

percentages (to favor the cache over local processes in memory allocation) On a database server,though, you will probably want to reduce these sizes

15.4.3 Managing Paging Space

Specially designated areas of disk are used forpaging On most Unix systems, distinct, dedicated disk

partitions—called swap partitions—are used to hold pages written out from memory In some recent Unix implementations, paging can also go to special page files stored in a regular Unix filesystem.[26]

[26] Despite their names, both swap partitions and page files can be used for paging and for swapping(on systems supporting virtual memory)

Trang 3

Many discussions of setting up paging space advise using multiple paging areas, spreadacross different physical disk drives Paging I/O performance will generally improve thecloser you come to this ideal

However, regular disk I/O also benefits from careful disk placement It is not always possible toseparate both paging space and important filesystems Before you decide which to do, you mustdetermine which kind of I/O you want to favor and then provide the improvements appropriate forthat kind

In my experience, paging I/O is best avoided rather than optimized, and other kinds of disk I/Odeserve far more attention than paging space placement

15.4.3.1 How much paging space?

There are as many answers to this question as there are people to ask The correct answer is, of course, "Itdepends." What it depends on is the type of jobs your system typically executes A single-user workstationmight find a paging area of one to two times the size of physical memory adequate if all the system is usedfor is editing and small compilations On the other hand, real production environments running programswith very large memory requirements might need two or even three times the amount of physical memory.Keep in mind that some processes will be killed if all available paging space is ever exhausted (and newprocesses will not be able to start)

One factor that can have a large effect on paging space requirements is the way that the operating systemassigns paging space to virtual memory pages implicitly created when programs allocate large amounts ofmemory (which may not all be needed in any individual run) Many recent systems don't allocate pagingspace for such pages until each page is actually accessed; this practice tends to minimize per-processmemory requirements and stretch a given amount of physical memory as far as possible However, othersystems assign paging space to the entire block of memory as soon as it is allocated Obviously, under thelatter scheme, the system will need more page file space than under the former

Other factors that will tend to increase your page file space needs include:

Jobs requiring large amounts of memory, especially if the system must run more than one at a time

Jobs with virtual address spaces significantly larger than the amount of physical memory

Programs that are themselves very large (i.e., have large executables) This often implies the itemabove, but not vice versa

A very, very large number of simultaneously running jobs, even if each individual job is fairly small

15.4.3.2 Listing paging areas

Most systems provide commands to determine the locations of paging areas and how much of the totalspace is currently in use:

Trang 4

List paging areas Show current usage

Here is some output from a Solaris system:

swapfile dev swaplo blocks free

Page Space Phys Volume Volume Group Size %Used Active Auto

hd6 hdisk0 rootvg 200MB 76 yes yes

paging00 hdisk3 uservg 128MB 34 yes yes

The output lists the paging space name, the physical disk it resides on, the volume group it is part of, itssize, how much of it is currently in use, whether it is currently active, and whether it is activated

automatically at boot time This system has two paging spaces totaling about 328 MB; total system swapspace is currently about 60% full

Here is some output from an HP-UX system:

-The first three lines of the output provide details about the system swap configuration -The first line (dev)

shows that 34 MB is currently in use within the paging area at /dev/vg00/lvol2 (its total size is 192 MB) The

next line indicates that another 98 MB has been reserved within this paging area but is not yet in use

The third line of the display is present when pseudo-swap has been enabled on the system This is

accomplished by setting the swapmem_on kernel variable to 1 (in fact, this is the default) Pseudo-swap

allows applications to reserve more swap space than physically exists on the system It is important to

Trang 5

emphasize that pseudo-swap does not itself take up any memory, up to a limit of seven eighths of physicalmemory Line 3 indicates that there is 164 MB of memory overcommitment capacity remaining for

applications to use (32 MB is in use)

The final line (total) is a summary line In this case, it indicates that there is 257 MB of total swap space onthis system 164 MB of it is currently either reserved or allocated: the 34 MB allocated from the paging areaplus 98 MB reserved in the paging area plus 32 MB of the pseudo-swap capacity

15.4.3.3 Activating paging areas

Normally, paging areas are activated automatically at boot time On many systems, swap partitions are

listed in the filesystem configuration file, usually /etc/fstab The format of the filesystem configuration file is

discussed in detail in Section 10.2, although some example entries will be given here:

/dev/ad0s2b none swap sw 0 0 FreeBSD

/dev/vg01/swap swap pri=0 0 0 HP-UX

/dev/hda1 swap swap defaults 0 0 Linux

This entry says that the first partition on disk 1 is a swap partition This basic form is used for all swappartitions

Solaris systems similarly place swap areas into /etc/vfstab:

swapon -a > /dev/console 2>&1

when adding a new partition Solaris provides the swapadd tool to perform the same function during boots.

Under AIX, paging areas are listed in the file /etc/swapspaces :

hd6:

dev = /dev/hd6

paging00:

dev = /dev/paging00

Each stanza lists the name of the paging space and its associated special file (the stanza name and the

filename in /dev are always the same) All paging logical volumes listed in /etc/swapspaces are activated at

boot time by a swapon -a command in /etc/rc Paging logical volumes can also be activated when they are

created or by manually executing the swapon -a command

15.4.3.4 Creating new paging areas

Trang 6

As we've noted, paging requires dedicated disk space, which is used to store paged-out data Making a newswap partition on an existing disk without free space is a painful process, involving these steps:

Performing a full backup of all filesystems currently on the device and verifying that the tapes arereadable

Restructuring the physical disk organization (partition sizes and layout), if necessary

Creating new filesystems on the disk At this point, you are treating the old disk as if it were a brandnew one

Restoring files to the new filesystems

Activating the new swapping area and adding it to the appropriate configuration files

Most of these steps are covered in detail in other chapters A better approach is the subject of the nextsubsection

15.4.3.5 Filesystem paging

Many modern Unix operating systems offer a great deal more flexibility by supporting filesystem paging

—paging to designated files within normal filesystems Page files can be created or deleted as needs change,albeit at a modest increase in paging operating system overhead

Under Solaris, the mkfile command creates new page files For example, the following command will create

the file /chem/page_1 as a 50 MB file:

# mkfile 50m /chem/page_1

# swap -a /chem/page_1 0 102400

size of the file is interpreted as bytes unless a k (KB) or m (MB) suffix is appended to it The regular swap

command is then used to designate an existing file as a page file by substituting its pathname for the specialfilename

On HP-UX systems, filesystem paging is initiated by designating a directory as the swap device to the

swapon command In this mode, it has the following basic syntax:

swapon [-m min] [-l limit] [-r reserve] dir

min is the minimum number of filesystem blocks to be used for paging (the block size is as defined when the filesystem was created: 4096 or 8192), limit is the maximum number of filesystem blocks to be used for paging space, and reserve is the amount of space reserved for files beyond that currently in use which may never be used for paging space For example, the following command initiates paging to the /chem

filesystem, limiting the size of the page file to 5000 blocks and reserving 10000 blocks for future filesystemexpansion:

# swapon -l 5000 -r 10000 /chem

You can also create a new logical volume as an additional paging space under HP-UX For example, the

following commands create and activate a 125 MB swap logical volume named swap2:

Trang 7

# lvcreate -l 125 -n swap2 -C y -r n /dev/vg01

# swapon /dev/vg01/swap2

The logical volume uses a contiguous allocation policy and has bad block relocation disabled (-C and -r,respectively) Note that no filesystem is built on the logical volume

On Linux systems, a page file may be created with commands like these:

# dd if=/dev/zero of=/swap1 bs=1024 count=8192 Create 8MB file.

# mkswap /swap1 8192 Make file a swap device.

# sync; sync

# swapon /swap1 Activate page file.

On FreeBSD systems, a page file is created as follows:

# dd if=/dev/zero of=/swap1 bs=1024 count=8192 Create 8MB file.

# vnconfig -e vnc0 /swap1 swap Create pseudo disk /dev/vn0c

and enable swapping.

The vnconfig command configures the paging area and activates it

Under AIX, paging space is organized as special paging logical volumes Like normal logical volumes, pagingspaces may be increased in size as desired as long as there are unallocated logical partitions in their volumegroup

You can use the mkps command to create a new paging space or the chps command to enlarge an existing

one For example, the following command creates a 200 MB paging space in the volume group chemvg:

# mkps -a -n -s 50 chemvg

The paging space will be assigned a name like pagingnn where nn is a number: paging01, for example The

-a option says to activate the paging space automatically on system boots (its name is entered into

/etc/swapspaces) The -n option says to activate the paging space immediately after it is created The -s

option specifies the paging space's size, in logical partitions (whose default size is 4 MB) The volume groupname appears as the final item on the command line

The size of an existing paging space may be increased with the chps command Here the -s option specifiesthe number of additional logical partitions to be added:

# chps -s 10 paging01

This command adds 40 MB to the size of paging space paging01.

FreeBSD does not support filesystem paging, although you can use a logical volumefor swapping in either environment The latter makes it much easier to add anadditional paging space without adding a new disk

15.4.3.6 Linux and HP-UX paging space priorities

HP-UX and Linux allow you to specify a preferred usage order for multiple paging spaces via a prioritysystem The -p option to swapon may be used to assign a priority number to a swap partition or other

Trang 8

paging area when it is activated Priority numbers run from 0 to 10 under HP-UX, with lower numbered areasbeing used first; the default value is 1.

On Linux systems, priorities go from 0 to 32767, with higher numbered areas being used first, and theydefault to 0 It is usually preferable to give dedicated swap partitions a higher usage priority than filesystempaging areas

15.4.3.7 Removing paging areas

Paging spaces may be removed if they are no longer needed, unless they're on the root disk To remove aswap partition or filesystem page file in a BSD-style implementation—FreeBSD, Linux, HP-UX, and

Tru64—remove the corresponding line from the appropriate system configuration file Once the system isrebooted, the swap partition will be deactivated (rebooting is necessary to ensure that there are no activereferences to the partition or page file) Page files may then be removed normally with rm

Under Solaris, the -d option to the swap command deactivates a swap area Here are some examples:

# swap -d /dev/dsk/c1d1s1 0

# swap -d /chem/page_1 0

Once the swap -d command is executed, no new paging will be done to that area, and the kernel will

attempt to free areas in it that are still in use, if possible However, the file will not actually be removed until

no processes are using it

Under AIX, paging spaces may be removed with rmps once they are deactivated:

# chps -a n paging01

# rmps paging01

The chps command removes paging01 from the list to be activated at boot time (in /etc/swapspaces).The

rmps command actually removes the paging space

Administrative Virtues: Persistence

Monitoring system activity levels and tuning system performance both rely on the same system

administrative virtue:persistence These tasks naturally must be performed over an extended

period of time, and they are also inherently cyclical (or even recursive) You'll need persistence

most at two points:

When you are just getting started and don't have any idea what is wrong with the system

and what to try to improve the situation

After the euphoria from your early successes has worn off and you have to spend more

time to achieve smaller improvements

System performance tuning—and system performance itself—both follow the 80/20 rule: getting

the last 20% done takes 80% of the time (System administration itself often follows another

variation of the rule: 20% of the people do 80% of the work.) Keep in mind the law of

diminishing returns, and don't waste any time trying to eke out that last 5% or 10%

Trang 9

I l@ve RuBoard

Trang 10

I l@ve RuBoard

15.5 Disk I/O Performance Issues

Disk I/O is the third major performance bottleneck that can affect a system or individual job This section willlook first at the tools for monitoring disk I/O and then consider some of the factors that can affect disk I/Operformance

15.5.1 Monitoring Disk I/O Performance

Unfortunately, Unix tools for monitoring disk I/O data are few and rather poor BSD-like systems provide the

iostat command (all but Linux have some version of it) Here is an example of its output from a FreeBSDsystem experiencing moderate usage on one of its two disks:

The command parameter specifies the interval between reports (and we've omitted the first, summary one,

as usual) The columns headed by disk names are the most useful for our present purposes They showcurrent disk usage as the number of transfers/sec (tps) and MB/sec

System V-based systems offer the sar command, and it can be used to monitor disk I/O Its syntax in thismode is:

$ sar -d interval [count]

interval is the number of seconds between reports, and count is the total number of reports to produce (the

default is one) In general, sar's options specify what data to include in its report sar is available for AIX,HP-UX, Linux, and Solaris However, it requires that process accounting be set up before it will return anydata

This report shows the current disk usage on a Linux system:

Trang 11

15.5.2 Getting the Most From the Disk Subsystem

Disk performance is something that more effectively results from installation-time planning and configurationthan from after-the-fact tuning Different techniques are most effective for optimizing different kinds of I/O.This means that you'll need to understand the I/O performed by the applications/typical workload on thesystem

There are two sorts of disk I/O:

Sequential access

Data from disk is read in disk block order, one block after another After the initial seek (head

movement) to the starting point, the speed of this sort of I/O is limited by disk transfer rates

Random access

Data is read in no particular order This means that the disk head will have to move frequently toreach the proper data In this case, seek time is an important factor in overall I/O performance, andyou will want to minimize it to the extent possible

Three major factors affect disk I/O performance in general:

Disk hardware

Data distribution across the system's disks

Data placement on the physical disk

15.5.2.1 Disk hardware

In general, the best advice is to choose the best hardware you can afford when disk I/O performance is animportant consideration Remember that the best SCSI disks are many times faster than the fastest EIDEones, and also many times more expensive

These are some other points to keep in mind:

When evaluating the performance of individual disks, consider factors such as its local cache in addition

to quoted peak transfer rates

Trang 12

Be aware that actual disk throughput will seldom if ever achieve the advertised peak transfer rates.Consider the latter merely as relative numbers useful in comparing different disks.

Musameci and Loukides suggest using the following formula to estimate actual disk speeds: per-track * RPM * 512)/60,000,000 This yields an estimate of the disk's internal transfer rate in MB.However, even this rate will only be achievable via sequential access (and rarely even then)

(sectors-When random access performance is important, you can estimate the number of I/O operations persecond as 1000/(average-seek-time + 30000/rpm)

Don't neglect to consider the disk controller speed and other characteristics when choosing hardware.Fast disks won't perform as well on a mediocre controller

Don't overload disk controllers Placing disks on multiple disk controllers is one way to improve I/Othroughput rates In configuring a system, be sure to compare the maximum transfer rate for eachdisk adapter with the sum of the maximum transfer rates for all the disks it will control; obviously,placing too large a load on a disk controller will do nothing but degrade performance A more

conservative view states that you should limit total maximum disk transfer rates to 85%-90% of thetop controller speed

Similarly, don't overload system busses For example, a 32-bit/33MHz PCI bus has a peak transfer rate

of 132 MB/sec, less than what an Ultra3 SCSI controller is capable of

15.5.2.2 Distributing the data among the available disks

The next issue to consider after a system's hardware configuration is planning data distribution among theavailable disks: in other words, what files will go on which disk The basic principle to take into account insuch planning is to distribute the anticipated disk I/O across controllers and disks as evenly as possible (in

an attempt to prevent any one resource from becoming a performance bottleneck) In its simplest form, thismeans spreading the files with the highest activity across two or more disks

Here are some example scenarios that illustrate this principle:

If you expect most of a system's I/O to come from user processes, distributing the files they are likely

to use across multiple disks usually works better than putting everything on a single disk

A system intended to support multiple processes with large I/O requirements will benefit from placingthe data for different programs or jobs on different disks (and ideally on separate controllers) Thisminimizes the extent to which the jobs interfere with one another

For a system running a large transaction-oriented database, ideally you will want to place each of thefollowing item pairs on different disks:

Tables and their indexes

Database data and transaction logs

Large, heavily used tables accessed simultaneously

Given the constraints of an actual system, you may have to decide which of these separations is the

Trang 13

most important.

Of course, placing heavily accessed files on network rather than local drives is almost always a guarantee ofpoor performance Finally, it is also almost always a good idea to use a separate disk for the operatingsystem filesystem(s) (provided you can afford to do so) to isolate the effects of the operating system's ownI/O operations from user processes

15.5.2.3 Data placement on disk

The final disk I/O performance factor that we will consider is the physical placement of files on disk Thefollowing general considerations apply to the relationship between file access patterns, physical disk location,and disk I/O performance:

Sequential access of large files (i.e., reading or writing, starting at the beginning and moving steadilytoward the end) is most efficient when the files are contiguous: made up of a single, continuous chunk

of space on disk Again, it may be necessary to rebuild a filesystem to create a large amount of

contiguous disk space.[27] Sequential access performance is highest at the outer edge of the disk (i.e.,beginning at 0) because the platter is the widest at that point (head movement is minimized)

[27] Unfortunately, some disks are too smart for their own good Disks are free to do all kinds ofremapping to improve their concept of disk organization and to mask bad blocks Thus, there is

no guarantee that what look like sequential blocks to the operating system are actually

sequential on the disk

Disk I/O to large sequential files also benefits from software disk striping, provided an appropriatestripe size is selected (see Section 10.3) Ideally, each read should result in one I/O operation (or less)

to the striped disk

Placing large, randomly accessed files in the center portions of disk drives (rather than out at theedges) will yield the best performance Random data access is dominated by seek times—the timetaken to move the disk heads to the correct location—and seek times are minimized when the data is

in the middle of the disk and increases at the inner and outer edges AIX allows you to specify thepreferred on-disk location when you create a logical volume (see Section 10.3) With other Unix

versions, you accomplish this by defining physical disk partitions appropriately

Disk striping is also effective for processes performing a large number of I/O operations

Filesystemfragmentation degrades I/O performance Fragmentation results when the free space within

a filesystem is made of many small chunks of space (rather than fewer large ones of the same

aggregate size) This means that files themselves become fragmented (noncontiguous), and accesstimes to reach them become correspondingly longer If you observe degrading I/O performance on avery full filesystem, fragmentation may be the cause

Filesystem fragmentation tends to increase over time Eventually, it may be necessary or desirable touse a defragmenting utility If none is available, you will need to rebuild the filesystem to reducefragmentation; the procedure for doing so is discussed in Section 10.3

15.5.3 Tuning Disk I/O Performance

Some systems offer a few hooks for tuning disk I/O performance We'll look at the most useful of them in

Trang 14

this subsection.

15.5.3.1 Sequential read-ahead

Some operating systems attempt to determine when a process is accessing data files in a sequential

manner When it decides that this is the access pattern being used, it attempts to aid the process by

performing read-ahead operations: reading more pages from the file than the process has actually

requested For example, it might begin by retrieving two pages instead of one As long as sequential access

of the file continues, the operating system might double the number of pages read with each operationbefore settling at some maximum value

The advantage of this heuristic is that data has often already been read in from disk at the time the processasks for it, and so much of the process's I/O wait time is eliminated because no physical disk operation needtake place

Maximum number of pages to read ahead You will want to increase this parameter for striped

filesystems Good values to try are 8-16 times the number of component drives

Both parameters must be a power of 2

15.5.3.1.2 Linux

Linux provides some kernel parameters related to read-ahead behavior They may be accessed via these

files in /proc/sys/vm:

page-cluster

Determines the number of pages read in by a single read operation The actual number is computed

as 2 raised to this power The default setting is 4, resulting in a page cluster size of 16 Large

sequential I/O operations may benefit from increasing this value

min-readahead and max-readahead

Specify the minimum and maximum pages used for read-ahead They default to 3 and 31,

respectively

Finally, the Linux Logical Volume Manager allows you to specify the read-ahead size when you create alogical volume with lvcreate, via its -r option For example, this command specifies a read-ahead size of 8sectors and also creates a contiguous logical volume:

# lvcreate -L 800M -n bio_lv -r 8 -C y vg1

Trang 15

The valid range for -r is 2 to 120.

15.5.3.2 Disk I/O pacing

AIX also provides a facility designed to prevent general system interactive performance from being adverselyaffected by large I/O operations By default, write requests are serviced by the operating system in the order

in which they are made (queued) A very large I/O operation can generate many pending I/O requests, andusers needing disk access can be forced to wait for them to complete This occurs most frequently when anapplication computes a large amount of new data to be written to disk (rather than processing a data set byreading it in and then writing it back out)

You can experience this effect by copying a large file—32MB or more—in the background and then running

an ls command on any random directory you have not accessed recently on the same physical disk You'llnotice an appreciable wait time before the ls output appears

Disk I/O pacing is designed to prevent large I/O operations from degrading interactive performance It isdisabled by default Consider enabling it only under circumstances like those described

This feature may be activated by changing the values of the minpout and maxpout system parameters using

the chdev command When these parameters are nonzero, if a process tries to write to a file for which there

are already maxpout or more pending write operations, the process is suspended until the number of

pending requests falls below minpout.

maxpout must be one more than a multiple of 4: 5, 9, 13, and so on (i.e., of the form 4x+1) minpout must

be a multiple of 4 and at least 4 less than maxpout The AIX documentation suggests starting with values of

33 and 16, respectively, and observing the effects The following command will set them to these values:

# chdev -l sys0 -a maxpout=33 -a minpout=16

If interactive performance is still not as rapid as you want it to be, try decreasing these parameters; on theother hand, if the performance of the job doing the large write operation suffers more than you want it to,increase them Note that their values do persist across boot because they are stored in the ODM

I l@ve RuBoard

Trang 16

I l@ve RuBoard

15.6 Monitoring and Managing Disk Space Usage

This section looks at the tools available to monitor and track disk space usage It then goes on to discussways of approaching a perennial administrative challenge: getting users to reduce their disk use

15.6.1 Where Did It All Go?

The df -k command produces a report that describes all the filesystems, their total capacities, and theamount of free space available on each one (reporting sizes in KB) Here is the output from a Linux system:

File system Kbytes used avail capacity Mounted on

/dev/sd0a 7608 6369 478 93% /

/dev/sd0g 49155 45224 0 102% /corp

This output reports the status of two filesystems: /dev/sd0a, the root disk, and /dev/sd0g, the disk mounted

at corp (containing all files and subdirectories underneath /corp) Each line of the report shows the

filesystem's name, the total number of kilobytes on the disk, the number of kilobytes in use, the number ofkilobytes available, and the percentage of the filesystem's storage that is in use It is evident that both

filesystems are heavily used In fact, the /corp filesystem appears to be overfull.

As we've noted earlier, the operating system generally holds back some amount of space in each filesystem,allocatable only by the superuser (usually 10%, although Linux uses 5% by default) A filesystem mayappear to use over 100% of the available space when it has tapped into this reserve

The du -k command reports the amount of disk space used by all files and subdirectories underneath one ormore specified directories, listed on a per-subdirectory basis (amounts are given in KB)

A typical du report looks like this:

$ du -k -s /home/chavez

34823 /home/chavez

In many cases, this may be all the information you care about

Trang 17

To generate a list of the system's directories in order of size, execute the command:

$ du -k / | sort -rn

This command starts at the root filesystem, lists the storage required for each directory, and pipes its output

to sort With the -rn options (reverse sort order, sort by numeric first field), sort orders these directoriesaccording to the amount of storage they occupy, placing the largest first

If the directory specified as its parameter is large or has a large number of subdirectories, du can take quite

a while to execute It is thus a prime candidate for automation via scripts and after-hours execution via

quot reports the number of kilobytes used by each user in the specified filesystem It is run as root (to

access the disk special files) Here's a typical example:

15.6.2 Handling Disk Shortage Problems

The commands and scripts we've just looked at will let you know when you have a disk space shortage andwhere the available space went, but you'll still have to solve the problem and free up the needed spacesomehow There is a large range of approaches to solving disk space problems, including the following:

Buy another disk This is the ideal solution, but it's not always practical

Mount a remote disk that has some free space on it This solution assumes that such a disk is

available, that mounting it on your system presents no security problems, and that adding additionaldata to it won't cause problems on its home system

Eliminate unnecessary files For example, in a pinch, you can remove the preformatted versions of themanual pages provided that the source files are also available on your system

Compress large, infrequently accessed files

Trang 18

Convince or cajole users into deleting unneeded files and backing up and then deleting old files theyare no longer using If you are successful, a great deal of free disk space usually results At the sametime, you should check the system for log files that can be reduced in size (discussed later in thissection).

When gentle pressure on users doesn't work, sometimes peer pressure will The system administrator

on one system I worked on used to mail a list of the top five "disk hogs"—essentially the output of the

quot command—whenever disk space was short I recommend this approach only if you have both athick skin and a good-natured user community

Some sites automatically archive and then delete user files that haven't been accessed in a certainperiod of time (often two or three months) If a user wants a file back, he can send a message to thesystem administration staff, who will restore it This approach is the most brutal and should only betaken when absolutely necessary It is fairly common in university environments, but rarely usedelsewhere It's also easy to circumvent by touching all your files every month, and performing systembackups may also reset access times on inactive files

These, then, are some of the alternatives.[29] In most cases, though, when you can't add any disks to thesystem, the most effective way to solve a disk space problem is to convince users to reduce their storagerequirements by deleting old, useless, and seldom (if ever) used files (after backing them up first) Junk filesabound on all systems For example, many text editors create checkpoint and backup files as protectionagainst a user error or a system failure If these accumulate, they can consume a lot of disk space In

addition, users often keep many versions of files around (noticed most often in the case of program sourcefiles), frequently not even remembering what the differences are between them

[29] There is another way to limit users' disk usage on some systems: disk quotas (discussed later inthis section) However, quotas won't help you once the disks are already too full

The system scratch directory /tmp also needs to be cleared out periodically (as well as any other directories

serving a similar function) If your system doesn't get rebooted very often, you'll need to do this by hand

You should also keep an eye on the various system spooling directories under /usr/spool or /var/spool

because files can often become stagnant there

Unix itself has a number of accounting and logging files that, if left unattended, will grow without bound Asadministrator, you are responsible for extracting the relevant data from these files periodically and thentruncating them We'll look at dealing with these sources of wasted space in the following sections

Under some circumstances, a filesystem's performance can begin to degrade when afilesystem is more than 80%-90% full Therefore, it is a good idea to take anycorrective action before your filesystems reach this level, rather than waiting untilthey are completely full

15.6.2.1 Using find to locate or remove wasted space

The find command may be used to locate potential candidates for archival and deletion (or just deletion) inthe event of a disk space shortage For example, the following command prints all files with names beginning

with BAK or ending with a tilde, the formats for backup files from two popular text editors:

$ find / -name ".BAK.*" -o -name "*~" -print

As we've seen, find can also delete files automatically For example, the following command deletes alleditor backup files over one week old:

Trang 19

# find / /bio /corp -atime +7 \( -name ".BAK.*" \

-o -name "*~" \) -type f -xdev -exec rm -f {} \;

When using find for automatic deletion, it pays to be cautious That is why the previous command includesthe -type and -xdev options and lists each filesystem separately With the cron facility, you can use find

to produce a list of files subject to deletion nightly (or to delete them automatically)

Another tactic is to search the filesystem for duplicate files This will require writing a script, but you'll beamazed at how many you'll find

15.6.2.2 Limiting the growth of log files

The system administrator is responsible for reaping any data needed from logfiles and keeping them to areasonable size The major offenders include these files:

The various system log files in /usr/adm or /var/adm, which may include sulog, messages, and other files set up via /etc/syslog.conf.

Accounting files in /usr/adm or /var/adm, especially wtmp and acct (BSD) or pacct (System V) Also,

under System V, the space consumed by the cumulative summary files and ASCII reports in

/var/adm/acct/sum and /var/adm/acct/fiscal are worth monitoring.

Subsystem log files: many Unix facilities, such as cron, the mail system, and the printing system, keeptheir own log files

Under AIX, the files smit.log and smit.script in users' home directories are appended to every time

someone runs SMIT They become large very quickly You should watch the ones in your own and

root's home directories (if you su to root, the files still go into your own home directory) Alternatively,

you could run the smit command with the -l and -s options (which specify the log and script

filenames respectively) and set both filenames to /dev/null Defining an alias is the easy way to do so:

alias smit="smit -l /dev/null -s /dev/null" bash/ksh

alias smit "smit -l /dev/null -s /dev/null" csh/tcsh

There are several approaches to controlling the growth of system log files The easiest is to truncate them byhand when they become large This is advisable only for ASCII (text) log files To reduce a file to zero length,use a command such as:

# cat /dev/null > /var/adm/sulog

Copying from the null device into the file is preferable to deleting the file, because in some cases the

subsystem won't recreate the log file if it doesn't exist It's also preferable to rm followed by touch becausethe file ownerships and permissions remain correct and also because it releases the disk space immediately

To retain a small part of the current logging information, use tail, as in this example:

Trang 20

AIX provides the skulker script (stored in /usr/sbin) to perform some of these filesystem cleanup

operations, including the following:

Clearing the queueing system spooling areas of old, junk files

Clearing /tmp and /var/tmp of all files over one day old.

Deleting old news files (over 45 days old)

Deleting a variety of editor backup files, core dump files, and random executables (named a.out) You

may want to add to the list of file types

The system comes set up to run skulker every day at 3 A.M via cron, but the crontab entry is commentedout If you want to run skulker, you'll need to remove the comment character from the skulker line in

root's crontab file.

15.6.3 Controlling Disk Usage with Disk Quotas

Disk space shortages are a perennial problem on all computers For systems where direct control over howmuch disk space each user uses is essential, disk quotas may provide a solution

The disk quota system allows an administrator to limit the amount of filesystem storage that any user canconsume If quotas are enabled, the operating system will maintain separate quotas for each user's diskspace and inode consumption (equivalent to the total number of files he owns) on each filesystem

There are two distinct kinds of quota: a hard limit and a soft limit A user is never allowed to exceed his hard

limit, under any circumstances When a user reaches his hard limit, he'll get a message that he has

exceeded his quota, and the operating system will refuse to allocate any more storage A user may exceedthe soft limit for a limited period of time; in such cases, he gets a warning message, and the operatingsystem grants the request for additional storage If his disk usage still exceeds this soft limit at the nextlogin, the message will be repeated He'll continue to receive warnings at each successive login until either:

He reduces his disk usage to below the soft limit, or

He's been warned a fixed number of times (or for a specified period of time, depending on the

implementation) At this point, the operating system will refuse to allocate any more storage until theuser deletes enough files that his disk usage again falls below his soft limit

The disk quota system has been designed to let users have large temporary files, provided that in the longterm, they obey a much stricter limit For example, consider a user with a hard limit of 15,000 blocks and a

soft limit of 10,000 blocks If this user's storage ever exceeds 15,000 blocks, the operating system will

refuse to allocate any more storage immediately; he will need to free some storage before he can save anymore files If this user's storage exceeds 10,000 blocks, he'll get a warning but requests for more disk spacewill still be honored However, if this user does not reduce his storage below 10,000 blocks, the operatingsystem will eventually refuse to allocate any additional storage until it does fall below 10,000 blocks

If you decide to implement a quota system, you must determine which filesystems need quotas In mostsituations, the filesystems containing user home directories are appropriate candidates for quotas

Filesystems that are reserved for public files (for example, the root filesystem) probably shouldn't use quotas The /tmp filesystem doesn't usually have quotas because it's designed to provide temporary scratch

Trang 21

Many operating systems require quotas to be enabled in the kernel, and manykernels do not include them by default Check your kernel configuration beforeattempting to use quotas

15.6.3.1 Preparing filesystems for quotas

After deciding which filesystems will have quotas, you'll need to edit the filesystem entries in the filesystem

configuration file (usually /etc/fstab) to indicate that quotas are in use by editing the options field, as in

/dev/dsk/c0t3d0s0 /1 ufs 2 yes rw,logging,quota

See Section 10.2 for full details on the filesystem configuration file on the various systems

On AIX systems, add a line like the following to the filesystem's stanza in /etc/filesystems:

quota = userquota,groupquota

Include the userquota keyword for standard disk quotas and the groupquota keyword for group-based disk

quotas (described in the final part of this section)

Next, make sure that there is a file named quotas in the top-level directory of each filesystem for which you

want to establish quotas If the file does not exist, create it with the touch command:[31]

[31] This is not always required by recent quota system implementations, but it won't hurt either

# cd /chem

# touch quotas

# chmod 600 quotas

The file must be writable by root and no one else.

15.6.3.2 Setting users' quota limits

Trang 22

Use the edquota command to establish filesystem quotas for individual users This command can be invoked

to edit the quotas for a single user:

# edquota username(s)

When you execute this command, edquota creates a temporary file containing the hard and soft limits oneach filesystem for each user After creating the file, edquota invokes an editor so you can modify it (bydefault, vi; you can use the environment variable EDITOR to specify your favorite editor) Each line in this

file describes one filesystem The format varies somewhat; here is an example:

/chem: blocks in use: 13420, limits (soft=20000, hard=30000)

inodes in use: 824, limits (soft=0, hard=0)

This entry specifies quotas for the /chem filesystem; by editing it, you can add hard and soft limits for this

user's total disk space and inode space (total number of files) Setting a quota to 0 disables that quota Theexample specifies a soft quota of 20,000 disk blocks, a hard quota of 30,000 disk blocks, and no quotas oninodes Note that the entry in the temporary file does not indicate anything about the user(s) to which thesequotas apply; quotas apply to the user specified when you execute the edquota command When you listmore than one user on the command line, you will edit a file for each one of them in turn

After you save the temporary quota file and exit the editor (using whatever commands are appropriate forthe editor you are using), edquota modifies the quotas files themselves These files cannot be edited

directly

The -p option to edquota lets you copy quota settings between users For example, the following command

applies chavez's quota settings to users wang and harvey:

# edquota -p chavez wang harvey

15.6.3.3 Setting the soft limit expiration period

edquota's -t option is used to specify the system-wide time limit for soft quotas Executing edquota -t

also starts an editor session something like this one:

Time units may be: days, hours, minutes, or seconds

Grace period before enforcing soft limits for groups:

/chem: block grace period: 3 days, file grace period: 0 days

A value of zero days indicates the default value is in effect (usually seven days) You can specify the time

period in other units by changing days to one of the other listed keywords Some implementations allow you

to specify the grace period in months as well, but then one would have to start to wonder what the point ofusing disk quotas was in the first place

15.6.3.4 Enabling quota checking

# quotaon filesystem

# quotaon -a

The first command enables the quota system for the specified filesystem The latter enables quotas on all

Trang 23

filesystems listed with quotas in the filesystem configuration file For example, the following command

enables quotas for the /chem filesystem:

# quotaon /chem

Similarly, the command quotaoff disables quotas It can be used with the -a option to disable all quotas,

or with a list of filesystem names

15.6.3.5 Quota consistency checking

The quotacheck command checks the consistency of the quotas file for the filesystem specified as its

argument It verifies that the quota files are consistent with current actual disk usage This command should

be executed after you install or modify the quota system If used with the option -a, quotacheck checks allfilesystems designated as using quotas in the filesystem configuration file

them to one of the system boot scripts on AIX systems The other Unix versions run them automatically, viathese boot scripts:

FreeBSD /etc/rc (if check_quotas="yes" in /etc/rc.conf)

HP-UX /sbin/init.d/localmount

Linux /etc/init.d/quota(SuSE 7: if START_QUOTA="yes" in /etc/rc.config)

Solaris /etc/init.d/MOUNTFS and ufs_quota

Tru64 /sbin/init.d/quota if QUOTA_CONFIG="yes" in /etc/rc.config

15.6.3.6 Disk quota reports

The repquota command reports the current quotas for one or more specified filesystem(s) Here is an

example of the reports generated by repquota:

# repquota -v /chem

*** Report for user quotas on /chem (/dev/sd1d)

Block limits File limits

User used soft hard grace used soft hard grace chavez 13420 20000 25000 824 0 0

chen +- 2436 2000 3000 2days 8 0 0

The plus sign in the entry for user chen indicates that he has exceeded his disk quota.

Users can use the quota command to determine where their current disk usage falls with respect to theirdisk quotas

15.6.3.7 Group-based quotas (AIX, FreeBSD, Tru64 and Linux)

AIX, FreeBSD, Tru64, and Linux extend standard disk quotas to Unixgroups as well as individual users.Specifying the -g option to edquota causes names on the command line to be interpreted as group namesrather than as usernames Similarly, edquota -t -g allows you to specify the soft limit timeout period forgroup quotas

Trang 24

By default, the quotaon, quotaoff, quotacheck, and repquota commands operate on both user andgroup quotas You can specify the -u and -g options to limit their scope to only user quotas or only groupquotas, respectively Users must use the following form of the quota command to determine the currentstatus of group quotas:

$ quota -g chem

For example, this command will report the disk quota status for group chem Users may query the disk

quota status only for groups of which they are a member

I l@ve RuBoard

Trang 25

I l@ve RuBoard

15.7 Network Performance

This section concludes our look at performance monitoring and tuning on Unix systems It contains a briefintroduction to network performance, a very large topic whose full treatment is beyond the scope of thisbook Consult the work by Musameci and Loukides for further information

15.7.1 Basic Network Performance Monitoring

network statistics You can limit the display to a single network protocol via the -p option, as in this examplefrom an HP-UX system:

$ netstat -s -p tcp Output shortened.

tcp:

178182 packets sent

111822 data packets (35681757 bytes)

30 data packets (3836 bytes) retransmitted

66363 ack-only packets (4332 delayed)

337753 packets received

89709 acks (for 35680557 bytes)

349 duplicate acks

0 acks for unsent data

284726 packets (287618947 bytes) received in-sequence

0 completely duplicate packets (0 bytes)

3 packets with some dup, data (832 bytes duped)

11 out of order packets (544 bytes)

5 packets received after close

11 out of order packets (544 bytes)

The output gives statistics since the last boot.[32]

[32] Or most recent counter reset, if supported

Network operations are proceeding nicely on this system The highlighted lines are among those that wouldindicate transmission problems if the values in them rose to appreciable percentages of the total networktraffic

More detailed network performance data can be determined via the various network monitoring tools weconsidered in Section 8.6

15.7.2 General TCP/IP Network Performance Principles

Good network performance depends on a combination of several components working properly and

efficiently Performance problems can arise in many places and take many forms These are among the mostcommon:

Trang 26

Network interface problems, including insufficient speed and high error rates due to failing or

misconfigured hardware This sort of problem shows up as poor performance and/or many errors on aparticular host

Network adapters, hubs, switches, and network devices in general seldom fail all at once, but ratherproduce increasing error rates and/or degrading performance over time These metrics should bemonitored regularly to spot problems before they become severe Degradation can also occur due toaging drop cables

Hardware device setup errors, including half/full duplex mismatches, cause high error and collisionrates and result in hideous performance

Overloaded servers can also produce poor network response Servers can have several kinds of

shortfalls: too much traffic for its interface to handle, too little memory for the network workload (or anincorrect configuration), and insufficient disk I/O bandwidth The server's performance will need to beinvestigated to determine which of these are relevant (and hence where the most attention to theproblem should be paid)

Insufficient network bandwidth for the workload You can recognize such situations by the presence ofslow response and/or significant timeouts on systems throughout the local network, which is notalleviated by the addition of another server system The best solution to such problems is to use high-performance switches If this is not possible, another, much less desirable, solution is to divide thenetwork into multiple subnets that separate systems requiring distinct network resources from oneanother

All of these problem types are best addressed via by correcting or replacing hardware and/or reallocatingresources rather than configuration-level tuning

15.7.2.1 Two TCP parameters

TCP operations are controlled by a very large number of parameters Most of them should not be modified bynonexperts In this subsection, we'll consider two that are most likely to produce significant improvementswith little risk

The maximum segment size (MSS) determines the largest "packet" size that the TCP protocol willtransmit across the network (The actual size will be 40 bytes larger due to the IP and TCP headers.)Larger segments result in fewer transmissions to transfer a given amount of data and usually providecorrespondingly better performance on Ethernet networks.[33] For Ethernet networks, the maximumallowed size, 1460 bytes (1500 minus 40), is usually appropriate.[34]

[33] Note that this will often not be the case for slow network links, especially for applications thatare very sensitive to network transmission latencies

[34] When is it inappropriate? When the headers are larger than the minimum and using a sizethis large causes packet fragmentation and its resultant overhead For example, a value of 1200-

1300 is more appropriate when, say, the PPP over Ethernet protocol is used, as would be thecase on a web server accessed by cable modem users

Socket buffer sizes When an application sends data across the network via the TCP protocol, it is firstplaced in a buffer From there, the protocol will divide it as needed and create segments for

Trang 27

transmission Once the buffer is full, the application generally must wait for the entire buffer to betransmitted and acknowledged before it is allowed to queue additional data.

On faster networks, a larger buffer size can improve application performance The tradeoff here is thateach buffer consumes memory, so the system must have sufficient available memory resources toaccommodate all of the buffers for (at least) the usual network load For example, using read and writesocket buffers of 32 KB for each of 500 network connections would require approximately 32 MB ofmemory on the network server (32 x 2 x 500) This would not be a problem on a dedicated networkserver but might be an issue on busy, general-purpose systems

On current systems with reasonable memory sizes and no other applications with significant memoryrequirements, socket buffer sizes of 48 to 64 KB are usually reasonable

Table 15-7 lists the relevant parameters for each of our Unix versions, along with the commands that may

be used to modify them

Table 15-7 Important TCP parameters

KB]

MSS [default in bytes]

[16]tcp_recvspace [16] tcp_mssdflt [512]

FreeBSD sysctl param=value(also /etc/sysctl.conf)

net.inet.tcp.sendspace[32]net.inet.tcp.recvspace[64]

net.inet.tcp.mssdflt[512]

HP-UX ndd -set /dev/tcp param value(also

/etc/rc.config.d/nddconf)

tcp_recv_hiwater_def[32]tcp_xmit_hiwater_def[32]

min, default, max)

rmem_max [64]wmem_max[64]tcp_rmem

Tru64 sysconfig -r inet param=value(also

/etc/sysconfigtab)

tcp_sendspace[60]tcp_recvspace [60] tcp_mssdflt [536]

The remaining sections will consider performance issues associated with two important network subsystems:DNS and NFS

15.7.3 DNS Performance

DNSperformance is another item that is easiest to affect at the planning stage The key issues with DNS are:

Sufficient server capacity to service all of the clients

Trang 28

Balancing the load among the available servers

At the moment, the latter is best accomplished by specifying different name server orderings within the

/etc/resolv.conf files on groups of client systems It is also helpful to provide at least one DNS server on

each side of slow links

Careful placement of forwarders can also be beneficial At larger sites, a two-tiered forwarding hierarchy mayhelp to channel external queries through specific hosts and reduce the load on other internal servers

Finally, use separate servers for handling internal and external DNS queries Not only will there be

performance benefits for internal users, it is also the best security practice

DNS itself can also provide a very crude sort of load balancing via the use of multiple A records in a zone file,

as in this example:

docsrv IN A 192.168.10.1

IN A 192.168.10.2

IN A 192.168.10.3

These records define three servers with the hostname docsrv Successive queries for this name will receive

each IP address in turn.[35]

[35] Actually, each query will receive each IP address as the first entry in the list that is returned Mostclients pay attention only to the top entry

This technique is most effective when the operations that are requested from the servers are all essentiallyequivalent, and so a simple round robin distribution of them is appropriate It will be less successful whenrequests can vary greatly in size or resource requirements In such cases, manual assigning servers to the

various clients will work better You can do so my editing the nameserver entries in /etc/resolv.conf.

15.7.4 NFS Performance

The Network File System is a very important Unix network service, so we'll complete our discussion of

performance by considering some of its performance issues

Monitoring NFS-specific network traffic and performance is done via the nfsstat command For example,the following command lists NFS client statistics:

Trang 29

If either of these values is appreciable, there is probably an NFS bottleneck somewhere If badxids is within

a factor of, say, 6-7 of timeouts, the responsiveness the remote NFS server is the source of the client's

performance problems On the other hand, if there are many more timeouts than badxids, then general

network congestion is to blame

The nfsstat command's -s option is used to obtain NFS server statistics:

Server nfs V2: (54231 out of 59077 calls)

null getattr setattr root lookup readlink read

Server nfs V3: (4846 out of 59077 calls)

null getattr setattr lookup access readlink read

15.7.4.1 NFS Version 3 performance improvements

Many Unix systems are now providingNFS Version 3 instead of or in addition to Version 2 NFS Version 3 hasmany benefits in several areas; reliability, security, performance are among them The following are themost important improvements provided by NFS Version 3:

Trang 30

TCP versus UDP: Traditionally, NFS uses the UDP transport protocol NFS Version 3 uses TCP as itsdefault transport protocol.[36] Doing so provides NFS operations with both flow control and packet-level retransmission By contrast, when using UDP, any network failure requires that the entire

operation be repeated Thus, using TCP often results in smaller performance hits when there areproblems

[36] Some NFS Version 2 implementations can also optionally use TCP instead of UDP

Two-phase writes: Previously, NFS write operations were performed synchronously, meaning that aclient had to wait for each write operation to be completed before starting another one Under NFSVersion 3, write operations are performed in two parts:

The client queues a write request, which the server acknowledges immediately Additional writeoperations can be queued once the acknowledgement is received

The client commits the write operation (possibly after some intermediate modifications), and theserver commits it to disk (or requests its retransmission if the data is no longer available (e.g., ifthere was an intervening system crash)

The maximum data block size is increased (the previous limit was 8 KB) The actual maximum value isdetermined by transport protocol; for TCP, it is 32 KB In addition to reducing the number of packets, alarger block size can result in fewer disks seeks and faster sequential file access The effect is

especially noticeable with high-speed networks

15.7.4.2 NFS performance principles

The following points are important to keep in mind with respect to NFS server performance, especially in theplanning stages:

Mounting NFS filesystems in the background (i.e., with the bg option) will speed up boots

Use an appropriate number of NFS daemon processes The rule of thumb is 2 per expected

simultaneous client process In contrast, if there are idle NFS daemons on a server, you can reduce thenumber and release their (albeit small) memory resources

Very busy NFS servers will benefit from a multiprocessor computer CPU resources are almost never anissue for NFS, but the context switches generated by very large numbers of clients can be significant

Don't neglect the usual system memory and disk I/O performance considerations, including the size ofthe buffer cache, filesystem fragmentation, and data distribution across disks

NFS searches remote directories sequentially, entry by entry, so avoid remote directories with largenumbers of file

Remember that not every task is appropriate for remote files For example, compiling a program suchthat the object files are written to a remote filesystem will run very slowly indeed In general, sourcefiles may be remote, but object files and executables should be created on the local system In

general, for best network performance, avoid writing large amounts of data to remote files (althoughyou may to sacrifice disk and network I/O performance in order to use the CPU resources of a fastremote system)

Trang 31

Resources for You

After all of this discussion of system resources, it's worth spending a little time considering onesfor yourself Resources for system administrators come in many varieties: books and magazines,web sites and news groups, conferences and professional organizations, and humor and fun (allwork and no play won't do anything positive for your performance)

Here are some of my favorites:

An excellent Unix internals book: UNIX Internals: The New Frontier by Uresh Vahalia

(Prentice-Hall)

Sys Admin magazine, http://www.sysadminmag.com

Useful web sites: http://www.ugu.com, http://www.lwn.net, http://www.slashdot.com (thelast for news and rumors)

LISA: an annual conference for system administrators run by Usenix and Sage (see

http://www.usenix.org/events)

UNIX Hater's Handbook, ed Simson Garfinkel, Daniel Weise, and Steve Strassmann (IDG

Books) This is still the funniest book I've read in a long time You can expect to waste a fewhours at work if you start reading it there because you won't be able to put it down

I l@ve RuBoard

Trang 32

I l@ve RuBoard

Chapter 16 Configuring and Building Kernels

As we've noted many times before, the kernel is the heart of the Unix operating system It is the core

program, always running while the operating system is up, providing and overseeing the system

environment Thekernel is responsible for all aspects of system functioning, including:

Process creation, termination and scheduling

Virtual memory management (including paging)

Device I/O (via interfaces with device drivers: modules that perform the actual low-level

communication with physical devices such as disk controllers, serial ports, and network adapters)

Interprocess communication (both local and network)

Enforcing access control and other security mechanisms

Traditionally, the Unix kernel is a single, monolithic program On more recent systems, however, the trend

has been toward modularized kernels: small core executable programs to which additional, separate object

or executable files—modules—can be loaded and/or unloaded as needed Modules provide a convenient way

to provide support for a new device type or add specific new functionality to an existing kernel

In many instances, the standard kernel program provided with the operating system works perfectly well forthe system's needs There are a few circumstances, however, where it is necessary to create a customkernel (or perform equivalent customization activities) to meet the special needs of a particular system orenvironment Some of the most common are:

To add capabilities to the kernel (e.g., support for disk quotas or a new filesystem type)

To add support for new devices

To remove unwanted capabilities/features from the kernel to reduce its size and resource consumption(mostly memory) and thereby presumably improve system performance

To change the values of hardwired kernel parameters that cannot be modified dynamically

How often you have to build a new kernel depends greatly on which system you are administering On someolder systems (mid-1990s versions of SCO Unix come to mind), you had to build a new kernel any time youadded even the smallest, most insignificant new device or capability to the system On most current

systems, such as FreeBSD and Tru64, you build a kernel only when you want to significantly alter the systemconfiguration And on a few systems, like Solaris and especially AIX, you may never have to do so

In this chapter, we'll look at the process of building a customized kernel, and we'll also examine

administering kernel modules There are many reasons you might want to alter the standard kernel:

addressing performance issues, supporting a device and subsystem, removing features the system doesn'tuse (in an effort to make the kernel smaller), adjusting the operating system's behavior and resource limits,and so on We won't be able to go into every possible change you might make on each of the systems weare considering Instead, we'll look at the general process you go through to make a kernel, including how to

Trang 33

install it and boot from it and how to back out your changes should they prove unsatisfactory.

NOTE

Custom kernel building and reconfiguration is not for the faint-hearted, the careless orthe ignorant Know what you're doing, and why, to avoid inadvertently making yoursystem unusable

In general, building a custom kernel consists of these steps:

Installing the kernel source code package (if necessary)

Applying any patches, adding new device driver code, and/or making any other source code changesyou may require

Saving the current kernel and its associated configuration files

Modifying the current system configuration as needed

Building a new kernel executable image

Building any associated kernel modules (if applicable)

Installing and testing the new kernel

Table 16-1 lists the kernel locations and kernel build directories for the operating systems we are

considering

Table 16-1 Standard kernel image and build directory locations

Solaris /kernel/unix (or genunix[2]) none

[1] This component is architecture-specific; i386 is the generic subdirectory for Intel-based PCs Ifyou're running on a more recent CPU type, building a kernel for that specific processor may improvethe operating system's performance

[2] The gen forms are the generic, hardware-independent versions of the kernel.

We'll begin with the kernel build process on FreeBSD and Tru64 systems (which are very similar) and thenconsider each of the other environments in turn In each case, we will also consider other mechanisms for

Trang 34

configuring the kernel and/or kernel modules that are available.

Trang 35

I l@ve RuBoard

16.1 FreeBSD and Tru64

Tru64 and FreeBSD use an almost identical process for building a customized kernel They rely on a

configuration file for specifying which capabilities to include within the kernel and setting the values of

various system parameters The configuration file is located in /usr/sys/conf on Tru64 systems and in /usr/src/sys/arch/conf under FreeBSD, where arch is an architecture-specific subdirectory (we'll use i386 as

an example)

Configuration filenames are conventionally all uppercase, and the directory typically contains several

different configuration files The one used to build the current kernel is usually indicated in the /etc/motd file For example, the GENERIC file was used to build the kernel on this FreeBSD system:

FreeBSD 4.3-RELEASE (GENERIC) #0: Sat Apr 21 10:54:49 GMT 2001

Default Tru64 configuration files are often named GENERIC or sometimes ALPHA.

On FreeBSD systems, you will first need to install the kernel sources if you have not already done so:

FreeBSD

# cd /

# mkdir -p /usr/src/sys If not already present.

# mount /cdrom

# cat /cdrom/src/ssys.[ad]* | tar xzvf

-To add a device to a Tru64 system, you must boot the generic kernel, /genvmunix, to force the system to

recognize and create configuration information for the new device:

# emacs NEWKERN [/tmp/NEWDEVS]

The GENERIC configuration file is the standard, hardware-independent version provided with the operating

system If you have already customized the kernel, you would start with the corresponding configuration file

Trang 36

While editing the new configuration file, add (or activate) lines for new devices or features, disable or

comment out lines for services you don't want to include, and specify the values for any applicable kernelparameters In general, it's unlikely that you'll need to modify the contents of hardware device-related

entries The one exception is the ident entry, which assigns a name to the configuration You should change

it so its value corresponds to the name you have selected:

ident NEWKERN

You may also occasionally remove unneeded subsystems by commenting out the corresponding option'sentry, as in this example, which disables disk quotas:

#options QUOTA Tru64

On Tru64 systems, you will need to merge in any new device lines from the file created by the sizer

command (placed into /tmp), indicated by the optional second parameter to the Tru64 emacs commandabove One way to locate these device lines is to diff that file against your current kernel configuration file

or the GENERIC file.

The FreeBSD configuration file contains a large number of settings, most of them corresponding to hardwaredevices and their characteristics In addition, there are several entries specifying the values of various kernelparameters that might need to be altered in some circumstances For example:

[3] Many kernel parameters can also be modified via the sysctl command and its initialization file (see

Section 15.4)

You can examine the LINT or NOTES configuration file for documentation on most available parameters.

The next step in the kernel build process is to run the command that creates a custom build area for the newconfiguration:

Trang 37

doconfig and config create the NEWKERN subdirectory, where the new kernel is actually built Once the

make commands complete, the new kernel may be installed in the root directory and tested

If there are problems building the new kernel, you can boot the saved version with these commands:

disk1s1a:> unload

disk1s1a:> load kernel.save

disk1s1a:> boot

>>> boot -fi vmunix.save

16.1.1 Changing FreeBSD Kernel Parameters

FreeBSD also allows manykernel parameters to be changed dynamically The sysctl command can be used

to list all kernel parameters along with their current values:

You can use this command form to modify a parameter value:

16.1.2 FreeBSD Kernel Modules

FreeBSD also provides support for kernel modules; you can compile them via the corresponding

Trang 38

subdirectories in /usr/src/sys/modules The kldstat -v command displays a list of currently-loaded kernelmodules Virtually all are used for supporting devices or filesystem types You can load and unload kernelmodules manually with the kldload and kldunload commands.

The file /boot/loader.conf specifies modules that should be loaded at boot time:

userconfig_script_load="YES" Line created by sysinstall.

usb_load="YES" Load USB modules.

ums_load="YES"

umass_load="YES"

Of course, you need to create the required modules before they can be autoloaded

16.1.3 Installing the FreeBSD Boot Loader

Generally, the FreeBSDboot loader is installed by default in the Master Boot Record (MBR) of the systemdisk However, should you ever need to, you can install it manually with this command:

# boot0cfg -B /dev/ad0

The -B option says to leave the partition table unaltered

You can also use this command's -m option to prevent certain partitions from appearing in the boot menu.This option takes a hexadecimal integer as its argument The value is interpreted as a bit mask that includes(bit is on) or excludes (bit is off) each partition from the menu (provided that it is a BSD partition in the firstplace) The ones bit in the mask corresponds to the first partition, and so on

For example, the following command enables only partition 3 to be listed in the menu:

# boot0cfg -B -m 0x4 /dev/ad0

subpartition within a physical disk partition, as in this example, which installs the boot program into the firstsubpartition in the first partition:

# disklabel -B /dev/ad0s1

16.1.4 Tru64 Dynamic Kernel Configuration

Tru64 also supports two sorts of kernel reconfigurationwithout needing to build a new kernel: subsystemloading and unloading and kernel parameter modifications

A very few subsystems may be dynamically loaded and unloaded into the Tru64 kernel You can list allconfigured subsystems using the sysconfig command:

# sysconfig -s

cm: loaded and configured

hs: loaded and configured

ksm: loaded and configured

Subsystems can be loaded or unloaded The -m option displays whether each one is dynamic (loadable and

Trang 39

unloadable with a running kernel) or static:

# sysconfig -m | grep dynamic

hwautoconfig: dynamic

envmon: dynamic

lat: dynamic

On this system, only three subsystems are dynamic For these modules, you can use the sysconfig -c and

-u options to load and unload them, respectively

Static and dynamic subsystems can also have settable kernel parameters associated with them You canview the list of available parameters with a command like this one:

# sysconfig -Q lsm Parameters for the Logical Storage Manager

lsm:

Module_Name - type=STRING op=Q min_len=3 max_len=30

lsm_rootdev_is_volume - type=INT op=CQ min_val=0 max_val=2

Enable_LSM_Stats - type=INT op=CRQ min_val=0 max_val=1

The display lists the parameter name, its data type, allowed operations, and valid range of values The

operations are specified via a series of code letters: Q means can be queried, C means the change occurs after reboot, R means the change occurs on a running system.

In our example, the first parameter (the name of the module) can be queried but not modified; the secondparameter (whether the root filesystem is a logical volume) can be modified, but the new value won't takeeffect until the system reboots; and the third parameter (whether subsystem statistics are recorded) takeseffect as soon as it is changed

You use the -q option to display the current value of a parameter and the -r option to change its value:

The /etc/sysconfigtab file can be used to set kernel parameters at boot time (see Section 15.4).

If you prefer a graphical interface, the dxkerneltuner utility can also be used to view and modify thevalues of kernel parameters The sys_attrs manual page provides descriptions of kernel parameters andtheir meanings

I l@ve RuBoard

Trang 40

I l@ve RuBoard

16.2 HP-UX

SAM is still the easiest way to build a new kernel under HP-UX However, you can build one manually if youprefer:[4]

[4] This command is also useful for simply listing the modified variables in the current kernel

# cd /stand Move to kernel directory.

# mv vmunix vmunix.save Save current kernel.

# cd build Move to build subdirectory.

# /usr/lbin/sysadm/system_prep -v -s system Extract system file.

# kmtune -s var

=value

-S /stand/build/system Modify kernel parameters.

.

# mk_kernel -s /system -o /vmunix_new Build new kernel.

# kmupdate /stand/build/vmunix_new Schedule kernel install.

# mv /stand/system /stand/system.prev Save old system file.

# mv /stand/build/system /stand/system Install new system file.

The system_prep script creates a new system configuration file by extracting the information from the

running kernel The kmtune command(s) specify the values of kernel variables for the new kernel

The mk_kernel script calls the config command and initiates the make process automatically Once thekernel is built, you use the kmupdate command to schedule its installation at the next reboot You can thenreboot to activate it

If there is a problem with the new kernel, you can boot the saved kernel with a command like the following:

ISL> hpux /stand/vmunix.save

To determine what kernel object files are available, use the following command to list the contents of the

/stand directory:

ISL> hpux ll /stand

The system file contains information about system devices and settings for various kernel parameters Here

are some examples of the latter:

maxfiles_lim 1024 Maximum open files per process.

maxusers 250 Number of users/processes to assume when sizing kernel

data structures.

nproc 512

You can also use SAM to configure these parameters and then rebuild the kernel Figure 16-1 illustratesusing SAM to modify a kernel parameter (in this case, the length of the time slice: the maximum period forwhich a process can execute before being interrupted by the scheduler)

Định dạng
Số trang	111
Dung lượng	3,43 MB