Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com... prefix ./ in each report line with the user-specified directory name so that the output files LC_ALL=C sort >
Trang 1y
Create a soft (symbolic) link existent file
$ file two Diagnose this file
Find all files
./two
Find soft links only
$ find -type l -follow Find soft links and try to
e A common idiom is find
odified in the last week
$ ls Show that we have an empt
The -links option requires a following integer number If it is unsigned, it selects only fi
ard links If it is negative, only files with fewer than that many (in absolute value) link
plus sign, then only files with more than that many links are selected Thus, the usual way to find files with halinks is f
The -atime (access time), -ctime (inode-change time), and -mtime (modification time) options require a
following integer number, measured in days If unsigned, it means exactly that many days old If negative, it means less than that absolute value With a plus sign, it means more than that valu
find files m
-mtime -7 to
It is regrettable that find does not allow the number to have a fractional part or a units suffix: we've often wanted to specify units of years, months, weeks, hours, minutes, or
seconds with these options GNU find provides the -amin, -cmin, and -mmin options
which take values in minutes, but units suffixes on the original timestamp selection options would have been more general
the specified file If you need
me timestampfile, and then egate the selector: ! -newer timestampfile
n to be taken They can be
th the -a (AND) option if you wish There is also a -o (OR) option that specifies that at least one
selector of the surrounding pair must match Here are two simple examples of the use of these Boolean
A related option, -newer filename, selects only files modified more recently than
finer granularity than a day, you can create an empty file with touch -t date_ti
use that file with the -newer option If you want to find files older than that file, n
The find command selector options can be combined: all must match for the actio
Trang 2and -o operators, together with the grouping options \( and \), can be used to create complex Boolean
rs You'll rarely need them, and when you do, you'll find them
nce they are debugged, and then just use that script happily ever after
2 A simple find script
feeding them into a simple pipeline Now let's look at a slightly more complex example In Se , we
all HTML
filename into variable
progress
mv $file $file.save Save a backup copy
sed -f $HOME/html2xhtml.sed < $file.save > $file Make the change
In this section, we develop a real working example of find's virtuosity.[8]
presented a simple sed script to (begin to) convert HTML to XHTML
ML-based version of HTML Combining sed with find and a simple loop accomplishe
lines of code:
cd top level web site directory
find -name '*.html' -type f | Find
10.4.3.3 A complex find script
It is a shell script named
rontab
filesdirectories that some of our local users with large home-directory trees run nightly via the c
system (see Section 13.6.4) to create several lists of files and directories, grouped by the number of days within
faster elf
iple output files to be ver a version that
which they have been changed This helps remind them of their recent activities, and provides a much
itsway to search their trees for particular files by searching a single list file rather than the filesystem
[8]
Our thanks go to Pieter J Bowman at the University of Utah for this example
filesdirectories requires GNU find for access to the -fprint option, which permits mult
this script ocreated in one pass through the directory tree, producing a tenfold speedup for
used multiple invocations of the original Unix find
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 3al security feature, the script invokes umask to limit access to the owner of the output files:
umask 077 # ensure file privacy
It then initializes TMPFILES to a long list of temporary files that collect the output:
$TMP/DIRECTORIES.all.$$ $TMP/DIRECTORIES.all.$$.tmp
$TMP/DIRECTORIES.last01.$$ $TMP/DIRECTORIES.last01.$$.tmp
ES.last02.$$.tmp
7.$$.tmp last14.$$.tmp $TMP/DIRECTORIES.last31.$$ $TMP/DIRECTORIES.last31.$$.tmp
MP/FILES.all.$$ $TMP/FILES.all.$$.tmp
FILES.last01.$$.tmp $TMP/FILES.last02.$$ $TMP/FILES.last02.$$.tmp
$TMP/FILES.last07.$$ $TMP/FILES.last07.$$.tmp
$TMP/FILES.last14.$$ $TMP/FILES.last14.$$.tmp
$TMP/FILES.last31.$$ $TMP/FILES.last31.$$.tmp
"
contain the names of directories and files in the entire tree (*.all.*), as well as the names of
t day (*.last01.*), last two days (*.last02.*), and so on
The WD variable saves the argument directory name for later use, and then the script changes to that directory:
Changing the working directory before running find solves two problems:
#! /
set the IFS variable to newline-space-tab:
IFS='
'
and set the PATH variable to ensure that GNU find is found first:
PATH=/usr/local/bin:/bin:/usr/bin # need GNU f
It then checks for the expected single argument, and otherwise, prints a brief error message on standard errand exits with a nonzero status value:
Trang 4ment is not a directory, or is but lacks the needed permissions, then the cd command fails, and
the script terminates immediately with a nonzero exit value
w symbolic links unless given extra options, but there is no way to tell it to do so only for the top-level directory In
t is
erminates:
The exit status value is preserved across the TRap (see Section 13.3.2
• If the argu
• If the argument is a symbolic link, cd follows the link to the real location find does not follo
practice, we do not want filesdirectories to follow links in the directory tree, although i
straightforward to add an option to do so
The trap commands ensure that the temporary files are removed when the script t
trap 'exit 1' HUP INT PIPE QUIT TERM
trap 'rm -f $TMPFILES' EXIT
The lines with the -name
e names of the output files from a previous run, and the -true option causes them to be ignored
so ey do not clutter the output reports:
IRECTORIES.all -true \ -o -name 'DIRECTORIES.last[0-9][0-9]' -true \
rue \ 0-9][0-9]' -true \
es to $TMP/FILES.all.$$: \
The next five lines select files modified in the last 31, 14, 7, 2, and 1 days (the -type f selector is still in effect),
-a -mtime -14 -fprint $TMP/FILES.last14.$$ \
ones,
il the next three, so it will be included only in the FILES.last31.$$ and FILES.last14.$$ files
The next line matches all ordinary files, and the -fprint option writes their nam
-o -type f -fprint $TMP/FILES.all.$$
and the -fprint option writes their names to the indicated temporary files:
-a -mtime -31 -fprint $TMP/FILES.last31.$$ \
-a -mtime -7 -fprint $TMP/FILES.last07.$$ \
-a -mtime -2 -fprint $TMP/FILES.last02.$$ \
-a -mtime -1 -fprint $TMP/FILES.last01.$$ \
The tests are made in order from oldest to newest because each set of files is a subset of the previous
reducing the work at each step Thus, a ten-day-old file will pass the first two -mtime tests, but will fa
The next line matches directories, and the -fprint option writes their names to $TMP/DIRECTORIES.al
-o -type d -fprint $TMP/DIRECTORIES.all.$$ \
The final five lines of the find command match subsets of directories (the -type d selecto
write their names, just as for files earlier in the command:
TORIE -a -mtime -31 -fprint $TMP/DIREC
-a -mtime -14 -fprint $TMP/DIRECTORIES.last14.$$ \
-a -mtime -7 -fprint $TMP/DIRECTORIES.last07.$$ \
-a -mtime -1 -fprint $TMP/DIREC
e find command finishes, its preliminary reports are available in the temporary
Trang 5prefix ./ in each report line with the user-specified directory name so that the output files
LC_ALL=C sort > $TMP/$i.$$.tmp
d to, and avoids surprise iverse environments because our systems differ in their default locales
he loop over the report files:
DIRECTORIES.last07 DIRECTORIES.last02 DIRECTORIES.last01
do
sed replaces the
contain full, rather than relative, pathnames:
sed -e "s=^[.]/=$WD/=" -e "s=^[.]$=$WD=" $TMP/$i.$$ |
sort orders the results from sed into a temporary file named by the input filename suffixed with tmp:
Setting LC_ALL to C produces the traditional Unix sort order that we have long been use
n when more modern locales are set Using the traditional order is particularly helpful in our and confusio
d
The cmp command silently checks whether the report file differs from that of a previous run, and if so, replaces
the old one:
cmp -s $TMP/$i.$$.tmp $i || mv $TMP/$i.$$.tmp $i
Otherwise, the temporary file is left for cleanup by the trap handler
The final statement of the script completes t
done
At runtime, the script terminates via the EXIT trap set earlier
The complete filesdirectories script is collected in Example 10-1 Its structure should be clear enough that
t files, such as for files and directories modified in the last quarter,
f the -mtime values, you can get reports of files that have not been
recently modified, which might be helpful in tracking down obsolete files
iles and directories, and groups of ently modified ones, in a directory tree, creating
s in FILES.* and DIRECTORIES.* at top level
es directory
t PATH
if [ $# -ne 1 ]
tory
you can easily modify it to add other repor
half year, and year By changing the sign o
Example 10-1 A complex shell script for find
umask 077 # ensure file privacy
TMP=${TMPDIR:-/tmp} # allow alternate temporary direc
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 6$TMP/DIRECTORIES.all.$$ $TMP/DIRECTORIES.all.$$.tmp
mp
/FILES.last01.$$ $TMP/FILES.last01.$$.tmp $TMP/FILES.last02.$$ $TMP/FILES.last02.$$.tmp
.$$.tmp 4.$$.tmp
$$ $TMP/FILES.last31.$$.tmp "
WD=$1
cd $WD || exit 1
trap 'exit 1' HUP INT PIPE QUIT TERM
-name DIRECTORIES.all -true \
-o -name FILES.all -true \
-a -mtime -31 -fprint $TMP/FILES.last31.$$ \
or i in FILES.all FILES.last31 FILES.last14 FILES.last07 \
ES.all \
14 \ DIRECTORIES.last07 DIRECTORIES.last02 DIRECTORIES.last01
e "s=^[.]$=$WD=" $TMP/$i.$$ |
oblem Files
TMPFILES="
$TMP/DIRECTORIES.last01.$$ $TMP/DIRECTORIES.last01.$$.tmp $TMP/DIRECTORIES.last02.$$ $TMP/DIRECTORIES.last02.$$.tmp $TMP/DIRECTORIES.last07.$$ $TMP/DIRECTORIES.last07.$$.tmp $TMP/DIRECTORIES.last14.$$ $TMP/DIRECTORIES.last14.$$.t $TMP/DIRECTORIES.last31.$$ $TMP/DIRECTORIES.last31.$$.tmp $TMP/FILES.all.$$ $TMP/FILES.all.$$.tmp
-o -name 'DIRECTORIES.last[0-9][0-9]' -true \
-o -name 'FILES.last[0-9][0-9]' -true \
-o -type f -fprint $TMP/FILES.all.$$ \
-a -mtime -14 -fprint $TMP/FILES.last14.$$ \
-a -mtime -7 -fprint $TMP/FILES.last07.$$ \
-a -mtime -2 -fprint $TMP/FILES.last02.$$ \
-a -mtime -1 -fprint $TMP/FILES.last01.$$ \
-o -type d -fprint $TMP/DIRECTORIES.all.$$ \ -a -mtime -31 -fprint $TMP/DIRECTORIES.last31.$$ \ -a -mtime -14 -fprint $TMP/DIRECTORIES.last14.$$ \ -a -mtime -7 -fprint $TMP/DIRECTORIES.last07.$$ \ -a -mtime -2 -fprint $TMP/DIRECTORIES.last02.$$ \ -a -mtime -1 -fprint $TMP/DIRECTORIES.last01.$$ f
FILES.last02 FILES.last01 DIRECTORI
Trang 7In Section 10.1, we noted the difficulties presented by filenames containing special characters, such as newline
GNU find has the -print0 option to display filenames as NUL-terminated strings Since pathnames can legally
contain any character except NUL, this option provides a way to produce lists of filenames that can be parsed unambiguously
It is hard to parse such lists with typical Unix tools, most of which assume line-oriented text input However, in
a compiled language with byte-at-a-time input, such as C, C++, or Java, it is straightforward to write a program
he presence of problematic filenames in your filesystem Sometimes they get there by simple
isguising For example, suppose that you did a directory listing and got output like this:
wo special hidden not have seen any hidden files, and also, there appears to be a space before the first dot in the output Something
-print0 | od -ab Convert NUL-terminated
to octal and ASCII
0000000 nul / sp nul / sp nul /
056 056 000 056 057 056 sp sp nl
Now we can see what is going on: we have the normal dot directory, then a file named space-dot, another
dot-dot-space-dot-dot-space-dot-dot-space-dot-space-newline-newline-newline-space-space Unless someone was practicing Morse code in your
f e files look awfully suspicious, and you should investigate them further before you get rid of them
10.5 Running Commands: xargs
for the symbol POSIX_OPEN_MAX in system header files:
grep POSIX_OPEN_MAX /dev/null $(find /usr/include -type f | sort)
dotted files for the current and parent directory However, notice that we did not use the -a
is just not right! Let's apply find and od to investigate further:
Wh produces a list of files, it is often useful to be
nd Normally, this is done with the shell's com
$
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 8/usr/include/limits.h:#define _POSIX_OPEN_MAX 16
W you write a program or a command that deals with a list of objects, you should make sure that it
no output: that will not happen here, but it is good to develop defensive program
t ined length of a command line and its environment variables is exceeded When that happens, you'll s
$ grep POSIX_OPEN_MAX /dev/null $(find /usr/include -type f | sort)
/usr/local/bin/grep: Argument list too long
That lim
of AR
31072
On the systems that we tested, the reported values ranged from a low of 24,576 (IBM AIX) to a high of
tandard input, one ) to another
command given as arguments to xargs Here is an example that eliminates the obnoxious Argument list too long error:
$ find /usr/include -type f | xargs grep POSIX_OPEN_MAX /dev/null
nclude/bits/posix1_lim.h:#define _POSIX_OPEN_MAX 16
_POS
ing it to print the
rted match If xargs gets no input filenames, it terminates silently without even
ng its argument program
has the —null option to handle the NUL-terminated filename lists produced by GNU find's -print0 ption xargs passes each such filename as a complete argument to the command that it runs, without danger of
shell (mis)interpretation or newline confusion; it is then up to that command to handle its arguments sensibly
t awk program, you
$ find -ls | awk '{Sum += $7} END {printf("Total: %.0f bytes\n", Sum)}'
However, that report underestimates the space used, because files are allocated in fixed-size blocks, and it tells
us nothing about the used and available space in the entire filesystem Two other useful tools provide better
solutions: df and du
henever
ehaves properly if the list is empty Because grep reads standard input when it is given no file argu
d an argument of /dev/null to ensure that it does not hang waiting for terminal input if find p
ming habits
tput from the substituted command can sometim
he comb
ee this instead:
it can be found with getconf:
conf ARG_MAX Get system configuration
G_MAX
1
1,048,320 (Sun Solaris)
The solution to the ARG_MAX problem is provided by xargs: it takes a list of arguments on s
per line, and feeds them in suitably sized groups (determined by the host's value of ARG_MAX
/usr/i
/usr/include/bits/posix1_lim.h:#define _POSIX_FD_SETSIZE
IX_OPEN_MAX
Here, the /dev/null argument ensures that grep always sees at least two file arguments, caus
filename at the start of each repo
invoki
GNU xargs
o
xargs has options to control where the arguments are substituted, and to limit the number of arguments passed
to one invocation of the argument command The GNU version can even run multiple argument processes in parallel However, the simple form shown here suffices most of the time Consult the xargs(1) manual pages for
ls, and for examples of some of the wizardry possible with its fancier features
further detai
10.6 Filesystem Space Information
With suitable options, the find and ls commands report file sizes, so with the help of a shor
can report how many bytes your files occupy:
Total: 23079017 bytes
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 9lude only local filesystems, excluding network-mounted ones Here is a typical example from one of our web servers:
Filesystem 1K-blocks Used Available Use% Mounted on
38M 7.9M 29M 22% /boot 9.7G 6.2G 3.0G 68% /export none 502M 0 502M 0% /dev/shm
dev/sda8 99M 4.4M 90M 5% /tmp
3% /var 4% /ww
arbitrary, but the presence of the one-line header makes it harder to apply sort
Fortunately, on most systems, the output is only a few lines long
df (disk free) gives a one-line summary of used and available space on each mounted filesystem The units are
systemdependent blocks on some systems, and kilobytes on others Most modern implementations support the
-k option to force -kilobyte units, and the -l (lowercase L) option to inc
Usage
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 10Lowercase L Show only local filesystems
Behavior
r each file or directory argument, or for all filesystems if there are no such arguments, df
produces a one-line header that identifies the output columns, followed by a usage report for
the filesystem containing that file or directory
Caveats
The output of df varies considerably between systems, making it hard to use reliably in
ortable shell scripts
Space reports for remote filesystems may be inaccurate
Fo
p
df's output is not sorted
Reports represent only a single snapshot that might be quite different a short time later in an
active multiuser system
You can supply a list of one or more filesystem names or mount points to limit the output to just those:
$ df -lk /dev/sda6 /var
df's reports about the free space on remote filesystems may be inaccurate, because of software implementation
inconsistencies in accounting for the space reserved for emergency use
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda6 4032092 1684660 2142608 45% /ww
/dev/sda9 13432904 269704 12480844 3% /var
unted filesystems, entries in the Filesystem column are prefixed by hostname:, making
ough that some df implementations split the display into two lines, which is a nuisance fo
rses the output Here's an example from a Sun Solaris system:
1k-blocks Used Available Use% Mounted on
fs:/export/home/0075
35197586 33528481 1317130 97% /a/fs/export/home/0075
In Section B.4.3 in Appendix B, we disc
that is set when the filesystem is created
uss the issue that the inode table in a filesystem has an immutable size
The -i (inode units) option provides a way to assess inode usage Here
Filesystem Inodes IUsed IFree IUse% Mounted on
ape, since its inode use and filesystem space are both just over 40 percent
The /ww filesystem is in excellent sh
of capacity For a healthy computing system, system managers should routinely monitor inode usag
filesystems
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 11df is one of those com on in the options and output appearance, which again is
s
, HP provides a Berkeley-style equivalent, bdf, that produces output that is
similar to our example To deal with this variation, we recommend that you install the GNU version everywhere
at your site; it is part package cited in Section 4.1.5
mands where there is wide variati
a nuisance for portable programs that want to parse its output Hewlett-Packard's implementation on HP-UX iradically different, but fortunately
10.6.2 The du Command
f summarizes free space by filesystem, but does not tell you how much space a particular directory tree
requires That job is done by du (disk usage) Like its companion, df, du's options tend to vary substantially
plemented: -k
du -s /tmp
The GNU version provides the -h (human-readable) option:
s /var/log /var/spool /var/tmp
between systems, and its space units also may vary Two important options are widely im
(kilobyte units) and -s (summarize) Here are examples from our web server system:
du does not count extra hard links to the same file, and normally ignores soft links However, some
implementations provide options to force soft links to be followed, but the option names vary: consult the manual pages for your system
Trang 12Show only a one-line summary for each argument
Behavior
For each file or directory argument, or for the current directory if no such arguments are
given, du normally produces one output line containing an integer representing the usage,
followed by the name of the file or directory Unless the -s option is given, each directory
argument is searched recursively, with one report line for each nested directory
Caveats
du's output is not sorted
One common problem that du helps to solve is finding out who the big filesystem users are Assuming that user
home-directory trees reside in /home/users, root can do this:
# du -s -k /home/users/* | sort -k1nr | less Find large home
000 command
directory trees
This produces a list of the top space consumers, from largest to smallest A find dirs -size +10
in a few of the largest directory trees can quickly locate files that might be candidates for compression or
deletion, and the du output can identify user directory trees that might better be moved to larger quarters
Some managers automate the regular processing of du reports, sending warning mail to
users with unexpectedly large directory trees, such as with the script in Example 7-1 in
Chapter 7 In our experience, this is much better than using the filesystem quota system (see the manual pages for quota(1)), since it avoids assigning magic numbers
(filesystem-space limits) to users; those numbers are invariably wrong, and they inevitably prevent people from getting legitimate work done
of unts
s nothing magic about how du works: like any other program, it has to descend through the file
and total up the space used by every file Thus, it can be slow on large filesystems, and it can be locked out directory trees by strict permissions; if its output contains Permission denied messages, its report underco
, only has sufficient privileges to use du everywhere in the local system
the space usage Generally
10.7 Comparing Files
In this section, we look at four related topics that involve comparing files:
• Checking whether two files are the same, and if not, finding how they differ
• Applying the differences between two files to recover one from the other
• Using checksums to find identical files
• Using digital signatures for file verification
10.7.1 The cmp and diff
lem that frequently arises in text processing is determining whether the contents of two or more file
e, even if their names differ
ave just two candidates, then the file comparison utility, cmp, readily provides the answer:
/bin/ls /tmp
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 13No output means that the files
$ cmp /bin/cp /bin/ls Compare different files
s differ: char 27, line 1 Output identifies the location
arning message with the -s option:
$ cmp -s /bin/cp /bin/ls Compare different files
lently
$ echo $? Display the exit code
1 Nonzero value means that the files differ
you want to know the differences between two similar files, diff does the job:
Create first test file
Create second test file
re the two files
1c1
-
he older file as the first argument
ngle bracket correspond to the left (first) file, and those prefixed by a right
e right (second) file The 1c1 preceding the differences is a compact representation
f the input-file line numbers where the difference occurred, and the operation needed to make the edit: here, c
a for add and d for delete
be used by other programs For example, revision control
stems use diff to manage the differences between successive versions of files under their management
There is an occasionally useful companion to diff that does a sli htly different job diff3 compares three files,
nt people, and produces an ed-command script
e do not illustrate it here, but
10.7.2 The patch Utility
he patch utility uses the output of diff and either of the original files to reconstruct the other one Because the
h
$ patch < test.dif Apply the differences
$ cmp /bin/ls /tmp/ls Compare the original with the
copy
are identical
/bin/cp /bin/l
of the first difference
is silent when its two argument files are identical If you are interested only in its exit status, you can
cmp
suppress the w
si
If
$ echo Test 1 > test.1
$ echo Test 2 > test.2
$ diff test.[12] Compa
< Test 1
> Test 2
It is conventional in using diff to supply t
Difference lines prefixed by a left a
angle bracket come from th
o
means change In larger examples, you will usually also find
diff's output is carefully designed so that it can
sy
gsuch as a base version and modified files produced by two differe
that can be used to merge both sets of modifications back into the base version W
you can find examples in the diff3(1) manual pages
T
differences are generally much smaller than the original files, software developers often exchange difference
listings via email, and use patch to apply them Here is how patch can convert the contents of test.1 to matcthose of test.2:
$ diff -c test.[12] > test.dif Save a context difference in
test.dif
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 14rt tells patch the filenames, and allows it to verify the change location and to
nce
hat grows quadratically in the number of files, which is soon
patching file test.1
$ cat test.1 Show the patched test.1 file
Test 2
patch applies as many of the differences as it can; it reports any failures for you to handle manually
Although patch can use the ordinary output of diff, it is more common to use diff's -c option to get a context
difference That more verbose repo
recover from mismatches Context differences are not essential if neither of the two files has been changed sithe differences were recorded, but in software development, quite often one or the other will have evolved
10.7.3 File Checksum Matching
If you have lots of files that you suspect have identical contents, using cmp or diff would require comparing all
pairs of them, leading to an execution time t
intolerable
You can get nearly linear performance by using file checksums There are several utilities for computing
checksums of files and strings, including sum, cksum, and checksum,[9] the message-digest tools[10] md5 and
md5sum, and the secure-hash algorithm[11] tools sha, sha1sum, sha256, and sha384 Regrettably,
implementations of sum differ across platforms, making its output useless for comparisons of checksums of file
R Rivest, RFC 1321: The MD5 Message-Digest Algorithm, available at ftp://ftp.internic.net/rfc/rfc1321.txt
md5sum is part of the GNU coreutils package
[11]
NIST, FIPS PUB 180-1: Secure Hash Standard, April 1995, available at
p://www.cerberussystems.com/INFOSEC/stds/fip180-1.htm
htt , and implemented in the GNU coreutils package
system, but all are easy to build and install Their output formats differ, but here is a typical example:
hexadecimal digits, equivalent to 128 bits Thus, the chance[12]
sum command, only a few of these program
s of
The long hexadecimal signature string is just a many-digit integer that is computed from all of the byte
file in such a way as to make it unlikely that any other byte stream could produce the same value With good
gorithms, longer signatures in general mean greater likelihood of uniqueness The md5sum output has 32
al
of having two different files with identical
oth with the same checksum, is likely
glossary entry includes a short proof and numerical examples
signatures is only about one in 2 = 1.84 10 , which is probably negligible Recent cryptographic research has
same MD5 checksum However, demonstrated that it is possible to create families of pairs of files with the
creating a file with similar, but not identical, contents as an existing file, b
to remain a difficult problem
[12]
If you randomly select an item from a collection of N items, each has a 1/N chance of being chosen If you select
M items, then of the M(M-1)/2 possible pairs, the chance of finding a pair with identical elements is (M(M-1)/2)/N
That value reaches probability 1/2 for M about the square root of N This is called the birthday paradox; you can
find discussions of it in books on cryptography, number theory, and probability, as well as at numerous web sites Its
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 15e program in Example 10-2
To find matches in a set of signatures, use them as indices into a table of signature counts, and report just thos
cases where the counts exceed one awk is just the tool that we need, and the is short and clear
Example 10-2 Finding matching file contents
We can conclude, for example, that ed and red are identical programs on this system, although they m
vary their behavior according to the name that they are invoked with
Files with identical contents are often links to each other, especially when found in system dir
identical-files provides more useful information when applied to user directories, where it is l
are links and more likely that they're un
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 16e the checksum of a file with different contents Software announcements often include checksums of the
e However, checksums alone do not provide verification: if the checksum were recorded in another file
at you downloaded with the software, an attacker could have maliciously changed the software and simply
ate key, known only to its owner, and a public key, potentially known to
t is decryptable with that key,
Alice
be confident that only
[13]
10.7.4 Digital Signature Verification
The various checksum utilities provide a single number that is characteristic of the file, and is unlikely to bsame as the
distribution files so that you have an easy way to tell whether the copy that you just downloaded matches thoriginal
th
revised the checksum accordingly
The solution to this problem comes from public-key cryptography, where data security is obtained from the
existence of two related keys: a priv
anyone Either key may be used for encryption; the other is then used for decryption The security of public-key cryptography lies in the belief that knowledge of the public key, and text tha
provides no practical information that can be used to recover the private key The great breakthrough of this invention was that it solved the biggest problem in historical cryptography: secure exchange of encryption keys among the parties needing to communicate
re is how the private and public keys are used If Alice wants to sign an open letter, she uses her private
to encrypt it Bob uses Alice's public key to decrypt the signed letter, and can then be confident that only
could have signed it, provided that she is trusted not to divulge her private key
If Alice wants to send a letter to Bob that only he can read, she encrypts it with Bob's public key, and he then uses his private key to decrypt it As long as Bob keeps his private key secret, Alice can
Bob can read her letter
It isn't necessary to encrypt the entire message: instead, if just a file checksum is encrypted, then one has a digital signature This is useful if the message itself can be public, but a way is needed to verify its authenticity Several tools for public-key cryptography are implemented in the GNU Privacy Guard (GnuPG) and Pretty Good Privacy[14] (PGP) utilities A complete description of these packages requires an entire book; see the
Chapter 16 However, it is straightforward to use them for one important task: verification of digital sign
We illustrate only GnuPG here, since it is under active development and it builds more
coreutils-5.0.tar* Show the distribution files
1 jones devel 6020616 Apr 2 2003 coreutils-5.0.tar.gz
1 jones devel 65 Apr 2 2003 coreutils-5.0.tar.gz.sig
g: Signature made Wed Apr 2 14:26:58 2003 MST using DSA key ID D333CBA1
k signature: public key not found
e only information that we have here is the key ID Fortunately,
The signature verification failed because we have not added the signer's public key to the gpg key ring If we
knew who signed the file, then we might be able to find the public key at the signer's personal web site or ask the signer for a copy via email However, th
people who use digital signatures generally register their public keys with a third-party public-key server, an
that registration is automatically shared with other key servers Some of the major ones are listed in Table 10 , Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 17web search engines Replicated copies of public keys enhance security: if one key server is unavailable or compromised, you can easily switch to another one
blic-key servers
and more can be found by
Table 10-2 Major pu Country URL
Finally, save the key text in a temporary file—say, temp.key—and add it to your key ring:
.key Add the public key to your key
Now you can verify the signature successfully:
$ gpg coreutils-5.0.tar.gz.sig Verify the digital signature
gpg: Signature made Wed Apr 2 14:26:58 2003 MST using DSA key ID D333CBA1 gpg: Good signature from "Jim Meyering <jim@meyering.net>"
aka "Jim Meyering <meyering@lucent.com>"
1
Use a web browser to visit the key server, type the key ID 0xD333CBA1 into a search box (the leading 0x is
mandatory), and get a report like this:
Public Key Server Index ''0xD333CBA1 ''
Type bits /keyID Date User ID
pub 1024D/D333CBA1 1999/09/26 Jim Meyering <meyering@ascend.com>
eceding code snippet in bold) to get a web pagFollow the link on the key ID (shown in the pr
Public Key Server Get ''0xD333CBA1 ''
Version: PGP Key Server 0.9.6
gpg: key D333CBA1: public key "Jim Meyering <jim@meyering.net>" imported
gpg: Total number processed: 1
gpg: imported: 1
gpg: aka "Jim Meyering <meyering@na-net.ornl.gov>"
gpg: aka "Jim Meyering <meyering@pobox.com>"
gpg: aka "Jim Meyering <meyering@ascend.com>"
gpg:
gpg: checking the trustdb
gpg: checking at depth 0 signed=0 ot(-/q/n/m/f/u)=0/0/0/0/0/
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 18gpg: next trustdb check due at ????-??-??
gpg: WARNING: This key is not certified with a trusted signature!
gpg: There is no indication that the signature belongs to the owner
Primary key fingerprint: D70D 9D25 AF38 37A5 909A 4683 FDD2 DEAC D333 CBA1
nless you personally know the signer and have good reason to believe that the key is
istribution, but without knowledge of the signer's (secret) private
ey, the digital signature cannot be reproduced, and gpg detects the attack:
ls -l coreutils-5.0.tar.gz List the maliciously modified
archive file
-rw-rw-r 1 jones devel 6074205 Apr 2 2003 coreutils-5.0.tar.gz
$ gpg coreutils-5.0.tar.gz.sig Try to verify the digital
be revealed when the signature was verified Security is never perfect
Y ou do not need to use a web browser to retrieve a public key: the GNU wget utility[15]
The warning in the successful verification simply means that you have not certified that the signer's key really does belong to him U
valid, you should not certify keys
An attacker could modify and repackage the d
Trang 19rm -f /tmp/pgp-0xD333CBA1.tmp.21643
Some keys can be used with both PGP and GnuPG, but others cannot, so the reminder covers both Because the
command-line options for gpg and pgp differ, and pgp was developed first, gpg comes with a wrapper program,
pgpgpg, that takes the same options as pgp, but calls gpg to do the work Here, pgpgpg -ka is the same as gpg import
getpubkey allows you to add retrieved keys to either, or both, of your GnuPG and PGP key rings, at the expense
of a bit of cut-and-paste gpg provides a one-step solution, but only updates your GnuPG key ring:
$ gpg keyserver pgp.mit.edu search-keys 0xD333CBA1
gpg: searching for "0xD333CBA1" from HKP server pgp.mit.edu
ealed information about the time-of-day clock and its limited range in many current systems
echo "Try: pgp -ka $tmpfile"
echo " pgpgpg -ka $tmpfile"
echo "
done
Here is an example of its use:
$ getpubkey D333CBA1 Get the public key for key ID
Try: pgp -ka /tmp/p
pgpgpg -ka /tm
Keys 1-6 of 6 for "0xD333CBA1"
(1) Jim Meyering <meyering@ascend.com>
1024 bit DSA key D333CBA1, created 1999-09-26
Enter number(s), N)ext, or Q)uit > 1
gpg: key D333CBA1: public key "Jim Meyering <jim@meyering.net>" imported gpg: Total number processed: 1
constructed by complete scans of the filesystem When you know part or all of a filename and just wan
where it is in the filesystem, locate is generally the best way to track it down, unless it was created after the
database was constructed
The type command is a good way to find out information about shell commands, and our pathfind script from
Chapter 8 provides a more general solution for locating files in a specified directory path
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 20e took several pages to explore the powerful find command, which uses brute-force filesystem traversal to
find files that match user-specified criteria Nevertheless, we still had to leave many of its facilities for you to
and the extensive manual for GNU find
powerful command for doing operations on lists of files, often
what fi
may use them often
We wrapped up with a description of commands for comparing files, applying patches, generating file
W
discover on your own from its manual pages
We gave a brief treatment of xargs, another
produced upstream in a pipeline by find Not only does this overcome command-line length restrictions on
many systems, but it also gives you the opportunity to insert additional filters in the pipeline to further control
les are ultimately processed
du commands report the space used in filesystems and directory trees Learn them
checksums, and validating digital signatures
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 21Chapter 11 Extended Example: Merging User Dat
ix password file, /etc/passwd, has shown up in several places throughout the book System
stration tasks often revolve around manipulation of the password file (and the corresponding group f
[1]
roup) The format is well known:
[1]
BSD systems maintain an additional file, /etc/master.passwd, which has three additional fields: the user's
login class, password change time, and account expiration time These fields are placed between the GID field and
tolstoy:x:2076:10:Leo Tolstoy:/home/tolstoy:/bin/bash
The
is empty, the user can log in without a password, and anyone with access to the system or a terminal on it can
the field for the full name
re are seven fields: username, encrypted password, user ID number (UID), group ID number (GID), full ome directory, and login shell It's a bad idea to
log in as that user If the seventh field (the shell) is left empty, Unix defaults to the Bourne shell, /bin/sh
As is discussed in detail in Appendix B, it is the user and group ID numbers that Unix uses for permission checking when accessing files If two users have different names but the same UID number, then as far as Unix knows, they are identical There are rare occasions when you want such a situation, but usually having two
articular, NFS requires a uniform UID space; user number
2076 on all systems accessing each other via NFS had better be the same user (tolstoy), or else there will be
r and available on non-Sun systems At the time, one of us was a system administrator of two separate
via TCP/IP, but did not have NFS However, a new OS vendor was scheduled to make 4.3 BSD + NFS available for these systems There were a number of
ame was the same, but the UID wasn't! These systems erative that their UID spaces be merged The task was
uring that all users from both systems had unique UID numbers
rrect users in the case where an existing UID was to be used
he original scripts are long gone, and it's occasionally interesting and instructive to reinvent a useful wheel.) This problem isn't just academic, either:
e It's possible for there inistrator, you may one
accounts with the same UID number is a mistake In p
serious security problems
Now, return with us for a moment to yesteryear (around 1986), when Sun's NFS was just beginning to become popula
4.2 BSD Unix minicomputers These systems communicated
users with accounts on both systems; typically the usern
were soon to be sharing filesystems via NFS; it was imp
to write a series of scripts that would:
• Merge the /etc/passwd files of the two systems This entailed ens
• Change the ownership of all files to the co
for a different user
It is this task that we recreate in this chapter, from scratch (T
consider two departments in a company that have been separate but that now must merg
to be users with accounts on systems in multiple departments If you're a system adm
day face this very task In any case, we think it is an interesting problem to solve
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 2211.2 The Password Files
Let's call our two hypothetical Unix systems u1 and u2 Example 11-1 presents the /etc/passwd file from u1.[2]
[2]
Any resemblance to actual users, living or dead, is purely coincidental
/root:/bin/bash bin:x:1:1:bin:/bin:/sbin/nologin
o Tolstoy:/home/tolstoy:/bin/bash
t Camus:/home/camus:/bin/bash jhancock:x:200:10:John Hancock:/home/jhancock:/bin/bash
n:/bin/bash ome/abe:/bin/bash
ash And Example 11-2
Example 11-1 u1 /etc/passwd file
/bin/bash /home/tj:/bin/bash
If you examine these files carefully, you'll represent the various possibilities that our program has to
ost typically with exist only on one system but not the other In this case, when the
• Users for whom the username is different on both systems, but the UIDs are the same
Example 11-2 u2 /etc/passwd fil
root:x:0:0:root:
bin:x:1:1:bin:/bin:/sbin/nologin
login daemon:x:2:2:daemon:/sbin:/sbin/no
/nologiadm:x:3:4:adm:/var/adm:/sbin
eorge:x:1100:10:George Washi
g
betsy:x:1110:10:Betsy Ross:/home/betsy:/bin/bash
e/jhancock:/bin/bash jhancock:x:300:10:John Hancock:/hom
• Users for whom the username and UID are the same on both systems This happens m
nd bin administrative accounts such as root a
whom the username and UID
• Users for
files are merged, there is no problem
• Users for whom the username is the same on both systems, but the UIDs are different
11.3 Merging Password Files
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com