Network debugging In this chapter: networ k problems problems application layers In this chapter: networ k problems problems application layers The chances are quite good that you’ll hav
Trang 1Network debugging
In this chapter:
networ k problems
problems
application layers
In this chapter:
networ k problems
problems
application layers
The chances are quite good that you’ll have some problems somewhere when you set up your network FreeBSD gives you a large number of tools with which to find and solve the problem
In this chapter, we’ll consider a methodology of debugging network problems In the process, we’ll look at the programs that help debugging It will help to have your finger
in Chapter 16 while reading this section
How to approach network problems
Recall from Chapter 16 that network software and hardware operate on at least four layers If one layer doesn’t work, the ones above won’t either When solving problems,
it obviously makes sense to start at the bottom and work up
Most people understand this up to a point Nobody expects a PPP connection to the Internet to work if the modem can’t dial the ISP On the other hand, a large number of messages to theFreeBSD-questionsmailing list show that many people seem to think that once this connection has been established, everything else will work automatically
If it doesn’t, they’re puzzled
Unfortunately, the Net isn’t that simple In fact, it’s too complicated to give a hard-and-fast methodology at all Much network debugging can look more like magic than anything rational Nevertheless, a surprising number of network problems can be solved
by using the steps below Even if they don’t solve your problem, read through them They might give you some ideas about where to look
netdebug.mm,v v4.15 (2003/04/02 03:23:15) 401
Trang 2How to approach networ k problems 402
Link layer problems
To test your link layer, start with ping ping is a relatively simple program that sends an
ICMP echo packet to a specific IP address and checks the reply ICMP, is the Internet
Illustrated, by Richard Stevens, for more information.
A typical ping output might look like:
$ ping bumble
PING bumble.example.org (223.147.37.156): 56 data bytes
64 bytes from 223.147.37.156: icmp_seq=0 ttl=255 time=1.137 ms
64 bytes from 223.147.37.156: icmp_seq=1 ttl=255 time=0.640 ms
64 bytes from 223.147.37.156: icmp_seq=2 ttl=255 time=0.671 ms
64 bytes from 223.147.37.156: icmp_seq=3 ttl=255 time=0.612 ms
ˆC
bumble.example.org ping statistics
-4 packets transmitted, -4 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.612/0.765/1.137/0.216 ms
In this case, we are sending the messages to the system bumble.example.org By default,
ping sends messages of 56 bytes With the IP header, this makes packets of 64 bytes.
By default, ping continues until you stop it—notice theˆCindicating that this invocation
was stopped by pressing Ctrl-C.
The information that ping gives you isn’t much, but it’s useful:
• It tells you how long it takes for each packet to get to its destination and back
• It tells you how many packets didn’t make it
• It also prints a summary of packet statistics
But what if this doesn’t work? You enter your ping command, and all you get is:
$ ping wait
PING wait.example.org (223.147.37.4): 56 data bytes
ˆC
wait.example.org ping statistics
-5 packets transmitted, 0 packets received, 100% packet loss
Obviously, something’s wrong here We’ll look at it in more detail below This is very
different, however, from this situation:
$ ping presto
ˆC
In the second case, even after waiting a reasonable amount of time, nothing happened at
all ping didn’t print thePINGmessage, and when we hit Ctrl-C there was no further
output This is indicative of a name resolution problem: ping can’t print the first line
(PING presto ) until it has found the IP address of the system, in other words, until it has performed a DNS lookup If we wait long enough, it will time out, and we get the messageping: cannot resolve presto: Unknown host If this happens, use the
IP address instead of the name DNS is an application, so we won’t even try to debug it
Trang 3until we’ve debugged the link and network layers.
If things don’t work out, there are two possibilities:
• If both systems are on the same network, it’s a link layer problem We’ll look at that first
• If the systems are on two different networks, it might be a network layer problem That’s more complicated: we don’t know which network to look at It could be either
of the networks on which the systems are located, or it could also be a problem with one of the networks on the way How do you find out where your packets get lost? First you check the link layer If it checks out OK, and the problem still exists, continue with the network layer on page 405
So what can cause link layer problems? There are a number of possibilities:
• One of the interfaces (source or destination) could be misconfigured They should both have on the same range of network addresses For example, the following two interface configurations cannot talk to each other directly, even if they’re on the same physical network:
machine 1
dc0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
inet 223.147.37.81 netmask 0xffffff00 broadcast 223.147.37.255
machine 2
xl0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
options=3<RXCSUM,TXCSUM>
inet 192.168.27.1 netmask 0xffffff00 broadcast 192.168.27.255
• If you see something like this on an Ethernet interface, it’s pretty clear that it has a cabling problem:
xl0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
options=3<RXCSUM,TXCSUM>
inet 192.168.27.1 netmask 0xffffff00 broadcast 192.168.27.255
media: Ethernet autoselect (none)
status: no carrier
In this case, check the physical connections If you’re using UTP, check that you have the right kind of cable, normally a ‘‘straight-through’’ cable If you accidentally use a crossover cable where you need a straight-through cable, or vice versa, you will not get any connection Also, many hubs and switches have a ‘‘crossover’’ switch that achieves the same result
• If you’re on an RG-58 thin Ethernet, the most likely problem is a break in the cabling You can check the static resistance between the central pin and the external part of the connector with a multimeter It should be approximately 25Ω If it’s 50Ω, it indicates that there is a break in the cable, or that one of the terminators has been disconnected
• If your interface is configured correctly, and you’re using a 10 Mb/s card, check whether you are using the correct connection to the network Some older Ethernet boards support multiple physical connections (for example, both BNC and UTP) For
netdebug.mm,v v4.15 (2003/04/02 03:23:15)
Trang 4Link layer problems 404
example, if your network runs on RG58 thin Ethernet, and your interface is set to AUI, you may still be able to send data on the RG58, but you won’t be able to receive any
The method of setting the connection depends on the board you are using PCI boards are not normally a problem, because the driver can set the parameters directly, but ISA boards can drive you crazy In the case of very old boards, such as the Western Digital 8003, you may need to set jumpers In others, you may need to run
the setup utility under DOS, and with others you can set it with the link flags to
ifconfig For example, on a 3Com 3c509 ‘‘combo’’ board, you can set the connection
like this:
# ifconfig ep0 link0 -link1 set AUI
This example is correct for the ep driver, but not necessarily for other Ethernet
boards: each board has its own flags Read the man page for the board for the correct flags
• If your interface looks OK, the next thing to do is to see whether you can send data to other machines on the network If so, of course, you should continue your search on the machine that isn’t responding If none are working, you probably have a cabling problem
On a wireless network, you need to check for a number of additional problems ifconfig
should show something like this:
wi0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
inet6 fe80::202:2dff:fe04:93a%wi0 prefixlen 64 scopeid 0x3
inet 192.168.27.17 netmask 0xffffff00 broadcast 192.168.27.255
ether 00:02:2d:21:54:4c
media: IEEE 802.11 Wireless Ethernet autoselect (DS/11Mbps)
status: associated
ssid "FreeBSD IBSS" 1:""
stationname "FreeBSD WaveLAN/IEEE node"
channel 3 authmode OPEN powersavemode OFF powersavesleep 100
wepmode OFF weptxkey 1
wepkey 2:64-bit 0x123456789a 3:128-bit 0x123456789abcdef123456789ab
There are many things to check here:
• Do you have the same operating mode? This example shows a card operating in BSS
or IBSS mode By contrast, you might see this:
media: IEEE 802.11 Wireless Ethernet autoselect (DS/11Mbps <adhoc, flag0>)
In this case, the interface is operating in so-called ‘‘Lucent demo ad-hoc’’ mode, which is not the same thing as ‘‘ad-hoc’’ mode (which in turn is better called IBSS mode) IBSS mode (‘‘ad-hoc’’) and BSS mode are compatible IBSS mode and
‘‘Lucent demo ad-hoc’’ mode are not See Chapter 17, page 306 for further details
Trang 5• Is the statusassociated? The alternative isno carrier Some cards, including this one, showno carrierwhen communicating with a station operating in IBSS mode, but they nev er showassociatedunless they are really associated
• If the card is not associated, check the frequencies and the network name
• Check the WEP (encryption) parameters to ensure that they match Note that
ifconfig does not display the WEP key unless you areroot
Your card may showassociated ev en if the WEP key doesn’t match In such a case, it knows about the network, but it can’t communicate with it
After checking all these things, you should have a connection But you may not be home yet:
• If you have a connection, check if all packets got there Lost packets could mean line
quality problems That’s not very likely on an Ethernet, but it’s very possible on a
PPP or DSL link There’s an uncertainty about dropped packets: you might hit
Ctrl-C after the last packet went out, but before it came back If the line is very slow, you
might lose multiple packets Compare the sequence number of the last packet that returns with the total number returned If it’s one less, all the packets except the ones
at the end made it
• Check that each packet comes back only once If not, there’s definitely something wrong, or you have been pinging a broadcast address That looks like this:
$ ping 223.147.37.255
PING 223.147.37.255 (223.147.37.255): 56 data bytes
64 bytes from 223.147.37.1: icmp_seq=0 ttl=255 time=0.428 ms
64 bytes from 223.147.37.88: icmp_seq=0 ttl=255 time=0.785 ms (DUP!)
64 bytes from 223.147.37.65: icmp_seq=0 ttl=64 time=1.818 ms (DUP!)
64 bytes from 223.147.37.1: icmp_seq=1 ttl=255 time=0.426 ms
64 bytes from 223.147.37.88: icmp_seq=1 ttl=255 time=0.442 ms (DUP!)
64 bytes from 223.147.37.65: icmp_seq=1 ttl=64 time=1.099 ms (DUP!)
64 bytes from 223.147.37.126: icmp_seq=1 ttl=255 time=45.781 ms (DUP!)
FreeBSD systems do not respond to broadcast pings, but most other systems do, so
this effectively counts the number of non-BSD machines on a network
• Check the times A ping across an Ethernet should take between about 0.2 and 2 ms,
a ping across a wireless connection should take between 2 and 12 ms, a ping across
an ISDN connection should take about 30 ms, a ping across a 56 kb/s analogue connection should take about 100 ms, and a ping across a satellite connection should
take about 250 ms in each direction All of these times are for idle lines, and the time can go up to over 5 seconds for a slow line transferring large blocks of data across a
serial line (for example, ftping a file) In this example, some line traffic delayed the
response to individual pings
netdebug.mm,v v4.15 (2003/04/02 03:23:15)
Trang 6Link layer problems 406
Network layer problems
Once we know the link layer is working correctly, we can turn our attention to the next layer up, the network layer Well, first we should check if the problem is still with us
We need additional tools for the network layer ping is a useful tool for telling you
whether data is getting through to the destination, and if so, how much is getting through But what if your local network checks out just fine, and you can’t reach a remote
network? Or if you’re losing 40% of your packets to foo.bar.org, and the remaining ones are taking up to 5 seconds to get through Where’s the problem? Based on the recent
‘‘upgrade’’ your ISP performed, and the fact that you’ve had trouble getting to other sites, you suspect that the performance problems might be occurring in the ISP’s net How can you find out?
As we saw while investigating the link layer, a complete failure is often easier to fix than
a partial failure If nothing at all is getting through, you probably have a routing problem
Check the routing table with netstat On bumble, you might see:
$ netstat -r
Routing tables
Internet:
The default route is via gw, which is correct The first thing is to ensure that you can
ping gw; that’s a link level issue, so we’ll assume that you can But what if you try to ping a remote system and you see something like this?
# ping rider.fc.net
PING rider.fc.net (207.170.123.194): 56 data bytes
36 bytes from gw.example.org (223.147.37.5): Destination Host Unreachable
4 5 00 6800 c5da 0 0000 fe 01 246d 223.147.37.2 207.170.123.194
36 bytes from gw.example.org (223.147.37.5): Destination Host Unreachable
4 5 00 6800 c5e7 0 0000 fe 01 2460 223.147.37.2 207.170.123.194
ˆC
rider.fc.net ping statistics
-2 packets transmitted, 0 packets received, 100% packet loss
These are ICMP messages from gw indicating that it does not know where to send the data This is almost certainly a routing problem; on gw you might see something like:
Trang 7$ netstat -r
Routing tables
Internet:
The problem here is that there is nodefaultroute Add it with the route command:
# route add default free-gw.example.net
# netstat -r
Routing tables
Internet:
default free-gw.example.ne UGSc 24 5724 ppp0
etc
See Chapter 17, page 310, for more details, including how to ensure that the routes will
be added automatically at boot time
But what if the routes look right, you don’t get any ICMP messages, and no data gets through? You don’t always get ICMP messages when the data can’t get through The
logical next place to look is free-gw.example.net, but there’s a problem with that: as the administrator of example.org, you don’t hav e access to example.net’s machines You can
call them up, of course, but before you do you should be reasonably sure it’s their
problem You can find out more information with traceroute.
traceroute
traceroute sends UDP packets to the destination, but it modifies the time-to-live field in
the IP header (see page 280) so that, initially at any rate, they don’t get there As we saw there, the time-to-live field specifies the number of hops that a packet can go before it is
discarded When it is, the system that discards it should send back an ICMP destination
unreachable message traceroute uses this feature and sends out packets with
time-to-live set first to one, then to two, and so on It prints the IP address of the system that sends the ‘‘destination unreachable’’ message and the time it took, thus giving something
like a two-dimensional ping Here’s an example to hub.FreeBSD.org:
netdebug.mm,v v4.15 (2003/04/02 03:23:15)
Trang 8traceroute 408
$ traceroute hub.freebsd.org
traceroute to hub.freebsd.org (204.216.27.18), 30 hops max, 40 byte packets
1 gw (223.147.37.5) 1.138 ms 0.811 ms 0.800 ms
2 free-gw.example.net (139.130.136.129) 131.913 ms 122.231 ms 134.694 ms
3 Ethernet1-0.way1.Adelaide.example.net (139.130.237.65) 118.229 ms 120.040 ms 118.723 ms
4 Fddi0-0.way-core1.Adelaide.example.net (139.130.237.226) 171.590 ms 117.911 ms 123.513 ms
5 Serial5-0.lon-core1.Melbourne.example.net (139.130.239.21) 129.267 ms 226.927
ms 125.547 ms
6 Fddi0-0.lon5.Melbourne.example.net (139.130.239.231) 144.372 ms 133.998 ms 13 6.699 ms
7 borderx2-hssi3-0.Bloomington.mci.net (204.70.208.121) 962.258 ms 482.393 ms 7 54.989 ms
8 core2-fddi-1.Bloomington.mci.net (204.70.208.65) 821.636 ms * 701.920 ms
9 bordercore3-loopback.SanFrancisco.mci.net (166.48.16.1) 424.254 ms 884.033 ms 645.302 ms
10 pb-nap.crl.net (198.32.128.20) 435.907 ms 438.933 ms 451.173 ms
11 E0-CRL-SFO-02-E0X0.US.CRL.NET (165.113.55.2) 440.425 ms 430.049 ms 447.340 ms
12 T1-CDROM-00-EX.US.CRL.NET (165.113.118.2) 553.624 ms 460.116 ms *
13 hub.FreeBSD.ORG (204.216.27.18) 642.032 ms 463.661 ms 432.976 ms
By default, traceroute tries each hop three times and prints out the times as they happen,
so if the reponse time is more than about 300 ms, you’ll notice it as it happens If there is
no reply after a timeout period (default 5 seconds), traceroute prints an asterisk (*) You’ll also occasionally notice a significant delay at the beginning of a line, although the response time seems reasonable In this case, the delay is probably caused by a DNS reverse lookup for the name of the system If this becomes a problem (maybe because the global DNS servers aren’t reachable), you can turn off DNS reverse lookup using the
-nflag
If you look more carefully at the times in the example above, you’ll see three groups of times:
1 The times to gw are round 1 ms This is typical of an Ethernet network.
2 The times for hops 2 to 6 are in the order of 100 to 150 ms This indicates that the
link between gw.example.org and free-gw.example.net is running PPP over a
telephone line The delay between free-gw.example.net and
Fddi0-0.lon5.Mel-bourne.example.net is negligible compared to the delay across the PPP link, so you
don’t see much difference
3 The times from borderx2-hssi3-0.Bloomington.mci.net to hub.FreeBSD.ORG are
significantly higher, between 400 and 1000 ms We also note a couple of dropped
packets This indicates that the line between Fddi0-0.lon5.Melbourne.example.net and borderx2-hssi3-0.Bloomington.mci.net is overloaded The length of the link
(about 13,000 km) also plays a role: that’s a total distance of 26,000 km, which take about 85 ms to transfer If this were a satellite connection, things would be much slower: the total distance from ground station to satellite and back to the ground is 72,000 km, which takes a total of 240 ms to propagate
Back to our problem If we see something like the output in the previous example, we
know that there’s no reason to call up the people at example.net: it’s not their problem.
This might just be overloading on the global Internet On the other hand, what about this?
Trang 9$ traceroute hub.freebsd.org
traceroute to hub.freebsd.org (204.216.27.18), 30 hops max, 40 byte packets
1 gw (223.147.37.5) 1.138 ms 0.811 ms 0.800 ms
2 * * *
3 * * *
ˆC
You’ve fixed your routing problems, but you still can’t get data off the system There are
a number of possibilities here:
• The link to the next system may be down The solution’s obvious: bring it up and try again
• gw may not be configured as a gateway You can check this with:
$ sysctl net.inet.ip.forwarding
net.inet.ip.forwarding: 1
For a router, this value should be1 If it’s 0, change it with:
# sysctl -w net.inet.ip.forwarding=1
net.inet.ip.forwarding: 0 -> 1
See page 313 for further details, including how to ensure that this sysctl is set
correctly when the system starts
• You may be trying to use a non-routable IP address such as those in the range
192.168.x.x You can’t do that If you don’t hav e enough globally visible IP
address, you’ll need to run some kind of aliasing package, such as NAT See Chapter
22, page 393, for further details
• Maybe there is something wrong with routing to your network This is a difficult one
to check, but in the case of the reference network, one possibility is to repeat the
traceroute from the machine gw: gw’s external address on the tun0 interface is
139.130.136.133, which is on the ISP’s network As a result, they are not affected
by a routing problem for network 223.147.37.x If this proves to be the case,
contact your ISP to solve it
• Maybe there is something wrong with the other end; if everything else fails, you may
have to call the admins at example.net ev en if you have no hard evidence that it’s
their problem
But maybe the data gets one hop further:
$ traceroute hub.freebsd.org
traceroute to hub.freebsd.org (204.216.27.18), 30 hops max, 40 byte packets
1 gw (223.147.37.5) 1.138 ms 0.811 ms 0.800 ms
2 free-gw.example.net (139.130.136.129) 131.913 ms 122.231 ms 134.694 ms
3 * * *
4 * * *
ˆC
In this case, there is almost certainly a problem at example.net This would be the correct
time to use the telephone
netdebug.mm,v v4.15 (2003/04/02 03:23:15)
Trang 10traceroute 410
High packet loss
But maybe data is getting through Well, some data, anyway Consider this ping
session:
$ ping freefall.FreeBSD.org
PING freefall.FreeBSD.org (216.136.204.21): 56 data bytes
64 bytes from 216.136.204.21: icmp_seq=0 ttl=244 time=496.426 ms
64 bytes from 216.136.204.21: icmp_seq=1 ttl=244 time=491.334 ms
64 bytes from 216.136.204.21: icmp_seq=2 ttl=244 time=479.077 ms
64 bytes from 216.136.204.21: icmp_seq=3 ttl=244 time=473.774 ms
64 bytes from 216.136.204.21: icmp_seq=4 ttl=244 time=733.429 ms
64 bytes from 216.136.204.21: icmp_seq=5 ttl=244 time=644.726 ms
64 bytes from 216.136.204.21: icmp_seq=7 ttl=244 time=490.331 ms
64 bytes from 216.136.204.21: icmp_seq=8 ttl=244 time=839.671 ms
64 bytes from 216.136.204.21: icmp_seq=9 ttl=244 time=773.764 ms
64 bytes from 216.136.204.21: icmp_seq=10 ttl=244 time=553.067 ms
64 bytes from 216.136.204.21: icmp_seq=11 ttl=244 time=454.707 ms
64 bytes from 216.136.204.21: icmp_seq=12 ttl=244 time=472.212 ms
64 bytes from 216.136.204.21: icmp_seq=13 ttl=244 time=448.322 ms
64 bytes from 216.136.204.21: icmp_seq=14 ttl=244 time=441.352 ms
64 bytes from 216.136.204.21: icmp_seq=15 ttl=244 time=455.595 ms
64 bytes from 216.136.204.21: icmp_seq=16 ttl=244 time=460.040 ms
64 bytes from 216.136.204.21: icmp_seq=17 ttl=244 time=476.943 ms
64 bytes from 216.136.204.21: icmp_seq=18 ttl=244 time=514.615 ms
64 bytes from 216.136.204.21: icmp_seq=23 ttl=244 time=538.232 ms
64 bytes from 216.136.204.21: icmp_seq=24 ttl=244 time=444.123 ms
64 bytes from 216.136.204.21: icmp_seq=25 ttl=244 time=449.075 ms
ˆC
216.136.204.21 ping statistics
-27 packets transmitted, 21 packets received, 22% packet loss
round-trip min/avg/max/stddev = 441.352/530.039/839.671/113.674 ms
In this case, we have a connection But look carefully at those sequence numbers At one point, four packets in a row (sequence 19 to 22) get lost How high a packet drop rate is still acceptable? 1% or 2% is probably still (barely) acceptable By the time you get to 10%, though, things look a lot worse 10% packet drop rate doesn’t mean that your connection slows down by 10% For every dropped packet, you have a minimum delay
of one second until TCP retries it If that retried packet gets dropped too—which it will
ev ery 10 dropped packets if you have a 10% drop rate—the second retry takes another three seconds If you’re transmitting packets of 64 bytes over a 33.6 kb/s link, you can normally get about 60 packets through per second With 10% packet loss, the time to get these packets through is about eight seconds, a throughput loss of 87.5%
With 20% packet loss, the results are even more dramatic Now 12 of the 60 packets have
to be retried, and 2.4 of them will be retried a second time (for three seconds delay), and 0.48 of them will be retried a third time (six seconds delay) This makes a total of 22 seconds delay, a throughput degradation of nearly 96%
Theoretically, you might think that the degradation would not be as bad for big packets,
such as you might have with file transfers with ftp In fact, the situation is worse then: in
most cases the packet drop rate rises sharply with the packet size, and it’s common
enough that ftp times out completely before it can transfer a file.
To get a better overview of what’s going on, let’s look at another program, tcpdump.