Main Functions That Manipulate IP Addresses and Configuration Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com... If you are using Zebra, the routing protocols
Trang 1Number of transmitted multicast packets Not used by IPv4 at the moment.
Fields related to defragmentation
IPSTATS_MIB_REASMTIMEOUT
Number of packets that failed defragmentation because some of the fragments were not received in time The value reflects the number of complete packets, not the number of fragments This field is updated in ip_expire, which is the timer function executed when an IP fragment list is dropped due to a timeout Note that this counter is not used as defined in the two RFCs mentioned at the beginning of this section
Number of packets successfully defragmented This field is updated in ip_frag_reasm
Fields related to fragmentation
IPSTATS_MIB_FRAGFAILSSimpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 2Number of failed fragmentation efforts This field is updated in ip_fragment (and in ipmr_queue_xmit for multicast).
IPSTATS_MIB_FRAGOKS
Number of fragments transmitted This field is updated in ip_fragment
IPSTATS_MIB_FRAGCREATES
Number of fragments created This field is updated in ip_fragment
The values of these counters are exported in the /proc/net/snmp file.
Each CPU keeps its own accounting information about the packets it processes Furthermore, it keeps two counters: one for events in interrupt context and the other for events outside interrupt context Therefore, the ip_statistics array includes two elements per CPU, one for interrupt context and one for noninterrupt context Not all of the events can happen in both contexts, but to make things easier and clearer, the vector has simply been defined of double in size; those elements that do not make sense in one of the two contexts are simply not to be used
Because some pieces of code can be executed both in interrupt context and outside interrupt context, the kernel provides three different macros to add an event to the IP statistics vector:
#define IP_INC_STATS (field) SNMP_INC_STATS (ip_statistics, field)
#define IP_INC_STATS_BH (field) SNMP_INC_STATS_BH (ip_statistics, field)
#define IP_INC_STATS_USER(field) SNMP_INC_STATS_USER(ip_statistics, field)
The first can be used in either context, because it checks internally whether it was called in interrupt context and updates the right element accordingly The second and the third macros are to be used for events that happened in and outside interrupt context, respectively The macros IP_INC_STATS, IP_INC_STATS_BH, and IP_INC_STATS_USER are defined in include/net/ip.h, and the three associated SNMP_INC_XXX macros are defined in include/net/snmp.h
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 3Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 423.4 IP Configuration
The Linux IP protocol can be tuned and configured manually by a system administrator in different ways This tuning includes both
changes to the protocol itself and to device configuration The four main interfaces are:
ioctl calls made via ifconfig
ifconfig is the older Unix-legacy tool for configuring IP on network devices.
These three protocols can be used to dynamically assign an IP configuration to a host and its interfaces
The last set of protocols in the preceding list have an interesting twist They are normally implemented in user space, but Linux also has
a simple kernel-space implementation that is useful when used together with the nfsroot boot option The latter allows the kernel to
mount the root directory (/) via NFS To do that, it needs an IP configuration at boot time before the system is able to initialize the IP
configuration from user space (which, by the way, could be stored in a remote partition and not even be available to the system when it
mounts the root directory) Via kernel boot options, it is possible to give nfsroot a static configuration, or specify what protocols (yes, more
than one can be used concurrently) to use to obtain the configuration The IP configuration code is in net/ipv4/ipconfig.c, and the one
used by nfsroot is in fs/nfs/nfsroot.c The two files cross-reference variables and functions, but they are actually simple to read We will
not cover them, because network filesystems and user-space clients are outside the scope of this book Once you know how to read _
_setup macros (described in Chapter 7), reading the code should become a piece of cake It is clear and well commented
The third item in the list, /proc, is covered later in the section "Tuning via /proc Filesystem."
In this section, I will say a bit about the kernel interfaces that support the behavior of the first two items, ifconfig and ip The purpose here
is not to cover the internals of the user-space commands or the associated kernel counterparts that handle configuration requests It is to
show how user space and kernel space communicate, and the kernel functions that are invoked in response to a user-space command
23.4.1 Main Functions That Manipulate IP Addresses and Configuration
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 5Before reading these descriptions of functions, it would be worthwhile reviewing the key data structures used by the IP layer, introduced
in Chapter 19 and described in detail later in this chapter For instance, a single IP address is represented by an in_ifaddr structure and the complete IPv4 configuration of a device by an in_device structure
inetdev_init and inetdev_destroy
inetdev_init is invoked when the first IP configuration is applied to a device It allocates the in_device structure and links it to the associated net_device instance It also creates a directory in /proc/sys/net/ipv4/conf/ (see the section "Tuning via /proc Filesystem")
The IP configuration can be removed with inetdev_destroy, which simply undoes whatever was done in inetdev_init, plus removes all of the linked in_ifaddr structures The latter are removed with inet_free_ifa, which also decrements the reference count on the in_device structure with in_dev_put When the last reference is released, probably with the last call to
inet_free_ifa, the in_device instance is freed with in_dev_finish_destroy
inet_alloc_ifa and inet_free_ifa
Those two functions allocate and free, respectively, an in_ifaddr data structure A new one is allocated when a user adds a new address to an interface A deletion can be triggered by the removal of a single address, or by the removal of all of the devices' IP configurations together Both routines use the read-copy update (RCU) mechanism as a means to enforce mutual exclusion
inet_insert_ifa and inet_del_ifa
inet_insert_ifa adds a new in_ifaddr structure to the list within in_device It detects duplicates and marks the address as
secondary if it finds out that it falls within another address's subnet Suppose, for instance that eth0 already had the address
10.0.0.1/24 When a new 10.0.0.2/24 address is added, it will be recognized as secondary with respect to the first Primary addresses are also used to feed the entropy of the kernel random number generator with net_srandom More information on primary and secondary addresses can be found in Chapter 30
inet_del_ifa simply removes an in_ifaddr structure from the associated in_device instance, making sure that, if the address is primary, all of the associated secondary addresses are removed too, unless the administrator has explicitly configured the
device via its /proc/sys/net/ipv4/conf/ dev_name /promote_secondaries file not to remove secondary addresses Instead, a
secondary address can be promoted to a primary one when the associated primary address is removed Given the in_deviceinstance, this configuration can be accessed with the IN_DEV_PROMOTE_SECONDARIES macro The inet_del_ifa function accepts an extra input parameter that can be used to tell whether the in_device structure should be freed when the last in_ifaddr instance has been removed While it is normal to remove the empty in_device structure, sometimes a caller might not do it, such as when it knows it is going to add a new in_ifaddr soon
In both cases, addition and deletion, successful completion leads to a Netlink broadcast notification with rtmsg_ifa (see the section "Change Notification: rtmsg_ifa") and a notification to the other kernel subsystems via the inetaddr_chainnotification chain (see Chapter 4)
inet_set_ifa
This is a wrapper for inet_insert_ifa that creates an in_device structure if none exists for the associated device, and sets the scope of the address to local (RT_SCOPE_HOST) for addresses like 127.x.x.x Refer to the section "Scope" in Chapter 30 for more details on scopes
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 6Many other, smaller functions can be used to make the code more readable Here are a few of them:
inet_select_addr
This function is used to select an IP address among the ones configured on a given device The function accepts an optional scope as a parameter, which can be used to narrow down the lookup domain We will see where this function is useful in Chapter 35
inet_make_mask and inet_mask_len
Given the number of 1s the netmask is composed of, inet_make mask creates the associated netmask For example, an input
of 24 would generate the netmask with the decimal representation 255.255.255.0
inet_mask_len is the converse, returning the number of 1s in a decimal netmask For instance, 255.255.0.0 would return 16
inet_ifa_match
Given an IP address and a netmask, inet_ifa_match checks whether a given second IP address falls within the same subnet
This function is often used to classify secondary addresses and to check whether a given IP address belongs to one of the locally configured subnets See, for instance, inet_del_ifa
for_primary_ifa and for_ifa
These two functions are macros that can be used to browse all of the in_ifaddr instances associated with a given in_devicestructure for_primary_ifa considers only primary addresses, and for_ifa goes through all of them
23.4.2 Change Notification: rtmsg_ifa
Netlink provides the RTMGRP_IPV4_IFADDR multicast group to user-space applications interested in changes to the locally configured
IP addresses The kernel uses the rtmsg_ifa function to notify those applications that registered to the group when any change takes
place on the local IP addresses The function can be called when two types of events occur:
RTM_NEWADDR
A new address has been configured on a device
RTM_DELADDR
An address has been removed from a device
The generated message is initialized with inet_fill_ifaddr, the same function used to handle dump requests from user space (with
commands such as ip addr list) The message includes the address being added or removed, and the device associated with it.
So, who is interested in this kind of notification? Routing protocols are a major example If you are using Zebra, the routing protocols you
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 723.4.3 inetaddr_chain Notification Chain
The IP subsystem uses the inetaddr_chain notification chain to notify other kernel subsystems about changes to the IP configuration of
the local devices A kernel subsystem can register and unregister itself with inetaddr_chain by means of the register_inetaddr_notifier and
unregister_inetaddr_notifier functions Here are two examples of users for this notification chain:
Routing
See the section "External Events" in Chapter 32
Netfilter masquerading
When a local IP address is used by the Netfilter's masquerading feature, and that address disappears, all of the connections
that are using that address must be dropped (see net/ipv4/netfilter/ipt_MASQUERADE.c).
The two NETDEV_DOWN and NEtdEV_UP events, respectively, are notified when an IP address is removed and when it is added to a
local device Such notifications are generated by the inet_del_ifa and inet_insert_ifa routines introduced in the section "Main Functions
That Manipulate IP Addresses and Configuration."
23.4.4 IP Configuration via ip
Traditionally, Unix system administrators configured interfaces and routes manually using ifconfig, route, and other commands Currently
Linux provides an umbrella ip command to handle IP configuration, with a number of subcommands.
In this section we will see how IPROUTE2 handles the main addressing operations, such as adding and removing an address Once you
are familiar with these operations, you can easily understand and read through the code for the others
Figure 23-2 shows the files and the main functions of the IPROUTE2 package that are involved with IP address configuration activities
The labels on the lines are ip keywords, and the nodes show the function invoked and the file the latter belongs to For instance, the
command ip address addwould be handled by ipaddr_modify
Figure 23-2 IPROUTE2 files and functions for address configuration
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 8Table 23-1 shows the association between the operation specified with a command-line keyword (e.g., add) and the kernel handler run by
the kernel For instance, when the kernel receives a request for an RTM_NEWADDR operation, it knows it is associated with an addcommand and therefore invokes inet_rtm_newaddr Some kernel operations are overloaded, and for these, the kernel needs extra flags
to figure out exactly what the user-space command is asking for See Chapter 36 for an example This association is defined in
net/ipv4/devinet.c in the inet_rtnetlink_table structure For an introduction to RTNetlink, refer to Chapter 3
Table 23-1 ip route commands and associated kernel operations
The list and flush commands need some explanation list is simply a request to the kernel to dump information, for instance, about a given device, and flush is a request to clear the entire IP configuration on the device.
The two functions inet_rtm_newaddr and inet_rtm_deladdr are wrappers for the generic functions inet_insert_ifa and inet_del_ifa that we introduced in the section "Main Functions That Manipulate IP Addresses and Configuration." All the wrappers do is translate the request that comes from user space into an input understandable by the two more-general functions They also filter bad requests that are associated with nonexistent devices
23.4.5 IP Configuration via ifconfig
ifconfig is implemented in the ifconfig.c user-space file (part of the net-tools package) Unlike ip, ifconfig uses ioctl calls to interface to the
kernel However, a set of functions are used by both the ip and ifconfig handlers In Chapter 3, we had an overview of how ioctl calls are handled by the kernel Here all we need to know is that the requests related to IPv4 configuration are handled by the inet_ioctl function in
net/ipv4/af_inet.c Based on the ioctl code you can see what helper functions inet_ioctl uses to process the user-space commands (e.g., devinet_ioctl)
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 9Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 1023.5 IP-over-IP
IP-over-IP, also called IP tunneling (or IPIP), consists of transmitting IP packets inside other IP packets This protocol is useful in some very interesting cases, including in a Virtual Private Network (VPN) Of course, nothing comes for free; you can well imagine the extra weight of the doubling of the protocol: because each IP packet has two IP headers, the overhead becomes huge for small packets There are subtle complexities in implementation, too For instance, what is the relationship between the IP options of the two headers?
If you consider just the IPv4 and IPv6 protocols, you already have four possible combinations of tunneling But not all of these combinations are likely to be used
To make things more complex (I should actually say "flexible"), keep in mind that there is no limit to the number of recursions in
tunneling.[*]
[*] IPv6 defines the "tunnel encapsulation limit" as the maximum number of nested encapsulations See section 6.6
of RFC 2473
The different tunnel interfaces that can be created in Linux are not covered in this book However, given the background on the IP
implementation in this part of the book, you can study the code in net/ipv4/ipip.c and include/net/ipip.h to derive the implementation
details
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 1123.6 IPv4: What's Wrong with It?
We saw in the section "IP Protocol: The Big Picture" in Chapter 18 what the main tasks are of the IP protocol IPv4 was designed almost
25 years ago (in 1981), and given the speed with which the Internet and network services have evolved since then, the protocol is showing its age Because IPv4 was not originally designed with today's big network topologies and commercial uses in mind, it has shown several limitations over the years These have been only partially solved, sometimes with special extensions to the protocol (e.g., classless interdomain routing), DiffServ Code Point (DSCP) replacement to ToS, congestion notification, etc.), and other times by defining specialized external protocols such as IPsec
Thanks to the experience gained with IPv4, the new IPv6 version of the protocol has been designed to address the known shortcomings
of IPv4, taking into consideration such aspects as:
When analyzing IPv4 packet transmission, we saw that fragmentation and options processing were the two most expensive tasks It should not come as a surprise, therefore, that IPv6 addressed both points:
Fragmentation has been limited in IPv6: an IP packet can be fragmented only at the source
The presence of IP options may sometimes inhibit the fast processing path: this is true for both software routers like Linux on
a PC and commercial hardware IP implementations For a commercial implementation, it could mean that IP packets without options can be forwarded in hardware at much higher speed, and the ones with options have to be handled in software The
way options are handled by IPv6 is also different: IPv6 uses the concept of extensions, whose main advantage is that not all
of the routers have to process them
One other big limitation of IPv4 is the 32-bit size of its addresses and the limited hierarchy they come with Network Address Translation (NAT) is only a short-term solution that partially solves the problem NAT comes with some limitations, which are listed on the following page
Each protocol has to be treated specially, so some protocols don't always work passing through a NAT router (e.g., H323)
The NAT router becomes a single point of failure Because it needs to keep state information for all the connections passing through it, designing a network with redundancy or security in mind is not easy
Its tasks are complex and computationally heavy when there is a need to support those complex protocols that have not been
designed with NAT support in mind (these are considered to be "not NAT-friendly"[*])
[*]
You can read RFC 3235 if you would like to see what is considered a NAT-friendly protocol or Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 12The limited number of addresses in IPv4 also contributes (because of its limited hierarchy) to the creation of huge routing tables A core router can have up to hundreds of thousands of routes This trend is bad, for a couple of reasons:
The routes require lots of memory
Lookups are slower
Classless interdomain routing helps in reducing the size of the routing tables, but cannot solve the limited address space problem of IPv4
In IPv6, the address has been made four times bigger in size, which does not mean four times as many addresses, but rather 296 times
as many! This potentially brings systems outside the NAT router and makes them full-fledged citizens of the Internet, with implications for new types of applications
IPv4 was not designed with security in mind Because of this, several approaches of different granularity have been developed:
application end-to-end solutions such as Secure Sockets Layer (SSL), host end-to-end solutions such as IPsec, etc Each has its own pros and cons SSL requires the applications to be written to use that security layer (which sits on top of TCP), whereas IPsec (which is what most people identify VPNs with) does not: IPsec sits at the L3 layer and therefore is transparent to applications IPsec can be used
by both IPv4 and IPv6, but it fits better with IPv6
With IPv6, the neighboring system has changed as well It is called neighbor discovery, and represents the counterpart to ARP for IPv4
The QoS component is also expanded
With IPv4 networks, it is already possible to carry out automatic host configuration, thanks to protocols such as DHCP; however, some constraints make that solution less Plug and Play (PnP) than it should be This issue has been solved by IPv6 too, with the so-called
autoconfiguration feature.
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 1323.7 Tuning via /proc Filesystem
The /proc filesystem was introduced in Chapter 3; it provides a simple interface for users to view and change kernel parameters and is
the model for the newer sysfs directory It contains a huge number of files (or rather, virtual data structures that look to the user just like
files) that map to variables and functions inside the kernel and that can be used to tune the behavior of the networking component of the
kernel as well
The files used for IPv4 tuning are located mainly in two directories:
/proc/sys/net/ipv4/
Table 23-2 shows some of the files in this directory that are used by IPv4 The kernel variables associated with those files are
declared in net/ipv4/sysctl_net_ipv4.c and are statically registered at boot time (see Chapter 3) Note that the directory contains many more files than the ones in Table 23-2 Most of the extra files are associated with L4 protocols, especially TCP
/proc/sys/net/ipv4/conf/
This directory contains a subdirectory for each network device recognized by the kernel, plus other special directories (see Figure 36-4 in Chapter 36) Those subdirectories include configuration parameters that are device specific; among them areaccept_redirects, send_redirects, accept_source_route, and forwarding These will be covered in Chapter 36, with the exception of promote_secondaries, which is described in the section "Main Functions That Manipulate IP Addresses and Configuration."
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 14Table 23-2 IPv4-related files in /proc/sys/net/ipv4
These values are updated by tcp_init at boot time based on the amount of memory available in the system Even if they are updated
by TCP, they are used by any L4 protocol that uses ports
b
This value is updated by inet_initpeers at boot time based on the amount of memory available in the system
The first three elements in Table 23-2 are members of two data structures of type ipv4_devconf and ipv4_config, located, respectively, in
include/linux/inetdevice.h and include/net/ip.h and described later in this chapter The other elements of those structures are either
exported elsewhere or not exported at all (we will cover them in the associated chapters) The meaning of the files and kernel variables
When 0, path MTU discovery is enabled
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 15Maximum number of inet_peer structures that can be allocated.
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 16inet_peer_gc_mintime
Amount of time between regular garbage collection passes Since the amount of memory usable by the inet_peer structures
is limited (by inet_peer_threshold), there is a regular timer that expires unused entries based on these two variables inet_peer_gc_maxtime is used when the system is not heavily loaded, and inet_peer_gc_mintime is used in the opposite case Thus, the more entries there are, the more frequently the timer expires
Trang 1723.8 Data Structures Featured in This Part of the Book
The section "Main IPv4 Data Structures" in Chapter 19 gave a brief overview of the main data structures This section has a detailed
description of each data structure type Figure 23-3 shows the file that defines each data structure
23.8.1 iphdr Structure
The meaning of its fields has already been covered in the section "IP Header" in Chapter 18
23.8.2 ip_options Structure
This structure represents the options for a packet that needs to be transmitted or forwarded The options are stored in this structure
because it is easier to read than the corresponding portion of the IP header itself
Figure 23-3 Distribution of data structures in kernel files
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 18Let's go field by field They should be fairly simple to understand if you have read the section "IP Options" in Chapter 18 After this description, you will be able to understand more easily how the parsing is done and how its results are used by the IP layer subsystems, such as the code that processes incoming IP packets Some of the bit fields are grouped together into an unsigned char; the
declarations of these end with :1
unsigned char optlen
Length of the set of options As explained in Chapter 18, this is limited to a maximum of 40 bytes by the definition of the IP header
unsigned char is_changed:1
Set if the IP header has been modified (such as an IP address or a timestamp) This is useful to know because if the packet has to be forwarded, this field indicates that the IP checksum has to be recomputed
_ _u32 faddr
unsigned char is_strictroute:1
unsigned char srr
unsigned char srr_is_hit:1
faddr is meaningful only for transmitted packets (that is, those generated locally) and only for those using source routing The value of faddr is set to the first of the IP addresses provided for source routing See the section "Option: Strict and Loose Source Routing" in Chapter 19
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 19unsigned char rr
When rr is nonzero, Record Route is one of the IP options and the value of this field represents the offset inside the IP header where the option starts This field is used together with rr_needaddr
unsigned char rr_needaddr:1
When rr_needaddr is true, Record Route is one of the IP options and there is still room in the header for another route; therefore, the current node should copy the IP address of the outgoing interface into the IP header at the offset specified by rr
unsigned char ts
When ts is nonzero, Timestamp is one of the IP options and this field represents the offset inside the IP header where the option starts This field is used together with ts_needaddr and ts_needtime
unsigned char is_setbyuser:1
This field makes sense only for transmitted packets and is set when the options were passed from user space with the system call setsockopt Currently, however, it is never used
unsigned char is_data:1
unsigned char _data[0]
These fields are used in two situations: when the local node transmits a locally generated packet, and when the local node replies to an ICMP echo request In these cases, is_data is true and _data points to an area containing the options to append
to the IP header The [0] definition is a common convention used for reserving space for a pointer
When forwarding a packet, the options are in the associated skb buffer (see the ip_options_get function in the
net/ipv4/ip_options.c file).
unsigned char ts_needtime:1
When this option is true, Timestamp is one of the IP options and there is still room in the header for another timestamp; therefore, the current node should add the time of transmission into the IP header at the offset specified by ts
unsigned char ts_needaddr:1
Used with ts and ts_needtime to indicate that the IP address of the egress device should also be copied into the IP header
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 20unsigned char router_alert
When this option is true, Router Alert is one of the IP options
unsigned char _ _pad1, _ _pad2
Because memory accesses are faster when the location is aligned to a 32-bit boundary, the Linux kernel data structures are often padded out with unused fields called _ _padn in order to make their sizes a multiple of 32 bits This is the only purpose
of _ _pad1 and _ _pad2; they are not used otherwise
The flags srr, rr, and ts also are useful when parsing the options in order to detect the ones that are present more than once, which is illegal (see the section "Option Parsing" in Chapter 19)
Here is the description of the fields of the ipq structure For the sake of simplicity, not all fields are shown in Figure 22-1 in Chapter 22
struct ipq *next
When the fragments are put into the ipq_hash hash table, conflicting elements (elements with the same hash value) are linked together with this field Note that this field does not indicate the order of fragments within the packet; it is used simply
as a standard way to organize the hash table The order of fragments within the packet is controlled by the fragments field (see Figure 22-1 in Chapter 22)
struct ipq **pprev
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 21All of the ipq structures are kept sorted in a global list, ipq_lru_list, based on a least-recently-used criterion This list is useful when performing garbage collection This field is used to link the ipq structure to such a list.
u32 user
The reason why an IP packet is to be defragmented, which indirectly says what kernel subsystem asked for the defragmentation The list of allowed values for IP_DEFRAG_XXX is in include/net/ip.h The most common one is IP_DEFRAG_LOCAL_DELIVER, which is used when defragmenting ingress packets that are to be delivered locally
The first of the fragments (the one with offset=0) has been received The first fragment is the only one carrying all
of the options that were in the original IP packet
Trang 22struct sk_buff *fragments
List of fragments received so far
struct timer_list timer
Chapter 18 explained why IP fragments cannot stay forever in memory and should be removed after some time if defragmentation is not possible This field is the timer that takes care of that
int iif
ID of the device from which the last fragment was received When a list of fragments expires, this field is used to decide which device to use to transmit the FRAGMENTATION REASSEMBLY TIMEOUT ICMP message (see ip_expire in the
net/ipv4/ip_fragment.c file).
struct timeval stamp
Time when the last fragment was received (see ip_frag_queue in net/ipv4/ip_fragment.c)
The ipq_hash table is protected by ipfrag_lock, which can be taken in either shared (read-only) or exclusive (read-write) mode Do not
confuse this lock with the one embedded in each ipq element
23.8.5 inet_peer Structure
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 23struct inet_peer *avl_left
struct inet_peer *avl_right
Left and right pointers to the two subtrees
_ _u16 avl_height
Height of the AVL tree
struct inet_peer *unused_next
struct inet_peer **unused_prevp
Used to link the node into a list that contains elements that expired unused_prevp is used to check whether the node is in that list
A node can be put into that list and then taken back out of it several times without ever being removed completely See the section "Garbage Collection."
unsigned long dtime
Time when this element was added to the unused list inet_peer_unused_head via inet_putpeer
unsigned long tcp_ts_stamp
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 24Used by TCP to manage timestamps.
The in_device structure stores all of the IPv4-related configuration for a network device, such as changes made by a user with the ifconfig
or ip command This structure is linked to the net_device structure via net_device->ip_ptr and can be retrieved with in_dev_get and _ _in_dev_get The difference between those two functions is that the first one takes care of all of the necessary locking, and the second one assumes the caller has taken care of it already
Since in_dev_get internally increases a reference count on the in_dev structure when it succeeds (i.e., when a device is configured to support IPv4), its caller is supposed to decrement the reference count with in_dev_put when it is done with the structure
The structure is allocated and linked to the device with inetdev_init, which is called when the first IPv4 address is configured on the device Here are the meanings of its fields:
struct net_device *dev
Pointer back to the associated net_device structure
atomic_t refcnt
Reference count The structure cannot be freed until this field is 0
int dead
This field is set to mark the device as dead This is useful to detect those cases where the entry cannot be destroyed because
it has a nonzero reference count, but a destroy action has been initiated The two most common events that trigger the removal of an in_device structure are:
Unregistration of the device (see Chapter 8)
Removal of the last configured IP address from the device (see inet_del_ifa in net/ipv4/devinet.c)
struct in_ifaddr *ifa_list
List of IPv4 addresses configured on the device The in_ifaddr instances are kept sorted by scope (bigger scope first), and elements with the same scope are kept sorted by address type (primary first) The in_ifaddr data structure is further described in the section "in_ifaddr Structure."
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 25struct ipv4_devconf cnf
See the section "ipv4_devconf Structure"
struct rcu_head rcu_head
Used by the RCU mechanism to enforce mutual exclusion It accomplishes the same job as a lock
The rest of the fields are used by the multicast code For instance, mc_list stores the device's multicast configuration and it is the
multicast counterpart of ifa_list mr_vl_seen and mr_v2_seen are timestamps used by the IGMP protocol to keep track of the reception of
versions 1 and 2 IGMP packets
23.8.8 in_ifaddr Structure
When configuring an IPv4 address on an interface, the kernel creates an in_ifaddr structure that includes the 4-byte address along with
several other fields Here are their meanings:
struct in_ifaddr *ifa_next
Pointer to the next element in the list The list contains all of the addresses configured on the device
struct in_device *ifa_dev
Pointer back to the associated in_device structure
u32 ifa_local
u32 ifa_address
The values of these two fields depend on whether the address is assigned to a tunnel interface If so, ifa_local and ifa_address are the local and remote addresses of the tunnel, respectively If not, both contain the address of the local interface
u32 ifa_mask
unsigned char ifa_prefixlen
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 26ifa_mask is the netmask associated with the address ifa_prefixlen is the number of 1s that compose the netmask Since they are different ways of representing the same information, one of the two is normally computed from the other This is done, for
instance, by the ip and ifconfig user-space configuration tools described in the section "IP Configuration." ip passes the kernel
ifa_prefixlen and lets the latter compute ifa_mask, whereas ifconfig does the opposite The kernel provides some functions to convert a netmask into a prefix length, and vice versa
u32 ifa_broadcast
Broadcast address
u32 ifa_anycast
Anycast address
unsigned char ifa_scope
Scope of the address The default is RT_SCOPE_UNIVERSE (which corresponds to the value 0) and the field is usually set
to that value by ifconfig/ip, although a different value can be chosen The main exception is an address in the range 127 x.x.x,
which is given the RT_SCOPE_HOST scope See Chapter 30 for more details
unsigned char ifa_flags
The possible IFA_F_XXX bit flags are listed in include/linux/rtnetlink.h Here is the one used by IPv4:
A string used mostly for backward compatibility with 2.0.x kernels that allowed aliased interfaces with names such as eth0:1.
struct rcu_head rcu_head
Used by the RCU mechanism to enforce mutual exclusion It accomplishes the same job as a lock
23.8.9 ipv4_devconf Structure
The ipv4_devconf data structure, whose fields are exported via /proc in /proc/sys/net/ipv4/conf/, is used to tune the behavior of a network device There is an instance for each device, plus one that stores the default values (ipv4_devconf_dflt) The meanings of its fields are covered in Chapters 29 and 36, with the exception of promote_secondaries, which is described in the section "Main Functions That Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 2723.8.10 ipv4_config Structure
While ipv4_devconf structures are used to store per-device configuration, ipv4_config stores configuration that applies to the host
Here is a brief description of its fields:
int log_martians
This parameter is also present in the ipv4_devconf structure It is used to decide whether to print warning messages to the console when specific errors occur Its value is not checked directly, but via the macro IN_DEV_LOG_MARTIANS, which gives higher priority to the per-device instance
[*] IPv6 defines its own version of cork in include/linux/ipv6.h.
Here is a brief description of its fields:
unsigned int flags
Currently only one flag used by IPv4 can be set: IPCORK_OPT When this flag is set, it means there are options in opt
unsigned int fragsize
Size of the data fragments generated This includes both payload and L3 header and is normally the PMTU
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 28struct ip_options *opt
struct page *page
Pointer to the memory page On i386, the page size is 4 KB To find the size of a page on any given architecture xxx, look for PAGE_SIZE in include/asm-xxx /page.h.
_ _u16 page_offset
Offset, relative to the beginning of the page, where the fragment starts
_ _u16 size
Size of the fragment
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 2923.9 Functions and Variables Featured in This Part of the Book
Table 23-3 summarizes the main functions, variables, and data structure introduced or referenced in the chapters of this book covering
the IPv4 protocol
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 30Table 23-3 Functions, variables, and data structures in the IPv4 subsystem
ip_init Initializes the IPv4 protocol See the section "IP Options" in Chapter 19
ip_rcv Processes ingress IP packets See the section "Processing Input IP Packets" in Chapter 19.
Deliver an ingress IP packet to the local host See the section "Local Delivery" in Chapter 20
ipfrag_init Initializes the IP Fragmentation/Defragmentation subsystem
inet_initpeers Initializes the IP peer subsystem.
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 31secure_ip_id
ip_call_ra_chain Hands ingress IP packets that carry the Router Alert option to the interested local Raw sockets
See the section "ip_forward function " in Chapter 20
Add, remove, and manipulate the IP addresses configured on the local devices See the section
"Main functions that manipulate IP addresses and configuration."
for_primary_ifa
for_ifa
Browse the IP addresses configured on a network device
rtmsg_ifa Generates notifications about changes to the IP address configuration of local devices See the
section "Change notification: rtmsg_ifa."
Variables
ipv4_devconf
ipv4_devconf_dflt
Store a set of parameters that can be tuned on a per-device basis via the /proc filesystem See
the section "Tuning via /proc filesystem."
ip_frag_mem Amount of memory held by ingress IP fragments See the section "Garbage Collection" in Chapter
Trang 32/proc filename Associated kernel variable
peer_pool_lock Lock used for the AVL tree where inet_peer structures are inserted.
inet_peer_unused_lock Lock used for the list where unused inet_peer structures are inserted.
ip_statistics Stores statistics about IP traffic See the section "IP Statistics."
Trang 3323.10 Files and Directories Featured in This Part of the Book
The net/ipv4 directory contains more files than the ones listed in Figure 23-4, but they are covered in other chapters, including the
chapters comprising Parts VI and VII
Figure 23-4 Files and directories featured in this part of the book
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 34Chapter 24 Layer Four Protocol and Raw IP
Handling
This chapter describes the interface between L3 and L4 protocols The only L3 protocol considered here is IP The L4 protocols include the familiar TCP, UDP, and ICMP, along with several other ones The L4 protocols are not covered in this book for reasons of space and complexity However, this chapter explains what happens when applications handle their own L4 (and sometimes L3) processing through raw IP
In particular, this chapter explains:
How L4 protocols register with the kernel and tell the kernel what kind of traffic they are interested in
How ingress packets are passed to the correct L4 protocol handler
How applications tell the kernel to let the application process headers
We saw in Chapter 21 the functions that L4 protocols use to transmit an IP datagram Since this book focuses on IP, this chapter covers only those L4 protocols that sit on top of IP The chapter describes the IPv4 interface and then briefly shows where IPv6 differs
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 35Version 3: 3376(2002)
Protocol Independent Multicast, version 1 (PIMv1) and version 2 (PIMv2) 2362(1998)
IPsec suite: IP Authentication Header Protocol (AH) , IP Encapsulating Security Payload Protocol (ESP) , IP
Payload Compression Protocol (IPcomp)
AH: 2402(1998)
ESP: 2406(1998)
IPcomp: 3173(2001)
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 36Other protocols are available for the Linux kernel but are either implemented in user space (routing protocols are an example) or are available as kernel patches because they are not yet integrated into the core kernel.
Figure 24-1 shows how the L4 protocols rest on L3 protocols The three main protocols (ICMP, UDP, and TCP), as well as the IPsec suite, have IPv6 counterparts There is no IGMPv6 in Figure 24-1 because its functionality is implemented as part of ICMPv6
Figure 24-1 L4 protocols on top of IPv4 and IPv6 that are implemented in the Linux kernel
Note that the last four items in Table 24-2 are tunneling protocols Their IDs identify an L3 protocol For example, the IPIP protocol is used
to transport IPv4 datagrams inside IPv4 datagrams Note that the value assigned to the protocol field of the IPv4 header when it
encapsulates an IP datagram has nothing to do with the value used to initialize the protocol field of an Ethernet header when the Ethernet payload is an IP datagram Even though the two fields refer to the same protocol (IPv4), they belong to two different domains: one is an L3 protocol identifier, whereas the other is an L4 protocol identifier
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 3724.2 L4 Protocol Registration
The L4 protocols that rest on IPv4 are defined by net_protocol data structures, defined in include/net/protocol.h, which consist of the following three fields:
int (*handler)(struct sk_buff *skb)
Function registered by the protocol as the handler for incoming packets This is discussed further in the section "L3 to L4 Delivery: ip_local_deliver_finish." It is possible to have protocols that share the same handler for both IPv4 and IPv6 (e.g., SCTP)
void (*err_handler)(struct sk_buff *skb, u32 info)
Function used by the ICMP protocol handler to inform the L4 protocol about the reception of an ICMP UNREACHABLE message We will see in Chapter 35 when a Linux system generates ICMP UNREACHABLE messages, and we will see in Chapter 25 how the ICMP protocol uses err_handler
int no_policy
This field is consulted at certain key points in the network stack and is used to exempt protocols from IPsec policy checks: 1 means that there is no need to check the IPsec policies for the protocol Do not confuse the no_policy field of the net_protocolstructure with the field bearing the same name in the ipv4_devconf structure: the former applies to a protocol; the latter applies to a device See the sections "L3 to L4 Delivery: ip_local_deliver_finish" and "IPsec" for how no_policy is used
The include/linux/in.h file contains a list of L4 protocols defined as IPPROTO_XXX symbols (For a more complete list, see the
/etc/protocols file, or RFC 1700 and its successor RFCs.) The maximum value for an L4 protocol identifier is 28-1 or 255, because the field
in the IP header allocated to specify the L4 protocol is 8 bits The highest number, 255, is reserved for Raw IP, IPPROTO_RAW
Not all of the protocols defined in the list of symbols are handled at the kernel layer; some of them (notably Resource Reservation Protocol, or RSVP, and the various routing protocols) are usually handled in user space This is, for example, why RSVP and routing protocols like OSPF are not included in the list of L4 protocols supported by the kernel that is in the previous section
24.2.1 Registration: inet_add_protocol and inet_del_protocol
Trang 38currently it is a simple flat array with one item for each of the possible 256 protocols The protocol number from /etc/protocols is the slot in
the table where the protocol is inserted If you'd like to see how the table was handled as a hash table in the 2.4 kernel, look in the 2.4
sources at the ip_run_ipprot function Figure 24-2 shows the numbers and initials of the most common protocols; for instance, ICMP is
protocol 1 and occupies slot 1 in the inet_protos table
Figure 24-2 IPv4 protocol table
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 39Concurrent accesses to the inet_protos table are managed in this way:
Read-write accesses (i.e., inet_add_protocol and inet_del_protocol) are serialized with the inet_proto_lock spin lock
Read-only accesses (i.e., ip_local_deliver_finish; see the next section) are protected with rcu_read_lock/rcu_read_unlock
Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com
Trang 40inet_del_protocol, which may remove an entry of the table currently held by an RCU reader, calls synchronize_net to wait for all the currently executing RCU readers to complete their critical section before returning There is another hash table used by protocols that rest on IPv6 Note that IPv6 appears in the IPv4 inet_protos table as well: the kernel can tunnel IPv6 over IPv4 (also called SIT, for Simple Internet Transition) See the section "IPv6 Versus IPv4."
As mentioned in the previous section, the ICMP, UDP, and TCP protocols are always part of the kernel and therefore are statically added
to the hash table at boot time by inet_init in net/ipv4/af_inet.c The following excerpts show the definitions of their structures and the actual inet_add_protocol calls that register them:
The IGMP handler is registered only when the kernel is compiled with support for IP multicast
As an example of how other protocols are dynamically registered, the following snapshot is taken from the Zebra user-space routing
daemon's implementation of the Open Shortest Path First IGP (OSPFIGP) protocol The code is taken from the ospfd/ospf_network.c file
in the Zebra package The socket call effectively registers the user-space daemon with the kernel, giving the kernel a place to send ingress packets that use the protocol specified in the third argument This protocol is IPPROTO_OSPFIGP, a symbol equal to 89, the Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com