1. Trang chủ
  2. » Công Nghệ Thông Tin

Understanding Linux Network Internals 2005 phần 6 ppt

128 457 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Understanding Linux Network Internals 2005 phần 6 ppt
Trường học Institute of Post and Telecommunications (IPSTATS)
Chuyên ngành Networking / Computer Science
Thể loại slide presentation
Năm xuất bản 2005
Định dạng
Số trang 128
Dung lượng 6,25 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Main Functions That Manipulate IP Addresses and Configuration Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com... If you are using Zebra, the routing protocols

Trang 1

Number of transmitted multicast packets Not used by IPv4 at the moment.

Fields related to defragmentation

IPSTATS_MIB_REASMTIMEOUT

Number of packets that failed defragmentation because some of the fragments were not received in time The value reflects the number of complete packets, not the number of fragments This field is updated in ip_expire, which is the timer function executed when an IP fragment list is dropped due to a timeout Note that this counter is not used as defined in the two RFCs mentioned at the beginning of this section

Number of packets successfully defragmented This field is updated in ip_frag_reasm

Fields related to fragmentation

IPSTATS_MIB_FRAGFAILSSimpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 2

Number of failed fragmentation efforts This field is updated in ip_fragment (and in ipmr_queue_xmit for multicast).

IPSTATS_MIB_FRAGOKS

Number of fragments transmitted This field is updated in ip_fragment

IPSTATS_MIB_FRAGCREATES

Number of fragments created This field is updated in ip_fragment

The values of these counters are exported in the /proc/net/snmp file.

Each CPU keeps its own accounting information about the packets it processes Furthermore, it keeps two counters: one for events in interrupt context and the other for events outside interrupt context Therefore, the ip_statistics array includes two elements per CPU, one for interrupt context and one for noninterrupt context Not all of the events can happen in both contexts, but to make things easier and clearer, the vector has simply been defined of double in size; those elements that do not make sense in one of the two contexts are simply not to be used

Because some pieces of code can be executed both in interrupt context and outside interrupt context, the kernel provides three different macros to add an event to the IP statistics vector:

#define IP_INC_STATS (field) SNMP_INC_STATS (ip_statistics, field)

#define IP_INC_STATS_BH (field) SNMP_INC_STATS_BH (ip_statistics, field)

#define IP_INC_STATS_USER(field) SNMP_INC_STATS_USER(ip_statistics, field)

The first can be used in either context, because it checks internally whether it was called in interrupt context and updates the right element accordingly The second and the third macros are to be used for events that happened in and outside interrupt context, respectively The macros IP_INC_STATS, IP_INC_STATS_BH, and IP_INC_STATS_USER are defined in include/net/ip.h, and the three associated SNMP_INC_XXX macros are defined in include/net/snmp.h

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 3

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 4

23.4 IP Configuration

The Linux IP protocol can be tuned and configured manually by a system administrator in different ways This tuning includes both

changes to the protocol itself and to device configuration The four main interfaces are:

ioctl calls made via ifconfig

ifconfig is the older Unix-legacy tool for configuring IP on network devices.

These three protocols can be used to dynamically assign an IP configuration to a host and its interfaces

The last set of protocols in the preceding list have an interesting twist They are normally implemented in user space, but Linux also has

a simple kernel-space implementation that is useful when used together with the nfsroot boot option The latter allows the kernel to

mount the root directory (/) via NFS To do that, it needs an IP configuration at boot time before the system is able to initialize the IP

configuration from user space (which, by the way, could be stored in a remote partition and not even be available to the system when it

mounts the root directory) Via kernel boot options, it is possible to give nfsroot a static configuration, or specify what protocols (yes, more

than one can be used concurrently) to use to obtain the configuration The IP configuration code is in net/ipv4/ipconfig.c, and the one

used by nfsroot is in fs/nfs/nfsroot.c The two files cross-reference variables and functions, but they are actually simple to read We will

not cover them, because network filesystems and user-space clients are outside the scope of this book Once you know how to read _

_setup macros (described in Chapter 7), reading the code should become a piece of cake It is clear and well commented

The third item in the list, /proc, is covered later in the section "Tuning via /proc Filesystem."

In this section, I will say a bit about the kernel interfaces that support the behavior of the first two items, ifconfig and ip The purpose here

is not to cover the internals of the user-space commands or the associated kernel counterparts that handle configuration requests It is to

show how user space and kernel space communicate, and the kernel functions that are invoked in response to a user-space command

23.4.1 Main Functions That Manipulate IP Addresses and Configuration

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 5

Before reading these descriptions of functions, it would be worthwhile reviewing the key data structures used by the IP layer, introduced

in Chapter 19 and described in detail later in this chapter For instance, a single IP address is represented by an in_ifaddr structure and the complete IPv4 configuration of a device by an in_device structure

inetdev_init and inetdev_destroy

inetdev_init is invoked when the first IP configuration is applied to a device It allocates the in_device structure and links it to the associated net_device instance It also creates a directory in /proc/sys/net/ipv4/conf/ (see the section "Tuning via /proc Filesystem")

The IP configuration can be removed with inetdev_destroy, which simply undoes whatever was done in inetdev_init, plus removes all of the linked in_ifaddr structures The latter are removed with inet_free_ifa, which also decrements the reference count on the in_device structure with in_dev_put When the last reference is released, probably with the last call to

inet_free_ifa, the in_device instance is freed with in_dev_finish_destroy

inet_alloc_ifa and inet_free_ifa

Those two functions allocate and free, respectively, an in_ifaddr data structure A new one is allocated when a user adds a new address to an interface A deletion can be triggered by the removal of a single address, or by the removal of all of the devices' IP configurations together Both routines use the read-copy update (RCU) mechanism as a means to enforce mutual exclusion

inet_insert_ifa and inet_del_ifa

inet_insert_ifa adds a new in_ifaddr structure to the list within in_device It detects duplicates and marks the address as

secondary if it finds out that it falls within another address's subnet Suppose, for instance that eth0 already had the address

10.0.0.1/24 When a new 10.0.0.2/24 address is added, it will be recognized as secondary with respect to the first Primary addresses are also used to feed the entropy of the kernel random number generator with net_srandom More information on primary and secondary addresses can be found in Chapter 30

inet_del_ifa simply removes an in_ifaddr structure from the associated in_device instance, making sure that, if the address is primary, all of the associated secondary addresses are removed too, unless the administrator has explicitly configured the

device via its /proc/sys/net/ipv4/conf/ dev_name /promote_secondaries file not to remove secondary addresses Instead, a

secondary address can be promoted to a primary one when the associated primary address is removed Given the in_deviceinstance, this configuration can be accessed with the IN_DEV_PROMOTE_SECONDARIES macro The inet_del_ifa function accepts an extra input parameter that can be used to tell whether the in_device structure should be freed when the last in_ifaddr instance has been removed While it is normal to remove the empty in_device structure, sometimes a caller might not do it, such as when it knows it is going to add a new in_ifaddr soon

In both cases, addition and deletion, successful completion leads to a Netlink broadcast notification with rtmsg_ifa (see the section "Change Notification: rtmsg_ifa") and a notification to the other kernel subsystems via the inetaddr_chainnotification chain (see Chapter 4)

inet_set_ifa

This is a wrapper for inet_insert_ifa that creates an in_device structure if none exists for the associated device, and sets the scope of the address to local (RT_SCOPE_HOST) for addresses like 127.x.x.x Refer to the section "Scope" in Chapter 30 for more details on scopes

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 6

Many other, smaller functions can be used to make the code more readable Here are a few of them:

inet_select_addr

This function is used to select an IP address among the ones configured on a given device The function accepts an optional scope as a parameter, which can be used to narrow down the lookup domain We will see where this function is useful in Chapter 35

inet_make_mask and inet_mask_len

Given the number of 1s the netmask is composed of, inet_make mask creates the associated netmask For example, an input

of 24 would generate the netmask with the decimal representation 255.255.255.0

inet_mask_len is the converse, returning the number of 1s in a decimal netmask For instance, 255.255.0.0 would return 16

inet_ifa_match

Given an IP address and a netmask, inet_ifa_match checks whether a given second IP address falls within the same subnet

This function is often used to classify secondary addresses and to check whether a given IP address belongs to one of the locally configured subnets See, for instance, inet_del_ifa

for_primary_ifa and for_ifa

These two functions are macros that can be used to browse all of the in_ifaddr instances associated with a given in_devicestructure for_primary_ifa considers only primary addresses, and for_ifa goes through all of them

23.4.2 Change Notification: rtmsg_ifa

Netlink provides the RTMGRP_IPV4_IFADDR multicast group to user-space applications interested in changes to the locally configured

IP addresses The kernel uses the rtmsg_ifa function to notify those applications that registered to the group when any change takes

place on the local IP addresses The function can be called when two types of events occur:

RTM_NEWADDR

A new address has been configured on a device

RTM_DELADDR

An address has been removed from a device

The generated message is initialized with inet_fill_ifaddr, the same function used to handle dump requests from user space (with

commands such as ip addr list) The message includes the address being added or removed, and the device associated with it.

So, who is interested in this kind of notification? Routing protocols are a major example If you are using Zebra, the routing protocols you

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 7

23.4.3 inetaddr_chain Notification Chain

The IP subsystem uses the inetaddr_chain notification chain to notify other kernel subsystems about changes to the IP configuration of

the local devices A kernel subsystem can register and unregister itself with inetaddr_chain by means of the register_inetaddr_notifier and

unregister_inetaddr_notifier functions Here are two examples of users for this notification chain:

Routing

See the section "External Events" in Chapter 32

Netfilter masquerading

When a local IP address is used by the Netfilter's masquerading feature, and that address disappears, all of the connections

that are using that address must be dropped (see net/ipv4/netfilter/ipt_MASQUERADE.c).

The two NETDEV_DOWN and NEtdEV_UP events, respectively, are notified when an IP address is removed and when it is added to a

local device Such notifications are generated by the inet_del_ifa and inet_insert_ifa routines introduced in the section "Main Functions

That Manipulate IP Addresses and Configuration."

23.4.4 IP Configuration via ip

Traditionally, Unix system administrators configured interfaces and routes manually using ifconfig, route, and other commands Currently

Linux provides an umbrella ip command to handle IP configuration, with a number of subcommands.

In this section we will see how IPROUTE2 handles the main addressing operations, such as adding and removing an address Once you

are familiar with these operations, you can easily understand and read through the code for the others

Figure 23-2 shows the files and the main functions of the IPROUTE2 package that are involved with IP address configuration activities

The labels on the lines are ip keywords, and the nodes show the function invoked and the file the latter belongs to For instance, the

command ip address addwould be handled by ipaddr_modify

Figure 23-2 IPROUTE2 files and functions for address configuration

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 8

Table 23-1 shows the association between the operation specified with a command-line keyword (e.g., add) and the kernel handler run by

the kernel For instance, when the kernel receives a request for an RTM_NEWADDR operation, it knows it is associated with an addcommand and therefore invokes inet_rtm_newaddr Some kernel operations are overloaded, and for these, the kernel needs extra flags

to figure out exactly what the user-space command is asking for See Chapter 36 for an example This association is defined in

net/ipv4/devinet.c in the inet_rtnetlink_table structure For an introduction to RTNetlink, refer to Chapter 3

Table 23-1 ip route commands and associated kernel operations

The list and flush commands need some explanation list is simply a request to the kernel to dump information, for instance, about a given device, and flush is a request to clear the entire IP configuration on the device.

The two functions inet_rtm_newaddr and inet_rtm_deladdr are wrappers for the generic functions inet_insert_ifa and inet_del_ifa that we introduced in the section "Main Functions That Manipulate IP Addresses and Configuration." All the wrappers do is translate the request that comes from user space into an input understandable by the two more-general functions They also filter bad requests that are associated with nonexistent devices

23.4.5 IP Configuration via ifconfig

ifconfig is implemented in the ifconfig.c user-space file (part of the net-tools package) Unlike ip, ifconfig uses ioctl calls to interface to the

kernel However, a set of functions are used by both the ip and ifconfig handlers In Chapter 3, we had an overview of how ioctl calls are handled by the kernel Here all we need to know is that the requests related to IPv4 configuration are handled by the inet_ioctl function in

net/ipv4/af_inet.c Based on the ioctl code you can see what helper functions inet_ioctl uses to process the user-space commands (e.g., devinet_ioctl)

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 9

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 10

23.5 IP-over-IP

IP-over-IP, also called IP tunneling (or IPIP), consists of transmitting IP packets inside other IP packets This protocol is useful in some very interesting cases, including in a Virtual Private Network (VPN) Of course, nothing comes for free; you can well imagine the extra weight of the doubling of the protocol: because each IP packet has two IP headers, the overhead becomes huge for small packets There are subtle complexities in implementation, too For instance, what is the relationship between the IP options of the two headers?

If you consider just the IPv4 and IPv6 protocols, you already have four possible combinations of tunneling But not all of these combinations are likely to be used

To make things more complex (I should actually say "flexible"), keep in mind that there is no limit to the number of recursions in

tunneling.[*]

[*] IPv6 defines the "tunnel encapsulation limit" as the maximum number of nested encapsulations See section 6.6

of RFC 2473

The different tunnel interfaces that can be created in Linux are not covered in this book However, given the background on the IP

implementation in this part of the book, you can study the code in net/ipv4/ipip.c and include/net/ipip.h to derive the implementation

details

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 11

23.6 IPv4: What's Wrong with It?

We saw in the section "IP Protocol: The Big Picture" in Chapter 18 what the main tasks are of the IP protocol IPv4 was designed almost

25 years ago (in 1981), and given the speed with which the Internet and network services have evolved since then, the protocol is showing its age Because IPv4 was not originally designed with today's big network topologies and commercial uses in mind, it has shown several limitations over the years These have been only partially solved, sometimes with special extensions to the protocol (e.g., classless interdomain routing), DiffServ Code Point (DSCP) replacement to ToS, congestion notification, etc.), and other times by defining specialized external protocols such as IPsec

Thanks to the experience gained with IPv4, the new IPv6 version of the protocol has been designed to address the known shortcomings

of IPv4, taking into consideration such aspects as:

When analyzing IPv4 packet transmission, we saw that fragmentation and options processing were the two most expensive tasks It should not come as a surprise, therefore, that IPv6 addressed both points:

Fragmentation has been limited in IPv6: an IP packet can be fragmented only at the source

The presence of IP options may sometimes inhibit the fast processing path: this is true for both software routers like Linux on

a PC and commercial hardware IP implementations For a commercial implementation, it could mean that IP packets without options can be forwarded in hardware at much higher speed, and the ones with options have to be handled in software The

way options are handled by IPv6 is also different: IPv6 uses the concept of extensions, whose main advantage is that not all

of the routers have to process them

One other big limitation of IPv4 is the 32-bit size of its addresses and the limited hierarchy they come with Network Address Translation (NAT) is only a short-term solution that partially solves the problem NAT comes with some limitations, which are listed on the following page

Each protocol has to be treated specially, so some protocols don't always work passing through a NAT router (e.g., H323)

The NAT router becomes a single point of failure Because it needs to keep state information for all the connections passing through it, designing a network with redundancy or security in mind is not easy

Its tasks are complex and computationally heavy when there is a need to support those complex protocols that have not been

designed with NAT support in mind (these are considered to be "not NAT-friendly"[*])

[*]

You can read RFC 3235 if you would like to see what is considered a NAT-friendly protocol or Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 12

The limited number of addresses in IPv4 also contributes (because of its limited hierarchy) to the creation of huge routing tables A core router can have up to hundreds of thousands of routes This trend is bad, for a couple of reasons:

The routes require lots of memory

Lookups are slower

Classless interdomain routing helps in reducing the size of the routing tables, but cannot solve the limited address space problem of IPv4

In IPv6, the address has been made four times bigger in size, which does not mean four times as many addresses, but rather 296 times

as many! This potentially brings systems outside the NAT router and makes them full-fledged citizens of the Internet, with implications for new types of applications

IPv4 was not designed with security in mind Because of this, several approaches of different granularity have been developed:

application end-to-end solutions such as Secure Sockets Layer (SSL), host end-to-end solutions such as IPsec, etc Each has its own pros and cons SSL requires the applications to be written to use that security layer (which sits on top of TCP), whereas IPsec (which is what most people identify VPNs with) does not: IPsec sits at the L3 layer and therefore is transparent to applications IPsec can be used

by both IPv4 and IPv6, but it fits better with IPv6

With IPv6, the neighboring system has changed as well It is called neighbor discovery, and represents the counterpart to ARP for IPv4

The QoS component is also expanded

With IPv4 networks, it is already possible to carry out automatic host configuration, thanks to protocols such as DHCP; however, some constraints make that solution less Plug and Play (PnP) than it should be This issue has been solved by IPv6 too, with the so-called

autoconfiguration feature.

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 13

23.7 Tuning via /proc Filesystem

The /proc filesystem was introduced in Chapter 3; it provides a simple interface for users to view and change kernel parameters and is

the model for the newer sysfs directory It contains a huge number of files (or rather, virtual data structures that look to the user just like

files) that map to variables and functions inside the kernel and that can be used to tune the behavior of the networking component of the

kernel as well

The files used for IPv4 tuning are located mainly in two directories:

/proc/sys/net/ipv4/

Table 23-2 shows some of the files in this directory that are used by IPv4 The kernel variables associated with those files are

declared in net/ipv4/sysctl_net_ipv4.c and are statically registered at boot time (see Chapter 3) Note that the directory contains many more files than the ones in Table 23-2 Most of the extra files are associated with L4 protocols, especially TCP

/proc/sys/net/ipv4/conf/

This directory contains a subdirectory for each network device recognized by the kernel, plus other special directories (see Figure 36-4 in Chapter 36) Those subdirectories include configuration parameters that are device specific; among them areaccept_redirects, send_redirects, accept_source_route, and forwarding These will be covered in Chapter 36, with the exception of promote_secondaries, which is described in the section "Main Functions That Manipulate IP Addresses and Configuration."

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 14

Table 23-2 IPv4-related files in /proc/sys/net/ipv4

These values are updated by tcp_init at boot time based on the amount of memory available in the system Even if they are updated

by TCP, they are used by any L4 protocol that uses ports

b

This value is updated by inet_initpeers at boot time based on the amount of memory available in the system

The first three elements in Table 23-2 are members of two data structures of type ipv4_devconf and ipv4_config, located, respectively, in

include/linux/inetdevice.h and include/net/ip.h and described later in this chapter The other elements of those structures are either

exported elsewhere or not exported at all (we will cover them in the associated chapters) The meaning of the files and kernel variables

When 0, path MTU discovery is enabled

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 15

Maximum number of inet_peer structures that can be allocated.

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 16

inet_peer_gc_mintime

Amount of time between regular garbage collection passes Since the amount of memory usable by the inet_peer structures

is limited (by inet_peer_threshold), there is a regular timer that expires unused entries based on these two variables inet_peer_gc_maxtime is used when the system is not heavily loaded, and inet_peer_gc_mintime is used in the opposite case Thus, the more entries there are, the more frequently the timer expires

Trang 17

23.8 Data Structures Featured in This Part of the Book

The section "Main IPv4 Data Structures" in Chapter 19 gave a brief overview of the main data structures This section has a detailed

description of each data structure type Figure 23-3 shows the file that defines each data structure

23.8.1 iphdr Structure

The meaning of its fields has already been covered in the section "IP Header" in Chapter 18

23.8.2 ip_options Structure

This structure represents the options for a packet that needs to be transmitted or forwarded The options are stored in this structure

because it is easier to read than the corresponding portion of the IP header itself

Figure 23-3 Distribution of data structures in kernel files

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 18

Let's go field by field They should be fairly simple to understand if you have read the section "IP Options" in Chapter 18 After this description, you will be able to understand more easily how the parsing is done and how its results are used by the IP layer subsystems, such as the code that processes incoming IP packets Some of the bit fields are grouped together into an unsigned char; the

declarations of these end with :1

unsigned char optlen

Length of the set of options As explained in Chapter 18, this is limited to a maximum of 40 bytes by the definition of the IP header

unsigned char is_changed:1

Set if the IP header has been modified (such as an IP address or a timestamp) This is useful to know because if the packet has to be forwarded, this field indicates that the IP checksum has to be recomputed

_ _u32 faddr

unsigned char is_strictroute:1

unsigned char srr

unsigned char srr_is_hit:1

faddr is meaningful only for transmitted packets (that is, those generated locally) and only for those using source routing The value of faddr is set to the first of the IP addresses provided for source routing See the section "Option: Strict and Loose Source Routing" in Chapter 19

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 19

unsigned char rr

When rr is nonzero, Record Route is one of the IP options and the value of this field represents the offset inside the IP header where the option starts This field is used together with rr_needaddr

unsigned char rr_needaddr:1

When rr_needaddr is true, Record Route is one of the IP options and there is still room in the header for another route; therefore, the current node should copy the IP address of the outgoing interface into the IP header at the offset specified by rr

unsigned char ts

When ts is nonzero, Timestamp is one of the IP options and this field represents the offset inside the IP header where the option starts This field is used together with ts_needaddr and ts_needtime

unsigned char is_setbyuser:1

This field makes sense only for transmitted packets and is set when the options were passed from user space with the system call setsockopt Currently, however, it is never used

unsigned char is_data:1

unsigned char _data[0]

These fields are used in two situations: when the local node transmits a locally generated packet, and when the local node replies to an ICMP echo request In these cases, is_data is true and _data points to an area containing the options to append

to the IP header The [0] definition is a common convention used for reserving space for a pointer

When forwarding a packet, the options are in the associated skb buffer (see the ip_options_get function in the

net/ipv4/ip_options.c file).

unsigned char ts_needtime:1

When this option is true, Timestamp is one of the IP options and there is still room in the header for another timestamp; therefore, the current node should add the time of transmission into the IP header at the offset specified by ts

unsigned char ts_needaddr:1

Used with ts and ts_needtime to indicate that the IP address of the egress device should also be copied into the IP header

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 20

unsigned char router_alert

When this option is true, Router Alert is one of the IP options

unsigned char _ _pad1, _ _pad2

Because memory accesses are faster when the location is aligned to a 32-bit boundary, the Linux kernel data structures are often padded out with unused fields called _ _padn in order to make their sizes a multiple of 32 bits This is the only purpose

of _ _pad1 and _ _pad2; they are not used otherwise

The flags srr, rr, and ts also are useful when parsing the options in order to detect the ones that are present more than once, which is illegal (see the section "Option Parsing" in Chapter 19)

Here is the description of the fields of the ipq structure For the sake of simplicity, not all fields are shown in Figure 22-1 in Chapter 22

struct ipq *next

When the fragments are put into the ipq_hash hash table, conflicting elements (elements with the same hash value) are linked together with this field Note that this field does not indicate the order of fragments within the packet; it is used simply

as a standard way to organize the hash table The order of fragments within the packet is controlled by the fragments field (see Figure 22-1 in Chapter 22)

struct ipq **pprev

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 21

All of the ipq structures are kept sorted in a global list, ipq_lru_list, based on a least-recently-used criterion This list is useful when performing garbage collection This field is used to link the ipq structure to such a list.

u32 user

The reason why an IP packet is to be defragmented, which indirectly says what kernel subsystem asked for the defragmentation The list of allowed values for IP_DEFRAG_XXX is in include/net/ip.h The most common one is IP_DEFRAG_LOCAL_DELIVER, which is used when defragmenting ingress packets that are to be delivered locally

The first of the fragments (the one with offset=0) has been received The first fragment is the only one carrying all

of the options that were in the original IP packet

Trang 22

struct sk_buff *fragments

List of fragments received so far

struct timer_list timer

Chapter 18 explained why IP fragments cannot stay forever in memory and should be removed after some time if defragmentation is not possible This field is the timer that takes care of that

int iif

ID of the device from which the last fragment was received When a list of fragments expires, this field is used to decide which device to use to transmit the FRAGMENTATION REASSEMBLY TIMEOUT ICMP message (see ip_expire in the

net/ipv4/ip_fragment.c file).

struct timeval stamp

Time when the last fragment was received (see ip_frag_queue in net/ipv4/ip_fragment.c)

The ipq_hash table is protected by ipfrag_lock, which can be taken in either shared (read-only) or exclusive (read-write) mode Do not

confuse this lock with the one embedded in each ipq element

23.8.5 inet_peer Structure

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 23

struct inet_peer *avl_left

struct inet_peer *avl_right

Left and right pointers to the two subtrees

_ _u16 avl_height

Height of the AVL tree

struct inet_peer *unused_next

struct inet_peer **unused_prevp

Used to link the node into a list that contains elements that expired unused_prevp is used to check whether the node is in that list

A node can be put into that list and then taken back out of it several times without ever being removed completely See the section "Garbage Collection."

unsigned long dtime

Time when this element was added to the unused list inet_peer_unused_head via inet_putpeer

unsigned long tcp_ts_stamp

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 24

Used by TCP to manage timestamps.

The in_device structure stores all of the IPv4-related configuration for a network device, such as changes made by a user with the ifconfig

or ip command This structure is linked to the net_device structure via net_device->ip_ptr and can be retrieved with in_dev_get and _ _in_dev_get The difference between those two functions is that the first one takes care of all of the necessary locking, and the second one assumes the caller has taken care of it already

Since in_dev_get internally increases a reference count on the in_dev structure when it succeeds (i.e., when a device is configured to support IPv4), its caller is supposed to decrement the reference count with in_dev_put when it is done with the structure

The structure is allocated and linked to the device with inetdev_init, which is called when the first IPv4 address is configured on the device Here are the meanings of its fields:

struct net_device *dev

Pointer back to the associated net_device structure

atomic_t refcnt

Reference count The structure cannot be freed until this field is 0

int dead

This field is set to mark the device as dead This is useful to detect those cases where the entry cannot be destroyed because

it has a nonzero reference count, but a destroy action has been initiated The two most common events that trigger the removal of an in_device structure are:

Unregistration of the device (see Chapter 8)

Removal of the last configured IP address from the device (see inet_del_ifa in net/ipv4/devinet.c)

struct in_ifaddr *ifa_list

List of IPv4 addresses configured on the device The in_ifaddr instances are kept sorted by scope (bigger scope first), and elements with the same scope are kept sorted by address type (primary first) The in_ifaddr data structure is further described in the section "in_ifaddr Structure."

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 25

struct ipv4_devconf cnf

See the section "ipv4_devconf Structure"

struct rcu_head rcu_head

Used by the RCU mechanism to enforce mutual exclusion It accomplishes the same job as a lock

The rest of the fields are used by the multicast code For instance, mc_list stores the device's multicast configuration and it is the

multicast counterpart of ifa_list mr_vl_seen and mr_v2_seen are timestamps used by the IGMP protocol to keep track of the reception of

versions 1 and 2 IGMP packets

23.8.8 in_ifaddr Structure

When configuring an IPv4 address on an interface, the kernel creates an in_ifaddr structure that includes the 4-byte address along with

several other fields Here are their meanings:

struct in_ifaddr *ifa_next

Pointer to the next element in the list The list contains all of the addresses configured on the device

struct in_device *ifa_dev

Pointer back to the associated in_device structure

u32 ifa_local

u32 ifa_address

The values of these two fields depend on whether the address is assigned to a tunnel interface If so, ifa_local and ifa_address are the local and remote addresses of the tunnel, respectively If not, both contain the address of the local interface

u32 ifa_mask

unsigned char ifa_prefixlen

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 26

ifa_mask is the netmask associated with the address ifa_prefixlen is the number of 1s that compose the netmask Since they are different ways of representing the same information, one of the two is normally computed from the other This is done, for

instance, by the ip and ifconfig user-space configuration tools described in the section "IP Configuration." ip passes the kernel

ifa_prefixlen and lets the latter compute ifa_mask, whereas ifconfig does the opposite The kernel provides some functions to convert a netmask into a prefix length, and vice versa

u32 ifa_broadcast

Broadcast address

u32 ifa_anycast

Anycast address

unsigned char ifa_scope

Scope of the address The default is RT_SCOPE_UNIVERSE (which corresponds to the value 0) and the field is usually set

to that value by ifconfig/ip, although a different value can be chosen The main exception is an address in the range 127 x.x.x,

which is given the RT_SCOPE_HOST scope See Chapter 30 for more details

unsigned char ifa_flags

The possible IFA_F_XXX bit flags are listed in include/linux/rtnetlink.h Here is the one used by IPv4:

A string used mostly for backward compatibility with 2.0.x kernels that allowed aliased interfaces with names such as eth0:1.

struct rcu_head rcu_head

Used by the RCU mechanism to enforce mutual exclusion It accomplishes the same job as a lock

23.8.9 ipv4_devconf Structure

The ipv4_devconf data structure, whose fields are exported via /proc in /proc/sys/net/ipv4/conf/, is used to tune the behavior of a network device There is an instance for each device, plus one that stores the default values (ipv4_devconf_dflt) The meanings of its fields are covered in Chapters 29 and 36, with the exception of promote_secondaries, which is described in the section "Main Functions That Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 27

23.8.10 ipv4_config Structure

While ipv4_devconf structures are used to store per-device configuration, ipv4_config stores configuration that applies to the host

Here is a brief description of its fields:

int log_martians

This parameter is also present in the ipv4_devconf structure It is used to decide whether to print warning messages to the console when specific errors occur Its value is not checked directly, but via the macro IN_DEV_LOG_MARTIANS, which gives higher priority to the per-device instance

[*] IPv6 defines its own version of cork in include/linux/ipv6.h.

Here is a brief description of its fields:

unsigned int flags

Currently only one flag used by IPv4 can be set: IPCORK_OPT When this flag is set, it means there are options in opt

unsigned int fragsize

Size of the data fragments generated This includes both payload and L3 header and is normally the PMTU

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 28

struct ip_options *opt

struct page *page

Pointer to the memory page On i386, the page size is 4 KB To find the size of a page on any given architecture xxx, look for PAGE_SIZE in include/asm-xxx /page.h.

_ _u16 page_offset

Offset, relative to the beginning of the page, where the fragment starts

_ _u16 size

Size of the fragment

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 29

23.9 Functions and Variables Featured in This Part of the Book

Table 23-3 summarizes the main functions, variables, and data structure introduced or referenced in the chapters of this book covering

the IPv4 protocol

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 30

Table 23-3 Functions, variables, and data structures in the IPv4 subsystem

ip_init Initializes the IPv4 protocol See the section "IP Options" in Chapter 19

ip_rcv Processes ingress IP packets See the section "Processing Input IP Packets" in Chapter 19.

Deliver an ingress IP packet to the local host See the section "Local Delivery" in Chapter 20

ipfrag_init Initializes the IP Fragmentation/Defragmentation subsystem

inet_initpeers Initializes the IP peer subsystem.

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 31

secure_ip_id

ip_call_ra_chain Hands ingress IP packets that carry the Router Alert option to the interested local Raw sockets

See the section "ip_forward function " in Chapter 20

Add, remove, and manipulate the IP addresses configured on the local devices See the section

"Main functions that manipulate IP addresses and configuration."

for_primary_ifa

for_ifa

Browse the IP addresses configured on a network device

rtmsg_ifa Generates notifications about changes to the IP address configuration of local devices See the

section "Change notification: rtmsg_ifa."

Variables

ipv4_devconf

ipv4_devconf_dflt

Store a set of parameters that can be tuned on a per-device basis via the /proc filesystem See

the section "Tuning via /proc filesystem."

ip_frag_mem Amount of memory held by ingress IP fragments See the section "Garbage Collection" in Chapter

Trang 32

/proc filename Associated kernel variable

peer_pool_lock Lock used for the AVL tree where inet_peer structures are inserted.

inet_peer_unused_lock Lock used for the list where unused inet_peer structures are inserted.

ip_statistics Stores statistics about IP traffic See the section "IP Statistics."

Trang 33

23.10 Files and Directories Featured in This Part of the Book

The net/ipv4 directory contains more files than the ones listed in Figure 23-4, but they are covered in other chapters, including the

chapters comprising Parts VI and VII

Figure 23-4 Files and directories featured in this part of the book

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 34

Chapter 24 Layer Four Protocol and Raw IP

Handling

This chapter describes the interface between L3 and L4 protocols The only L3 protocol considered here is IP The L4 protocols include the familiar TCP, UDP, and ICMP, along with several other ones The L4 protocols are not covered in this book for reasons of space and complexity However, this chapter explains what happens when applications handle their own L4 (and sometimes L3) processing through raw IP

In particular, this chapter explains:

How L4 protocols register with the kernel and tell the kernel what kind of traffic they are interested in

How ingress packets are passed to the correct L4 protocol handler

How applications tell the kernel to let the application process headers

We saw in Chapter 21 the functions that L4 protocols use to transmit an IP datagram Since this book focuses on IP, this chapter covers only those L4 protocols that sit on top of IP The chapter describes the IPv4 interface and then briefly shows where IPv6 differs

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 35

Version 3: 3376(2002)

Protocol Independent Multicast, version 1 (PIMv1) and version 2 (PIMv2) 2362(1998)

IPsec suite: IP Authentication Header Protocol (AH) , IP Encapsulating Security Payload Protocol (ESP) , IP

Payload Compression Protocol (IPcomp)

AH: 2402(1998)

ESP: 2406(1998)

IPcomp: 3173(2001)

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 36

Other protocols are available for the Linux kernel but are either implemented in user space (routing protocols are an example) or are available as kernel patches because they are not yet integrated into the core kernel.

Figure 24-1 shows how the L4 protocols rest on L3 protocols The three main protocols (ICMP, UDP, and TCP), as well as the IPsec suite, have IPv6 counterparts There is no IGMPv6 in Figure 24-1 because its functionality is implemented as part of ICMPv6

Figure 24-1 L4 protocols on top of IPv4 and IPv6 that are implemented in the Linux kernel

Note that the last four items in Table 24-2 are tunneling protocols Their IDs identify an L3 protocol For example, the IPIP protocol is used

to transport IPv4 datagrams inside IPv4 datagrams Note that the value assigned to the protocol field of the IPv4 header when it

encapsulates an IP datagram has nothing to do with the value used to initialize the protocol field of an Ethernet header when the Ethernet payload is an IP datagram Even though the two fields refer to the same protocol (IPv4), they belong to two different domains: one is an L3 protocol identifier, whereas the other is an L4 protocol identifier

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 37

24.2 L4 Protocol Registration

The L4 protocols that rest on IPv4 are defined by net_protocol data structures, defined in include/net/protocol.h, which consist of the following three fields:

int (*handler)(struct sk_buff *skb)

Function registered by the protocol as the handler for incoming packets This is discussed further in the section "L3 to L4 Delivery: ip_local_deliver_finish." It is possible to have protocols that share the same handler for both IPv4 and IPv6 (e.g., SCTP)

void (*err_handler)(struct sk_buff *skb, u32 info)

Function used by the ICMP protocol handler to inform the L4 protocol about the reception of an ICMP UNREACHABLE message We will see in Chapter 35 when a Linux system generates ICMP UNREACHABLE messages, and we will see in Chapter 25 how the ICMP protocol uses err_handler

int no_policy

This field is consulted at certain key points in the network stack and is used to exempt protocols from IPsec policy checks: 1 means that there is no need to check the IPsec policies for the protocol Do not confuse the no_policy field of the net_protocolstructure with the field bearing the same name in the ipv4_devconf structure: the former applies to a protocol; the latter applies to a device See the sections "L3 to L4 Delivery: ip_local_deliver_finish" and "IPsec" for how no_policy is used

The include/linux/in.h file contains a list of L4 protocols defined as IPPROTO_XXX symbols (For a more complete list, see the

/etc/protocols file, or RFC 1700 and its successor RFCs.) The maximum value for an L4 protocol identifier is 28-1 or 255, because the field

in the IP header allocated to specify the L4 protocol is 8 bits The highest number, 255, is reserved for Raw IP, IPPROTO_RAW

Not all of the protocols defined in the list of symbols are handled at the kernel layer; some of them (notably Resource Reservation Protocol, or RSVP, and the various routing protocols) are usually handled in user space This is, for example, why RSVP and routing protocols like OSPF are not included in the list of L4 protocols supported by the kernel that is in the previous section

24.2.1 Registration: inet_add_protocol and inet_del_protocol

Trang 38

currently it is a simple flat array with one item for each of the possible 256 protocols The protocol number from /etc/protocols is the slot in

the table where the protocol is inserted If you'd like to see how the table was handled as a hash table in the 2.4 kernel, look in the 2.4

sources at the ip_run_ipprot function Figure 24-2 shows the numbers and initials of the most common protocols; for instance, ICMP is

protocol 1 and occupies slot 1 in the inet_protos table

Figure 24-2 IPv4 protocol table

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 39

Concurrent accesses to the inet_protos table are managed in this way:

Read-write accesses (i.e., inet_add_protocol and inet_del_protocol) are serialized with the inet_proto_lock spin lock

Read-only accesses (i.e., ip_local_deliver_finish; see the next section) are protected with rcu_read_lock/rcu_read_unlock

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 40

inet_del_protocol, which may remove an entry of the table currently held by an RCU reader, calls synchronize_net to wait for all the currently executing RCU readers to complete their critical section before returning There is another hash table used by protocols that rest on IPv6 Note that IPv6 appears in the IPv4 inet_protos table as well: the kernel can tunnel IPv6 over IPv4 (also called SIT, for Simple Internet Transition) See the section "IPv6 Versus IPv4."

As mentioned in the previous section, the ICMP, UDP, and TCP protocols are always part of the kernel and therefore are statically added

to the hash table at boot time by inet_init in net/ipv4/af_inet.c The following excerpts show the definitions of their structures and the actual inet_add_protocol calls that register them:

The IGMP handler is registered only when the kernel is compiled with support for IP multicast

As an example of how other protocols are dynamically registered, the following snapshot is taken from the Zebra user-space routing

daemon's implementation of the Open Shortest Path First IGP (OSPFIGP) protocol The code is taken from the ospfd/ospf_network.c file

in the Zebra package The socket call effectively registers the user-space daemon with the kernel, giving the kernel a place to send ingress packets that use the protocol specified in the third argument This protocol is IPPROTO_OSPFIGP, a symbol equal to 89, the Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Ngày đăng: 13/08/2014, 04:21

TỪ KHÓA LIÊN QUAN