Understanding Linux Network Internals 2005 phần 7 pot

Example of external neighbor reachability confirmation Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com... Main Data Structuresstruct neigh_ops A set of functions

Trang 1

Marks a neighbor as unreachable because of a failed solicitation request, either the one generated when the entry was created

or the one triggered by the NUD_PROBE state

NUD_PERMANENT

The L2 address of the neighbor has been statically configured (i.e., with user-space commands) and therefore there is no need

to use any neighboring protocol to take care of it See the section "System Administration of Neighbors" in Chapter 29

Trang 2

NUD_PERMANENT NUD_NOARP NUD_REACHABLE NUD_PROBE NUD_STALE NUD_DELAY

NUD_CONNECTED

This is used for the subset of NUD_VALID states that do not have a confirmation process pending:

NUD_PERMANENT NUD_NOARP NUD_REACHABLE

NUD_IN_TIMER

The neighboring subsystem is running a timer for this entry, which happens when the status is unclear The basic states that correspond to this are:

NUD_INCOMPLETE NUD_DELAY NUD_PROBE

Let's look at an example of why a derived state is useful in kernel code When a neighbor instance is removed, the host needs to stop all the pending timers associated with that data structure Instead of comparing the neighbor's state to the three states known to have a pending timer associated with them, it is just cleaner to define NUD_IN_TIMER and compare the neighbor's state against it using the bitwise operator &

time This is called reachability confirmation.

Note that a change in reachability status is not necessarily due to the reasons listed in the section "Reasons That Neighboring Protocols Are Needed"; a router, bridge, or other network device may just be experiencing some problems While the reachability confirmation is in progress, the cached information is temporarily used under the assumption that it is most likely still valid

The three NUD states NUD_STALE, NUD_DELAY, and NUD_PROBE support the task of reachability confirmation The key reason for the use of these states is that there is no need to start a reachability confirmation process until a packet needs to be sent to the associated neighbor.This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it Thanks

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 3

Let's define once again the exact meaning of these three NUD states, and then look at the two ways a mapping can be confirmed:

NUD_STALE

The cache contains the address of the neighbor, but the latter has not been confirmed for a certain amount of time (see the discussion of reachable_time in the section "neigh_parms Structure" in Chapter 29) The next time a packet is sent to the neighbor, the reachability verification process will be started

This state gives some time to the upper network layers to provide a reachability confirmation, which may relieve the kernel from sending a solicitation request and thus save both bandwidth and CPU usage This state may look like a small optimization, but if you think in terms of big networks, you can imagine the gain it can provide

If no confirmation is received, the entry is put into the next state, NUD_PROBE, which resolves the status of the neighbor through explicit solicitation requests or whatever other mechanism a protocol might use

Confirmation from a unicast solicitation's reply

When your host receives a solicitation reply in answer to a solicitation request it previously sent out, it means that the neighbor received the request and was able to send back a reply; this in turn means that either it already had your L2 address or it learned your address from your request (see the section "Creating a neighbour Entry" in Chapter 27 It also means that there is a working path in both directions Note, however, that this is true only when the solicitation's reply is sent as a unicast packet The reception

of a broadcast reply would move the state to NUD_STALE rather than NUD_REACHABLE (You can find more discussion of this from the standpoint of ARP in the section "Processing Ingress ARP Packets" in Chapter 28.)

External confirmation

If your host is sure it received a packet from the neighbor in response to something previously sent, it can assume the neighbor is still reachable Figure 26-14 shows an example, where the TCP layer of Host A confirms the reachability of Host B when it receives a SYN/ACK in reply to its SYN Note that if Host B was not a neighbor of Host A, the reception of the SYN/ACK from Host B would confirm the reachability of the next hop gateway used by Host A to reach Host B

Figure 26-14 Example of external neighbor reachability confirmation

Trang 4

Confirmation is done via dst_confirm, which confirms the validity of the routing table cache entry used to route the SYN packet toward Host B dst_confirm is a simple wrapper around neigh_confirm, which accomplishes the task we described earlier: it confirms the reachability of the neighbor and therefore the L3-to-L2 mapping Note that neigh_confirm only updates the neigh->confirmed timestamp; it will be the neigh_periodic_timer function (which is executed by the expiration of the timer started when the neighbor entered the NUD_DELAY

state) that actually upgrades the neighbor entry's state to NUD_REACHABLE.[*]

[*]

The delay between the reception of the confirmation from the L4 layer and the setting of the state to

NUD_REACHABLE does not affect traffic in any way

Note that the correlation between the two packets in Figure 26-14 could not be performed at the IP layer because the latter doesn't have any knowledge of data streams This is why the L4 layer takes care of the confirmation TCP SYN/ACK exchanges are only one example of an L4 protocol providing external confirmation Given a socket, and therefore the associated routing cache entry and its next-hop gateway, a user-space application can confirm the reachability of the gateway by using the

MSG_CONFIRM option with transmission calls such as send and sendmsg

While the reception of a solicitation's reply can move the state to NUD_REACHABLE regardless of the current state, external confirmations can be used only when the current state is NUD_STALE This means that if the entry had just been created and it was

in the NUD_INCOMPLETE state, external confirmations would not be allowed to confirm the reachability of the neighbor (see Figure 26-13)

Note that NUD_DELAY/NUD_PROBE and NUD_NONE can lead to NUD_REACHABLE, as shown in Figure 26-13; however, from NUN_NONE to get to

NUD_REACHABLE, you need full proof of reachability, while from NUD_DELAY/NUD_PROBE, any kind of confirmation is sufficient

This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it Thanks

Trang 5

Chapter 27 Neighboring Subsystem: Infrastructure

In Chapter 26, we saw the main problems that the neighboring protocols are asked to solve You also learned that the Linux kernel

abstracted out parts of the solution into a common infrastructure shared by various neighboring protocols In this chapter, we will see

how the infrastructure is designed In particular, we will see how protocols interface to the common infrastructure, how caching and proxying are implemented, and how external subsystems such as higher-layer protocols notify the neighboring protocols about interesting events We will conclude the chapter with a description of how L3 protocols such as IPv4 actually interface with their neighboring protocols, and how queuing is implemented for buffers awaiting address resolution

Trang 6

27.1 Main Data Structures

struct neigh_ops

A set of functions that represents the interface between the L3 protocols such as IP and dev_queue_xmit, the API introduced in

Chapter 11 and described briefly in the upcoming section "Common Interface Between L3 Protocols and Neighboring Protocols." The virtual functions can change based on the context in which they are used (that is, on the status of the neighbor, as described

Trang 7

When a host needs to route a packet, it first consults its cache and then, in the case of a cache miss, it queries the routing table Every time the host queries the routing table, the result is saved into the cache The IPv4 routing cache is composed of rtable

structures Each instance is associated with a different destination IP address Among the fields of the rtable structure are the destination address, the next hop (router), and a structure of type dst_entry that is used to store the protocol-independent information dst_entry includes a pointer to the neighbour structure associated with the next hop I cover the dst_entry data structure in detail

in Chapter 36 In the rest of this chapter, I will often refer to dst_entry structures as elements of the routing table cache, even though

dst_entry is actually only a field of the rtable structure

Figure 27-1 shows how dst_entry structures are linked to hh_cache and neighbour structures

The neighboring code also uses some other small data structures For instance, struct pneigh_entry is used by destination-based proxying, and

struct neigh_statistics is used to collect statistics about neighboring protocols The first structure is described in the section "Acting As a Proxy," and the second one is described in the section "Statistics" in Chapter 29 Figure 27-2 also includes the following data structure types, described in greater detail in Chapters 22 and 23:

Figure 27-1 Relationship among dst_entry, neighbour, and hh_cache structures

in_device, inet6_dev

Used to store the IPv4 and IPv6 configurations of a device, respectively

net_device

There is one net_device structure for each network device recognized by the kernel See Chapter 8

Figure 27-2 shows the relationships between the most important data structures Right now it might seem a big mess, but it will make much more sense by the end of this chapter

Here are the main points shown in Figure 27-2:

In the central part of the figure, you can see that each network device has a pointer to a data structure that holds the configuration for each L3 protocol configured on the device In the example shown in the figure, IPv6 is configured on one device and IPv4 is configured on both Both the in_device structure (IPv4 configuration) and inet6_dev structure (IPv6 configuration) include a pointer to the configuration used by their neighboring protocols, respectively ARP and ND

Trang 8

All of the neigh_parms structures used by any given protocol are linked together in a unidirectional list whose root is stored in the protocol's neigh_table structure.

The top and bottom of the figure show that each protocol keeps two hash tables The first one, hash_buckets, caches the L3-to-L2 mappings resolved by the protocol or statically configured The second one, phash_bucket, stores those IP addresses that are proxied, as described in the section "Per-Device Proxying and Per-Destination Proxying." Note that phash_bucket is not a cache, so its elements do not expire and don't need confirmation Each pneigh_entry structure

Figure 27-2 Data structures' relationships

Trang 9

Trang 10

includes a pointer (not depicted in Figure 27-2) to its associated net_device structure Figure 27-6 gives more detail on the structure

of the cache hash_buckets

Each neighbour instance is associated with one or more hh_cache structures, if the device supports header caching The section "L2 Header Caching," and Figures 27-1 and 27-10, give more details about the relationship between neighbour and hh_cache structures

Trang 11

27.2 Common Interface Between L3 Protocols and Neighboring Protocols

The Linux kernel has a generic neighboring layer that connects L3 protocols to the main L2 transmit function (dev_queue_xmit) via a virtual function table (VFT) A VFT is the mechanism frequently used in the Linux kernel for allowing subsystems to use different functions at different times The VFT for the neighboring subsystem is implemented as a data structure named neigh_ops A pointer to one of these structures is embedded as a field named ops in each neighbour structure

The flexibility of the VFT interface allows different L3 protocols to use different neighboring protocols This in turn allows different

neighboring protocols to behave quite differently while allowing the neighboring subsystem to provide a common generic interface between the neighboring protocols and the L3 protocols

In this section, we examine the VFT-based interface between the L3 protocols and the neighboring protocols, the advantages of using the VFT, when it is first initialized, and how it is updated during the lifetime of a neighbor The section concludes with a brief overview of the functions used to control the initialization of the VFT To better understand this section, you are invited to first read the section "neigh_ops Structure" in Chapter 29

Let's start with an overview of how the routines in the VFT are invoked Given a neighbour instance and its embedded VFT neighbour->ops, the function to which the output field points could in theory be invoked directly like this:

neigh->ops->output

But this construct is not found in the Linux code because even this is not general enough The function in the output field of the neigh_ops

structure is only one of four functions that perform similar tasks, each function having its own field in neigh_ops The individual protocol has to decide which of the four functions to use The proper function depends on events, the context, and the configuration of the interface and device So, to leave the neighboring infrastructure protocol-independent, the neighbour structure contains its own output field The individual protocol assigns the proper function from one of the fields in neigh->ops to neigh->output This allows the code to be simpler and clearer For instance, instead of doing:

if (neighbour is not reachable)

as long as neigh->output has been initialized by the protocol to the right neigh_ops method Of course, each neighboring protocol uses its own logic

to initialize neigh->output; it does not necessarily have to follow the rules in this snapshot

When a neighbor is created, its neighbour->ops field is initialized to the proper neigh_ops structure, as shown in Figure 27-3(a) This assignment does not change during the neighbor's lifetime However, as depicted in Figure 27-3(b), neigh->output can be changed to different functions many times during the lifetime of the neighbor structure, driven both by events that take place during protocol operation, and (much less often) by user commands The following sections will go into detail on both initializations shown in Figure 27-3

Trang 12

Figure 27-3 (a) Initialization of neigh->ops; (b) initialization of neigh->output

27.2.1 Initialization of neigh->ops

On certain types of devices, the initialization of the functions listed in Figure 27-3(b) could be further optimized to speed up transmissions These include, for instance, the situations described in the section "Special Cases" in Chapter 26, where there is no need to map an L3 address to an L2 address In those cases, the neighboring subsystem can almost be bypassed altogether and only the queue_xmit function described in Chapter 11 is needed The protocol code needs to know this kind of detail, but the general neighboring infrastructure does not, so the protocol can just initialize neigh->output to neigh->ops->queue_xmitand everything remains transparent to the upper layers Simple!

For this reason, each protocol provides for three different instances of the neigh_ops VFT:

A generic table that can be used in any context (xxx_generic_ops) This is the one that is normally used to handle neighbors whose L2 addresses need to be resolved

An optimized set of functions that can be used when the device driver provides its own set of functions to manipulate L2 headers and thus take advantage of the speedup coming from the use of cached headers (xxx_hh_ops)

A table that can be used when the device does not need to map L3 addresses to L2 addresses (xxx_direct_ops) An example is the This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it Thanks

Trang 13

use of ISDN with raw IP encapsulation.

When the neighbor instance is created, the protocol initializes the neigh_ops VFT to the right instance depending on several factors See the

section "neigh_ops Structure" in Chapter 29

In the specific case of IPv4/ARP, a fourth instance of neigh_ops called arp_broken_ops is used to initialize those neighbour instances associated with

old devices that have not been adapted to the new neighboring infrastructure and therefore would not work otherwise This once again

shows how generic the neighboring infrastructure is: by initializing the neigh_ops VFT in the right way, the kernel is even able to use the old

ARP code

27.2.2 Initialization of neigh->output and neigh->nud_state

The state of a neighbor (neigh->nud_state) and the neigh->output function depend on each other When nud_state changes, output often has to be

updated accordingly As a simple example, if the state becomes stale, confirmation of reachability is required But the neighboring

infrastructure doesn't waste time confirming reachability right away; there might be no further traffic and the effort might be wasted

Instead, the neighboring infrastructure stops using the optimized output function that blindly plugs in the current address, and switches to the

slower output function that checks the address In the example in Figure 27-3(a), we would change connected_output from c1 to o1.

For help in understanding this section, check Figure 26-13 in Chapter 26 for the possible states that neigh->nud_state can assume, based on

device type and protocol events

The neighboring subsystem provides a generic routine, neigh_update, that moves a neighbor to the state provided as an input argument A

later section in this chapter describes neigh_update in detail, but let's first look at the most common changes of state and the helper routines

that can be called, either directly or via neigh_update, to take care of them

Let's start with the most common case: a device that needs a neighboring protocol, an address that does not belong to any of the special

cases described in Chapter 26, and a change of state caused by a transition (that is, we exclude creation and deletion).[*]Figure 26-12 in

Chapter 26 can then be simplified to produce Figure 27-4 The figure also shows the kernel functions where the transitions are handled

However, not all of the transitions made by calls to neigh_update are shown, because most are too generic to add any value to the figure; only

the transition triggered by the reception of a solicitation reply is shown

[*] For the first initialization of neigh->output, check the source code of the constructor routines (e.g., arp_constructor/ndisc_constructor

for ARP/ND) For ARP, see the section "Initialization of a neighbour Structure" in Chapter 28

Figure 27-4 Possible state transitions for a neighbor that has been resolved at least once

Trang 14

Note that some of the transitions in Figure 27-4 are asynchronous: they are taken care of by a timer and are therefore triggered by

timestamp comparisons.[*] Other transitions are taken care of synchronously by the protocols (e.g., neigh_event_send[ ]).

[*]

The routines used to compare timestamps, such as time_after_eq and time_before_eq, are defined in include/linux/jiffies.h.

[ ] Part of neigh_event_send is also depicted in Figure 27-13 as part of the expanded neigh_resolve_output flowchart

27.2.2.1 Common state changes: neigh_connect and neigh_suspect

The main ways a neighbor can enter the NUD_REACHABLE state (all described in Chapter 26) are:

Reception of a solicitation reply

When a solicitation reply is received, either to resolve a mapping for the first time or to confirm a neighbor in the NUD_PROBE state, the protocol updates neigh->nud_state via neigh_update This update is synchronous and happens right away

L4 confirmation

The first time neigh_timer_handler is executed after the reception of an L4 reachability confirmation, the state is changed to

NUD_REACHABLE (see the section "Reachability Confirmation" in Chapter 26) An L4 confirmation is asynchronous and may be slightly delayed

Manual configuration

This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it Thanks

Trang 15

When a new neighbour structure is created by the user through a system administration command, this command can specify the state, and NUD_REACHABLE is a valid state In this case, neigh_connect is invoked via neigh_update.

Whenever the NUD_REACHABLE state is entered, the neighboring infrastructure calls the neigh_connect function to make the neigh->output function point

to neigh_ops->connected_output

When a neighbor in the NUD_REACHABLE state moves to NUD_STALE or NUD_DELAY, or is simply initialized to a state different from one of the states

in NUD_CONNECTED (for example, by a call to neigh_update), the kernel invokes neigh_suspect to enforce confirmation of reachability (see the section

"Reachability Confirmation" in Chapter 26) neigh_suspect does this by setting neighbour->output to neigh_ops->output

Both neigh_connect and neigh_suspect also update the neighbour->output and neighbour->hh_output functions of all of the hh_cache structures linked to the input

neighbour instance (see Figure 27-1) Neither function, however, updates the NUD state of a neighbour instance, because that is already taken care of by their callers Later in this chapter I'll use the forms "connect the neighbor" and "suspect the neighbor" to refer to the invocation of

neigh_connect and neigh_suspect, respectively, for that neighbor

Some transitions (changes of NUD state) can happen at any time and more than once during the lifetime of a neighbour instance Others can take place only once With some knowledge of networking, it is not hard to look at Figure 26-13 in Chapter 26 and identify the transitions that belong to each of the two categories For those neighbour instances initialized to permanent states (for instance, NUD_NOARP), neigh->output can

be initialized to neigh_ops->connected right away and it will never change

27.2.2.2 Routines used for neigh->output

As explained in the previous section, neigh->output is initialized by the neighbor's constructor function, and later is manipulated as a consequence

of protocol events via the two routines neigh_connect and neigh_suspect neigh->output is always set to one of the virtual functions of neigh_ops This section lists the functions that can be assigned to the neigh_ops virtual functions The dev_queue_xmit function, which is not really part of the

neighboring subsystem, is defined in net/core/dev.c The other routines are defined in net/core/neighbour.c.

dev_queue_xmit

The L3 layer always calls this function when transmitting a packet, regardless of the kind of device or L2 and L3 protocols used

A neighboring protocol initializes the function pointers of neigh_ops to dev_queue_xmit when all the information needed to transmit on the egress device is present and there is no extra work for the neighboring subsystem to do If you look at arp_direct_ops in Chapter 28, you can see that all four transmission virtual functions are set to dev_queue_xmit That function is described in Chapter 11

neigh_connected_output

This function just fills in the L2 header and then calls neigh_ops->queue_xmit Therefore, it expects the L2 address to be resolved It is used by neighbour structures in the NUD_CONNECTED state

neigh_resolve_output

This function resolves the L3 address to the L2 address before transmitting, so it is used when that association is not ready yet

or needs to be confirmed Except for the situations in the section "Special Cases" in Chapter 26, neigh_resolve_output is usually the default routine used when a new neighbour structure is created and its L3 address needs to be resolved

neigh_compat_output

This function is present for backward compatibility Before the neighboring infrastructure was introduced, it was possible to call

dev_queue_xmit even if the L2 address was not ready yet

Trang 16

This function is used to handle the temporary case where a neighbour structure cannot be removed because someone is still holding a reference to it neigh_blackhole discards any packet received in input This is necessary to ensure that no attempt to transmit a packet to the neighbor will take place, because the neighbor's data structures are about to be removed See the section "Neighbor Deletion."

The section "Initialization of a neighbour Structure" in Chapter 28 shows how ARP uses these functions to initialize the different instances

of the neigh_ops VFT The choices made by the functions are also shown in the flowchart in Figure 27-13

27.2.3 Updating a Neighbor's Information: neigh_update

neigh_update, defined in net/core/neighbour.c, is a generic function that can be used to update the link layer address of a neighbour structure This

is its prototype, with a brief description of the input parameters:

int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,

u32 flags)

neigh

Pointer to the neighbour structure to update

lladdr

New link layer (L2) address lladdr may not always be initialized to a new value For instance, when neigh_update is called to delete a

neighbour structure (by setting its state to NUD_FAILED, as described in the section "Neighbor Deletion," it is passed a NULL value for

The current L2 address can be overridden by lladdr Administrative changes use this flag to distinguish between replace and add

This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it Thanks

Trang 17

commands, among other things (see Table 29-1 in Chapter 29) Protocol code can use this flag to enforce a minimum lifetime for

an L2 address (see, for example, the section "Final Common Processing in Chapter 28)

The next three flags are used only by IPv6 code:

If the link layer address lladdr supplied in input differs from the current known link layer address of the neighbor neigh->ha, the address

is suspected (i.e., its state is moved to NUD_STALE so that reachability confirmation is triggered)

The IPv6's ND protocol uses flags in the protocol header that can influence the setting of the NEIGH_UPDATE_F_XXX flags just listed The

discussion that follows skips over the parts of neigh_update that deal with the IPv6-only flags

neigh_update is used by all of the administrative interfaces to change the link layer address of a neighbour structure, as shown in Figure 29-1 in

Chapter 29 The function can also be used by the neighboring protocols themselves, but it is not the only function that changes state

Figures 27-5(a) and 27-5(b) show a high-level description of neigh_update's internals The flowchart is divided into different areas, each area taking care of a different task:

Sanity checks

Changes applied to a neighbor whose current state is not NUD_VALID

Selection of the L2 address to use for a change applied to a neighbor whose current state is NUD_VALID

Setting a new link layer address

Change of NUD state

Handling an arp_queue queue

The following subsections explain the code in detail

27.2.3.1 neigh_update optimization

Before changing the state of a neighbor, neigh_update first checks to see whether it is possible to avoid the change An optimization discards the change of state if both of the following conditions are met (see (c)):

The link layer address has not been modified (that is, the input lladdr is the same as the current neigh->ha)

The new state is NUD_STALE and the current one is NUD_CONNECTED, which means that the current state is actually better than the new one

Trang 18

Figure 27-5a neigh_update function

27.2.3.2 Initial neigh_update operations

In this section, we trace the decisions made by neigh_update as it handles various values for the current state (neighbour->nud_state) and the

Trang 19

requested state (the new parameter).

Figure 27-5b neigh_update function

Trang 20

This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it Thanks

.

Trang 21

Only administrative commands (NEIGH_UPDATE_F_ADMIN) can change the state of a neighbor that is currently in the NUD_NOARP or NUD_PERMANENT

state A sanity check at the beginning of neigh_update causes it to exit right away if these constraints are violated

When the new state new is not a valid oneif it is NUD_NONE or NUD_INCOMPLETEthe neighbor timer is stopped if it is running, and the entry is marked suspect (that is, requiring reachability confirmation) through neigh_suspect if the old state was NUD_CONNECTED See the section

"Initialization of neigh->output and neigh->nud_state." When the new state is a valid one, the neighbor timer is restarted if the new state requires it (NUD_IN_TIMER)

When neigh_update is asked to change the NUD state to a value different from the current one, which is normally the case, it needs to check whether the state is changing from a value included in NUD_VALID to another value not in NUD_VALID (remember that NUD_VALID is a derived state that includes multiple NUD_XXX values) In particular, when the old state was not NUD_VALID and the new one is NUD_VALID, the host has to transmit all of the packets that are waiting in the neighbor's arp_queue queue Since the state of the neighbor could change while doing this (because the host may be a symmetric multiprocesing, or SMP, system), the state of the neighbor is rechecked before sending each packet

27.2.3.3 Changes of link layer address

The reason for calling neigh_update is to change the NUD state, but it can also change the destination link layer address by which a neighbor

is reached The function will do this if a new link layer address is provided (that is, if the lladdr parameter is not NULL) and if the input parameter flags allows it When the link layer address is changed, all of the cached headers need to be updated accordingly This is taken care of by neigh_update_hhs

When no link layer address is supplied to neigh_update (i.e., lladdr is NULL), and the current NUD state is not a valid one, neigh_update discards the input frame skb and returns with an error (no change of state is applied if there is no valid link layer address for the neighbor)

27.2.3.4 Notifications to arpd

Some sites with large networks choose to manage ARP requests through a user-space daemon called arpd instead of making the kernel

do it When the kernel is compiled with support for arpd, and its use is configured (that is, app_probes > 0), neigh_update notifies the daemon about the following events:[*]

[*]

See the section "ARPD" in Chapter 28, and the section "neigh_parms Structure" in Chapter 29

When a state is modified from NUD_VALID to a state that is not valid

When the link layer address is changedSimpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 22

Trang 23

27.3 General Tasks of the Neighboring Infrastructure

This section describes a few general concepts that you should be familiar with before delving into specific functions within the neighboring

infrastructure: caching , reference counting, and timers

27.3.1 Caching

The neighboring layer implements two kinds of caching:

Neighbor mappings

As with any other kind of data that can be used multiple times, it makes sense to cache the results of the L3-to-L2 mappings

Negative results (where an attempt to resolve the address failed) are not cached But the neighbour structures associated with failed mappings are set to the NUD_FAILED state so that the garbage collection timer can clean them up (see the section

"Garbage Collection")

L2 headers

The neighboring infrastructure caches L2 headers to speed up the time required to encapsulate an L3 packet into an L2 frame

Otherwise, the infrastructure would have to initialize each field of the L2 header one by one

Because the caching of neighbor mappings is central to the operation of the neighboring subsystem , this section describes it in detail

(The later section "L2 Header Caching" describes L2 header caching.) The contents of a neighbour structure are described in the section

"neighbour Structure" in Chapter 29, and the structure's creation and deletion are described in later sections in this chapter Here we will

stay at a higher level, describing how those structures are organized and accessed by the neighboring infrastructure

The neighboring infrastructure places neighbour structures into caches, one per protocol, which are implemented as typical hash tables

where elements that collide into the same bucket are linked into a singly linked list New elements are added at the head of the lists (see

the function neigh_create in the section "The neigh_create Function's Parameters") The inputs to the hash function that distributes

elements into buckets are the L3 address, the associated device, and a random value that is recomputed regularly to reduce the

effectiveness of a hypothetical Denial of Service (DoS) attack Figure 27-6 shows the structure of the cache In Figure 27-2, you can see

its relationship to other key data structures, such as the per-protocol neigh_table structure

Hash tables are allocated and freed with neigh_hash_alloc and neigh_hash_free, respectively Each hash table is created with a size of

two elements at protocol initialization time (see neigh_table_init) When the number of elements in the table grows bigger than the number

of buckets, the table is reorganized as follows First, the size of the table is doubled (thus, the size of the hash table is always a power of

2)

Figure 27-6 neighbour's cache

Trang 24

This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it Thanks.

Trang 25

The random value used for hashing is recalculated Finally, the elements are redistributed throughout the table using the same previously mentioned variables: L3 address, device, and random number This extension of the hash table is performed by neigh_hash_grow, which

is called by neigh_create when necessary

Note that extension of the hash table is easily triggered Therefore, it rarely has more than one or two structures per bucket

The maximum number of elements in a table is controlled by the gc_threshX variables described in the section "Garbage Collection." These limits are needed to prevent possible DoS attacks

When the "neighboring system" needs to search a hash table for a neighbor, the search key is the destination L3 address (primary_key) together with the device (dev) tHRough which the neighbor can be reached Because different protocols may use keys of different lengths, the common lookup APIs need to take into account the key length Therefore, the key length is stored in the neigh_table structure

The main function used to query a neighbor protocol's cache is neigh_lookup There are two others, both wrappers around neigh_lookup, that can either force the creation of a neighbour entry if the lookup fails or decide whether to create one according to an input parameter Here is a brief description of the three routines:

neigh_lookup

Checks whether the element being searched for exists, and returns a pointer to it when successful

struct neighbour *neigh_lookup(struct neigh_table *tbl, const void *pkey, struct net_device *dev)

{ struct neighbour *n;

int key_len = tbl->key_len;

u32 hash_val = tbl->hash(pkey, dev) & tbl->hash_mask;

read_lock_bh(&tbl->lock);

for (n = tbl->hash_buckets[hash_val]; n; n = n->next) {

if (dev == n->dev &&

!memcmp(n->primary_key, pkey, key_len)) { neigh_hold(n);

NEIGH_CACHE_STAT_INC(tbl, hits);

break;

} } read_unlock_bh(&tbl->lock);

Trang 26

A wrapper around neigh_lookup that creates the neighbour entry by means of neigh_create when the lookup fails and when _ _neigh_lookup was invoked with the creat input flag set.

_ _neigh_lookup_errno

Uses neigh_lookup to see whether the entry exists, and always creates a new neighbour instance when the lookup fails This function is basically the same as _ _neigh_lookup without the input creat flag

Chapter 28 describes another function, arp_find, which is a wrapper around _ _neigh_lookup and is kept for backward compatibility, for use

by legacy code Another function, neigh_lookup_nodev, is currently used only by DECnet

Each protocol also maintains a separate cache and an associated set of lookup APIs used for destination proxying You can find more

details about them in the section "Acting As a Proxy."

27.3.2 Timers

The neighboring subsystem uses several timers Some are global, whereas others are created on a one-per-neighbor basis Some run

periodically, and others are started only when needed The following is a brief overview of the timers we will see in more detail in later

sections:

Transitions between states (neighbour->timer)

Some transitions between NUD states are driven by the passage of time rather than by events in the system These transitions include:

From NUD_REACHABLE to NUD_DELAY or NUD_STALE

This transition takes place when a certain amount of time goes by without sending or receiving traffic from a neighbor; the neighboring subsystem automatically suspects that the neighbor may not be reachable

From NUD_DELAY to NUD_PROBE or NUD_REACHABLE

This is the next state after the neighbor's reachability is suspected; either it must be confirmed by an external event or the neighboring subsystem must launch an explicit probe The timer simply detects the condition required to change state and takes care of it For example, we saw in Figure 26-14 in Chapter 26 how neigh_confirm may be called when TCP provides

confirmation of reachability neigh_confirm updates a timestamp in the neighbour structure but does not change the state

Instead, when this timer detects the new timestamp, it changes the neighbor's state

A timer in each neighbour structure controls both of these transitions Its callback is initialized to neigh_timer_handler when the neighbour

enTRy is created with neigh_alloc You can find more information on this in Figure 27-4, and in the section "Reachability Confirmation" in

Chapter 26

Failed solicitation requests

If no answer to a solicitation request is received within a given amount of time, a new solicitation is sent The maximum number

Trang 27

of solicitation requests that can be sent is given by the XXX_probes fields of the neigh_parms structure, described in the section

"neigh_parms Structure" in Chapter 29

After the final failed attempt, the neighbor entry is moved to the NUD_FAILED state (see Figure 27-13) After the state becomes NUD_FAILED, it is up to the garbage collection timer to remove the entry

Garbage collection (neigh_table->gc_timer)

A periodic timer is used to make sure that no memory is wasted by unused data structures The callback handler is neigh_periodic_timer The section "Garbage Collection" describes the garbage collection mechanism in detail

neigh_periodic_timer also updates the value of reachable_time in the neighbour structure to a random value[*] every 300 seconds The value is random rather than fixed because you want to avoid having too many entries expiring at the same time: in

a pretty big network, that could create a burst of traffic and CPU usage

[*]

To be more exact, it is a random value in the range base_reachable_time/2 to (3xbase_reachable_time)/2, as computed

by the neigh_rand_reach_time routine

Trang 28

27.4 Reference Counts on neighbour Structures

Many kernel subsystems involved in the creation of neighbors keep a reference to the neighbour structure in some data structure; the routing subsystem does so, for instance Therefore, the neighbour structure includes a reference count named refcnt, which is

incremented and decremented with neigh_hold and neigh_release, respectively

The most common event that increments a neighbor reference count is a packet transmission Whenever a packet is sent out, the associated sk_buff buffer holds a reference to a neighbour structure, so neighbour->refcnt is incremented to make sure that the

transmission can complete without problems Once the packet has been transmitted, the count is decremented again

This was an example of a short-term reference; others can last significantly longer One example is the reference kept by the routing

table cache (under both IPv4 and IPv6[*]), as depicted in Figure 27-10

[*]

Both IPv4's rt_intern_hash (described in Chapter 33) and IPv6's ip6_route_add end up calling _ _neigh_lookup_errno

The reference count is also incremented every time a per-neighbor timer is fired up, as shown in the following snapshot taken from neigh_update:

if (new & NUD_IN_TIMER) {

Trang 29

27.5 Creating a neighbour Entry

Like most cached items, the creation of neighbour entries is event driven: an instance is created when the system needs a neighbor and

there is a cache miss Specifically, a new instance is created when one of the following takes place:

Transmission request

When there is a transmission request toward a host whose L2 address is not known, the address needs to be resolved This

is the most common case and is depicted in Figure 27-13(a) When the target host is not directly connected to the sender, the L2 address to resolve will be that of the next hop gateway, not that of the target host

Reception of a solicitation request

Because the host sending the request identifies itself in that request, the recipient automatically creates a cache entry on the assumption that communication between the two systems is imminent (For details involving ARP, see Figure 28-2 in Chapter

28) However, information learned in this way (passively) is not considered as authoritative as information learned with an explicit solicitation request and reply (see the section "Transitions Between NUD States" in Chapter 26 for more details)

Manual coding

An administrator can create a cache entry through an ip neigh add command, as described in the section "System Administration of Neighbors" in Chapter 29

When one of these events happens, and a query to the neighboring subsystem cache returns a miss, the neighboring protocol tries to

resolve the association (normally by sending a solicitation request) and stores the resulting neighbour enTRy in the per-protocol cache

27.5.1 The neigh_create Function's Parameters

Now that we know what triggers the creation of a neighbour structure, we can look at the main functions involved with its creation

The data structure itself is created with neigh_create, whose return value is a pointer to the neighbour data structure Here is the

prototype and a description of the three input parameters:

struct neighbour * neigh_create(struct neigh_table *tbl, const void *pkey,

struct net_device *dev)

tbl

Trang 30

Identifies the neighboring protocol used The way this parameter is set is simple: if it is being called from IPv4 code (i.e., from arp_rcv) it is set to arp_tbl, etc.

neigh_alloc uses a memory pool created at subsystem initialization time (see the section "Protocol Initialization and Cleanup") The function fails only if the number of structures currently allocated is greater than some configurable threshold and, on top of that, an attempt by the garbage collector (via neigh_forced_gc) to free some memory failed (see the section "Synchronous cleanup: the

neigh_forced_gc function")

pkey is copied into the data structure with the help of key_len, which provides the size of the data to be copied This is necessary because the neighbour structures are used by protocol-independent cache lookup routines and the various neighboring protocols use addresses

of different sizes

memcpy(n->primary_key, pkey, key_len);

Also, because the neighbour entry holds a reference to the net_device structure dev, the kernel increases the reference count on the latter with dev_hold to make sure the device will not be removed until the neighbour structure ceases to exist

27.5.2 Neighbor Initialization

There are two kinds of initialization for a neighbour structure: one done by the neighboring protocol and one done by the device

if (tbl->constructor && (error = tbl->constructor(n)) < 0) {

rc = ERR_PTR(error);

goto out_neigh_release;

}

The protocol's initialization is carried out by the neigh_table->constructor function invoked, as shown here, from the function's tbl

parameter Chapter 28 explains how the ARP constructor does the job

Device initialization is done through the neigh_setup virtual function:

if (n->parms->neigh_setup &&

(error = n->parms->neigh_setup(n)) < 0) {

rc = ERR_PTR(error);

Trang 31

goto out_neigh_release;

}

This function is actually defined by only a few devices For instance, the shaper virtual device (an old piece of code in

drivers/net/shaper.c that has been rendered obsolete by the Traffic Control subsystem but is needed for backward compatibility) uses the

setup function to make sure the device is associated with a specific instance of the neigh_ops structures provided by ARP (see the section "Initialization of neigh->ops") Some WAN devices use a setup function for similar reasons

The neigh_create function ends by setting the entry's confirmed field to indicate that the neighbor is reachable Normally, this field is updated by a proof of reachability and is set to the current time expressed in jiffies But here, at the point of creation, the function subtracts a small amount of time (one-half the value reachable_time) to make the state move to NUD_STALE a little faster than usual and

to require proof of reachability

n->confirmed = jiffies - (n->parms->base_reachable_time<<1);

Once the entry has been initialized, it is added to the main cache using the hash function provided by the neighboring protocol

Trang 32

27.6 Neighbor Deletion

A neighbour data structure can be removed for three main reasons:

The kernel tries to send a packet to a host that is not reachable There are many reasons this could happen: the host went down, its cable came unplugged, it was a wireless device that moved out of range, its network configuration got corrupted, or

somebody manually created an entry for a nonexistent host Whatever the cause, the neighboring subsystem notices the failure and puts the associated neighbour structure into the NUD_FAILED state so that it is cleaned up by asynchronous garbage collection, described in the section "Asynchronous cleanup: the neigh_periodic_timer function "

The host associated with the neighbor structure has changed its L2 address (perhaps because its NIC was replaced) but still has the same L3 configuration Thus, the neighbour structure has an outdated L2 address A host with an outdated neighbor entry has

to put it into the NUD_FAILED state and create a new one.[*]

[*] Some device drivers let the administrator change the MAC address either temporarily (i.e., it returns to its original value after a power cycle) or permanently This operation is limited to special scenarios and is not needed by the average user

The structure gets old and the kernel needs its memory It is therefore removed by garbage collection, described in the section

"Synchronous cleanup: the neigh_forced_gc function."

The transition to NUD_FAILED is taken care of by the NUD algorithm introduced in the section "Transitions Between NUD States" in Chapter 26 Asynchronous garbage collection is performed by the neigh_periodic_timer function, which is associated with the neigh_table->gc_timer timer (see the sections "Timers" and "Garbage Collection" for more details)

A structure is removed only when its reference count goes to zero Thus, the function that carries out the deletion, neigh_destroy, is called only from neigh_release, which is called every time a reference to a structure is released neigh_release decrements the structure's reference count and calls neigh_destroy to actually remove the structure when the count goes down to zero:

static inline void neigh_release(struct neighbour *neigh)

{

if (atomic_dec_and_test(&neigh->refcnt))

neigh_destroy(neigh);

}

neigh_destroy carries out the following tasks:

Stops any pending timer This is a belt-and-suspenders precaution In theory, no timer should be pending when executing

neigh_destroy because the condition required by neigh_release to invoke neigh_destroy is a reference count value of 0, and timers always hold

a reference when running

Releases any references to external data structures, such as the associated device and cached L2 headers See Figures 27-1

and 27-10

The section "L2 Header Caching," later in this chapter, explains the purpose of the cache and shows the relationship between the

neighbour structure and the hh_cache structures that contain the headers Each hh_cache structure is strictly coupled with a neighbour entry and therefore should not be used once the neighbour entry has been removed or marked NUD_FAILED Thus, when a neighbour enTRy is deleted, any hh_cache structures to which it refers are unlinked from the cache and freed if their reference counts allow it, and

neigh_destroy sets the hh_cache->hh_output field in the cached header to neigh_blackhole (for that function, see the section "Routines used for

Trang 33

neigh->output") After this, any transmission attempt using the neighbour entry will silently fail and the packet will be dropped At the L3 layer, the results of dropping the packet can be seen in the section "Interaction Between Neighboring Protocols and L3 Transmission Functions."

If a destructor method has been provided by the neighboring protocol, executes it to give the protocol a chance to do its own cleanup

If the arp_queue queue is not empty, purges it (i.e., removes all of its elements) arp_queue is described in the section "Egress Queuing."

Decrements the global counter indicating the number of neighbour entries used by the host

Frees the neighbour data structure (i.e., gives it back to its memory pool)

27.6.1 Garbage Collection

Garbage collection refers to the process of eliminating resources that are not in use anymore Like many Linux kernel subsystems (networking and others), the neighboring subsystem maintains a timer that runs periodically and executes a function whenever the timerexpires, to clean up the unused data structures

The garbage collection algorithm used by the neighboring infrastructure has two main components:

This relatively complex system was chosen because, in the case of the neighboring subsystem, the designers thought it would be more efficient than simpler designs such as deleting a structure the moment its reference count went down to zero While the asynchronous cleanup tries to free structures that have no further value, the synchronous cleanup tries to sacrifice some of the less-needed entries to free some memory Therefore, the criteria used to select the eligible structures are different in the two types of cleanup

It is interesting to note that an asynchronous cleanup can be triggered by an external subsystem, too For instance, when the routing subsystem cannot insert a new routing entry into its cache, it tries to remove unused cache entries (see the description of the rt_intern_hash

function in Chapter 33), which indirectly causes neighbour structures to be freed, too

The parameters that tune garbage collection behavior are:

Trang 34

The following two sections explain their meaning and use Also consult the section "neigh_table structure," the section "neigh_parms

structure," and Table 29-3 in Chapter 29 for information on these variables

Figures 27-7 and 27-8 show the behavior of neigh_periodic_timer and neigh_forced_gc, the two routines described in the next two sections

27.6.1.1 Synchronous cleanup: the neigh_forced_gc function

Figure 27-7 shows the internals of neigh_forced_gc

Figure 27-7 neigh_forced_gc function

Trang 35

If there is no memory to allocate a new neighbour instance, the host cannot transmit any packet to neighbors for which there is not already a Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 36

neighbour structure in the cache Without a policy to handle this case, the consequences would be pretty bad: no communication could take

place with a new host until another neighbour structure happened to be removed for some reason

The neigh_alloc function, which we have seen is responsible for allocating memory in the neighboring subsystem, is the natural place to kick

off synchronous garbage collection To determine whether there is a danger situation and do garbage collection before memory is actually

exhausted, neigh_alloc checks two variables named gc_thresh2 and gc_thresh3 (Another variable, gc_thresh1, is currently declared in the kernel but is

not used.)

When the number of neighbour instances is greater than gc_thresh3, the neigh_alloc function forces garbage collection When the number of

instances is between gc_thresh2 and gc_thresh3, garbage collection is forced if the previous garbage collection took place at least 5 seconds

earlier The reason for the second check is to rate limit the time spent doing garbage collection

The default values for gc_thresh2 and gc_thresh3 are 512 and 1,024, respectively These look like big numbers, but are designed to support proxy

ARP Without a proxy ARP server, each host usually creates ARP entries for only a few local machines and the router, so it would never

get near those thresholds But when proxy ARP is in use, hosts request more L3 addresses because they rely less on the default gateway

The reception of a solicitation request by the proxy ARP server leads to the indirect creation of a neighbour entry for the sender's address

See the earlier section "Creating a neighbour Entry," and the description of arp_process in Chapter 28 In a medium-size network, the

thresholds are pretty safe and the cache is not likely to overflow

The routine invoked to do synchronous cleanup is neigh_forced_gc, which is depicted in Figure 27-7 neigh_forced_gc removes all of the eligible

elements from the hash table Eligible elements are the ones that meet both of the following requirements:

The reference count is 1, meaning that nobody is using the element, and the subsystem holding the remaining reference is free

to delete the element

The element is not in the NUD_PERMANENT state Elements in that state have been statically configured and therefore do not expire

Elements are added by neigh_create at the head of the bucket's lists in the hash table

27.6.1.2 Asynchronous cleanup: the neigh_periodic_timer function

Figure 27-8 shows the internals of neigh_periodic_timer

Figure 27-8 neigh_periodic_timer function

Trang 37

gc_timer is a per-protocol timer that expires periodically When the timer expires, it invokes the garbage collection routine neigh_periodic_timer The kernel actually invokes a function specified in a field of the neigh_table structure (one of which exists for each neighboring protocol), so each protocol could theoretically have its own implementation of the garbage collection handler, but in practice the field is initialized to the same routine across all the protocols in the neigh_table_init function.

Trang 38

How often gc_timer expires depends on the size of the hash_buckets table: because neigh_periodic_timer scans only one bucket of the table every time it

is called, and because the whole table is scanned (by design choice) once every base_reachable_time/2 seconds, it follows that the timer must

be set to expire every (base_reachable_time/2)/number_of_buckets

Every time neigh_periodic_timer is called, it remembers the last bucket scanned, thanks to neigh_table's field, hash_chain_gc, and scans the following one

The neigh->confirmed timestamp is updated every time the reachability of the neighbor is confirmed, for example, by calling neigh_confirm, as we saw

in the section "Reachability Confirmation" in Chapter 26 Even though its name suggests it, the neigh->used timestamp is not updated every time the neighbour structure is used (i.e., with the transmission of each packet to the neighbor) Because of this, it is possible that at some point, neigh->confirmed represents a more updated timestamp marking the last use of the neighbour structure For this reason, neigh_periodic_timer

updates neigh->used if that is needed (i.e., if neigh->confirmed is greater than neigh->used) It is important to keep neigh->used updated because that's the timestamp used by neigh_periodic_timer to eliminate old entries

As Figure 27-8 shows, eligible elements marked for deletion by neigh_periodic_timer meet both of the following criteria:

The reference count is 1, meaning it is no longer used

The entry either is in the NUD_FAILED state, which means that resolution failed, or has simply not been used for more than the configurable gc_staletime time

Trang 39

27.7 Acting As a Proxy

The section "Proxying the Neighboring Protocol" in Chapter 26 described why proxies are useful and gave a few examples of their use It also showed the criteria by which neighboring protocols decide whether a given solicitation request is taken care of by the proxy This section goes into detail on the implementation of proxying

We saw in the section "Conditions Required by the Proxy" in Chapter 26 that two kinds of proxying can be configured: a host either can proxy all requests received on a particular NIC (per-device proxying) or, more selectively, can proxy requests for a particular address received on a particular NIC (per-destination proxying)

The precedence shown in Figure 26-8 in Chapter 26 is enforced in protocol-specific code ARP's implementation is shown in Chapter 28, and you can look at the routine neigh_recv_ns for IPv6's implementation The section "Per-Device Proxying and Per-Destination Proxying" also goes into more detail about these two types of proxying

Before digging into the code, let me introduce a naming convention used extensively there The neighboring subsystem contains pairs of functions and data structures whose names differ only in the presence or absence of an initial p (e.g., neigh_lookup versus pneigh_lookup) The p stands for proxy Because addresses intercepted by proxies are handled differently, there is a dedicated set of functions to manipulate

The delay applied is a random value between 0 and the configured value proxy_delay (see the function pneigh_enqueue) The use of a random value reduces the likelihood of synchronized requests by multiple hosts, and the congestion that could result For example, if a power failure occurs at a site, and upon recovery it powers up hundreds of hosts at the same time, all of the hosts probably solicit the same set of servers or default gateways A random delay smoothes out the spike in traffic that would result

To apply a delay, the neighboring subsystem creates a queue storing ingress solicitation requests, and a timer The timer expires after the configured delay has passed and triggers the execution of a special handler that dequeues the elements from the queue They are then processed as if they had just been received from the network

Figure 27-9 depicts the model just described

The major variables and virtual functions involved in handling the proxy delay are:

From neigh_table (per-protocol parameters)

proxy_queueQueue where the ingress solicitation requests are temporarily buffered Elements are added to the end of the list When the proxy_queue list has reached the maximum length specified in proxy_qlen (discussed later), new elements are dropped; they do not replace the oldest ones

Trang 40

proxy_timerTimer used to enforce the delay The timer is initialized by neigh_table_init and the default handler is neigh_proxy_process.

proxy_redoFunction that processes the dequeued requests As shown in Figure 27-9, it consists of just a call to the same function that processes freshly received packets

Figure 27-9 Generic model of a protocol proxy handler

Định dạng
Số trang	128
Dung lượng	8,15 MB