Tài liệu Understanding NETWORK INTERNALS LINUX pptx

106 Data Structures Featured in This Chapter 106Registering a PCI NIC Device Driver 108Power Management and Wake-on-LAN 109Example of PCI NIC Driver Registration 110 Tuning via /proc Fil

Trang 3

LINUXNETWORK INTERNALS

Trang 4

Other Linux resources from O’Reilly

Related titles Linux in a Nutshell

Linux NetworkAdministrator’s GuideRunning Linux

Linux Device DriversUnderstanding the LinuxKernel

Building Secure Servers withLinux

LPI Linux Certification in aNutshell

Learning Red Hat LinuxLinux Server HacksTMLinux Security CookbookManaging RAID on LinuxLinux Web Server CDBookshelf

Building Embedded LinuxSystems

Linux Books

Resource Center

linux.oreilly.com is a complete catalog of O’Reilly’s books on

Linux and Unix and related technologies, including samplechapters and code examples

ONLamp.com is the premier site for the open source web

plat-form: Linux, Apache, MySQL, and either Perl, Python, or PHP

Conferences O’Reilly brings diverse innovators together to nurture the ideas

that spark revolutionary industries We specialize in ing the latest tools and systems, translating the innovator’sknowledge into useful skills for those in the trenches Visit

document-conferences.oreilly.com for our upcoming events.

Safari Bookshelf (safari.oreilly.com) is the premier online

refer-ence library for programmers and IT professionals Conductsearches across more than 1,000 books Subscribers can zero in

on answers to time-critical questions in a matter of seconds.Read the books on your Bookshelf from cover to cover or sim-ply flip to the page you need Try it today with a free trial

Trang 5

Understanding LINUX

NETWORK INTERNALS

Christian Benvenuti

Trang 6

Understanding Linux Network Internals

by Christian Benvenuti

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions

are also available for most titles (safari.oreilly.com) For more information, contact our tutional sales department: (800) 998-9938 or corporate@oreilly.com.

Production Editor: Philip Dangler

Cover Designer: Karen Montgomery

Interior Designer: David Futato

Printing History:

December 2005: First Edition.

Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of

O’Reilly Media, Inc The Linux series designations, Understanding Linux Network Internals, images of

the American West, and related trade dress are trademarks of O’Reilly Media, Inc.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in this book, and O’Reilly Media, Inc was aware of a trademark claim, the designations have been printed in caps or initial caps.

While every precaution has been taken in the preparation of this book, the publisher and author assume

no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.

[M]

Trang 7

When a Feature Is Offered as a Patch 20

2 Critical Data Structures 22

The Socket Buffer: sk_buff Structure 22

Trang 8

Part II System Initialization

4 Notification Chains 75

Reasons for Notification Chains 75

Notification Chains for the Networking Subsystems 81

Functions and Variables Featured in This Chapter 83Files and Directories Featured in This Chapter 83

5 Network Device Initialization 84

System Initialization Overview 84Device Registration and Initialization 86Basic Goals of NIC Initialization 86Interaction Between Devices and Kernel 87

6 The PCI Layer and Network Interface Cards 106

Data Structures Featured in This Chapter 106Registering a PCI NIC Device Driver 108Power Management and Wake-on-LAN 109Example of PCI NIC Driver Registration 110

Tuning via /proc Filesystem 114Functions and Variables Featured in This Chapter 114Files and Directories Featured in This Chapter 115

Trang 9

7 Kernel Infrastructure for Component Initialization 116

Optimized Macro-Based Tagging 125Boot-Time Initialization Routines 128

8 Device Registration and Initialization 136

When a Device Is Registered 137When a Device Is Unregistered 138Allocating net_device Structures 138Skeleton of NIC Registration and Unregistration 140

Part III Transmission and Reception

9 Interrupts and Network Drivers 177

Decisions and Traffic Direction 178Notifying Drivers When Frames Are Received 178

Trang 10

Enabling and Disabling Transmissions 241

12 General and Reference Material About Interrupts 261

Tuning via /proc and sysfs Filesystems 262Functions and Variables Featured in This Part of the Book 263Files and Directories Featured in This Part of the Book 265

13 Protocol Handlers 266

Executing the Right Protocol Handler 274Protocol Handler Organization 278Protocol Handler Registration 279Ethernet Versus IEEE 802.3 Frames 281Tuning via /proc Filesystem 293Functions and Variables Featured in This Chapter 293Files and Directories Featured in This Chapter 294

Part IV Bridging

14 Bridging: Concepts 297

Repeaters, Bridges, and Routers 297

Bridging Different LAN Technologies 302

Trang 11

15 Bridging: The Spanning Tree Protocol 310

Example of Hierarchical Switched L2 Topology 311Basic Elements of the Spanning Tree Protocol 314

Bridge Protocol Data Units (BPDUs) 323Defining the Active Topology 328

Transmitting Configuration BPDUs 346

Overview of Newer Spanning Tree Protocols 350

16 Bridging: Linux Implementation 355

Initialization of Bridging Code 360Creating Bridge Devices and Bridge Ports 361Creating a New Bridge Device 362Bridge Device Setup Routine 362

Enabling and Disabling a Bridge Device 367Enabling and Disabling a Bridge Port 368Changing State on a Bridge Port 370

Transmitting on a Bridge Device 380Spanning Tree Protocol (STP) 380netdevice Notification Chain 389

17 Bridging: Miscellaneous Topics 391

User-Space Configuration Tools 391Tuning via /proc Filesystem 396

Trang 12

Data Structures Featured in This Part of the Book 398Functions and Variables Featured in This Part of the Book 403Files and Directories Featured in This Part of the Book 405

18 Internet Protocol Version 4 (IPv4): Concepts 409

IP Protocol: The Big Picture 409

Packet Fragmentation/Defragmentation 420

21 Internet Protocol Version 4 (IPv4): Transmission 473

Key Functions That Perform Transmission 474Interface to the Neighboring Subsystem 510

23 Internet Protocol Version 4 (IPv4): Miscellaneous Topics 536

Long-Living IP Peer Information 536Selecting the IP Header’s ID Field 540

Trang 13

Functions and Variables Featured in This Part of the Book 565Files and Directories Featured in This Part of the Book 568

24 Layer Four Protocol and Raw IP Handling 569

L3 to L4 Delivery: ip_local_deliver_finish 574

25 Internet Control Message Protocol (ICMPv4) 585

Applications of the ICMP Protocol 595

Data Structures Featured in This Chapter 600

Passing Error Notifications to the Transport Layer 619Tuning via /proc Filesystem 620Functions and Variables Featured in This Chapter 622Files and Directories Featured in This Chapter 622

Part VI Neighboring Subsystem

26 Neighboring Subsystem: Concepts 625

Trang 14

27 Neighboring Subsystem: Infrastructure 651

Common Interface Between L3 Protocols and Neighboring Protocols 655General Tasks of the Neighboring Infrastructure 666Reference Counts on neighbour Structures 670

Example of an ARP Transaction 702Gratuitous ARP 702Responding from Multiple Interfaces 707

ARP Protocol Initialization 714Initialization of a neighbour Structure 716Transmitting and Receiving ARP Packets 722Processing Ingress ARP Packets 726

29 Neighboring Subsystem: Miscellaneous Topics 749

System Administration of Neighbors 749Tuning via /proc Filesystem 752Data Structures Featured in This Part of the Book 757Files and Directories Featured in This Part of the Book 774

Trang 15

Part VII Routing

32 Routing: Linux Implementation 830

Primary and Secondary IP Addresses 841Generic Helper Routines and Macros 842

Routing Subsystem Initialization 844

Interactions with Other Subsystems 858

33 Routing: The Routing Cache 861

Routing Cache Initialization 861

Interface Between the DST and Calling Protocols 879

Egress ICMP REDIRECT Rate Limiting 896

Trang 16

34 Routing: Routing Tables 898

Organization of Routing Hash Tables 898Routing Table Initialization 904

Policy Routing and Its Effects on Routing Table Definitions 910

Policy Routing and Routing Table Based Classifier 948

36 Routing: Miscellaneous Topics 952

User-Space Configuration Tools 952

Tuning via /proc Filesystem 958Enabling and Disabling Forwarding 966Data Structures Featured in This Part of the Book 968Functions and Variables Featured in This Part of the Book 986Files and Directories Featured in This Part of the Book 989

Index 991

Trang 17

Today more than ever before, networking is a hot topic Any electronic gadget in itslatest generation embeds some kind of networking capability The Internet contin-ues to broaden in its population and opportunities It should not come as a surprisethat a robust, freely available, and feature-rich operating system like Linux is wellaccepted by many producers of embedded devices Its networking capabilities make

it an optimal operating system for networking devices of any kind The features italready has are well implemented, and new ones can be added easily If you are adeveloper for embedded devices or a student who would like to experiment withLinux, this book will provide you with good fodder

The performance of a pure software-based product that uses Linux cannot competewith commercial products that can count on the help of specialized hardware This

of course is not a criticism of software; it is a simple recognition of the consequence

of the speed difference between dedicated hardware and general-purpose CPUs.However, Linux can definitely compete with low-end commercial products that areentirely software-based Of course, simple extensions to the Linux kernel allow ven-dors to use Linux on hybrid systems as well (software and hardware); it is only amatter of writing the necessary device drivers

Linux is also often used as the operating system of choice for the implementation ofuniversity projects and theses Not all of them make it to the official kernel (not rightaway, at least) A few do, and others are simply made available online as patches tothe official kernel Isn’t it a great satisfaction and reward to see your contribution tothe Linux kernel being used by potentially millions of users? There is only one draw-back: if your contribution is really appreciated, you may not be able to cope with thenumerous emails of thanks or requests for help

The momentum for Linux has been growing continually over the past years, andapparently it can only keep growing

I first encountered Linux at the University of Bologna, where I was a grad student incomputer science around 10 years ago What a wonderful piece of software! I could

Trang 18

work on my image processing projects at home on an i286/486 computer withouthaving to compete with other students for access to the few Sun stations available atthe university labs.

Since then, my marriage to Linux has never seen a gray day It has even started to place my fond memories of the glorious C64 generation, when I was first introduced

dis-to programming with Assembly language and the various dialects of BASIC Yes, Ibelong to the C64 generation, and to some extent I can compare the joy of my firstprogramming experiences with the C64 to my first journeys into the Linux kernel.When I was first introduced to the beautiful world of networking, I started playingwith the tools available on Linux I also had the fortune to work for a UNESCO cen-ter in Italy where I helped develop their networking courses, based entirely on Linuxboxes That gave me access to a good lab equipped with all sorts of network devicesand documentation, plus plenty of Linux enthusiasts to learn from and to collabo-rate with

Unfortunately for my own peace of mind (but fortunately, I hope, for the reader ofthis book who benefits from the results), I am the kind of person that likes to under-stand everything and takes very little for granted So at UNESCO, I started lookinginto the kernel code This not only proved to be a good way to burn in my knowl-edge, but it also gave me more confidence in making use of user-space configurationtools: whenever a configuration tool did not provide a specific option, I usually knewwhether it would be possible to add it or whether it would have required significantchanges to the kernel This kind of study turns into a path without an end: youalways want more

After developing a few tools as extensions to the Linux kernel (some revision of sions 2.0 and 2.2), my love for operating systems and networking led me to the Sili-con Valley (Cisco Systems) When you learn a language, be it a human language or acomputer programming language, a rule emerges: the more languages you know, theeasier it becomes to learn new ones You can identify each one’s strengths and weak-nesses, see the reasons behind design compromises, etc The same applies to operat-ing systems

ver-When I noticed the lack of good documentation about the networking code of theLinux kernel and the availability of good books for other parts of the kernel, Idecided to try filling in the gap—or at least part of it I hope this book will give youthe starting documentation that I would have loved to have had years ago

I believe that this book, together with O’Reilly’s other two kernel books

(Under-standing the Linux Kernel and Linux Device Drivers), represents a good starting point

for anyone willing to learn more about the Linux kernel internals They complementeach other and, when they do not address a given feature, point the reader to exter-nal documentation sources (when available)

Trang 19

However, I still suggest you make some coffee, turn on the music, and spend sometime on the source code trying to understand how a given feature is implemented Ibelieve the knowledge you build in this way lasts longer than that built in any otherway Shortcuts are good, but sometimes the long way has its advantages, too.

The Audience for This Book

This book can help those who already have some knowledge of networking andwould like to see how the engine of the Internet—that is, the Internet Protocol (IP)and its friends—is implemented on a first-class operating system However, there is atheoretical introduction for each topic, so newcomers will be able to get up to speedquickly, too Complex topics are accompanied by enough examples to make themeasier to follow

Linux doesn’t just support basic IP; it also has quite a few advanced features Moreimportant, its implementation must be sophisticated enough to play nicely withother kernel features such as symmetric multiprocessing (SMP) and kernel preemp-tion This makes the networking code of the Linux kernel a very good gym in which

to train and keep your networking knowledge in shape

Moreover, if you are like me and want to learn everything, you will find enoughdetails in this book to keep you satisfied for quite a while

Background Information

Some knowledge of operating systems would help The networking code, like anyother component of the operating system, must follow both common sense andimplicit rules for coexistence with the rest of the kernel, including proper use of lock-ing; fair use of memory and CPU; and an eye toward modularity, code cleanliness,and good performance Even though I occasionally spend time on those aspects, Irefer you to the other two O’Reilly kernel books mentioned earlier for a deeper anddetailed discussion on generic operating system services and design

Some knowledge of networking, and especially IP, would also help However, I thinkthe theory overview that precedes each implementation description in this book issufficient to make the book self-contained for both newcomers and experiencedreaders

The theoretical description of the topics covered in the book does not require anyprogramming experience However, the descriptions of the associated implementa-tions require an intermediate knowledge of the C language Chapter 1 will go through

a series of coding conventions and tricks that are often used in the code, whichshould help especially those with less experience with C and kernel programming

Trang 20

Organization of the Material

Some aspects of networking code require as many as seven chapters, while for otheraspects one chapter is sufficient When the topic is complex or big enough to spandifferent chapters, the part of the book devoted to that topic always starts with aconcept chapter that covers the theory necessary to understand the implementation,which is described in another chapter All of the reference and secondary material isusually located in one miscellaneous chapter at the end of the part No matter howbig the topic is, the same scheme is used to organize its presentation

For each topic, the implementation description includes:

• The big picture, which shows where the described kernel component falls in thenetwork stack

• A brief description of the main data structures and a figure that shows how theyrelate to each other

• A description of which other kernel features the component interfaces with—forexample, by means of notification chains or data structure cross-references Thefirewall is an example of such a kernel feature, given the numerous hooks it hasall over the networking code

• Extensive use of flow charts and figures to make it easier to go through the codeand extract the logic from big and seemingly complex functions

The reference material always includes:

• A detailed description of the most important data structures, field by field

• A table with a brief description of all functions, macros, and data structures,which you can use as a quick reference

• A list of the files mentioned in the chapter, with their location in the kernelsource tree

• A description of the interface between the most common user-space tools used

to configure the topic of the chapter and the kernel

• A description of any file in /proc that is exported

The Linux kernel’s networking code is not just a moving target, but a fast runner.The book does not cover all of the networking features New ones are probablybeing added right now while you are reading Many new features are driven by theneeds of single users or organizations, or as university projects, but they find theirway into the official kernel when they’re considered useful for a large audience.Besides detailing the implementation of a subset of those features, I try to give you

an idea of what the generic implementation of a feature might look like This willhelp you greatly in understanding changes to the code and learning how new fea-tures are implemented For example, given any feature, you need to take the follow-ing points into consideration:

Trang 21

• How do you design the data structures and the locking semantics?

• Is there a need for a user-space configuration tool? If so, is it going to interactwith the kernel via an existing system call, anioctlcommand, a /proc file, or the

Netlink socket?

• Is there any need for a new notification chain, and is there a need to register to

an already existing chain?

• What is the relationship with the firewall?

• Is there any need for a cache, a garbage collection mechanism, statistics, etc.?Here is the list of topics covered in the book:

Interface between user space and kernel

In Chapter 3, you will get a brief overview of the mechanisms that networkingconfiguration tools use to interact with their counterparts inside the kernel Itwill not be a detailed discussion, but it will help you to understand certain parts

of the kernel code

System initialization

Part II describes the initialization of key components of the networking code,and how network devices are registered and initialized

Interface between device drivers and protocol handlers

Part III offers a detailed description of how ingress (incoming or received)

pack-ets are handed by the device drivers to the upper-layer protocols, and vice versa

Bridging

Part IV describes transparent bridging and the Spanning Tree Protocol, the L2(Layer two) counterpart of routing at L3 (Layer three)

Internet Protocol Version 4 (IPv4)

Part V describes how packets are received, transmitted, forwarded, and ered locally at the IPv4 layer

deliv-Interface between IPv4 and the transport layer (L4) protocols

Chapter 20 shows how IPv4 packets addressed to the local host are delivered tothe transport layer (L4) protocols (TCP, UDP, etc.)

Internet Control Message Protocol (ICMP)

Chapter 25 describes the implementation of ICMP, the only transport layer (L4)protocol covered in the book

Neighboring protocols

These find local network addresses, given their IPaddresses Part VI describesboth the common infrastructure of the various protocols and the details of theARP neighboring protocol used by IPv4

Routing

Part VII, the biggest one of the book, describes the routing cache and tables.Advanced features such as Policy Routing and Multipath are also covered

Trang 22

What Is Not Covered

For lack of space, I had to select a subset of the Linux networking features to cover

No selection would make everyone happy, but I think I covered the core of the working code, and with the knowledge you can gain with this book, you will find iteasier to study on your own any other networking feature of the kernel

net-In this book, I decided to focus on the networking code, from the interface betweendevice drivers and the protocol handlers, up to the interface between the IPv4 and L4protocols Instead of covering all of the features with a compromise on quality, I pre-ferred to keep quality as the first goal, and to select the subset of features that wouldrepresent the best start for a journey into the kernel networking implementation.Here is a partial list of the features I could not cover for lack of space:

Internet Protocol Version 6 (IPv6)

Even though I do not cover IPv6 in the book, the description of IPv4 can helpyou a lot in understanding the IPv6 implementation The two protocols sharenaming conventions for functions and often for variables Their interface to Net-filter is also similar

IP Security protocol

The kernel provides a generic infrastructure for cryptography along with a lection of both ciphers and digest algorithms The first interface to the crypto-graphic layer was synchronous, but the latest improvements are adding anasynchronous interface to allow Linux to take advantage of hardware cards thatcan offload the work from the CPU

col-The protocols of the IPsec suite—Authentication Header (AH), Security Payload (ESP), and IP Compression (IPcomp)—are implemented in thekernel and make use of the cryptographic layer

Encapsulating-IP multicast and Encapsulating-IP multicast routing

Multicast functionality was implemented to conform to versions 2 and 3 of theInternet Group Management Protocol (IGMP) Multicast routing support is alsopresent, conforming to versions 1 and 2 of Protocol Independent Multicast (PIM)

Transport layer (L4) protocols

Several L4 protocols are implemented in the Linux kernel Besides the two known ones, UDPand TCP, Linux has the newer Stream Control TransmissionProtocol (SCTP) A good description of the implementation of those protocolswould require a new book of this size, all on its own

well-Traffic Control

This is the Quality of Service (QoS) layer of Linux, another interesting and erful component of the kernel’s networking code Traffic control is imple-mented as a general infrastructure and as a collection of traffic classifiers andqueuing disciplines I briefly describe it and the interface it provides to the maintransmission routine in Chapter 11 A great deal of documentation is available at

pow-http://lartc.org.

Trang 23

The firewall code infrastructure and its extensions (including the various NATflavors) is not covered in the book, but I describe its interaction with most of the

networking features I cover At the Netfilter home page, http://www.netfilter.org,

you can find some interesting documentation about its kernel internals

Network filesystems

Several network filesystems are implemented in the kernel, among them NFS(versions 2, 3, and 4), SMB, Coda, and Andrew You can read a detailed descrip-

tion of the Virtual File System layer in Understanding the Linux Kernel, and then

delve into the source code to see how those network filesystems interface with it

Virtual devices

The use of a dedicated virtual device underlies the implementation of ing features Examples include 802.1Q, bonding, and the various tunneling pro-tocols, such as IP-over-IP (IPIP) and Generalized Routing Encapsulation (GRE).Virtual devices need to follow the same guidelines as real devices and provide thesame interface to other kernel components In different chapters, where needed,

network-I compare real and virtual device behaviors The only virtual device that isdescribed in detail is the bridge interface, which is covered in Part IV

DECnet, IPX, AppleTalk, etc.

These have historical roots and are still in use, but are much less commonly usedthan IP I left them out to give more space to topics that affect more users

IP virtual server

This is another interesting piece of the networking code, described at http://

www.linuxvirtualserver.org/ This feature can be used to build clusters of servers

using different scheduling algorithms

Simple Network Management Protocol (SNMP)

No chapter in this book is dedicated to SNMP, but for each feature, I give adescription of all the counters and statistics kept by the kernel, the routines used

to manipulate them, and the /proc files used to export them, when available.

Frame Diverter

This feature allows the kernel to kidnap ingress frames not addressed to the local

host I will briefly mention it in Part III Its home page is http://diverter.

sourceforge.net.

Plenty of other network projects are available as separate patches to the kernel, and Ican’t list them all here One that I find particularly fascinating and promising, espe-cially in relation to the Linux routing code, is the highly configurable Click router,

currently offered at http://pdos.csail.mit.edu/click/.

Because this is a book about the kernel, I do not cover user-space configurationtools However, for each topic, I describe the interface between the most commonuser-space configuration tools and the kernel

Trang 24

Conventions Used in This Book

The following is a list of the typographical conventions used in this book:

Constant Width Italic

Used to indicate text within commands that the user replaces with an actualvalue

Constant Width Bold

Used in examples to show commands or other text that should be typed literally

by the user

Pay special attention to notes set apart from the text with the following icons:

This is a tip It contains useful supplementary information about the

topic at hand.

This is a warning It helps you solve and avoid annoying problems.

Using Code Examples

This book is here to help you get your job done In general, you may use the code inthis book in your programs and documentation The code samples are covered by adual BSD/GPL license

We appreciate, but do not require, attribution An attribution usually includes the

title, author, publisher, and ISBN For example: “Understanding Linux Network

0-596-00255-6.”

Trang 25

We’d Like to Hear from You

Please address comments and questions concerning this book to the publisher:O’Reilly Media, Inc

1005 Gravenstein Highway North

tech-Safari offers a solution that’s better than e-books It’s a virtual library that lets youeasily search thousands of top tech books, cut and paste code samples, downloadchapters, and find quick answers when you need the most accurate, current informa-

tion Try it for free at http://safari.oreilly.com.

Acknowledgments

This book would not have been possible without an interesting topic to talk about,and an audience The interesting topic is Linux, this modern operating system thatanyone has an opportunity to be part of, and the audience is the incredible number

of users that often decide not only to take advantage of the good work of others, butalso to contribute to its success by getting involved in its development I have alwaysloved sharing knowledge and passion for the things I like, and with this book, I havetried my best to add a lane or two to the highway that takes interested people intothe wonderful world of the Linux kernel

Trang 26

Of course, I did not do everything while lying in a hammock by the beach, with anice cream in one hand and a mouse in the other It took quite a lot of work to investi-gate the reasons behind some of the implementation choices It is incredible howmuch information you can dig out of the development mailing lists, and how muchpeople are willing to share their knowledge when you show genuine interest in theirwork.

For sure, this book would not be what it is without the great help and suggestions of

my editor, Andy Oram Due to the frequent changes that the networking code riences, a few chapters had to undergo substantial updates during the writing of thebook, but Andy understood this and helped me get to the finish line

expe-I also would like to thank all of those people that supported me in this effort, andCisco Systems for giving me the flexibility I needed to work on this book

A special thanks also goes to the technical reviewers for being able to review a book

of this size in a short amount of time, still providing useful comments that allowed

me to catch errors and improve the quality of the material The book was reviewed

by Jerry Cooperstein, Michael Boerner, and Paul Kinzelman (in alphabetical order,

by first name) I also would like to thank Francois Tallet for reviewing Part IV andAndi Kleen for his feedback on Part V

Trang 27

PART I

The information in this part of the book represents the basic knowledge you need tounderstand the rest of the book comfortably If you are already familiar with theLinux kernel, or you are an experienced software engineer, you will be able to gopretty quickly through these chapters For other readers, I suggest getting familiarwith this material before proceeding with the following parts of the book:

Chapter 1, Introduction

The bulk of this chapter is devoted to introducing a few of the common gramming patterns and tricks that you’ll often meet in the networking code

pro-Chapter 2, Critical Data Structures

In this chapter, you can find a detailed description of two of the most importantdata structures used by the networking code: the socket buffersk_buffand thenetwork devicenet_device

Chapter 3, User-Space-to-Kernel Interface

The discussion of each feature in this book ends with a set of sections that showshow user-space configuration tools and the kernel communicate The informa-tion in this chapter can help you understand those sections better

Trang 29

Chapter 1 CHAPTER 1

Introduction

To do research in the source code of a large project is to enter a strange, new landwith its own customs and unspoken expectations It is useful to learn some of themajor conventions up front, and to try interacting with the inhabitants instead ofmerely standing back and observing

The bulk of this chapter is devoted to introducing you to a few of the common gramming patterns and tricks that you’ll often meet in the networking code

pro-I encourage you, when possible, to try interacting with a given part of the kernel working code by means of user-space tools So in this chapter, I’ll give you a fewpointers as to where you can download those tools if they’re not already installed onyour preferred Linux distribution, or if you simply want to upgrade them to the lat-est versions

net-I’ll also describe some tools that let you find your way gracefully through the mous kernel code Finally, I’ll explain briefly why a kernel feature may not be inte-grated into the official kernel releases, even if it is widely used in the Linuxcommunity

The terms vector and array will be used interchangeably.

When referring to the layers of the TCP/IP network stack, I will use the tions L2, L3, and L4 to refer to the link, network, and transport layers, respectively

Trang 30

abbrevia-The numbers are based on the famous (if not exactly current) seven-layer OSI model.

In most cases, L2 will be a synonym for Ethernet, L3 for IPVersion 4 or 6, and L4 forUDP, TCP, or ICMP When I need to refer to a specific protocol, I’ll use its name

(i.e., TCP) rather than the generic Ln protocol term.

In different chapters, we will see how data units are received and transmitted by theprotocols that sit at a given layer in the network stack In those contexts, the terms

ingress and input will be used interchangeably The same applies to egress and put The action of receiving or transmitting a data unit may be referred to with the

out-abbreviations RX and TX, respectively

A data unit is given different names, such as frame, packet, segment, and message,

depending on the layer where it is used (see Chapter 13 for more details) Table 1-1summarizes the major abbreviations you’ll see in the book

Common Coding Patterns

Each networking feature, like any other kernel feature, is just one of the citizensinside the kernel As such, it must make proper and fair use of memory, CPU, and allother shared resources Most features are not written as standalone pieces of kernelcode, but interact with other kernel components more or less heavily depending onthe feature They therefore try, as much as possible, to follow similar mechanisms toimplement similar functionalities (there is no need to reinvent the wheel every time).Some requirements are common to several kernel components, such as the need toallocate several instances of the same data structure type, the need to keep track ofreferences to an instance of a data structure to avoid unsafe memory deallocations,etc In the following subsections, we will view common ways in Linux to handle suchrequirements I will also talk about common coding tricks that you may come acrosswhile browsing the kernel’s code

This book uses subsystem as a loose term to describe a collection of files that

imple-ment a major set of features—such as IPor routing—and that tend to be maintained

by the same people and to change in lockstep In the rest of the chapter, I’ll also use

Table 1-1 Abbreviations used frequently in this book

Abbreviation Meaning

Trang 31

the term kernel component to refer to these subsystems, because the conventions

dis-cussed here apply to most parts of the kernel, not just those involved in networking

Memory Caches

The kernel uses thekmallocandkfreefunctions to allocate and free a memory block,respectively The syntax of those two functions is similar to that of the two sistercalls,mallocandfree, from the libc user-space library For more details onkmalloc

It is common for a kernel component to allocate several instances of the same datastructure type When allocation and deallocation are expected to happen often, theassociated kernel component initialization routine (for example, fib_hash_init forthe routing table) usually allocates a special memory cache that will be used for theallocations When a memory block is freed, it is actually returned to the same cachefrom which it was allocated

Some examples of network data structures for which the kernel maintains dedicatedmemory caches include:

Socket buffer descriptors

This cache, allocated byskb_initin net/core/sk_buff.c, is used for the allocation

reg-isters the highest number of allocations and deallocations in the networking system

sub-Neighboring protocol mappings

Each neighboring protocol uses a memory cache to allocate the data structuresthat store L3-to-L2 address mappings See Chapter 27

up callingkmem_cache_free only when all the references to the buffer have beenreleased and all the necessary cleanup has been done by the interested sub-systems (for instance, the firewall)

Trang 32

The limit on the number of instances that can be allocated from a given cache (whenpresent) is usually enforced by the wrappers aroundkmem_cache_alloc, and are some-

times configurable with a parameter in /proc.

For more details on how memory caches are implemented and how they interface to

the slab allocator, please refer to Understanding the Linux Kernel (O’Reilly).

Caching and Hash Tables

It is pretty common to use a cache to increase performance In the networking code,there are caches for L3-to-L2 mappings (such as the ARPcache used by IPv4), for therouting table cache, etc

Cache lookup routines often take an input parameter that says whether a cache missshould or should not create a new element and add it to the cache Other lookuproutines simply add missing elements all the time

Caches are often implemented with hash tables The kernel provides a set of datatypes, such as one-way and bidirectional lists, that can be used as building blocks forsimple hash tables

The standard way to handle inputs that hash to the same value is to put them in alist Traversing this list takes substantially longer than using the hash key to do alookup Therefore, it is always important to minimize the number of inputs that hash

to the same value

When the lookup time on a hash table (whether it uses a cache or not) is a criticalparameter for the owner subsystem, it may implement a mechanism to increase thesize of the hash table so that the average length of the collision lists goes down andthe average lookup time improves See the section “Dynamic resizing of per-netmaskhash tables” in Chapter 34 for an example

You may also find subsystems, such as the neighboring layer, that add a randomcomponent (regularly changed) to the key used to distribute elements in the cache’sbuckets This is used to reduce the damage of Denial of Service (DoS) attacks aimed

at concentrating the elements of a hash table into a single bucket See the section

“Caching” in Chapter 27 for an example

Reference Counts

When a piece of code tries to access a data structure that has already been freed, thekernel is not very happy, and the user is rarely happy with the kernel’s reaction Toavoid those nasty problems, and to make garbage collection mechanisms easier andmore effective (see the section “Garbage Collection” later in this chapter), most datastructures keep a reference count Good kernel citizens increment and decrement thereference count of every data structure every time they save and release a reference,respectively, to the structure For any data structure type that requires a reference

Trang 33

count, the kernel component that owns the structure usually exports two functionsthat can be used to increment and decrement the reference count Such functions areusually called xxx_hold and xxx_release, respectively Sometimes the release func-tion is calledxxx_put instead (e.g.,dev_put fornet_device structures).

While we like to assume there are no bad citizens in the kernel, developers arehuman, and as such they do not always write bug-free code The use of the referencecount is a simple but effective mechanism to avoid freeing still-referenced data struc-tures However, it does not always solve the problem completely This is the conse-quence of forgetting to balance increments and decrements:

• If you release a reference to a data structure but forget to call thexxx_release

function, the kernel will never allow the data structure to be freed (unlessanother buggy piece of code happens to call the release function an extra time bymistake!) This leads to gradual memory exhaustion

• If you take a reference to a data structure but forget to callxxx_hold, and at somelater point you happen to be the only reference holder, the structure will be pre-maturely freed because you are not accounted for This case definitely can bemore catastrophic than the previous one; your next attempt to access the struc-ture can corrupt other data or cause a kernel panic that brings down the wholesystem instantly

When a data structure is to be removed for some reason, the reference holders can beexplicitly notified about its going away so that they can politely release their refer-ences This is done through notification chains See the section “Reference Counts”

in Chapter 8 for an interesting example

The reference count on a data structure typically can be incremented when:

• There is a close relationship between two data structure types In this case, one

of the two often maintains a pointer initialized to the address of the second one

• A timer is started whose handler is going to access the data structure When thetimer is fired, the reference count on the structure is incremented, because thelast thing you want is for the data structure to be freed before the timer expires

• A successful lookup on a list or a hash table returns a pointer to the matchingelement In most cases, the returned result is used by the caller to carry out sometask Because of that, it is common for a lookup routine to increase the referencecount on the matching element, and let the caller release it when necessary.When the last reference to a data structure is released, it may be freed because it isnot needed anymore, but not necessarily

The introduction of the new sysfs filesystem has helped to make a good portion of

the kernel code more aware of reference counts and consistent in its use of them

Trang 34

Garbage Collection

Memory is a shared and limited resource and should not be wasted, particularly inthe kernel because it does not use virtual memory Most kernel subsystems imple-ment some sort of garbage collection to reclaim the memory held by unused or staledata structure instances Depending on the needs of any given feature, you will findtwo main kinds of garbage collection:

Asynchronous

This type of garbage collection is unrelated to particular events A timer thatexpires regularly invokes a routine that scans a set of data structures and freesthe ones considered eligible for deletion The conditions that make a data struc-ture eligible for deletion depend on the features and logic of the subsystem, but acommon criterion is the presence of a null reference count

Synchronous

There are cases where a shortage of memory, which cannot wait for the chronous garbage collection timer to kick in, triggers immediate garbage collec-tion The criteria used to select the data structures eligible for deletion are notnecessarily the same ones used by asynchronous cleanup (for instance, theycould be more aggressive) See Chapter 33 for an example

asyn-In Chapter 7, you will see how the kernel manages to reclaim the memory used byinitialization routines and that is no longer needed after they have been executed

Function Pointers and Virtual Function Tables (VFTs)

Function pointers are a convenient way to write clean C code while getting some ofthe benefits of the object-oriented languages In the definition of a data structure type(the object), you include a set of function pointers (the methods) Some or all manipu-lations of the structure are then done through the embedded functions C-languagefunction pointers in data structures look like this:

struct sock {

void (*sk_state_change)(struct sock *sk);

void (*sk_data_ready)(struct sock *sk, int bytes);

};

A key advantage to using function pointers is that they can be initialized differentlydepending on various criteria and the role played by the object Thus, invokingsk_

Function pointers are used extensively in the networking code The following areonly a few examples:

Trang 35

• When an ingress or egress packet is processed by the routing subsystem, it tializes two routines in the buffer data structure You will see this in Chapter 35.Refer to Chapter 2 for a complete list of function pointers included in thesk_ buff data structure.

ini-• When a packet is ready for transmission on the networking hardware, it ishanded to the hard_start_xmit function pointer of the net_device data struc-ture That routine is initialized by the device driver associated with the device

• When an L3 protocol wants to transmit a packet, it invokes one of a set of tion pointers These have been initialized to a set of routines by the address reso-lution protocol associated with the L3 protocol Depending on the actual routine

func-to which the function pointer is initialized, a transparent L3-func-to-L2 address lution may take place (for example, IPv4 packets go through ARP) When theaddress resolution is unnecessary, a different routine is used See Part VI for adetailed discussion on this interface

reso-We see in the preceding examples how function pointers can be employed as faces between kernel components or as generic mechanisms to invoke the right func-tion handler at the right time based on the result of something done by a differentsubsystem There are cases where function pointers are also used as a simple way toallow protocols, device drivers, or any other feature to personalize an action

inter-Let’s look at an example When a device driver registers a network device with thekernel, it goes through a series of steps that are needed regardless of the device type

At some point, it invokes a function pointer on thenet_devicedata structure to letthe device driver do something extra if needed The device driver could either initial-ize that function pointer to a function of its own, or leave the pointer NULL becausethe default steps performed by the kernel are sufficient

A check on the value of a function pointer is always necessary before executing it toavoid NULL pointer dereferences, as shown in this snapshot from register_

to find out how the function pointer has been initialized It could depend on ent factors:

differ-• When the selection of the routine to assign to a function pointer is based on aparticular piece of data, such as the protocol handling the data or the devicedriver a given packet is received from, it is easier to derive the routine For exam-

ple, if a given device is managed by the drivers/net/3c59x.c device driver, you can

derive the routine to which a given function pointer of the net_device data

Trang 36

structure is initialized by reading the device initialization routine provided by thedevice driver.

• When the selection of the routine is based instead on more complex logic, such

as the state of the resolution of an L3-to-L2 address mapping, the routine used atany time depends on external events that cannot be predicted

A set of function pointers grouped into a data structure are often referred to as a

vir-tual function table (VFT) When a VFT is used as the interface between two major

subsystems, such as the L3 and L4 protocol layers, or when the VFT is simplyexported as an interface to a generic kernel component (set of objects), the number

of function pointers in it may swell to include many different pointers that modate a wide range of protocols or other features Each feature may end up usingonly a few of the many functions provided You will see an example in Part VI Ofcourse, if this use of a VFT is taken too far, it becomes cumbersome and a majorredesign is needed

accom-goto Statements

Few C programmers like thegotostatement Without getting into the history of the

goto(one of the longest and most famous controversies in computer programming),I’ll summarize some of the reasons thegotois usually deprecated, but why the Linuxkernel uses it anyway

Any piece of code that usesgotocan be rewritten without it The use ofgotoments can reduce the readability of the code, and make debugging harder, because atany position following agotoyou can no longer derive unequivocally the conditionsthat led the execution to that point

state-Let me make this analogy: given any node in a tree, you know what the path fromthe root to the node is But if you add vines that entwine around branches ran-domly, you do not always have a unique path between the root and the other nodesanymore

However, because the C language does not provide explicit exceptions (and they areoften avoided in other languages as well because of the performance hit and codingcomplexity), carefully placedgotostatements can make it easier to jump to code thathandles undesired or peculiar events In kernel programming, and particularly in net-working, such events are very common, sogoto becomes a convenient tool

I must defend the kernel’s use of gotoby pointing out that developers have by nomeans gone wild with it Even though there are more than 30,000 instances, they aremainly used to handle different return codes within a function, or to jump out ofmore than one level of nesting

Trang 37

required,placeholderis just a pointer to the end of the structure; it does not sume any space.

con-Thus, ifabcis used by several pieces of code, each one can use the same basic tion (avoiding the confusion of doing the same thing in slightly different ways) whileextendingabc differently to personalize its definition according to its needs

defini-We will see this kind of data structure definition a few times in the book One ple is in Chapter 19

exam-Conditional Directives (#ifdef and family)

Conditional directives to the compiler are sometimes necessary An excessive use ofthem can reduce the readability of the code, but I can state that Linux does not abusethem They appear for different reasons, but the ones we are interested in are thoseused to check whether a given feature is supported by the kernel Configuration tools

such as make xconfig determine whether the feature is compiled in, not supported at

all, or loadable as a module

Examples of feature checks by#ifdef or#if defined C preprocessor directives are:

To include or exclude fields from a data structure definition

In this example, the Netfilter debugging feature requires annf_debugfield in the

debug-ging (a feature needed by only a handful of developers), there is no need toinclude the field, which would just take up more memory for every networkpacket

Trang 38

To include or exclude pieces of code from a function

To select the right prototype for a function

Trang 39

#endif

Note that this case differs from the previous one In the previous case, the tion body lies outside the#ifdef/#endifblocks, whereas in this case, each blockcontains a complete definition of the function

func-The definition or initialization of variables and macros can also use conditionalcompilation

It is important to know about the existence of multiple definitions of certain tions or macros, whose selection at compile time is based on a preprocessor macro as

func-in the precedfunc-ing examples Otherwise, when you look for a function, variable, ormacro definition, you may be looking at the wrong one

See Chapter 7 for a discussion of how the introduction of special macros hasreduced, in some cases, the use of conditional compiler directives

Compile-Time Optimization for Condition Checks

Most of the time, when the kernel compares a variable against some external value tosee whether a given condition is met, the result is extremely likely to be predictable.This is pretty common, for example, with code that enforces sanity checks The ker-nel uses thelikelyandunlikelymacros, respectively, to wrap comparisons that arelikely to return a true (1) or false (0) result Those macros take advantage of a feature

of the gcc compiler that can optimize the compilation of the code based on that

An example of the optimization made possible by thelikelyandunlikelymacros is

in handling options in the IPheader The use of IPoptions is limited to very specificcases, and the kernel can safely assume that most IPpackets do not carry IPoptions.When the kernel forwards an IPpacket, it needs to take care of options according tothe rules described in Chapter 18 The last stage of forwarding an IPpacket is takencare of by ip_forward_finish This function uses the unlikely macro to wrap thecondition that checks whether there is any IPoption to take care of See the section

“ip_forward_finish Function” in Chapter 20

Trang 40

Mutual Exclusion

Locking is used extensively in the networking code, and you are likely to see it come

up as an issue under every topic in this book Mutual exclusion, locking nisms, and synchronization are a general topic—and a highly interesting and com-plex one—for many types of programming, especially kernel programming Linuxhas seen the introduction and optimization of several approaches to mutual exclu-sion over the years Thus, this section merely summarizes the locking mechanismsseen in networking code; I refer you to the high-quality, detailed discussions avail-

mecha-able in O’Reilly’s Understanding the Linux Kernel and Linux Device Driver.

Each mutual exclusion mechanism is the best choice for particular circumstances.Here is a brief summary of the alternative mutual exclusion approaches you will seeoften in the networking code:

Spin locks

This is a lock that can be held by only one thread of execution at a time Anattempt to acquire the lock by another thread of execution makes the latter loopuntil the lock is released Because of the waste caused by looping, spin locks areused only on multiprocessor systems, and generally are used only when thedeveloper expects the lock to be held for short intervals Also because of thewaste caused to other threads, a thread of execution must not sleep while hold-ing a spin lock

Read-write spin locks

When the uses of a given lock can be clearly classified as only and write, the use of read-write spin locks is preferred The difference between spinlocks and read-write spin locks is that in the latter, multiple readers can hold thelock at the same time However, only one writer at a time can hold the lock, and

read-no reader can acquire it when it is already held by a writer Because readers aregiven higher priority over writers, this type of lock performs well when the num-ber of readers (or the number of read-only lock acquisitions) is a good deal big-ger than the number of writers (or the number or read-write lock acquisitions).When the lock is acquired in read-only mode, it cannot be promoted to read-write mode directly: the lock must be released and reacquired in read-writemode

Read-Copy-Update (RCU)

RCU is one of the latest mechanisms made available in Linux to provide mutualexclusion It performs quite well under the following specific conditions:

• Read-write lock requests are rare compared to read-only lock requests

• The code that holds the lock is executed atomically and does not sleep

• The data structures protected by the lock are accessed via pointers

The first condition concerns performance, and the other two are at the base ofthe RCU working principle

Tiêu đề	Understanding Linux Network Internals
Tác giả	Christian Benvenuti
Trường học	Beijing, Cambridge, Farnham, Köln, Paris, Sebastopol, Taipei, Tokyo
Chuyên ngành	Computer Science
Thể loại	Sách hướng dẫn
Thành phố	Beijing

Định dạng
Số trang	1.064
Dung lượng	11,49 MB