106 Data Structures Featured in This Chapter 106Registering a PCI NIC Device Driver 108Power Management and Wake-on-LAN 109Example of PCI NIC Driver Registration 110 Tuning via /proc Fil
Trang 3LINUXNETWORK INTERNALS
Trang 4Other Linux resources from O’Reilly
Related titles Linux in a Nutshell
Linux NetworkAdministrator’s GuideRunning Linux
Linux Device DriversUnderstanding the LinuxKernel
Building Secure Servers withLinux
LPI Linux Certification in aNutshell
Learning Red Hat LinuxLinux Server HacksTMLinux Security CookbookManaging RAID on LinuxLinux Web Server CDBookshelf
Building Embedded LinuxSystems
Linux Books
Resource Center
linux.oreilly.com is a complete catalog of O’Reilly’s books on
Linux and Unix and related technologies, including samplechapters and code examples
ONLamp.com is the premier site for the open source web
plat-form: Linux, Apache, MySQL, and either Perl, Python, or PHP
Conferences O’Reilly brings diverse innovators together to nurture the ideas
that spark revolutionary industries We specialize in ing the latest tools and systems, translating the innovator’sknowledge into useful skills for those in the trenches Visit
document-conferences.oreilly.com for our upcoming events.
Safari Bookshelf (safari.oreilly.com) is the premier online
refer-ence library for programmers and IT professionals Conductsearches across more than 1,000 books Subscribers can zero in
on answers to time-critical questions in a matter of seconds.Read the books on your Bookshelf from cover to cover or sim-ply flip to the page you need Try it today with a free trial
Trang 5Understanding LINUX
NETWORK INTERNALS
Christian Benvenuti
Trang 6Understanding Linux Network Internals
by Christian Benvenuti
Copyright © 2006 O’Reilly Media, Inc All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions
are also available for most titles (safari.oreilly.com) For more information, contact our tutional sales department: (800) 998-9938 or corporate@oreilly.com.
Production Editor: Philip Dangler
Cover Designer: Karen Montgomery
Interior Designer: David Futato
Printing History:
December 2005: First Edition.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of
O’Reilly Media, Inc The Linux series designations, Understanding Linux Network Internals, images of
the American West, and related trade dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in this book, and O’Reilly Media, Inc was aware of a trademark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and author assume
no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
[M]
Trang 7When a Feature Is Offered as a Patch 20
2 Critical Data Structures 22
The Socket Buffer: sk_buff Structure 22
Trang 8Part II System Initialization
4 Notification Chains 75
Reasons for Notification Chains 75
Notification Chains for the Networking Subsystems 81
Functions and Variables Featured in This Chapter 83Files and Directories Featured in This Chapter 83
5 Network Device Initialization 84
System Initialization Overview 84Device Registration and Initialization 86Basic Goals of NIC Initialization 86Interaction Between Devices and Kernel 87
6 The PCI Layer and Network Interface Cards 106
Data Structures Featured in This Chapter 106Registering a PCI NIC Device Driver 108Power Management and Wake-on-LAN 109Example of PCI NIC Driver Registration 110
Tuning via /proc Filesystem 114Functions and Variables Featured in This Chapter 114Files and Directories Featured in This Chapter 115
Trang 97 Kernel Infrastructure for Component Initialization 116
Optimized Macro-Based Tagging 125Boot-Time Initialization Routines 128
Tuning via /proc Filesystem 134Functions and Variables Featured in This Chapter 134Files and Directories Featured in This Chapter 135
8 Device Registration and Initialization 136
When a Device Is Registered 137When a Device Is Unregistered 138Allocating net_device Structures 138Skeleton of NIC Registration and Unregistration 140
Tuning via /proc Filesystem 171Functions and Variables Featured in This Chapter 172Files and Directories Featured in This Chapter 173
Part III Transmission and Reception
9 Interrupts and Network Drivers 177
Decisions and Traffic Direction 178Notifying Drivers When Frames Are Received 178
Trang 10Enabling and Disabling Transmissions 241
12 General and Reference Material About Interrupts 261
Tuning via /proc and sysfs Filesystems 262Functions and Variables Featured in This Part of the Book 263Files and Directories Featured in This Part of the Book 265
13 Protocol Handlers 266
Executing the Right Protocol Handler 274Protocol Handler Organization 278Protocol Handler Registration 279Ethernet Versus IEEE 802.3 Frames 281Tuning via /proc Filesystem 293Functions and Variables Featured in This Chapter 293Files and Directories Featured in This Chapter 294
Part IV Bridging
14 Bridging: Concepts 297
Repeaters, Bridges, and Routers 297
Bridging Different LAN Technologies 302
Trang 1115 Bridging: The Spanning Tree Protocol 310
Example of Hierarchical Switched L2 Topology 311Basic Elements of the Spanning Tree Protocol 314
Bridge Protocol Data Units (BPDUs) 323Defining the Active Topology 328
Transmitting Configuration BPDUs 346
Overview of Newer Spanning Tree Protocols 350
16 Bridging: Linux Implementation 355
Initialization of Bridging Code 360Creating Bridge Devices and Bridge Ports 361Creating a New Bridge Device 362Bridge Device Setup Routine 362
Enabling and Disabling a Bridge Device 367Enabling and Disabling a Bridge Port 368Changing State on a Bridge Port 370
Transmitting on a Bridge Device 380Spanning Tree Protocol (STP) 380netdevice Notification Chain 389
17 Bridging: Miscellaneous Topics 391
User-Space Configuration Tools 391Tuning via /proc Filesystem 396
Trang 12Data Structures Featured in This Part of the Book 398Functions and Variables Featured in This Part of the Book 403Files and Directories Featured in This Part of the Book 405
18 Internet Protocol Version 4 (IPv4): Concepts 409
IP Protocol: The Big Picture 409
Packet Fragmentation/Defragmentation 420
21 Internet Protocol Version 4 (IPv4): Transmission 473
Key Functions That Perform Transmission 474Interface to the Neighboring Subsystem 510
23 Internet Protocol Version 4 (IPv4): Miscellaneous Topics 536
Long-Living IP Peer Information 536Selecting the IP Header’s ID Field 540
Trang 13Functions and Variables Featured in This Part of the Book 565Files and Directories Featured in This Part of the Book 568
24 Layer Four Protocol and Raw IP Handling 569
L3 to L4 Delivery: ip_local_deliver_finish 574
Tuning via /proc Filesystem 583Functions and Variables Featured in This Chapter 583Files and Directories Featured in This Chapter 583
25 Internet Control Message Protocol (ICMPv4) 585
Applications of the ICMP Protocol 595
Data Structures Featured in This Chapter 600
Passing Error Notifications to the Transport Layer 619Tuning via /proc Filesystem 620Functions and Variables Featured in This Chapter 622Files and Directories Featured in This Chapter 622
Part VI Neighboring Subsystem
26 Neighboring Subsystem: Concepts 625
Trang 1427 Neighboring Subsystem: Infrastructure 651
Common Interface Between L3 Protocols and Neighboring Protocols 655General Tasks of the Neighboring Infrastructure 666Reference Counts on neighbour Structures 670
Example of an ARP Transaction 702Gratuitous ARP 702Responding from Multiple Interfaces 707
ARP Protocol Initialization 714Initialization of a neighbour Structure 716Transmitting and Receiving ARP Packets 722Processing Ingress ARP Packets 726
29 Neighboring Subsystem: Miscellaneous Topics 749
System Administration of Neighbors 749Tuning via /proc Filesystem 752Data Structures Featured in This Part of the Book 757Files and Directories Featured in This Part of the Book 774
Trang 15Part VII Routing
32 Routing: Linux Implementation 830
Primary and Secondary IP Addresses 841Generic Helper Routines and Macros 842
Routing Subsystem Initialization 844
Interactions with Other Subsystems 858
33 Routing: The Routing Cache 861
Routing Cache Initialization 861
Interface Between the DST and Calling Protocols 879
Egress ICMP REDIRECT Rate Limiting 896
Trang 1634 Routing: Routing Tables 898
Organization of Routing Hash Tables 898Routing Table Initialization 904
Policy Routing and Its Effects on Routing Table Definitions 910
Policy Routing and Routing Table Based Classifier 948
36 Routing: Miscellaneous Topics 952
User-Space Configuration Tools 952
Tuning via /proc Filesystem 958Enabling and Disabling Forwarding 966Data Structures Featured in This Part of the Book 968Functions and Variables Featured in This Part of the Book 986Files and Directories Featured in This Part of the Book 989
Index 991
Trang 17Today more than ever before, networking is a hot topic Any electronic gadget in itslatest generation embeds some kind of networking capability The Internet contin-ues to broaden in its population and opportunities It should not come as a surprisethat a robust, freely available, and feature-rich operating system like Linux is wellaccepted by many producers of embedded devices Its networking capabilities make
it an optimal operating system for networking devices of any kind The features italready has are well implemented, and new ones can be added easily If you are adeveloper for embedded devices or a student who would like to experiment withLinux, this book will provide you with good fodder
The performance of a pure software-based product that uses Linux cannot competewith commercial products that can count on the help of specialized hardware This
of course is not a criticism of software; it is a simple recognition of the consequence
of the speed difference between dedicated hardware and general-purpose CPUs.However, Linux can definitely compete with low-end commercial products that areentirely software-based Of course, simple extensions to the Linux kernel allow ven-dors to use Linux on hybrid systems as well (software and hardware); it is only amatter of writing the necessary device drivers
Linux is also often used as the operating system of choice for the implementation ofuniversity projects and theses Not all of them make it to the official kernel (not rightaway, at least) A few do, and others are simply made available online as patches tothe official kernel Isn’t it a great satisfaction and reward to see your contribution tothe Linux kernel being used by potentially millions of users? There is only one draw-back: if your contribution is really appreciated, you may not be able to cope with thenumerous emails of thanks or requests for help
The momentum for Linux has been growing continually over the past years, andapparently it can only keep growing
I first encountered Linux at the University of Bologna, where I was a grad student incomputer science around 10 years ago What a wonderful piece of software! I could
Trang 18work on my image processing projects at home on an i286/486 computer withouthaving to compete with other students for access to the few Sun stations available atthe university labs.
Since then, my marriage to Linux has never seen a gray day It has even started to place my fond memories of the glorious C64 generation, when I was first introduced
dis-to programming with Assembly language and the various dialects of BASIC Yes, Ibelong to the C64 generation, and to some extent I can compare the joy of my firstprogramming experiences with the C64 to my first journeys into the Linux kernel.When I was first introduced to the beautiful world of networking, I started playingwith the tools available on Linux I also had the fortune to work for a UNESCO cen-ter in Italy where I helped develop their networking courses, based entirely on Linuxboxes That gave me access to a good lab equipped with all sorts of network devicesand documentation, plus plenty of Linux enthusiasts to learn from and to collabo-rate with
Unfortunately for my own peace of mind (but fortunately, I hope, for the reader ofthis book who benefits from the results), I am the kind of person that likes to under-stand everything and takes very little for granted So at UNESCO, I started lookinginto the kernel code This not only proved to be a good way to burn in my knowl-edge, but it also gave me more confidence in making use of user-space configurationtools: whenever a configuration tool did not provide a specific option, I usually knewwhether it would be possible to add it or whether it would have required significantchanges to the kernel This kind of study turns into a path without an end: youalways want more
After developing a few tools as extensions to the Linux kernel (some revision of sions 2.0 and 2.2), my love for operating systems and networking led me to the Sili-con Valley (Cisco Systems) When you learn a language, be it a human language or acomputer programming language, a rule emerges: the more languages you know, theeasier it becomes to learn new ones You can identify each one’s strengths and weak-nesses, see the reasons behind design compromises, etc The same applies to operat-ing systems
ver-When I noticed the lack of good documentation about the networking code of theLinux kernel and the availability of good books for other parts of the kernel, Idecided to try filling in the gap—or at least part of it I hope this book will give youthe starting documentation that I would have loved to have had years ago
I believe that this book, together with O’Reilly’s other two kernel books
(Under-standing the Linux Kernel and Linux Device Drivers), represents a good starting point
for anyone willing to learn more about the Linux kernel internals They complementeach other and, when they do not address a given feature, point the reader to exter-nal documentation sources (when available)
Trang 19However, I still suggest you make some coffee, turn on the music, and spend sometime on the source code trying to understand how a given feature is implemented Ibelieve the knowledge you build in this way lasts longer than that built in any otherway Shortcuts are good, but sometimes the long way has its advantages, too.
The Audience for This Book
This book can help those who already have some knowledge of networking andwould like to see how the engine of the Internet—that is, the Internet Protocol (IP)and its friends—is implemented on a first-class operating system However, there is atheoretical introduction for each topic, so newcomers will be able to get up to speedquickly, too Complex topics are accompanied by enough examples to make themeasier to follow
Linux doesn’t just support basic IP; it also has quite a few advanced features Moreimportant, its implementation must be sophisticated enough to play nicely withother kernel features such as symmetric multiprocessing (SMP) and kernel preemp-tion This makes the networking code of the Linux kernel a very good gym in which
to train and keep your networking knowledge in shape
Moreover, if you are like me and want to learn everything, you will find enoughdetails in this book to keep you satisfied for quite a while
Background Information
Some knowledge of operating systems would help The networking code, like anyother component of the operating system, must follow both common sense andimplicit rules for coexistence with the rest of the kernel, including proper use of lock-ing; fair use of memory and CPU; and an eye toward modularity, code cleanliness,and good performance Even though I occasionally spend time on those aspects, Irefer you to the other two O’Reilly kernel books mentioned earlier for a deeper anddetailed discussion on generic operating system services and design
Some knowledge of networking, and especially IP, would also help However, I thinkthe theory overview that precedes each implementation description in this book issufficient to make the book self-contained for both newcomers and experiencedreaders
The theoretical description of the topics covered in the book does not require anyprogramming experience However, the descriptions of the associated implementa-tions require an intermediate knowledge of the C language Chapter 1 will go through
a series of coding conventions and tricks that are often used in the code, whichshould help especially those with less experience with C and kernel programming
Trang 20Organization of the Material
Some aspects of networking code require as many as seven chapters, while for otheraspects one chapter is sufficient When the topic is complex or big enough to spandifferent chapters, the part of the book devoted to that topic always starts with aconcept chapter that covers the theory necessary to understand the implementation,which is described in another chapter All of the reference and secondary material isusually located in one miscellaneous chapter at the end of the part No matter howbig the topic is, the same scheme is used to organize its presentation
For each topic, the implementation description includes:
• The big picture, which shows where the described kernel component falls in thenetwork stack
• A brief description of the main data structures and a figure that shows how theyrelate to each other
• A description of which other kernel features the component interfaces with—forexample, by means of notification chains or data structure cross-references Thefirewall is an example of such a kernel feature, given the numerous hooks it hasall over the networking code
• Extensive use of flow charts and figures to make it easier to go through the codeand extract the logic from big and seemingly complex functions
The reference material always includes:
• A detailed description of the most important data structures, field by field
• A table with a brief description of all functions, macros, and data structures,which you can use as a quick reference
• A list of the files mentioned in the chapter, with their location in the kernelsource tree
• A description of the interface between the most common user-space tools used
to configure the topic of the chapter and the kernel
• A description of any file in /proc that is exported
The Linux kernel’s networking code is not just a moving target, but a fast runner.The book does not cover all of the networking features New ones are probablybeing added right now while you are reading Many new features are driven by theneeds of single users or organizations, or as university projects, but they find theirway into the official kernel when they’re considered useful for a large audience.Besides detailing the implementation of a subset of those features, I try to give you
an idea of what the generic implementation of a feature might look like This willhelp you greatly in understanding changes to the code and learning how new fea-tures are implemented For example, given any feature, you need to take the follow-ing points into consideration:
Trang 21• How do you design the data structures and the locking semantics?
• Is there a need for a user-space configuration tool? If so, is it going to interactwith the kernel via an existing system call, anioctlcommand, a /proc file, or the
Netlink socket?
• Is there any need for a new notification chain, and is there a need to register to
an already existing chain?
• What is the relationship with the firewall?
• Is there any need for a cache, a garbage collection mechanism, statistics, etc.?Here is the list of topics covered in the book:
Interface between user space and kernel
In Chapter 3, you will get a brief overview of the mechanisms that networkingconfiguration tools use to interact with their counterparts inside the kernel Itwill not be a detailed discussion, but it will help you to understand certain parts
of the kernel code
System initialization
Part II describes the initialization of key components of the networking code,and how network devices are registered and initialized
Interface between device drivers and protocol handlers
Part III offers a detailed description of how ingress (incoming or received)
pack-ets are handed by the device drivers to the upper-layer protocols, and vice versa
Bridging
Part IV describes transparent bridging and the Spanning Tree Protocol, the L2(Layer two) counterpart of routing at L3 (Layer three)
Internet Protocol Version 4 (IPv4)
Part V describes how packets are received, transmitted, forwarded, and ered locally at the IPv4 layer
deliv-Interface between IPv4 and the transport layer (L4) protocols
Chapter 20 shows how IPv4 packets addressed to the local host are delivered tothe transport layer (L4) protocols (TCP, UDP, etc.)
Internet Control Message Protocol (ICMP)
Chapter 25 describes the implementation of ICMP, the only transport layer (L4)protocol covered in the book
Neighboring protocols
These find local network addresses, given their IPaddresses Part VI describesboth the common infrastructure of the various protocols and the details of theARP neighboring protocol used by IPv4
Routing
Part VII, the biggest one of the book, describes the routing cache and tables.Advanced features such as Policy Routing and Multipath are also covered
Trang 22What Is Not Covered
For lack of space, I had to select a subset of the Linux networking features to cover
No selection would make everyone happy, but I think I covered the core of the working code, and with the knowledge you can gain with this book, you will find iteasier to study on your own any other networking feature of the kernel
net-In this book, I decided to focus on the networking code, from the interface betweendevice drivers and the protocol handlers, up to the interface between the IPv4 and L4protocols Instead of covering all of the features with a compromise on quality, I pre-ferred to keep quality as the first goal, and to select the subset of features that wouldrepresent the best start for a journey into the kernel networking implementation.Here is a partial list of the features I could not cover for lack of space:
Internet Protocol Version 6 (IPv6)
Even though I do not cover IPv6 in the book, the description of IPv4 can helpyou a lot in understanding the IPv6 implementation The two protocols sharenaming conventions for functions and often for variables Their interface to Net-filter is also similar
IP Security protocol
The kernel provides a generic infrastructure for cryptography along with a lection of both ciphers and digest algorithms The first interface to the crypto-graphic layer was synchronous, but the latest improvements are adding anasynchronous interface to allow Linux to take advantage of hardware cards thatcan offload the work from the CPU
col-The protocols of the IPsec suite—Authentication Header (AH), Security Payload (ESP), and IP Compression (IPcomp)—are implemented in thekernel and make use of the cryptographic layer
Encapsulating-IP multicast and Encapsulating-IP multicast routing
Multicast functionality was implemented to conform to versions 2 and 3 of theInternet Group Management Protocol (IGMP) Multicast routing support is alsopresent, conforming to versions 1 and 2 of Protocol Independent Multicast (PIM)
Transport layer (L4) protocols
Several L4 protocols are implemented in the Linux kernel Besides the two known ones, UDPand TCP, Linux has the newer Stream Control TransmissionProtocol (SCTP) A good description of the implementation of those protocolswould require a new book of this size, all on its own
well-Traffic Control
This is the Quality of Service (QoS) layer of Linux, another interesting and erful component of the kernel’s networking code Traffic control is imple-mented as a general infrastructure and as a collection of traffic classifiers andqueuing disciplines I briefly describe it and the interface it provides to the maintransmission routine in Chapter 11 A great deal of documentation is available at
pow-http://lartc.org.
Trang 23The firewall code infrastructure and its extensions (including the various NATflavors) is not covered in the book, but I describe its interaction with most of the
networking features I cover At the Netfilter home page, http://www.netfilter.org,
you can find some interesting documentation about its kernel internals
Network filesystems
Several network filesystems are implemented in the kernel, among them NFS(versions 2, 3, and 4), SMB, Coda, and Andrew You can read a detailed descrip-
tion of the Virtual File System layer in Understanding the Linux Kernel, and then
delve into the source code to see how those network filesystems interface with it
Virtual devices
The use of a dedicated virtual device underlies the implementation of ing features Examples include 802.1Q, bonding, and the various tunneling pro-tocols, such as IP-over-IP (IPIP) and Generalized Routing Encapsulation (GRE).Virtual devices need to follow the same guidelines as real devices and provide thesame interface to other kernel components In different chapters, where needed,
network-I compare real and virtual device behaviors The only virtual device that isdescribed in detail is the bridge interface, which is covered in Part IV
DECnet, IPX, AppleTalk, etc.
These have historical roots and are still in use, but are much less commonly usedthan IP I left them out to give more space to topics that affect more users
IP virtual server
This is another interesting piece of the networking code, described at http://
www.linuxvirtualserver.org/ This feature can be used to build clusters of servers
using different scheduling algorithms
Simple Network Management Protocol (SNMP)
No chapter in this book is dedicated to SNMP, but for each feature, I give adescription of all the counters and statistics kept by the kernel, the routines used
to manipulate them, and the /proc files used to export them, when available.
Frame Diverter
This feature allows the kernel to kidnap ingress frames not addressed to the local
host I will briefly mention it in Part III Its home page is http://diverter.
sourceforge.net.
Plenty of other network projects are available as separate patches to the kernel, and Ican’t list them all here One that I find particularly fascinating and promising, espe-cially in relation to the Linux routing code, is the highly configurable Click router,
currently offered at http://pdos.csail.mit.edu/click/.
Because this is a book about the kernel, I do not cover user-space configurationtools However, for each topic, I describe the interface between the most commonuser-space configuration tools and the kernel
Trang 24Conventions Used in This Book
The following is a list of the typographical conventions used in this book:
Constant Width Italic
Used to indicate text within commands that the user replaces with an actualvalue
Constant Width Bold
Used in examples to show commands or other text that should be typed literally
by the user
Pay special attention to notes set apart from the text with the following icons:
This is a tip It contains useful supplementary information about the
topic at hand.
This is a warning It helps you solve and avoid annoying problems.
Using Code Examples
This book is here to help you get your job done In general, you may use the code inthis book in your programs and documentation The code samples are covered by adual BSD/GPL license
We appreciate, but do not require, attribution An attribution usually includes the
title, author, publisher, and ISBN For example: “Understanding Linux Network
Internals, by Christian Benvenuti Copyright 2006 O’Reilly Media, Inc.,
0-596-00255-6.”
Trang 25We’d Like to Hear from You
Please address comments and questions concerning this book to the publisher:O’Reilly Media, Inc
1005 Gravenstein Highway North
tech-Safari offers a solution that’s better than e-books It’s a virtual library that lets youeasily search thousands of top tech books, cut and paste code samples, downloadchapters, and find quick answers when you need the most accurate, current informa-
tion Try it for free at http://safari.oreilly.com.
Acknowledgments
This book would not have been possible without an interesting topic to talk about,and an audience The interesting topic is Linux, this modern operating system thatanyone has an opportunity to be part of, and the audience is the incredible number
of users that often decide not only to take advantage of the good work of others, butalso to contribute to its success by getting involved in its development I have alwaysloved sharing knowledge and passion for the things I like, and with this book, I havetried my best to add a lane or two to the highway that takes interested people intothe wonderful world of the Linux kernel
Trang 26Of course, I did not do everything while lying in a hammock by the beach, with anice cream in one hand and a mouse in the other It took quite a lot of work to investi-gate the reasons behind some of the implementation choices It is incredible howmuch information you can dig out of the development mailing lists, and how muchpeople are willing to share their knowledge when you show genuine interest in theirwork.
For sure, this book would not be what it is without the great help and suggestions of
my editor, Andy Oram Due to the frequent changes that the networking code riences, a few chapters had to undergo substantial updates during the writing of thebook, but Andy understood this and helped me get to the finish line
expe-I also would like to thank all of those people that supported me in this effort, andCisco Systems for giving me the flexibility I needed to work on this book
A special thanks also goes to the technical reviewers for being able to review a book
of this size in a short amount of time, still providing useful comments that allowed
me to catch errors and improve the quality of the material The book was reviewed
by Jerry Cooperstein, Michael Boerner, and Paul Kinzelman (in alphabetical order,
by first name) I also would like to thank Francois Tallet for reviewing Part IV andAndi Kleen for his feedback on Part V
Trang 27PART I
The information in this part of the book represents the basic knowledge you need tounderstand the rest of the book comfortably If you are already familiar with theLinux kernel, or you are an experienced software engineer, you will be able to gopretty quickly through these chapters For other readers, I suggest getting familiarwith this material before proceeding with the following parts of the book:
Chapter 1, Introduction
The bulk of this chapter is devoted to introducing a few of the common gramming patterns and tricks that you’ll often meet in the networking code
pro-Chapter 2, Critical Data Structures
In this chapter, you can find a detailed description of two of the most importantdata structures used by the networking code: the socket buffersk_buffand thenetwork devicenet_device
Chapter 3, User-Space-to-Kernel Interface
The discussion of each feature in this book ends with a set of sections that showshow user-space configuration tools and the kernel communicate The informa-tion in this chapter can help you understand those sections better
Trang 29Chapter 1 CHAPTER 1
Introduction
To do research in the source code of a large project is to enter a strange, new landwith its own customs and unspoken expectations It is useful to learn some of themajor conventions up front, and to try interacting with the inhabitants instead ofmerely standing back and observing
The bulk of this chapter is devoted to introducing you to a few of the common gramming patterns and tricks that you’ll often meet in the networking code
pro-I encourage you, when possible, to try interacting with a given part of the kernel working code by means of user-space tools So in this chapter, I’ll give you a fewpointers as to where you can download those tools if they’re not already installed onyour preferred Linux distribution, or if you simply want to upgrade them to the lat-est versions
net-I’ll also describe some tools that let you find your way gracefully through the mous kernel code Finally, I’ll explain briefly why a kernel feature may not be inte-grated into the official kernel releases, even if it is widely used in the Linuxcommunity
The terms vector and array will be used interchangeably.
When referring to the layers of the TCP/IP network stack, I will use the tions L2, L3, and L4 to refer to the link, network, and transport layers, respectively
Trang 30abbrevia-The numbers are based on the famous (if not exactly current) seven-layer OSI model.
In most cases, L2 will be a synonym for Ethernet, L3 for IPVersion 4 or 6, and L4 forUDP, TCP, or ICMP When I need to refer to a specific protocol, I’ll use its name
(i.e., TCP) rather than the generic Ln protocol term.
In different chapters, we will see how data units are received and transmitted by theprotocols that sit at a given layer in the network stack In those contexts, the terms
ingress and input will be used interchangeably The same applies to egress and put The action of receiving or transmitting a data unit may be referred to with the
out-abbreviations RX and TX, respectively
A data unit is given different names, such as frame, packet, segment, and message,
depending on the layer where it is used (see Chapter 13 for more details) Table 1-1summarizes the major abbreviations you’ll see in the book
Common Coding Patterns
Each networking feature, like any other kernel feature, is just one of the citizensinside the kernel As such, it must make proper and fair use of memory, CPU, and allother shared resources Most features are not written as standalone pieces of kernelcode, but interact with other kernel components more or less heavily depending onthe feature They therefore try, as much as possible, to follow similar mechanisms toimplement similar functionalities (there is no need to reinvent the wheel every time).Some requirements are common to several kernel components, such as the need toallocate several instances of the same data structure type, the need to keep track ofreferences to an instance of a data structure to avoid unsafe memory deallocations,etc In the following subsections, we will view common ways in Linux to handle suchrequirements I will also talk about common coding tricks that you may come acrosswhile browsing the kernel’s code
This book uses subsystem as a loose term to describe a collection of files that
imple-ment a major set of features—such as IPor routing—and that tend to be maintained
by the same people and to change in lockstep In the rest of the chapter, I’ll also use
Table 1-1 Abbreviations used frequently in this book
Abbreviation Meaning
Trang 31the term kernel component to refer to these subsystems, because the conventions
dis-cussed here apply to most parts of the kernel, not just those involved in networking
Memory Caches
The kernel uses thekmallocandkfreefunctions to allocate and free a memory block,respectively The syntax of those two functions is similar to that of the two sistercalls,mallocandfree, from the libc user-space library For more details onkmalloc
It is common for a kernel component to allocate several instances of the same datastructure type When allocation and deallocation are expected to happen often, theassociated kernel component initialization routine (for example, fib_hash_init forthe routing table) usually allocates a special memory cache that will be used for theallocations When a memory block is freed, it is actually returned to the same cachefrom which it was allocated
Some examples of network data structures for which the kernel maintains dedicatedmemory caches include:
Socket buffer descriptors
This cache, allocated byskb_initin net/core/sk_buff.c, is used for the allocation
reg-isters the highest number of allocations and deallocations in the networking system
sub-Neighboring protocol mappings
Each neighboring protocol uses a memory cache to allocate the data structuresthat store L3-to-L2 address mappings See Chapter 27
up callingkmem_cache_free only when all the references to the buffer have beenreleased and all the necessary cleanup has been done by the interested sub-systems (for instance, the firewall)
Trang 32The limit on the number of instances that can be allocated from a given cache (whenpresent) is usually enforced by the wrappers aroundkmem_cache_alloc, and are some-
times configurable with a parameter in /proc.
For more details on how memory caches are implemented and how they interface to
the slab allocator, please refer to Understanding the Linux Kernel (O’Reilly).
Caching and Hash Tables
It is pretty common to use a cache to increase performance In the networking code,there are caches for L3-to-L2 mappings (such as the ARPcache used by IPv4), for therouting table cache, etc
Cache lookup routines often take an input parameter that says whether a cache missshould or should not create a new element and add it to the cache Other lookuproutines simply add missing elements all the time
Caches are often implemented with hash tables The kernel provides a set of datatypes, such as one-way and bidirectional lists, that can be used as building blocks forsimple hash tables
The standard way to handle inputs that hash to the same value is to put them in alist Traversing this list takes substantially longer than using the hash key to do alookup Therefore, it is always important to minimize the number of inputs that hash
to the same value
When the lookup time on a hash table (whether it uses a cache or not) is a criticalparameter for the owner subsystem, it may implement a mechanism to increase thesize of the hash table so that the average length of the collision lists goes down andthe average lookup time improves See the section “Dynamic resizing of per-netmaskhash tables” in Chapter 34 for an example
You may also find subsystems, such as the neighboring layer, that add a randomcomponent (regularly changed) to the key used to distribute elements in the cache’sbuckets This is used to reduce the damage of Denial of Service (DoS) attacks aimed
at concentrating the elements of a hash table into a single bucket See the section
“Caching” in Chapter 27 for an example
Reference Counts
When a piece of code tries to access a data structure that has already been freed, thekernel is not very happy, and the user is rarely happy with the kernel’s reaction Toavoid those nasty problems, and to make garbage collection mechanisms easier andmore effective (see the section “Garbage Collection” later in this chapter), most datastructures keep a reference count Good kernel citizens increment and decrement thereference count of every data structure every time they save and release a reference,respectively, to the structure For any data structure type that requires a reference
Trang 33count, the kernel component that owns the structure usually exports two functionsthat can be used to increment and decrement the reference count Such functions areusually called xxx_hold and xxx_release, respectively Sometimes the release func-tion is calledxxx_put instead (e.g.,dev_put fornet_device structures).
While we like to assume there are no bad citizens in the kernel, developers arehuman, and as such they do not always write bug-free code The use of the referencecount is a simple but effective mechanism to avoid freeing still-referenced data struc-tures However, it does not always solve the problem completely This is the conse-quence of forgetting to balance increments and decrements:
• If you release a reference to a data structure but forget to call thexxx_release
function, the kernel will never allow the data structure to be freed (unlessanother buggy piece of code happens to call the release function an extra time bymistake!) This leads to gradual memory exhaustion
• If you take a reference to a data structure but forget to callxxx_hold, and at somelater point you happen to be the only reference holder, the structure will be pre-maturely freed because you are not accounted for This case definitely can bemore catastrophic than the previous one; your next attempt to access the struc-ture can corrupt other data or cause a kernel panic that brings down the wholesystem instantly
When a data structure is to be removed for some reason, the reference holders can beexplicitly notified about its going away so that they can politely release their refer-ences This is done through notification chains See the section “Reference Counts”
in Chapter 8 for an interesting example
The reference count on a data structure typically can be incremented when:
• There is a close relationship between two data structure types In this case, one
of the two often maintains a pointer initialized to the address of the second one
• A timer is started whose handler is going to access the data structure When thetimer is fired, the reference count on the structure is incremented, because thelast thing you want is for the data structure to be freed before the timer expires
• A successful lookup on a list or a hash table returns a pointer to the matchingelement In most cases, the returned result is used by the caller to carry out sometask Because of that, it is common for a lookup routine to increase the referencecount on the matching element, and let the caller release it when necessary.When the last reference to a data structure is released, it may be freed because it isnot needed anymore, but not necessarily
The introduction of the new sysfs filesystem has helped to make a good portion of
the kernel code more aware of reference counts and consistent in its use of them
Trang 34Garbage Collection
Memory is a shared and limited resource and should not be wasted, particularly inthe kernel because it does not use virtual memory Most kernel subsystems imple-ment some sort of garbage collection to reclaim the memory held by unused or staledata structure instances Depending on the needs of any given feature, you will findtwo main kinds of garbage collection:
Asynchronous
This type of garbage collection is unrelated to particular events A timer thatexpires regularly invokes a routine that scans a set of data structures and freesthe ones considered eligible for deletion The conditions that make a data struc-ture eligible for deletion depend on the features and logic of the subsystem, but acommon criterion is the presence of a null reference count
Synchronous
There are cases where a shortage of memory, which cannot wait for the chronous garbage collection timer to kick in, triggers immediate garbage collec-tion The criteria used to select the data structures eligible for deletion are notnecessarily the same ones used by asynchronous cleanup (for instance, theycould be more aggressive) See Chapter 33 for an example
asyn-In Chapter 7, you will see how the kernel manages to reclaim the memory used byinitialization routines and that is no longer needed after they have been executed
Function Pointers and Virtual Function Tables (VFTs)
Function pointers are a convenient way to write clean C code while getting some ofthe benefits of the object-oriented languages In the definition of a data structure type(the object), you include a set of function pointers (the methods) Some or all manipu-lations of the structure are then done through the embedded functions C-languagefunction pointers in data structures look like this:
struct sock {
void (*sk_state_change)(struct sock *sk);
void (*sk_data_ready)(struct sock *sk, int bytes);
};
A key advantage to using function pointers is that they can be initialized differentlydepending on various criteria and the role played by the object Thus, invokingsk_
Function pointers are used extensively in the networking code The following areonly a few examples:
Trang 35• When an ingress or egress packet is processed by the routing subsystem, it tializes two routines in the buffer data structure You will see this in Chapter 35.Refer to Chapter 2 for a complete list of function pointers included in thesk_ buff data structure.
ini-• When a packet is ready for transmission on the networking hardware, it ishanded to the hard_start_xmit function pointer of the net_device data struc-ture That routine is initialized by the device driver associated with the device
• When an L3 protocol wants to transmit a packet, it invokes one of a set of tion pointers These have been initialized to a set of routines by the address reso-lution protocol associated with the L3 protocol Depending on the actual routine
func-to which the function pointer is initialized, a transparent L3-func-to-L2 address lution may take place (for example, IPv4 packets go through ARP) When theaddress resolution is unnecessary, a different routine is used See Part VI for adetailed discussion on this interface
reso-We see in the preceding examples how function pointers can be employed as faces between kernel components or as generic mechanisms to invoke the right func-tion handler at the right time based on the result of something done by a differentsubsystem There are cases where function pointers are also used as a simple way toallow protocols, device drivers, or any other feature to personalize an action
inter-Let’s look at an example When a device driver registers a network device with thekernel, it goes through a series of steps that are needed regardless of the device type
At some point, it invokes a function pointer on thenet_devicedata structure to letthe device driver do something extra if needed The device driver could either initial-ize that function pointer to a function of its own, or leave the pointer NULL becausethe default steps performed by the kernel are sufficient
A check on the value of a function pointer is always necessary before executing it toavoid NULL pointer dereferences, as shown in this snapshot from register_
to find out how the function pointer has been initialized It could depend on ent factors:
differ-• When the selection of the routine to assign to a function pointer is based on aparticular piece of data, such as the protocol handling the data or the devicedriver a given packet is received from, it is easier to derive the routine For exam-
ple, if a given device is managed by the drivers/net/3c59x.c device driver, you can
derive the routine to which a given function pointer of the net_device data
Trang 36structure is initialized by reading the device initialization routine provided by thedevice driver.
• When the selection of the routine is based instead on more complex logic, such
as the state of the resolution of an L3-to-L2 address mapping, the routine used atany time depends on external events that cannot be predicted
A set of function pointers grouped into a data structure are often referred to as a
vir-tual function table (VFT) When a VFT is used as the interface between two major
subsystems, such as the L3 and L4 protocol layers, or when the VFT is simplyexported as an interface to a generic kernel component (set of objects), the number
of function pointers in it may swell to include many different pointers that modate a wide range of protocols or other features Each feature may end up usingonly a few of the many functions provided You will see an example in Part VI Ofcourse, if this use of a VFT is taken too far, it becomes cumbersome and a majorredesign is needed
accom-goto Statements
Few C programmers like thegotostatement Without getting into the history of the
goto(one of the longest and most famous controversies in computer programming),I’ll summarize some of the reasons thegotois usually deprecated, but why the Linuxkernel uses it anyway
Any piece of code that usesgotocan be rewritten without it The use ofgotoments can reduce the readability of the code, and make debugging harder, because atany position following agotoyou can no longer derive unequivocally the conditionsthat led the execution to that point
state-Let me make this analogy: given any node in a tree, you know what the path fromthe root to the node is But if you add vines that entwine around branches ran-domly, you do not always have a unique path between the root and the other nodesanymore
However, because the C language does not provide explicit exceptions (and they areoften avoided in other languages as well because of the performance hit and codingcomplexity), carefully placedgotostatements can make it easier to jump to code thathandles undesired or peculiar events In kernel programming, and particularly in net-working, such events are very common, sogoto becomes a convenient tool
I must defend the kernel’s use of gotoby pointing out that developers have by nomeans gone wild with it Even though there are more than 30,000 instances, they aremainly used to handle different return codes within a function, or to jump out ofmore than one level of nesting
Trang 37required,placeholderis just a pointer to the end of the structure; it does not sume any space.
con-Thus, ifabcis used by several pieces of code, each one can use the same basic tion (avoiding the confusion of doing the same thing in slightly different ways) whileextendingabc differently to personalize its definition according to its needs
defini-We will see this kind of data structure definition a few times in the book One ple is in Chapter 19
exam-Conditional Directives (#ifdef and family)
Conditional directives to the compiler are sometimes necessary An excessive use ofthem can reduce the readability of the code, but I can state that Linux does not abusethem They appear for different reasons, but the ones we are interested in are thoseused to check whether a given feature is supported by the kernel Configuration tools
such as make xconfig determine whether the feature is compiled in, not supported at
all, or loadable as a module
Examples of feature checks by#ifdef or#if defined C preprocessor directives are:
To include or exclude fields from a data structure definition
In this example, the Netfilter debugging feature requires annf_debugfield in the
debug-ging (a feature needed by only a handful of developers), there is no need toinclude the field, which would just take up more memory for every networkpacket
Trang 38To include or exclude pieces of code from a function
To select the right prototype for a function
Trang 39#endif
Note that this case differs from the previous one In the previous case, the tion body lies outside the#ifdef/#endifblocks, whereas in this case, each blockcontains a complete definition of the function
func-The definition or initialization of variables and macros can also use conditionalcompilation
It is important to know about the existence of multiple definitions of certain tions or macros, whose selection at compile time is based on a preprocessor macro as
func-in the precedfunc-ing examples Otherwise, when you look for a function, variable, ormacro definition, you may be looking at the wrong one
See Chapter 7 for a discussion of how the introduction of special macros hasreduced, in some cases, the use of conditional compiler directives
Compile-Time Optimization for Condition Checks
Most of the time, when the kernel compares a variable against some external value tosee whether a given condition is met, the result is extremely likely to be predictable.This is pretty common, for example, with code that enforces sanity checks The ker-nel uses thelikelyandunlikelymacros, respectively, to wrap comparisons that arelikely to return a true (1) or false (0) result Those macros take advantage of a feature
of the gcc compiler that can optimize the compilation of the code based on that
An example of the optimization made possible by thelikelyandunlikelymacros is
in handling options in the IPheader The use of IPoptions is limited to very specificcases, and the kernel can safely assume that most IPpackets do not carry IPoptions.When the kernel forwards an IPpacket, it needs to take care of options according tothe rules described in Chapter 18 The last stage of forwarding an IPpacket is takencare of by ip_forward_finish This function uses the unlikely macro to wrap thecondition that checks whether there is any IPoption to take care of See the section
“ip_forward_finish Function” in Chapter 20
Trang 40Mutual Exclusion
Locking is used extensively in the networking code, and you are likely to see it come
up as an issue under every topic in this book Mutual exclusion, locking nisms, and synchronization are a general topic—and a highly interesting and com-plex one—for many types of programming, especially kernel programming Linuxhas seen the introduction and optimization of several approaches to mutual exclu-sion over the years Thus, this section merely summarizes the locking mechanismsseen in networking code; I refer you to the high-quality, detailed discussions avail-
mecha-able in O’Reilly’s Understanding the Linux Kernel and Linux Device Driver.
Each mutual exclusion mechanism is the best choice for particular circumstances.Here is a brief summary of the alternative mutual exclusion approaches you will seeoften in the networking code:
Spin locks
This is a lock that can be held by only one thread of execution at a time Anattempt to acquire the lock by another thread of execution makes the latter loopuntil the lock is released Because of the waste caused by looping, spin locks areused only on multiprocessor systems, and generally are used only when thedeveloper expects the lock to be held for short intervals Also because of thewaste caused to other threads, a thread of execution must not sleep while hold-ing a spin lock
Read-write spin locks
When the uses of a given lock can be clearly classified as only and write, the use of read-write spin locks is preferred The difference between spinlocks and read-write spin locks is that in the latter, multiple readers can hold thelock at the same time However, only one writer at a time can hold the lock, and
read-no reader can acquire it when it is already held by a writer Because readers aregiven higher priority over writers, this type of lock performs well when the num-ber of readers (or the number of read-only lock acquisitions) is a good deal big-ger than the number of writers (or the number or read-write lock acquisitions).When the lock is acquired in read-only mode, it cannot be promoted to read-write mode directly: the lock must be released and reacquired in read-writemode
Read-Copy-Update (RCU)
RCU is one of the latest mechanisms made available in Linux to provide mutualexclusion It performs quite well under the following specific conditions:
• Read-write lock requests are rare compared to read-only lock requests
• The code that holds the lock is executed atomically and does not sleep
• The data structures protected by the lock are accessed via pointers
The first condition concerns performance, and the other two are at the base ofthe RCU working principle