the linux process manager the internals of scheduling interrupts and signals

The Linux kernel source code reproduced in this book is covered by the GNU General Public Licence Library of Congress Cataloguing-in-Publication Data O’Gorman, John, 1945-The Linux proc

Trang 3

The Linux Process Manager

Trang 5

The Linux Process

Manager

The internals of scheduling, interrupts and signals

John O’Gorman University of Limerick, Limerick, Republic of Ireland

Trang 6

Phone (þ44) 1243 779777

E-mail (for orders and customer service enquiries): cs-books@wiley.co.uk

Visit our Home Page on www.wiley.co.uk or www.wiley.com

All Rights Reserved No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London, W1P 0LP, UK, without the permission in writing of the Publisher Requests to the Publisher should be addressed to the Permissions Department John Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, England,

or e-mailed to permreq@wiley.co.uk, or faxed to (44) 1243 770620.

This publication is designed to provide accurate and authoritative information in regard to the subject matter covered It is sold on the understanding that the Publisher is not engaged in rendering professional services If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Other Wiley Editorial Offices

John Wiley & Sons, Inc 111 River Street, Hoboken, NJ 07030, USA

Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA

Wiley-VCH Verlag GmbH, Pappellaee 3, D-69469 Weinheim, Germany

John Wiley & Sons Australia, Ltd, 33 Park Road, Milton, Queensland, 4064, Australia

John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809

John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada, M9W 1L1

Wiley also publishes its books in a variety of electronic formats Some content that appears

in print may not be available in electronic books.

The Linux kernel source code reproduced in this book is covered by the GNU General Public Licence Library of Congress Cataloguing-in-Publication Data

O’Gorman, John,

1945-The Linux process manager : the internals of scheduling, interrupts and signals / John O’Gorman.

p cm.

ISBN 0-470-84771-9 (Paper : alk paper)

1 Linux 2 Operating systems (Computers) I Title.

QA76.76.063034354 2003

005.49469 — dc21

British Library Cataloguing in Publication Data

A catalogue record for this book is available from the British Library

ISBN 0 470 84771 9

Typeset in 10 1 /12 1 pt Sabon by Keytec Typesetting, Bridport, Dorset

Printed and bound in Great Britain by Biddles Ltd., Guildford and Kings Lynn

This book is printed on acid-free paper responsibly manufactured from sustainable forestry,

Trang 7

The Linux Process Manager The Internals of Scheduling, Interrupts and Signals John O’Gorman

Trang 8

4 Wait queues 61

Trang 9

10.4 System call entry 287

11.10 Auxiliary functions for co-processor error handling 342

12.5 Hardware interrupts and the interrupt descriptor table 367

12.9 Functions for reading and writing input–output ports 400

13 Advanced programmable interrupt controllers 409

14 The input–output advanced programmable interrupt

Trang 11

23 Process accounting 755

Trang 13

Linux is growing in popularity Because it is open source, more and more people are lookinginto the internals There are a number of books that undertake to explain the internals of theLinux kernel, but none of them is really satisfactory The reason for this is that they allattempt to do too much They set out to give an overview of the whole of the kernel andconsequently are unable to go into detail on any part of it This book deals with only asubset of the kernel code, loosely described as the process manager It would roughlycorrespond to the architecture-independent code in the /kernel directory and themachine-specific code in the/arch/i386/kerneldirectories, as well as their associatedheader files It is based on version 2.4.18

When it comes to architecture-dependent code, it deals only with the i386 or PC version.The main reason for this is size Even with this restriction, the book is still quite sizeable.Another reason is that most Linux users use PCs – the interest in other versions would bequite small

Because it deals only with a subset of the code, it can afford to go into it in much greaterdetail than any other commentary currently available First of all, given the self-imposedlimitation of dealing only with the process manager, it is complete Every function, everymacro, every data structure used by the process manager is dealt with and its role in theoverall picture explained But it also goes into more detail in the sense that it is a line-by-line explanation of this subset of the Linux source code

It is intended as a reference book Although unlikely to be used as a textbook, I could see

it being specified as background reading in an advanced course on operating systems There

is a logical progression through the chapters and within each chapter Yet it is doubtful ifmany people would read it from beginning to end It is far more likely to be a book that oneturns to when information is needed on some specific aspect of Linux The format used is totake one function at a time, give the code for it, explain what it is doing, and show where itfits into the overall picture

Like any medium-to-large sized piece of software, an operating system has an internalstructure that is technically categorised as a graph This is sometimes dismissed as ‘spaghetticode’ When attempting to describe such a complex structure in an essentially linearmedium such as a book, it is necessary to flatten it out I have chosen to order the material in

a way I think makes it most intelligible to the reader

Although dealing with any particular function in isolation, to retain the sense of just

Trang 14

where it fits into the overall structure, much use is made of cross-references It is hoped thatthis will add to the reader’s confidence in the completeness of the book Even if you are notgoing to look it up right away, the fact that a reference is provided for each of the subroutinesbeing called gives assurance that the explanation will be there if you do want to cross-check

it in the future

The Linux source code is optimised for efficiency, and in places this can make it veryconvoluted No attempt is made to straighten out the code, to make it read better After all,the raw Linux code is what readers want to understand, and they expect this book to helpthem in that So every effort is concentrated in making the explanation of the code as clear

as possible

At the lower levels, there is a significant amount of inline assembler code used in theLinux kernel The AT&T syntax used for this is quite abstruse, even for a reader familiarwith the standard Intel i386 assembler syntax There is no skimping on the explanation here.When such code is encountered, it is explained line by line, instruction by instruction

John O’Gorman

Trang 15

His work involved constructing an interpreter for a proprietary language designed fordeveloping and delivering Computer Aided Learning material, which generated Intermediate(what we would now call Byte) Code John’s research was essentially an exercise insystematic reverse engineering, and demonstrated that it was relatively easy to developalternative interpreters, which could deliver the same lessons on other hardware, such as PCs.After lecturing on Computer Science in Maynooth he started research for his Doctorate

on Systematic Decompilation Techniques, which he completed in 1991 This was based onthe thesis that a non-optimizing code generator for Context Free Languages should probablyproduce code which itself constituted a sentence of another Context Free Language, whichshould then be describable by means of a Context Free Grammar, which John called aDecompilation Grammar The purpose of Decompilation was to discover this grammar, andtherefore the code-generation practices of the particular compiler This latter grammar could

be used to regenerate source code semantically equivalent to the original source code.This kind of research was meat and drink to John He was interested in what was actuallythe case rather than what should be the case Possibly because of his training in philosophy,

he distrusted grand theories, and like the great political thinker, Edmund Burke, he gloried

in the richness of the specific details of each situation This meant that he carried out somevery sophisticated experiments to check the actual structure of software artifacts rather thanwork from a preconceived theoretically satisfying structure

He was not interested in fashions in research although many of the areas in which hesupervised research students subsequently became very fashionable Yet because of thequality of his teaching, he found it very easy to induce students to undertake postgraduateresearch, despite the (then) substantial rewards available in industry

John carried out research to satisfy his own personal curiosity, and published very little ofhis work Indeed, to the best of my knowledge he never attended a technical conferenceoutside of Ireland

This may have been because John led a double life, as a member of the Dominican Order,

of Preachers, as a member of the Academic Community, and fully lived each life, but kepteach life tightly compartmentalized from the other, which must have involved considerablediscipline in how he managed his time

John combined his research and teaching interests by writing and publishing several

Trang 16

undergraduate textbooks on operating systems This book represents the last few years ofJohn’s life reading and analysing the source code of the Linux operating system It againreflects his passionate interest in how things actually work rather than how they should work.Tony Cahill

University of LimerickRepublic of Ireland

Trang 17

One omission that may bother some readers is that it does not cover the implementation

of any of the system services The sheer size of the project forced this decision, and maybethe system services are not the worst choice if something has to be left out Many of themare covered reasonably well in existing books on Linux kernel internals

Every effort has been made to ensure that the index is as complete as possible As well asthe usual topic entries, each function, macro and data structure discussed is also referencedthere It is envisaged that this index will be the first port of call for many users of the book

Overview of contents

The introduction lives up to its name It zooms in from operating systems, through Linux, tointroduce the Linux process manager It introduces the GCC compiler as well as giving theabsolute ‘minimum necessary’ on initialisation of the process manager

Linux represents processes by means of a data structure called atask_struct This is

at the heart of the whole process manager, indeed the whole operating system It is quite asizeable structure and takes a whole chapter to describe properly This gives an indication ofthe level of detail of the book

As there are many of these structures in existence at the same time, their organisation andmanipulation is the subject matter of Chapter 3

Trang 18

One fundamental aspect of process management is the ability to put a process to sleepand arrange for it to be woken and run again when some specific event occurs This, ofcourse, is implemented by manipulating the data structures that represent these processes.The various facilities that the process manager makes available for this are described inChapter 4, on wait queues.

Moving data structures on and off queues and linked lists as discussed in the previouschapter introduced the vexed question of mutual exclusion It must be possible to guaranteethat such manipulations are atomic and uninterruptible Linux has a whole range ofmechanisms available for this purpose Those that rely on busy waiting are introduced inChapter 5, whereas mechanisms that put waiting processes to sleep, such as semaphores, arecovered in Chapter 6

If thetask_struct is the most important data structure for the process manager, themost fundamental function must be the scheduler With the background provided in theprevious six chapters, this scheduler can now be introduced, and dissected, in Chapter 7.These first seven chapters between them describe the process manager in a steady state.Two further aspects must be added to this to get the overall picture Chapter 8 deals with therather joyful side: the creation of new processes But sadly, processes have to terminatesometime Chapter 9 explains all that is involved in this, including the sending of deathnotices to their parents, how they become zombies for a while and, finally, how all trace ofthem is removed from the system

The next seven chapters form a unit, which might be subtitled ‘Interrupting Linux’ Thereare a whole range of agents outside the running process that may interrupt it at unpredictabletimes, and the process manager must be able to handle these interruptions The need forseven chapters to cover this is due mainly to the complexity of the interrupt hardware on thei386 architecture

Chapter 10 introduces the topic of interrupts and exceptions and gives an overview ofhow the process manager handles them How control transfers from user mode to the kernel

is examined in detail, including system call entry

There are a number of interrupts generated by the i386 processor itself, in response tovarious conditions encountered in the program it is executing These are collectively known

as exceptions, and the handlers for them are the subject matter of Chapter 11

The other main source of interrupts is hardware devices attached to the computer.Handlers for these are device-specific and are not covered in this book, but the processmanager must provide some generic code for hardware interrupts, and Chapter 12 describesthat

More recent models of the i386 architecture include an advanced programmable interruptcontroller (APIC) as part of the CPU (central processing unit) chip This routes interruptsfrom external devices to the processor core The interrupt manager has to interface with thiscontroller, and the various functions provided for that are described in Chapter 13

On a multiprocessor system, the routing of interrupts between different CPUs is handled

by yet another specialist chip, the input–output APIC Chapter 14 describes how Linuxinterfaces with this controller

The hardware timer interrupt is the heartbeat of the whole operating system and, ofcourse, all the handler code for that interrupt is supplied by Linux This is described in detail

in Chapter 15

All operating systems divide the handling of interrupts into two levels There is urgentprocessing that must be done immediately, and there is less urgent processing that can be

Trang 19

delayed slightly In Linux, the latter processing is known collectively as a software interruptand is the subject matter of Chapter 16.

The signal mechanism is a venerable part of Unix It takes three chapters to describe all

of the code in the Linux kernel that deals with signals Chapter 17 introduces the datastructures involved, including the alternate stack for handling signals The functions used topost signals to a process are introduced in Chapter 18, as well as the kernel’s role indelivering these signals to the process The signal handler itself is user-level code, and so toactually run it the kernel must temporarily drop into user mode and then return to kernelmode when it finishes The description of the mechanics of this takes up the whole ofChapter 19

The remainder of the book deals with a number of miscellaneous topics that come underthe heading of the process manager Certainly, the code discussed here is in the/kernel

directory

Linux, like other flavours of Unix, is moving from the all-or-nothing protection scheme provided by the superuser mechanism to a more discriminated system ofcapabilities This is still rather rudimentary, but the elements of it are described in the shortChapter 20

privilege-and-A nice feature of Linux is its ability to execute binary programs that were compiled to rununder other operating systems on an i386 It recognises executables as having differentpersonalities and emulates the execution domains in which they expect to run Chapter 21deals with this aspect of the kernel

Another long-standing feature of Unix is the ability of one process to trace the execution

of another This is typically used by a debugger Chapter 22 examines the facilities provided

by the kernel to allow this level of interprocess communication

The BSD version of Unix introduced the process accounting file, and Linux also providesthis Chapter 23 examines when and how records are written to this file

In order to provide backward compatibility with 16-bit programs written for the 8086,modern 32-bit Intel processors have a special mode of executing in which they emulate an

8086 processor This is known as vm86 mode The software provided by the kernel forrunning in vm86 mode is described in full in Chapter 24

References

O’Gorman J, 2000 Operating Systems (Palgrave, London)

O’Gorman J, 2001 Operating Systems with Linux (Palgrave, London)

Background and overview xvii

Trang 21

Introduction

An operating system is a large – very large – piece of software, typically consisting ofhundreds of thousands of lines of source code It is questionable if any one person canunderstand all the ramifications of such a large construct, so we apply a divide-and-conquerapproach This involves breaking it down into smaller pieces and attempting to understandone piece at a time

It is generally accepted that in order to write a large software system with any hope ofsuccess, there must be some overall design structure This design structure can also be usedwhen trying to understand and explain the system after it is built Some designers suggest avery rigid layered structure in which modules in one layer only ever interface with the layersimmediately above or below them In reality, it is difficult if not impossible to constructsystems in this way, so most tend to end up without such clearcut distinctions Linux is noexception

Another possible approach is to divide the operating system up on the basis offunctionality This means identifying the basic functions of an operating system andseparating out the code that implements each of these functions A very course-graineddivision would be to divide it into a number of different managers These would include theprocess manager, memory manager, input–output (I/O) manager, file manager, and networkmanager This latter approach fits in well with the structure of Linux In fact, the source codetree has top-level directories entitled kernel,mm,drivers,fs, andnet So there is atleast a suggestion that we are on the right track when taking this approach to breaking upthe operating system for ease of handling Figure 1.1 shows how the process manager fitsinto the overall architecture of a Linux system

C library Pthreads library

System services

Process Memory Input–output managermanager manager

Filesystems Network

Figure 1.1 Outline architecture of a Linux system

Trang 22

1.1 Overview of the process manager

A process is the unit of work in a computer system As a first cut, a process can be described

as ‘a program in action’ When the instructions that make up a program are actually beingcarried out, there is a process

The process manager is the part of the operating system that is responsible for handlingprocesses, and there is quite a lot involved in that, as can be seen from the size of the book,

or from the brief resume´ given here

It must keep track of which processes actually exist in the system There is a significantamount of information that needs to be recorded about each process or job: not onlyobvious, if trivial, items such as its name and owner but also such matters as the areas ofmemory that have been allocated to it, the files and other I/O streams it has open and thestate of the process at any given time

Sometimes a process is running; other times it is ready and able to run but is waiting itsturn in a queue; other times again, it is not in a position to run This can be because it iswaiting for some resources it needs, such as data from a disk drive The process managerhas to record the fact that it is waiting, and exactly what it is waiting for Then when theresource becomes available, it knows which of the many processes to wake up

Recording this information is obviously very important; but we are still only talkingabout passive data structures The active part of the process manager is called thescheduler This is the code that shares the central processing unit (CPU) among the manycontending processes, deciding when to take a CPU from a process, and which process togive it to next

The concept of an interrupt is fundamental to any modern computer system Numerousevents occur at unpredictable times within the running system, which need to be handledimmediately The CPU suspends whatever it is doing, switches control to a block of codespecially tailored to deal with the specific event, and then returns seamlessly to take upwhatever it was doing beforehand The whole procedure is known as interrupt handling and

is the responsibility of the process manager

Interrupts originate from many sources, and the most important of these is the timerinterrupt, the heartbeat of the whole system Closely allied to this is the timing of otherevents and maintaining the time of day

As well as its own internal work, the process manager also provides services to userprocesses, on request Foremost among these is the facility to create new processes and toterminate them when no longer required by the user Other services allow processes tocommunicate among themselves, such as signals, or System V IPC

When more than one process is running on the one machine there is a need to controlprocess access to the various resources available Requests for some services may belegitimate from one process but not from another There is also a need to record whatresources any particular process has used All of this falls on the process manager

Within the Unix world, quite a number of different flavours have evolved over the years.Linux attempts to cover as many of these as possible The process manager attempts to supportprocesses executing programs that were compiled on one of a range of other Unix systems.Specific to i386 Linux is the ability to run programs written for earlier 16-bit processors

A final service, typically required by a debugger, is the ability for one process to controlanother completely, maybe executing it one instruction at a time, and allowing the programvariables to be inspected between each instruction This is known as tracing a process

Trang 23

1.1.1 Operating system processes

The operating system itself is one large job, and it would seem reasonable for it to be aprocess in its own right In fact, an operating system is frequently designed as a set ofcooperating processes For example, each of the managers identified above might be aprocess in its own right This means that the process manager we are considering here could

be a process itself and be managed by itself Although this sounds confusing, it can work inpractice

With some operating systems, each source of asynchronous events is given a process ofits own Such an asynchronous event is anything that can happen independently of therunning program, and so at unpredictable times This generally results in a process beingdedicated to each hardware device

For example, there would be a process created for the network device, which would spendmost of its time inactive, asleep But when a message arrives over the network, thisoperating system process is there waiting to respond It is woken up (by an interrupt), dealswith the network device, passes the message on to the appropriate user process, and goes tosleep again The user process cannot be expected to do this It cannot know in advance when

a message is going to arrive, and it cannot just sit there waiting for a message – it has to get

on with its own processing

Unix, and hence Linux, has traditionally taken a different approach from the one outlinedabove It has very few system processes Most of the code in the Unix kernel is executed aspart of the current user’s process, or in the context of the user’s process But while executingkernel code the process has extra privileges – it has access to any internal data structures itrequires in the operating system, and it can execute privileged instructions on the CPU.All modern CPUs operate in at least two different modes In one of these, called usermode, the CPU can execute only a subset of its instructions – the more common ones, such

as add, subtract, load and store In the other mode, called kernel mode, the CPU can executeall of its instructions, including extra privileged instructions These typically access specialregisters that control protection on the machine Normally, the machine runs in user mode.When it wants to do something special, it has to change into kernel mode

Obviously, this changing between modes is very dependent on the underlying hardware,but, in general, there is a special machine instruction provided that both changes the CPU toprivileged mode and transfers control to a fixed location Any user process can call thisinstruction, but it cannot decide where it goes after that It cannot execute its owninstructions with extra privilege; as it is forced to jump to a fixed location, it can executeonly those instructions that the designer has placed at that location In this way, a userprocess can be allowed strictly controlled access to the kernel of the operating system

Trang 24

uniprocessor are no longer valid in the multiprocessor case The process manager has to beexplicitly involved in taking out and releasing locks on various parts of the kernel.

Two different approaches are possible when designing an operating system for a processor computer One CPU could be dedicated to running the operating system, leavingthe others for user processes This is known as asymmetric multiprocessing It is a niceclean design, but in most cases it would probably result in the CPU dedicated to theoperating system being idle for some or much of the time

multi-Another possibility is to treat all CPUs equally, so that when operating system code needs

to execute, it uses whichever CPU is available This is known as symmetric multiprocessing(SMP) It improves overall utilisation of the CPUs, although it does result in some cacheinefficiency if the operating system is continually migrating from one CPU to another.For a system such as Unix, where operating system code does not run in a process of itsown but in the context of the calling process, the choice is fairly heavily weighted in favour

of SMP, so it is no surprise that this is the way multiprocessors are handled in Linux Buteven after settling for the SMP strategy, there are two ways of implementing it It would bepossible to produce a generic kernel, which would handle both the uniprocessor andmultiprocessor cases Decisions on which code branches to execute are then taken atruntime

This decision could also be taken at compile time, and in fact that is how it is done withLinux So, effectively two different kernels can be produced The SMP version is larger, but

it has the ability to control a multiprocessor computer The uniprocessor version omits all ofthe SMP code, so it is smaller In line with our stated aim of completeness, this book willconsider not just the uniprocessor code but all the SMP code as well

1.1.3 Threads

Sometimes a process is running on the processor; at other times it is idle, for one reason oranother The overhead involved in moving from running to idle can be considerable Ithappens frequently that a process begins running and almost immediately stops again to waitfor some input to become available The operating system has to save the whole state of themachine, as it was at the moment when the process stopped running There is a similaroverhead involved when a process begins running again All the saved state has to berestored and the machine set up exactly as it was when the process last ran And thisoverhead is on the increase, as the number and size of CPU registers grows and as operatingsystems become more complex, so requiring ever more state to be remembered

This has led to a distinction being made between the unit of resource allocation and theunit of execution Traditionally, these have been the same One process involved one path ofexecution through one program, using one block of allocated resources, especially memory.Now the trend is to have one unit of resource allocation, one executable program, with manypaths of execution through it at the same time The terminology that is evolving tends torefer to the old system as a heavyweight processes The new styles are called lightweightprocesses, or threads But there is no consistency A thread has access to all the resourcesassigned to the process It can be defined as one sequential flow of control within a program

A process begins life with a single thread, but it can then go on to create further threads ofcontrol within itself

Now that one process can have many threads of control, if one thread is blocked, another

Trang 25

can execute It is not necessary to save and restore the full state of the machine for this, as it

is using the same memory, program, files, devices – it is just jumping to another location inthe program code But each thread must maintain some state information of its own, forexample the program counter, stack pointer, and general purpose registers This is so thatwhen it regains control, it may continue from the point it was at before it lost control.Sometimes a thread package is implemented as a set of library routines, running entirely

at the user level This saves on the overhead of involving the kernel each time controlchanges from one thread to another The kernel gives control of the CPU to a process Theprogram that process is executing calls library functions that divide up the time among anumber of threads

This approach has the serious drawback that if one thread calls a system service such as

read(), and it has to wait for a key to be pressed, the kernel will block the whole processand give the CPU to another process The kernel just does not know about threads It might

be viewed as a cheap way to implement threads on top of an existing system

The other possibility is for threads to be implemented in the kernel itself, which thenhandles all the switching between threads This provides all the benefits of multi-threading.Obviously, if threads are to be provided by the operating system, the responsibility for thiswill fall on the process manager, but it implies such a radical rewrite of the process managerand other parts of the kernel as well that it is normally implemented only in new operatingsystems

Linux has a unique way of providing threads within the kernel It really creates a newprocess, but it specifies that the original process and new process are to share the samememory, program, and open files So it is essentially a thread in its parent process Eachsuch thread has its own ID number

There is very little extra code in the process manager to deal with threads Certainly, there

is no suggestion of a thread manager, but the concept of a thread group is introduced Thismeans that all the processes that represent a group of threads sharing the same code and dataare linked together on a list, which is headed from their parent process

1.2 The GCC compiler

The Linux kernel is written to be compiled by the GCC compiler There are a number offeatures of this compiler that impinge on how the code is written Such features come upfrequently throughout the book, so rather than repeating them each time they are gatheredtogether here in the introduction

Trang 26

be overridden by specifying that the parameters are to be passed in registers TheFASTCALL

macro is defined in,linux/kernel.h.as:

This means that the function x()has the regparm(3)attribute: that is, the compiler willpass it a maximum of three integer parameters in the EAX, EDX, and ECXregisters, instead

of on the stack, and the function itself will be compiled so as to expect them there Althoughthis has the drawback that the number and size of the parameters are limited, it does havethe advantage of being more efficient The called function does not have to load itsparameters into registers from the stack – they are there already It is particularly efficientwhen the function is called from an assembly language routine, that already has theparameters in the appropriate registers TheFASTCALLmacro is specific to the i386

1.2.1.2 Suggest branch directions for compiler

If a compiler can know in advance which branch of an if statement is more likely to betaken, then when laying out the code it can give a tiny margin of advantage to one ratherthan the other The macros likely()andunlikely()have been provided to facilitatethis (see Figure 1.2, from ,linux/compiler.h.) An example of its use would beinstead of writingif (x ¼¼ y), to writeif (likely(x ¼¼ y))

9 #if GNUC ¼¼ 2 && GNUC_MINOR , 96

12

Figure 1.2 Suggest branch direction for compiler

9–11 since version 2.96 ofGCC, there is a built-in function called builtin_expect() This isnot available for earlier compilers, so this macro substitutes a dummy for it

13–14 these lines pass the appropriate second parameter to builtin_expect() In thelikely()

case this isTRUE; forunlikely()it isFALSE In all cases x is a boolean expression

1.2.1.3 do { } while(0)

Macros are widely used in the Linux kernel code Although some are only ‘one-liners’,others are fairly lengthy and may include their ownifstatements within them

Remember, macros can be inserted anywhere in C code, even inside an if else

statement In that case, the if in the macro, and theif else in the main code canconfuse the compiler, leading to syntax errors, or incorrect nesting of if else

constructs

To solve this problem, the definition of many macros is wrapped by a do { } while(0)construct As it stands, it is saying ‘do this once’, which is just what is required

Trang 27

Also, the compiler will see this and will not generate any looping code So why is it there atall?

Thedo { } while(0)construct is a way of instructing the compiler that the macro

is one block of code and is to be treated as such, no matter where it is inserted in the maincode

1.2.2 Assembly language features

There is a small amount of assembly language code included in the kernel, and some genericfeatures of that things that are not specific to operating systems are described here

1.2.2.1 Aligning code and data

For efficiency of access to main memory, and to cache, it is frequently desirable to havemachine code aligned on even boundaries A number of macros are provided for this (seeFigure 1.3) from , linux/l inkage.h

40 #if defined( i386 ) && defined(CONFIG_X86_ALIGNMENT_16)

42 #define ALIGN_STR ".align 16,0x90"

45 #define ALIGN_STR ".align 4,0x90"

Figure 1.3 Alignment macros

40–42 this is the version used for 16-byte alignment, when optimizing use of the cache

41 this tells the assembler to align the next machine instruction on a 16-byte boundary and to fill anypadding bytes with 0x90, the NOPinstruction

42 this is the string version of the same macro, for use with inline assembler (see subsection1.2.2.3)

44–45 these are the 4-byte versions of the macros, used when optimising access to main memory.53–54 in both cases,ALIGNis an alias for ALIGN, andALIGN_STRis an alias for ALIGN_STR

1.2.2.2 Visibility

When mixing C code with assembler, it is a common requirement that C identifiers should

be visible to the assembler and also that the C code be able to jump to specific routineswithin the assembler The macros shown in Figure 1.4, from ,linux/linkage.h.,facilitate this

Trang 28

21 #define SYMBOL_NAME(X) X 22

Figure 1.4 Macros for mixing C and assembler code

21 this macro is used in assembler code to make the C identifierXvisible to the assembler

23 this macro converts the C identifierXinto a label in assembler code Note the colon (:) at theend of the line

56–59 this macro sets up a labelname:in assembly code, that can be called from C code

57 declarenameas a global symbol

59 use the macro from line 23 to create a label in the assembly code

1.2.2.3 Inline assembler

A very small part of Linux is written in pure assembly language code; the most obviousexample for the process manager is the filearch/i386/kernel/entry.S, dealt with inChapter 10 However, frequently in the middle of a unit of C code there is a requirement tocarry out some operation that can only be done in assembler The most common example is

to read or write a specific hardware register For this, inline assembler is used This is doneusing theasm()built-in function

The parameter to this function is a string representing the assembler mnemonic, in AT&Tsyntax A very simple example would beasm("nop") For more complex situations, it ispassed a concatenation of strings, each representing one or more assembler instructions,interspersed with formatting instructions such as\tand\n

It is also possible to pass instructions to the compiler about the location of each of theoperands used in the assembly code Each one is described by an operand constraint string,followed by the C expression representing that operand A colon separates the assemblertemplate from the first output operand, and another separates the last output operand fromthe first input operand An operand constraint string specifies whether the operand is read, orwrite, or both It also specifies the location of the operand (e.g in a register, or in memory).Some machine instructions have side-effects, altering specific registers or memory in anunspecified way The compiler can be warned of this by information placed after a thirdcolon

The GCC compiler reads and interprets only whatever comes after the first colon It usesthis information to generate assembler code that puts the operands in the correct registersbeforehand, and it heeds warnings about not caching memory values in registers across thegroup of assembler instructions in the template The actual assembler template itself itpasses on directly to the gas assembler Any errors in that string will only be picked up atthat stage

Trang 29

Each time such a portion of inline assembly code is encountered in the book, it isdescribed in some detail The outline given here will only become fully clear when readagain in such a context.

1.3 Initialising the process manager

The material in the remainder of this chapter is difficult to place Although it logicallybelongs here at the beginning, it is dealing with the initialisation of data structures andsubsystems that have not yet been encountered, so it is most likely to be read in conjunctionwith later parts of the book, when checking how particular structures got their originalvalues

When a CPU is powered up, it starts with defined values in all of its registers, includingthe program counter, so it always looks for its first instruction at a predefined location inmemory A system designer will arrange that some instructions be available in a ROM atthat location

The startup instructions in ROM will generally test the hardware, possibly includingidentifying peripherals It then tries to read the boot sector of whichever disk drive isconfigured as the boot device Typically, this boot block will contain only sufficient code toidentify where exactly on the disk the operating system is located

The ‘gory details’ of how Linux is booted are very architecture-dependent They arecertainly not the subject matter of this book In general, there would be machine-specificcode involved in the first stages of booting any operating system This is sometimes known

as the basic input–output system (BIOS) Then there would be architecture-specific Linuxcode, which initialises the hardware, including memory Eventually, this transfers control tothe architecture-independent routinestart_kernel()ininit/main.c

1.3.1 Starting the kernel

Even thestart_kernel()function is not really our concern here It is relevant, however,because it calls a number of other functions that are responsible for starting different parts

of the process manager

Figure 1.5 extracted from init/main.c, shows the relevant lines Each of thesefunctions is discussed further on in the book, as indicated in the comments

Figure 1.5 Starting the process manager

551 this function, to be described in Section 5.5, takes out a lock on the whole kernel while all of thisinitialisation is going on It will be released at line 533 ofrest_init()(see Section 1.3.3).Initialising the process manager 9

Trang 30

556 this function sets up the table of interrupt handlers (see Section 10.2.2).

557 the default handlers for the hardware interrupts are set up by this function (see Section 12.5.1)

558 this function is described in Section 1.3.2 It is somewhat misnamed, as the only thing it has to dowith scheduling is to initialise a hash table for process structures, but it also initialises the timersubsystem, as well as doing some work for the memory manager

559 this function sets up the software interrupt mechanism (see Section 16.1.2)

596 this function (see Section 8.1.2), calculates and records the maximum number of processesallowed in the system at any one time, based on the amount of memory available

620 this function (see Section 1.3.3) does some miscellaneous initialisation It creates the init

process and then metamorphoses into the idle process

1.3.2 Scheduler initialisation

One of the functions discussed in Section 1.3.1 was sched_init(), as shown in Figure1.6, from kernel/sched.c It is called from main.c, line 536, and it initialisesscheduling

1304 void init sched_init(void)

Figure 1.6 Function to initialize the scheduler

1310 this finds the ID of the current processor, in this case the boot processor The

smp_processor_id()macro is architecture-specific (see Section 7.2.1.4)

1313 this puts the ID value into theprocessorfield of thetask_structforinit_task Thisstructure has already been statically set up by the compiler (see Section 3.3.2)

Trang 31

1315–1316 the hash table, as described in Section 3.2, is initialised toNULLvalues.

1318 theinit_timervecs()function, which sets up the internal timer subsystem, will be described

in Section 15.3.1.3

1320–1322 finally, the bottom halves of the interrupt handlers are set up Most operating systems divide the

processing of an interrupt into two parts There is some urgent work that must be doneimmediately, and often there is less urgent work that can be done at a later time For example,when a network card interrupts, it is essential to get the data off the card and into memory assoon as possible This is so that the card is ready to receive the next message coming over thenetwork Once this has been done, the operating system can deliver the message to theappropriate user in its own time To handle this situation, Linux uses ‘bottom halves’ (see Section16.3 for full details) Theinit_bh()function is architecture-specific (see Section 16.3.1.3).1327–1328 these two lines relate to the memory management fields ofinit_taskand are not relevant in

the context of this book

1.3.3 Startinginitand the idle process

The remainder of the initialisation has been broken out into a separate function,

rest_init(), for memory-management reasons, which need not concern us (see Figure1.7, frominit/main.c)

Figure 1.7 Setting upinitand the idle process

532 this architecture-specific function is described in Section 8.5 It creates a new thread and arrangesfor it to run theinit()function This starts the background processes running, including loginprocesses on each connected terminal The original process continues at the next line

533 the startup process had exclusive control of the kernel up to this point; it now relinquishes thatcontrol (see Section 5.5 for the function)

534 by setting this flag in thetask_struct, a process signals to the scheduler that it is willing togive up the CPU

535 this function, described in Section 3.4.1, goes into an infinite loop The process executing theinitialisation code (process 0) thus becomes the idle process, but, because it has set its

need_reschedflag at line 534, the scheduler can preempt it at any stage

Initialising the process manager 11

Trang 33

in ,linux/sched.h. One task_struct is allocated per process It contains all thedata needed to control the process, whether it is running or not The operating systemguarantees to keep this information in memory at all times This is probably the single mostimportant data structure in the Linux kernel, so it will be examined in some detail in thischapter There is a sense in which it is a sort of table of contents for the remainder of thebook.

Many of the fields in thistask_structare concerned with process management, and

so each will be dealt with in its own place in the appropriate chapter; their purpose willbecome clearer when we later examine what use the kernel makes of them Other fields donot fall within the scope of this book Those dealing with memory management or input–output would be obvious examples Because this is the only place where they will beconsidered, they are covered in a little more detail

The fields are certainly not arranged in logical order – they seem to have just grown upwhere they are The comments in the source file break this large structure up into a number

of more manageable sections, and the same divisions will be followed here

2.1 Important fields hard coded at the beginning

Figure 2.1 shows the first few fields in the task_struct These are accessed fromassembler routines, such as those in arch/i386/kernel/entry.S, by offset, not byname This means that no matter what else changes, they must always be at the same relativeposition at the beginning of the structure

Trang 34

281 struct task_struct {

Any particular process can be in one of several states, defined as shown in Figure 2.2,from,linux/sched.h. A positive value means that the process is not running, for somereason

Figure 2.2 Possible values for thestatefield

86 TASK_RUNNINGis applied to both the currently running process or any processes that are ready

to run Such processes are maintained on a linked list called the runqueue (see Section 4.8)

87 when in theTASK_INTERRUPTIBLEstate, the process is sleeping It is maintained on a linkedlist called a wait queue (see Chapter 4), but it can be woken by a signal Signals are covered inChapters 17–19

88 in theTASK_UNINTERRUPTIBLEstate, the process is also sleeping on a wait queue, but it iswaiting for some event that is likely to occur very soon, so any signals will be held until it iswoken

89 the TASK_ZOMBIE state refers to a process that has terminated but that has not been fullyremoved from the system as yet Such a process is known as a zombie in Unix terminology Eachprocess in Unix has a parent process; the one that created it When a process terminates, it canpass back information to its parent, usually signifying the reason why it terminated, whether thiswas normal or abnormal Such information is stored in the exit_code field of the

Trang 35

task_struct, which will be seen in Section 2.3 So, the task_struct of this processcontinues to exist, after the process has terminated, until the parent collects the status informa-tion, using thewait()system service After that, all trace of the process disappears Chapter 9deals with process termination.

90 sometimes the process manager stops a process completely This is not termination, it is, rather,halting it for an indeterminate time The two common occasions on which this occurs are when aprocess loses contact with its controlling terminal; or when its execution is being traced byanother process (a debugger), and it is stopped after executing each instruction, so that its statecan be examined Stopping and starting processes is done by means of signals, which will beexamined in Chapters 17–19 When a process has been stopped like this, then its state is

TASK_STOPPED.Note that changing the state of a process, and moving it from one queue to another, are notnecessarily atomic operations

2.1.2 Line 286:flags

This is a bit field, recording various items of interest about the status of the process.Individual bits in this field denote different milestones in the lifecycle of a process Thesebits are defined as shown in Figure 2.3, from,linux/sched.h. They are not mutuallyexclusive; more than one can be set at the same time

Figure 2.3 Values for theflagsfield

418 alignment warning messages should be printed No example of its use has been found anywhere

in the code

420 the process is being created, and so should be left alone until it is fully started Again, no placehas been found in the code where it is actually used, certainly not byfork(), where it might beexpected

421 the process is being shut down, and should also be left alone This bit is set by the code thatimplementsexit()(see Section 9.1) It is checked by different parts of the I/O manager.Important fields hard coded at the beginning 15

Trang 36

422 the process has never calledexec(), so it is still running the same program as its parent This bit

is set by the code that implementsfork()(see Section 8.3.2) and is cleared by functions thatimplement exec() It only ever seems to be tested when doing BSD style accounting (seeSection 23.3.2)

423 the process has used superuser privileges Note it is not necessarily indicating that the processhas superuser privileges now; only that it requested them at some stage It is set by functions thatgive such privilege, such ascapable(), as well as the older stylesuser()andfsuser(),(see Chapter 20) It is used by BSD accounting This bit is cleared when a new process is created

424 this bit indicates that if the process terminates abnormally, it is to produce a core image on disk

It is set by functions that implementexec()

425 the process has been terminated by the arrival of a signal (see Section 18.4.4)

426–428 these bits are set and tested by various routines within the memory manager They will not be

considered any further here

429 when this bit is set, the process does not generate any further I/O As this is the province of theI/O manager, it will not be considered further in this book

431 this bit is cleared when a new process is created (see Section 8.3.2), and is afterwardsmanipulated only by architecture-specific code It signifies that the task used the hardwarefloating point unit (FPU) during the current time-slice

be considered further

Line 292:exec_domain

Linux can run programs compiled for different versions of UNIX This pointer is to a

struct exec_domain, which identifies the particular UNIX execution domain associatedwith the process Note that it is a pointer; the information is not maintained on a process-by-process basis, rather on a systemwide basis There is a linked list of these set up at boottime, and all processes point to a particular entry on the list There is no use count, for evenwhen no process is pointing to a particular entry, it still remains in existence, as Linux isstill able to support it Execution domains are dealt with in detail in Chapter 21

Trang 37

Line 293:need_resched

This flag is set to 1 to inform the scheduler that this process is willing to yield the CPU Itsuse will be examined in detail when dealing with the scheduler, in Chapter 7

Line 294:ptrace

This is a flag field used only when a process is being traced The possible values are shown

in Figure 2.4, from,linux/sched.h.

Figure 2.4 Values for theptracefield

436 this bit is set if theptrace()system service has been called by another process, to trace thisprocess

437 this process is being traced by another process, but it is only to be interrupted at each system call

438 this bit notes that theTRAPflag is set inEFLAGS, meaning that the process is to be interruptedafter each machine instruction A process may do this itself, even if not being traced by another

439 this bit allows a tracing parent to tell whether a child has stopped (in system service code)because of a system call, or because of aSIGTRAP signal, initiated by hardware (see Section22.5)

440 this bit indicates that tracing can continue even across the exec() of a set UID (uniqueidentifier) program Such a program runs with different privileges from those of the owner of theprocess

Process tracing will be dealt with in detail in chapter 22

2.1.3 Line 296:lock_depth

When Linux is running on a uniprocessor, once the current process enters the kernel toexecute a system service it cannot be preempted until it leaves the kernel again This designfeature guarantees it mutual exclusion on all the kernel data structures However, on amultiprocessor machine it is possible for processes executing on two or more CPUs to beactive in the kernel at the same time So, it became necessary to introduce a mutualexclusion lock on the kernel Because it is possible for a process to recursively acquire thislock, thelock_depthfield is incremented each time the lock is acquired, and decrementedwhen it is released So, it is possible to avoid the situation where a process leaves the kernel,still holding the lock The implementation of this lock is architecture-dependent and will bedealt with in Section 5.5 Note that this field is not the kernel lock itself; it is only anindication of how many times this process has recursively acquired the kernel lock (if at all).Important fields hard coded at the beginning 17

Trang 38

2.2 Scheduling fields

Figure 2.5 shows a group of fields in thetask_struct, which are used by the scheduler.Grouping them like this is an attempt to keep all fields that are needed for one of theinnermost loops in the scheduler (goodness(); see Section 7.4.2) in a single cache line of

32 bytes Thestruct list_head(line 321) consists of two 32-bit pointer fields, and onlythe first one, thenextfield, fits in the cache line But that is the only one needed, as readingthis list is always done from the beginning forwards Not all of the scheduling information isincluded here – other fields will be encountered later on

323

Figure 2.5 Fields in thetask_structused by the scheduler

Line 303:counter

The time left (in clock interrupts or ticks) from the quantum for this process is maintained

in counter Various functions modify it It is decremented on each clock tick while theprocess is running When it gets to 0, then theneed_reschedflag for this process is set It

is checked when a process wakes up; if it is greater than that of the current process, then the

need_resched field of the current process is set to 1 All of this is considered in greatdetail in Chapter 7, on scheduling

Trang 39

115 #define SCHED_OTHER 0

Figure 2.6 Possible values for thepolicyfield

115 this denotes the default time-sharing policy

116 this is first in, first out; such a process has no quantum It runs until it blocks or is preempted

117 this is a round robin; it runs until it blocks, is preempted, or uses up its quantum

123 a process may decide to give up the CPU for one pass over the runqueue by the scheduler In thatcase, every other process on the runqueue would be given an opportunity to run, before this oneruns again An example would be a real-time process, no matter how high its priority, yielding to

an interactive process

How these flags are used by the scheduler will be discussed in Chapter 7

Line 306:mm

This is a pointer to the data structures used by the memory manager Figure 2.7 shows the

mm_struct, from,linux/sched.h. Most of these fields are not relevant to the processmanager, but it does operate on this structure when creating (fork()) and deleting(exit()) processes The following brief description of the fields inmm_structassumessome knowledge of how a memory manager works It is included here for completeness

start_data, end_data;

env_end;

Scheduling fields 19

Trang 40

224 unsigned long def_flags;

227

Figure 2.7 Memory management control information

205 this is a header for a linked list of structures representing the regions in the address space of theprocess

206 the structures representing the regions in a process are also arranged in a tree, for speed oflookup This field is the root of that tree

207 this is a pointer to the most recently accessed region

208 this is a pointer to the physical page table

209 this is a reference count of the number of different user space threads sharing this memory map

210 the mm_countfield contains the total number of kernel thread references to this structure Alluser space threads count as a single reference

211 the number of regions in the memory map is inmap_count This is the number of entries in thelist headed frommmap

212 to protect the whole linked list headed frommmap, there is a semaphoremmap_sem

213 the spinlock page_table_lockis used to guarantee mutual exclusion on the physical pagetable pointed to bypgd

215 all themm_structin the system are linked together, through this field

220 the virtual addresses of the beginning and end of the code segment are in start_code and

end_code, respectively Likewise start_dataandend_data are, respectively, the virtualaddresses of the beginning and end of the initialised data segment

221 the address of the beginning of the heap is instart_brk The address of the highest memoryallocated dynamically is inbrk The start of the stack is instart_stack

222 the address of the beginning and end of the area containing the command line arguments passed

to the current program are in arg_startandarg_end, respectively Likewise env_start

andenv_endcontain, respectively, the address of the beginning and end of the area containingthe environment

223 the number of pages actually resident in memory is inrss The total amount of virtual memory

in bytes is in total_vm, and locked_vm contains the bytes of virtual memory which arelocked in

Định dạng
Số trang	848
Dung lượng	5,34 MB