File system forensic analysis

Đây là bộ sách tiếng anh cho dân công nghệ thông tin chuyên về bảo mật,lập trình.Thích hợp cho những ai đam mê về công nghệ thông tin,tìm hiểu về bảo mật và lập trình.

Trang 4

* ,

#

$

+

Trang 5

"3 #

"

2 '

4

,

* ,

$

), 5 ), 1

+ ), 5+

), 1+

, ), 5

), 1 ), 1"3 #

"

"" 5 )2 ( 7 ( "

6

#

Trang 6

Copyright

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals

The author and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions No liability is assumed for incidental or consequential damages in connection with

or arising out of the use of the information or programs contained herein

The publisher offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales, which may include electronic versions and/or custom covers and content particular to your business, training goals, marketing focus, and branding interests For more information, please contact:

U S Corporate and Government Sales

Visit us on the Web: www.awprofessional.com

Library of Congress Catalog Number: 2004116962

All rights reserved Printed in the United States of America This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise For information regarding permissions, write to

Pearson Education, Inc

Rights and Contracts Department

One Lake Street

Upper Saddle River, NJ 07458

Trang 7

Foreword

Computer forensics is a relatively new field, and over the years it has been called many things: "computer forensics," "digital forensics," and "media analysis" to name a few It has only been in the past few years that we have begun to recognize that all of our digital devices leave digital breadcrumbs and that these breadcrumbs are valuable evidence in a wide range

of inquiries While criminal justice professionals were some of the first to take an interest in this digital evidence, the intelligence, information security, and civil law fields have enthusiastically adopted this new source of information

Digital forensics has joined the mainstream In 2003, the American Society of Crime Laboratory Directors–Laboratory Accreditation Board (ASCLD–LAB) recognized digital evidence as a full-fledged forensic discipline Along with this acceptance came increased interest in training and education in this field The Computer Forensic Educator's Working Group (now known as the Digital Forensic Working Group) was formed to assist educators in developing programs in this field There are now over three-dozen colleges and universities that have, or are, developing programs in this field More join their ranks each month

I have had the pleasure of working with many law enforcement agencies, training organizations, colleges, and universities to develop digital forensic programs One of the first questions that I am asked is if I can recommend a good textbook for their course or courses There have been many books written about this field Most take a targeted approach to a particular investigative approach, such as incident response or criminal investigation Some tend to be how-to manuals for specific tools It has been hard to find a book that provides a solid technical and process foundation for the field That is, until now

This book is the foundational book for file system analysis It is thorough, complete, and well

organized Brian Carrier has done what needed to be done for this field This book provides a

solid understanding of both the structures that make up different file systems and how these structures work Carrier has written this book in such a way that the readers can use what they know about one file system to learn another This book will be invaluable as a textbook and

as a reference and needs to be on the shelf of every digital forensic practitioner and educator

It will also provide accessible reading for those who want to understand subjects such as data recovery

When I was first approached about writing this Foreword, I was excited! I have known Brian Carrier for a number of years and I have always been impressed with his wonderful balance

of incredible technical expertise and his ability to clearly explain not just what he knows but, more importantly, what you need to know Brian's work on Autopsy and The Sleuth Kit (TSK) has demonstrated his command of this field—his name is a household name in the digital forensic community I have been privileged to work with Brian in his current role at Purdue University, and he is helping to do for the academic community what he did for the commercial sector: He set a high standard

So, it is without reservation that I recommend this book to you It will provide you with a solid foundation in digital media

Mark M Pollitt

President, Digital Evidence Professional Services, Inc

Retired Director of the FBI's Regional Computer Forensic Laboratory Program

Trang 8

a high level, but source code is typically needed to learn the details My goal for this book is

to fill the void and describe how data are stored on disk and describe where and how digital evidence can be found

There are two target audiences for this book One is the experienced investigator that has learned about digital investigations from real cases and using analysis tools The other is someone who is new to the field and is interested in learning about the general theory of an investigation and where digital evidence may exist but is not yet looking for a book that has a tutorial on how to use a specific tool

The value of the material in this book is that it helps to provide an education rather than

training on a specific tool Consider some of the more formal sciences or engineering disciplines All undergraduates are required to take a couple of semesters of physics, chemistry, or biology These courses are not required because the students will be using all the material for the rest of their careers In fact, software and equipment exist to perform many of the calculations students are forced to memorize The point of the classes is to provide students with insight about how things work so that they are not constrained by their tools

The goal of this book is to provide an investigator with an education similar to what Chemistry 101 is to a chemist in a forensics lab The majority of digital evidence is found on

a disk, and knowing how and why the evidence exists can help an investigator to better testify about it It also will help an investigator find errors and bugs in his analysis tools because he can conduct sanity checks on the tool output

The recent trends in digital investigations have shown that more education is needed Forensic labs are being accredited for digital evidence, and there are debates about the required education and certification levels Numerous universities offer courses and even Master's degrees in computer forensics Government

Roadmap

This book is organized into three parts Part 1 provides the basic foundations, and Parts 2 and

3 provide the technical meat of the book The book is organized so that we move up the layers of abstraction in a computer We start by discussing hard disks and then discuss how disks are organized into partitions After we discuss partitions, we discuss the contents of partitions, which are typically a file system

Part 1, "Foundations," starts with Chapter 1, "Digital Investigation Foundations," and discusses the approach I take to a digital investigation The different phases and guidelines are presented so that you know where I use the techniques described in this book This book does not require that you use the same approach that I do Chapter 2, "Computer Foundations," provides the computer foundations and describes data structures, data encoding, the boot process, and hard disk technology Chapter 3, "Hard Disk Data Acquisition," provides the theory and a case study of hard disk acquisition so that we have data to analyze in Parts 2 and 3

Part 2, "Volume Analysis," of the book is about the analysis of data structures that partition

Trang 9

of the volume analysis techniques, and Chapter 5, "PC-based Partitions," examines the common DOS and Apple partitions Chapter 6, "Server-based Partitions," covers the partitions found in BSD, Sun Solaris, and Itanium-based systems Chapter 7, "Multiple Disk Volumes," covers RAID and volume spanning

Part 3, "File System Analysis," of the book is about the analysis of data structures in a volume that are used to store and retrieve files Chapter 8, "File System Analysis," covers the general theory of file system analysis and defines terminology for the rest of Part 3 Each file system has at least two chapters dedicated to it where the first chapter discusses the basic concepts and investigation techniques and the second chapter includes the data structures and manual analysis of example disk images You have a choice of reading the two chapters in parallel, reading one after the other, or skipping the data structures chapter altogether

The designs of the file systems are very different, so they are described using a general file system model The general model organizes the data in a file system into one of five categories: file system, content, metadata, file name, and application This general model is used to describe each of the file systems so that it is easier to compare them

Chapters 9, "FAT Concepts and Analysis," and 10, "FAT Data Structures," detail the FAT file system, and Chapters 11, "NTFS Concepts," 12, "NTFS Analysis," and 13, "NTFS Data Structures," cover NTFS Next, we skip to the Unix file systems with Chapters 14, "Ext2 and Ext3 Concepts and Analysis," and 15, "Ext2 and Ext3 Data Structures," on the Linux Ext2 and Ext3 file systems Lastly, Chapters 16, "UFS1 and UFS2 Concepts and Analysis," and

17, "UFS1 and UFS2 Data Structures," examine UFS1 and UFS2, which are found in FreeBSD, NetBSD, OpenBSD, and Sun Solaris

After Part 3 of this book, you will know where a file existed on disk and the various data structures that need to be in sync for you to view it This book does not discuss how to analyze the file's contents

Scope of Book

Now that you know what is included in this book, I will tell you what is not in this book This book stops at the file system level and does not look at the application level Therefore, we do not look at how to analyze various file formats We also do not look at what files a specific

OS or application creates If you are interested in a step-by-step guide to investigating a Windows '98 computer that has been used to download suspect files, then you will be disappointed with this book If you want a guide to investigating a compromised Linux server, then you may learn a few tricks in this book, but it is not what you are looking for Those topics fall into the application analysis realm and require another book to do them justice If you are interested in having more than just a step-by-step guide, then this book is probably for you

Resources

As I mentioned in the beginning, the target audience for this book is not someone who is new

to the field and looking for a book that will show the basic investigation concepts or how to use a specific tool There are several quality books that are breadth-based, including:

Casey, Eoghan Digital Evidence and Computer Crime 2nd ed London:

Academic Press, 2004

Kruse, Warren and Jay Heiser Computer Forensics Boston: Addison Wesley,

2002

Mandia, Kevin, Chris Prosise, and Matt Pepe Incident Response and

Computer Forensics Emeryville: McGraw Hill/Osborne, 2003

Trang 10

Throughout this book, I will be using The Sleuth Kit (TSK) on example disk images so that

both the raw data and formatted data can be shown That is not to say that this is a tutorial on using TSK To learn only about using TSK, the previous books or the computer forensic

chapters in Know Your Enemy, 2nd Edition should be referred to The appendix in this book

describes TSK and Autopsy (a graphical interface for TSK) TSK and additional documentation can be downloaded from http://www.sleuthkit.org

The URLs of other tools that are used throughout the book will be given as needed Additional resources, links, and corrections will be available from http://www.digital- evidence.org/fsfa/

Any corrections can be e-mailed to me at fsfa@digital-evidence.org

Trang 11

Acknowledgments

I would like to thank many people for helping me with digital forensics First, thanks go out

to those who have helped me in general over the years My appreciation goes to Eoghan Casey, Dave Dittrich, Dan Farmer, Dan Geer, Dan Kalil, Warren Kruse, Gary Palmer, Eugene Spafford, Lance Spitzner, and Wietse Venema for various forms of guidance, knowledge, and opportunities

I would also like to thank Cory Altheide, Eoghan Casey, Knut Eckstein, and Jim Lyle for reviewing the entire book Special thanks go to Knut, who went through every hexdump dissection of the example disk images and verified each hexadecimal to decimal conversion (and found several typos), and to Eoghan for reminding me when the content needed more practical applications Christopher Brown, Simson Garfinkel, Christophe Grenier, Barry Grundy, Gord Hama, Jesse Kornblum, Troy Larson, Mark Menz, Richard Russon, and Chris Sanft all reviewed and improved one or more chapters in their areas of expertise

Many folks at Addison Wesley and Pearson helped to make this book possible Jessica Goldstein guided and encouraged me through the process, Christy Hackerd made sure the editing and production process went smoothly, and Chanda Leary-Coutu provided her marketing expertise Thanks to Elise Walter for her copyediting, Christal Andry for her proofreading, Eric Schroeder for his indexing, Jake McFarland for his composition work, and Chuti Prasertsith for his cover design work

Finally, many thanks to my family and especially to my best friend (and Mrs.-to-be) Jenny, who helped me find balance in life despite the nights and weekends that I spent hunched over

a keyboard (and went as far as buying me an X-Box as a distraction from data structures and abstraction layers) Also, thanks to our cat, Achoo, for reminding me each day that playing with hair elastics and laser pointers is almost as fun as playing with ones and zeros

Trang 12

to take the same approach, but this chapter shows where I think the contents of this book fit into the bigger picture

Digital Investigations and Evidence

There is an abundant number of digital forensic and investigation definitions, and this section gives the definitions that I use and a justification for them The focus of a digital investigation

is going to be some type of digital device that has been involved in an incident or crime The digital device was either used to commit a physical crime or it executed a digital event that violated a policy or law An example of the first case is if a suspect used the Internet to conduct research about a physical crime Examples of the latter case are when an attacker gains unauthorized access to a computer, a user downloads contraband material, or a user sends a threatening e-mail When the violation is detected, an investigation is started to answer questions such as why the violation occurred and who or what caused it to occur

A digital investigation is a process where we develop and test hypotheses that answer

questions about digital events This is done using the scientific method where we develop a hypothesis using evidence that we find and then test the hypothesis by looking for additional

evidence that shows the hypothesis is impossible Digital evidence is a digital object that

contains reliable information that supports or refutes a hypothesis

Consider a server that has been compromised We start an investigation to determine how it occurred and who did it During the investigation, we find data that were created by events related to the incident We recover deleted log entries from the server, find attack tools, and find numerous vulnerabilities that existed on the server Using this data, and more, we develop hypotheses about which vulnerability the attacker used to gain access and what she did afterwards Later, we examine the firewall configuration and logs and determine that some of the scenarios in our hypotheses are impossible because that type of network traffic could not have existed, and we do not find the necessary log entries Therefore, we have found evidence that refutes one or more hypotheses

In this book, I use the term evidence in the investigative context Evidence has both legal and

investigative uses The definition that I previously gave was for the investigative uses of evidence, and there could be situations where not all of it can be entered into a court of law Because the legal admissibility requirements vary by country and state and because I do not have a legal background, I am going to focus on the general concept of evidence, and you can

Trang 13

make the adjustments needed in your jurisdiction[1] In fact, there are no legal requirements that are specific to file systems, so the general digital investigation books listed in the Preface can provide the needed information

So far, you may have noticed that I have not used the term "forensic" during the discussion about a digital investigation The American Heritage Dictionary defines forensic as an adjective and "relating to the use of science or technology in the investigation and establishment of facts or evidence in a court of law" [Houghton Mifflin Company 2000] The nature of digital evidence requires us to use technology during an investigation, so the main difference between a digital investigation and a digital forensic investigation is the

introduction of legal requirements A digital forensic investigation is a process that uses

science and technology to analyze digital objects and that develops and tests theories, which can be entered into a court of law, to answer questions about events that occurred In other words, a digital forensic investigation is a more restricted form of digital investigation I will

be using the term digital investigation in this book because the focus is on the technology and not specific legal requirements

Digital Crime Scene Investigation Process

There is no single way to conduct an investigation If you ask five people to find the person who drank the last cup of coffee without starting a new pot, you will probably see five different approaches One person may dust the pot for fingerprints, another may ask for security camera tapes of the break room, and another may look for the person with the hottest cup of coffee As long as we find the right person and do not break any laws in the process, it does not matter which process is used, although some are more efficient than others

The approach that I use for a digital investigation is based on the physical crime scene investigation process [Carrier and Spafford 2003] In this case, we have a digital crime scene that includes the digital environment created by software and hardware The process has three major phases, which are system preservation, evidence searching, and event reconstruction These phases do not need to occur one after another, and the flow is shown in Figure 1.1

Figure 1.1 The three major phases of a digital crime scene investigation

This process can be used when investigating both live and dead systems A live analysis

occurs when you use the operating system or other resources of the system being investigated

to find evidence A dead analysis occurs when you are running trusted applications in a

trusted operating system to find evidence With a live analysis, you risk getting false information because the software could maliciously hide or falsify data A dead analysis is more ideal, but is not possible in all circumstances

System Preservation Phase

The first phase in the investigation process is the System Preservation Phase where we try to

preserve the state of the digital crime scene The actions that are taken in this phase vary depending on the legal, business, or operational requirements of the investigation For example, legal requirements may cause you to unplug the system and make a full copy of all data On the other extreme could be a case involving a spyware infection or a honeypot[2] and

[2] A honeypot is "an information resource whose value lies in unauthorized or illicit use of that resource" [Honeynet Project 2004]

Trang 14

no preservation is performed Most investigations in a corporate or military setting that will not go to court use techniques in between these two extremes

The purpose of this phase is to reduce the amount of evidence that may be overwritten This process continues after data has been acquired from the system because we need to preserve the data for future analysis In Chapter 3, "Hard Disk Data Acquisition," we will look at how

to make a full copy of a hard disk, and the remainder of the book will cover how to analyze the data and search for evidence

Preservation Techniques

The goal of this phase is to reduce the amount of evidence that is overwritten, so we want to limit the number processes that can write to our storage devices For a dead analysis, we will terminate all processes by turning the system off, and we will make duplicate copies of all data As will be discussed in Chapter 3, write blockers can be used to prevent evidence from being overwritten

For a live analysis, suspect processes can be killed or suspended The network connection can

be unplugged (plug the system into an empty hub or switch to prevent log messages about a dead link), or network filters can be applied so that the perpetrator cannot connect from a remote system and delete data Important data should be copied from the system in case it is overwritten while searching for evidence For example, if you are going to be reading files, then you can save the temporal data for each file so that you have a copy of the last access times before you cause them to be updated

When important data are saved during a dead or live analysis, a cryptographic hash should be calculated to later show that the data have not changed A cryptographic hash, such as MD5, SHA-1, and SHA-256, is a mathematical formula that generates a very big number based on input data If any bit of the input data changes, the output number changes dramatically (A

more detailed description can be found in Applied Cryptography, 2nd Edition [Schneier

1995].) The algorithms are designed such that it is extremely difficult to find two inputs that generate the same output Therefore, if the hash value of your important data changes, then you know that the data has been modified

Evidence Searching Phase

After we have taken steps to preserve the data we need to search them for evidence Recall that we are looking for data that support or refute hypotheses about the incident This process typically starts with a survey of common locations based on the type of incident, if one is known For example, if we are investigating Web-browsing habits, we will look at the Web browser cache, history file, and bookmarks If we are investigating a Linux intrusion, we may look for signs of a rootkit or new user accounts As the investigation proceeds and we develop hypotheses, we will search for evidence that will refute or support them It is important to look for evidence that refutes your hypothesis instead of only looking for evidence that supports your hypothesis

The theory behind the searching process is fairly simple We define the general characteristics of the object for which we are searching and then look for that object in a collection of data For example, if we want all files with the JPG extension, we will look at each file name and identify the ones that end with the characters ".JPG." The two key steps are determining for what we are looking and where we expect to find it

Part 2, "Volume Analysis," and Part 3, "File System Analysis," of this book are about searching for evidence in a volume and file system In fact, the file system analysis chapters are organized so that you can focus on a specific category of data that may contain your evidence The end of this chapter contains a summary of the popular investigation toolkits,

Trang 15

and they all allow you to view, search, and sort the data from a suspect system so that you can find evidence

Search Techniques

Most searching for evidence is done in a file system and inside files A common search technique is to search for files based on their names or patterns in their names Another common search technique is to search for files based on a keyword in their content We can also search for files based on their temporal data, such as the last accessed or written time

We can search for known files by comparing the MD5 or SHA-1 hash of a file's content with

a hash database such as the National Software Reference Library (NSRL) (http://www.nsrl.nist.gov) Hash databases can be used to find files that are known to be bad or good Another common method of searching is to search for files based on signatures

in their content This allows us to find all files of a given type even if someone has changed their name

When analyzing network data, we may search for all packets from a specific source address

or all packets going to a specific port We also may want to find packets that have a certain keyword in them

Event Reconstruction Phase

The last phase of the investigation is to use the evidence that we found and determine what events occurred in the system Our definition of an investigation was that we are trying to answer questions about digital events in the system During the Evidence Searching Phase,

we might have found several files that violate a corporate policy or law, but that does not answer questions about events One of the files may have been the effect of an event that downloaded it, but we should also try to determine which application downloaded it Is there evidence that a Web browser downloaded them, or could it be from malware? (Several cases have used malware as a defense when contraband or other digital evidence has been found [George 2004; Brenner, Carrier, and Henninger 2004].) After the digital event reconstruction phase, we may be able to correlate the digital events with physical events

Event reconstruction requires knowledge about the applications and the OS that are installed

on the system so that you can create hypotheses based on their capabilities For example, different events can occur in Windows 95 than Windows XP, and different versions of the Mozilla Web browser can cause different events This type of analysis is out of the scope of this book, but general guidelines can be found in Casey [2004]

General Guidelines

Not every investigation will use the same procedures, and there could be situations where you need to develop a new procedure This book might be considered a little academic because it does not cover only what exists in current tools There are some techniques that have not been implemented, so you may have to improvise to find the evidence Here are my PICL guidelines, which will hopefully keep you out of one when you are developing new

procedures PICL stands for preservation, isolation, correlation, and logging

The first guideline is preservation of the system being investigated The motivation behind

this guideline is that you do not want to modify any data that could have been evidence, and you do not want to be in a courtroom where the other side tries to convince the jury that you may have overwritten exculpatory evidence This is what we saw in the Preservation Phase of the investigation process Some examples of how the preservation guideline is implemented are

• Copy important data, put the original in a safe place, and analyze the copy so that you can restore the original if the data is modified

Trang 16

• Calculate MD5 or SHA hashes of important data so that you can later prove that the data has not changed

• Use a write-blocking device during procedures that could write to the suspect data

• Minimize the number of files created during a live analysis because they could overwrite evidence in unallocated space

• Be careful when opening files on the suspect system during a live analysis because you could be modifying data, such as the last access time

The second guideline is to isolate the analysis environment from both the suspect data and

the outside world You want to isolate yourself from the suspect data because you do not know what it might do Running an executable from the suspect system could delete all files

on your computer, or it could communicate with a remote system Opening an HTML file from the suspect system could cause your Web browser to execute scripts and download files from a remote server Both of these are potentially dangerous, and caution should be taken Isolation from the suspect data is implemented by viewing data in applications that have limited functionality or in a virtual environment, such as VMWare (http://www.vmware.com), that can be easily rebuilt if it is destroyed

You should isolate yourself from the outside world so that no tampering can occur and so that you do not transmit anything that you did not want to For example, the previous paragraph described how something as simple as an HTML page could cause you to connect to a remote server Isolation from the outside world is typically implemented using an analysis network that is not connected to the outside world or that is connected using a firewall that allows only limited connectivity

Note that isolation is difficult with live analysis By definition, you are not isolated from the suspect data because you are analyzing a system using its OS, which is suspect code Every action you take involves suspect data Further, it is difficult to isolate the system from the outside world because that requires removing network connectivity, and live analysis typically occurs because the system must remain active

The third guideline is to correlate data with other independent sources This helps reduce the

risk of forged data For example, we will later see that timestamps can be easily changed in most systems Therefore, if time is very important in your investigation, you should try to find log entries, network traffic, or other events that can confirm the file activity times

The final guideline is to log and document your actions This helps identify what searches

you have not yet conducted and what your results were When doing a live analysis or performing techniques that will modify data, it is important to document what you do so that you can later document what changes in the system were because of your actions

Data Analysis

In the previous section, I said we were going to search for digital evidence, which is a rather general statement because evidence can be found almost anywhere In this section, I am going to narrow down the different places where we can search for digital evidence and identify which will be discussed later in this book We will also discuss which data we can trust more than others

Analysis Types

When analyzing digital data, we are looking at an object that has been designed by people Further, the storage systems of most digital devices have been designed to be scalable and flexible, and they have a layered design I will use this layered design to define the different analysis types [Carrier 2003a]

Trang 17

If we start at the bottom of the design layers, there are two independent analysis areas One is based on storage devices and the other is based on communication devices This book is going to focus on the analysis of storage devices, specifically non-volatile devices, such as hard disks The analysis of communication systems, such as IP networks, is not covered in this book, but is elsewhere [Bejtlich 2005; Casey 2004; Mandia et al 2003]

Figure 1.2 shows the different analysis areas The bottom layer is Physical Storage Media Analysis and involves the analysis of the physical storage medium Examples of physical store mediums include hard disks, memory chips, and CD-ROMs Analysis of this area might involve reading magnetic data from in between tracks or other techniques that require a clean room For this book, we are going to assume that we have a reliable method of reading data from the physical storage medium and so we have a stream 1s and 0s that were previously written to the storage device

Figure 1.2 Layers of analysis based on the design of digital data The bold boxes are covered

in this book

We now analyze the 1s and 0s from the physical medium Memory is typically organized by processes and is out of the scope of this book We will focus on non-volatile storage, such as hard disks and flash cards

Storage devices that are used for non-volatile storage are typically organized into volumes A

volume is a collection of storage locations that a user or application can write to and read

from We will discuss volume analysis in Part 2 of the book, but there are two major concepts

in this layer One is partitioning, where we divide a single volume into multiple smaller volumes, and the other is assembly, where we combine multiple volumes into one larger volume, which may later be partitioned Examples of this category include DOS partition tables, Apple partitions, and RAID arrays Some media, such as floppy disks, do not have any data in this layer, and the entire disk is a volume We will need to analyze data at the volume level to determine where the file system or other data are located and to determine where we may find hidden data

Inside each volume can be any type of data, but the most common contents are file systems Other volumes may contain a database or be used as a temporary swap space (similar to the

Windows pagefile) Part 3 of the book focuses on file systems, which is a collection of data

structures that allow an application to create, read, and write files We analyze a file system

to find files, to recover deleted files, and to find hidden data The result of file system analysis could be file content, data fragments, and metadata associated with files

To understand what is inside of a file, we need to jump to the application layer The structure

of each file is based on the application or OS that created the file For example, from the file system perspective, a Windows registry file is no different from an HTML page because they are both files Internally, they have very different structures and different tools are needed to

Trang 18

analyze each Application analysis is very important, and it is here where we would analyze configuration files to determine what programs were running or to determine what a JPEG picture is of I do not discuss application analysis in this book because it requires multiple books of its own to cover in the same detail that file systems and volumes are covered Refer

to the general digital investigation books listed in the Preface for more information

We can see the analysis process in Figure 1.3 This shows a disk that is analyzed to produce a stream of bytes, which are analyzed at the volume layer to produce volumes The volumes are analyzed at the file system layer to produce a file The file is then analyzed at the application layer

Figure 1.3 Process of analyzing data at the physical level to the application level

Essential and Nonessential Data

All data in the layers previously discussed have some structure, but not all structure is necessary for the layer to serve its core purpose For example, the purpose of the file system layer is to organize an empty volume so that we can store data and later retrieve them The file system is required to correlate a file name with file content Therefore, the name is essential and the on-disk location of the file content is essential We can see this in Figure 1.4 where we have a file named miracle.txt and its content is located at address 345 If either the name or the address were incorrect or missing, then the file content could not be read For example, if the address were set to 344, then the file would have different content

Trang 19

Figure 1.4 To find and read this file, it is essential for the name, size, and content location to

be accurate, but it is not essential for the last accessed time to be accurate

Figure 1.4 also shows that the file has a last accessed time This value is not essential to the purpose of the file system, and if it were changed, missing, or incorrectly set, it would not affect the process of reading or writing file content

In this book, I introduce the concept of essential and nonessential data because we can trust essential data but we may not be able to trust nonessential data We can trust that the file content address in a file is accurate because otherwise the person who used the system would not have been able to read the data The last access time may or may not be accurate The OS may not have updated it after the last access, the user may have changed the time, or the OS clock could have been off by three hours, and the wrong time was stored

Note that just because we trust the number for the content address does not mean that we trust the actual content at that address For example, the address value in a deleted file may be accurate, but the data unit could have been reallocated and the content at that address is for a new file Nonessential data may be correct most of the time, but you should try to find additional data sources to support them when they are used in an incident hypothesis (i.e., the correlation in the PICL guidelines) In Parts 2 and 3 of the book, I will identify which data are essential and which are not

Overview of Toolkits

There are many tools that can help an investigator analyze a digital system Most tools focus

on the preservation and searching phases of the investigation For the rest of this book, I will

be showing examples using The Sleuth Kit (TSK), which I develop and which is described

later in this section TSK is free, which means that any reader can try the examples in this book without having to spend more money

This book is not intended to be a TSK tutorial, and not everyone wants to use Unix-based, non-commercial tools Therefore, I am including a list of the most common analysis tools Most of the techniques described in this book can be performed using these tools Tools that are restricted to law enforcement are not listed here The descriptions are not an exhaustive list of features and are based on the content of their Web site I have not confirmed or used every feature, but each of the vendors has reviewed these descriptions

If you are interested in a more extensive list of tools, refer to Christine Siedsma's Electronic Evidence Information site (http://www.e-evidence.info) or Jacco Tunnissen's Computer Forensics, Cybercrime and Steganography site (http://www.forensics.nl) I also maintain

a list of open source forensics tools that are both commercial and non-commercial (http://www.opensourceforensics.org) This book helps show the theory of how a tool

Trang 20

is analyzing a file system, but I think open source tools are useful for investigations because they allow an investigator or a trusted party to read the source code and verify how a tool has implemented the theory This allows an investigator to better testify about the digital evidence [Carrier 2003b]

EnCase by Guidance Software

There are no official numbers on the topic, but it is generally accepted that EnCase

(http://www.encase.com) is the most widely used computer investigation software EnCase

is Windows-based and can acquire and analyze data using the local or network-based versions of the tool EnCase can analyze many file system formats, including FAT, NTFS, HFS+, UFS, Ext2/3, Reiser, JFS, CD-ROMs, and DVDs EnCase also supports Microsoft Windows dynamic disks and AIX LVM

EnCase allows you to list the files and directories, recover deleted files, conduct keyword searches, view all graphic images, make timelines of file activity, and use hash databases to identify known files It also has its own scripting language, called EnScript, which allows you

to automate many tasks Add-on modules support the decryption of NTFS encrypted files and allow you to mount the suspect data as though it were a local disk

Forensic Toolkit by AccessData

The Forensic Toolkit (FTK) is Windows-based and can acquire and analyze disk, file system,

and application data (http://www.accessdata.com) FTK supports FAT, NTFS, and Ext2/3 file systems, but is best known for its searching abilities and application-level analysis support FTK creates a sorted index of the words in a file system so that individual searches are much faster FTK also has many viewers for different file formats and supports many e-mail formats

FTK allows you to view the files and directories in the file system, recover deleted files, conduct keyword searches, view all graphic images, search on various file characteristics, and use hash databases to identify known files AccessData also has tools for decrypting files and recovering passwords

ProDiscover by Technology Pathways

ProDiscover (http://www.techpathways.com) is a Windows-based acquisition and analysis tool that comes in both local and network-based versions ProDiscover can analyze FAT, NTFS, Ext2/3, and UFS file systems and Windows dynamic disks When searching, it provides the basic options to list the files and directories, recover deleted files, search for keywords, and use hash databases to identify known files ProDiscover is available with a license that includes the source code so that an investigator or lab can verify the tool's actions

SMART by ASR Data

SMART (http://www.asrdata.com) is a Linux-based acquisition and analysis tool Andy Rosen, who was the original developer for Expert Witness (which is now called EnCase), developed SMART SMART takes advantage of the large number of file systems that Linux supports and can analyze FAT, NTFS, Ext2/3, UFS, HFS+, JFS, Reiser, CD-ROMs, and more To search for evidence, it allows you to list and filter the files and directories in the image, recover deleted files, conduct keyword searches, view all graphic images, and use hash databases to identify known files

The Sleuth Kit / Autopsy

The Sleuth Kit (TSK) is a collection of Unix-based command line analysis tools, and Autopsy

is a graphical interface for TSK (http://www.sleuthkit.org) The file system tools in TSK

Trang 21

written by Dan Farmer and Wietse Venema TSK and Autopsy can analyze FAT, NTFS, Ext2/3, and UFS file systems and can list files and directories, recover deleted files, make timelines of file activity, perform keyword searches, and use hash databases We will be using TSK throughout this book, and Appendix A, "The Sleuth Kit and Autopsy," provides a description of how it can be used

Summary

There is no single way to conduct an investigation, and I have given a brief overview of one approach that I take It has only three major phases and is based on a physical crime scene investigation procedure We have also looked at the major investigation types and a summary

of the available toolkits In the next two chapters, we will look at the computer fundamentals and how to acquire data during the Preservation Phase of an investigation

Bibliography

Brenner, Susan, Brian Carrier, and Jef Henninger "The Trojan Defense in Cybercrime

Cases." Santa Clara Computer and High Technology Law Journal, 21(1), 2004

Bejtlich, Richard The Tao of Network Security Monitoring: Beyond Intrusion Detection

Boston: Addison Wesley, 2005

Carrier, Brian "Defining Digital Forensic Examination and Analysis Tools Using

Abstraction Layers." International Journal of Digital Evidence, Winter 2003a

http://www.ijde.org

Carrier, Brian "Open Source Digital Forensic Tools: The Legal Argument." Fall 2003b

http://www.digital-evidence.org

Carrier, Brian, and Eugene H Spafford "Getting Physical with the Digital Investigation

Process." International Journal of Digital Evidence, Fall 2003 http://www.ijde.org

Casey, Eoghan Digital Evidence and Computer Crime 2nd ed London: Academic Press,

The Honeynet Project Know Your Enemy 2nd ed Boston: Addison-Wesley, 2004

Houghton Mifflin Company The American Heritage Dictionary 4th ed Boston: Houghton

Mifflin, 2000

Mandia, Kevin, Chris Prosise, and Matt Pepe Incident Response and Computer Forensics

2nd ed Emeryville: McGraw Hill/Osborne, 2003

Schneier, Bruce Applied Cryptography 2nd ed New York: Wiley Publishing, 1995

Trang 22

Chapter 2 Computer Foundations

The goal of this chapter is to cover the low-level basics of how computers operate In the following chapters of this book, we examine, in detail, how data are stored, and this chapter provides background information for those who do not have programming or operating system design experience This chapter starts with a discussion about data and how they are organized on disk We discuss binary versus hexadecimal values and little- and big-endian ordering Next, we examine the boot process and code required to start a computer Lastly,

we examine hard disks and discuss their geometry, ATA commands, host protected areas, and SCSI

Data Organization

The purpose of the devices we investigate is to process digital data, so we will cover some of the basics of data in this section We will look at binary and hexadecimal numbers, data sizes, endian ordering, and data structures These concepts are essential to how data are stored If you have done programming before, this should be a review

Binary, Decimal, and Hexadecimal

First, let's look at number formats Humans are used to working with decimal numbers, but computers use binary, which means that there are only 0s and 1s Each 0 or 1 is called a bit, and bits are organized into groups of 8 called bytes Binary numbers are similar to decimal numbers except that decimal numbers have 10 different symbols (0 to 9) instead of only 2 Before we dive into binary, we need to consider what a decimal number is A decimal number is a series of symbols, and each symbol has a value The symbol in the right-most column has a value of 1, and the next column to the left has a value of 10 Each column has a value that is 10 times as much as the previous column For example, the second column from the right has a value of 10, the third has 100, the fourth has 1,000, and so on Consider the decimal number 35,812 We can calculate the decimal value of this number by multiplying the symbol in each column with the column's value and adding the products We can see this

in Figure 2.1 The result is not surprising because we are converting a decimal number to its decimal value We will use this general process, though, to determine the decimal value of non-decimal numbers

Figure 2.1 The values of each symbol in a decimal number

The right-most column is called the least significant symbol, and the left-most column is called the most significant symbol With the number 35,812, the 3 is the most significant

symbol, and the 2 is the least significant symbol

Now let's look at binary numbers A binary number has only two symbols (0 and 1), and each column has a decimal value that is two times as much as the previous column Therefore, the right-most column has a decimal value of 1, the second column from the right has a decimal value of 2, the third column's decimal value is 4, the fourth column's decimal value is 8, and

so on To calculate the decimal value of a binary number, we simply add the value of each

Trang 23

column multiplied by the value in it We can see this in Figure 2.2 for the binary number

1001 0011 We see that its decimal value is 147

Figure 2.2 Converting a binary number to its decimal value

For reference, Table 2.1 shows the decimal value of the first 16 binary numbers It also shows the hexadecimal values, which we will examine next

Table 2.1 Binary, decimal, and hexadecimal conversion table

Now let's look at a hexadecimal number, which has 16 symbols (the numbers 0 to 9 followed

by the letters A to F) Refer back to Table 2.1 to see the conversion between the base hexadecimal symbols and decimal symbols We care about hexadecimal numbers because it's easy to convert between binary and hexadecimal, and they are frequently used when looking

at raw data I will precede a hexadecimal number with '0x' to differentiate it from a decimal number

We rarely need to convert a hexadecimal number to its decimal value by hand, but I will go through the process once The decimal value of each column in a hexadecimal number increases by a factor of 16 Therefore, the decimal value of the first column is 1, the second column has a decimal value of 16, and the third column has a decimal value of 256 To convert, we simply add the result from multiplying the column's value with the symbol in it Figure 2.3 shows the conversion of the hexadecimal number 0x8BE4 to a decimal number

Trang 24

Figure 2.3 Converting a hexadecimal value to its decimal value

Lastly, let's convert between hexadecimal and binary This is much easier because it requires only lookups If we have a hexadecimal number and want the binary value, we look up each hexadecimal symbol in Table 2.1 and replace it with the equivalent 4 bits Similarly, to convert a binary value to a hexadecimal value, we organize the bits into groups of 4 and then look up the equivalent hexadecimal symbol That is all it takes We can see this in Figure 2.4 where we convert a binary number to hexadecimal and the other way around

Figure 2.4 Converting between binary and hexadecimal requires only lookups from Table 2.1

Sometimes, we want to know the maximum value that can be represented with a certain number of columns We do this by raising the number of symbols in each column by the number of columns and subtract 1 We subtract 1 because we need to take the 0 value into account For example, with a binary number we raise 2 to the number of bits in the value and subtract 1 Therefore, a 32-bit value has a maximum decimal value of

A byte is the smallest amount of space that is typically allocated to data A byte can hold only

256 values, so bytes are grouped together to store larger numbers Typical sizes include 2, 4,

or 8 bytes Computers differ in how they organize multiple-byte values Some of them use

big-endian ordering and put the most significant byte of the number in the first storage byte, and others use little-endian ordering and put the least significant byte of the number in the

first storage byte Recall that the most significant byte is the byte with the most value (the

Trang 25

left-most byte), and the least significant byte is the byte with the least value (the right-most byte)

Figure 2.5 shows a 4-byte value that is stored in both little and big endian ordering The value has been allocated a 4-byte slot that starts in byte 80 and ends in byte 83 When we examine the disk and file system data in this book, we need to keep the endian ordering of the original system in mind Otherwise, we will calculate the incorrect value

Figure 2.5 A 4-byte value stored in both big- and little-endian ordering

IA32-based systems (i.e., Intel Pentium) and their 64-bit counterparts use the little-endian ordering, so we need to "rearrange" the bytes if we want the most significant byte to be the left-most number Sun SPARC and Motorola PowerPC (i.e., Apple computers) systems use big-endian ordering

Strings and Character Encoding

The previous section examined how a computer stores numbers, but we must now consider how it stores letters and sentences The most common technique is to encode the characters using ASCII or Unicode ASCII is simpler, so we will start there ASCII assigns a numerical value to the characters in American English For example, the letter 'A' is equal to 0x41, and '&' is equal to 0x26 The largest defined value is 0x7E, which means that 1 byte can be used

to store each character There are many values that are defined as control characters and are not printable, such the 0x07 bell sound Table 2.2 shows the hexadecimal number to ASCII character conversion table A more detailed ASCII table can be found at

Trang 26

first allocated byte The series of bytes in a word or sentence is called a string Many times,

the string ends with the NULL symbol, which is 0x00 Figure 2.6 shows an example string stored in ASCII The string has 10 symbols in it and is NULL terminated so it has allocated

11 bytes starting at byte 64

Figure 2.6 An address that is represented in ASCII starting at memory address 64

ASCII is nice and simple if you use American English, but it is quite limited for the rest of the world because their native symbols cannot be represented Unicode helps solve this problem by using more than 1 byte to store the numerical version of a symbol (More information can be found at www.unicode.org.) The version 4.0 Unicode standard supports over 96,000 characters, which requires 4-bytes per character instead of the 1 byte that ASCII requires

There are three ways of storing a Unicode character The first method, UTF-32, uses a 4-byte value for each character, which might waste a lot of space The second method, UTF-16, stores the most heavily used characters in a 2-byte value and the lesser-used characters in a 4-byte value Therefore, on average this uses less space than UTF-32 The third method is called UTF-8, and it uses 1, 2, or 4 bytes to store a character Each character requires a different number of bytes, and the most frequently used bytes use only 1 byte

UTF-8 and UTF-16 use a variable number of bytes to store each character and, therefore, make processing the data more difficult UTF-8 is frequently used because it has the least amount of wasted space and because ASCII is a subset of it A UTF-8 string that has only the characters in ASCII uses only 1 byte per character and has the same values as the equivalent ASCII string

Data Structures

Before we can look at how data are stored in specific file systems, we need to look at the general concept of data organization Let's return back to the previous example where we compared digital data sizes to boxes on a paper form With a paper form, a label precedes the boxes and tells you that the boxes are for the name or address Computers do not, generally, precede file system data with a label Instead, they simply know that the first 32 bytes are for

a person's name and the next 32 bytes are for the street name, for example

Computers know the layout of the data because of data structures A data structure describes

Trang 27

fields, and each field has a size and name, although this information is not saved with the data For example, our data structure could define the first field to be called 'number' and have a length of 2 bytes It is used to store the house number in our address Immediately after the 'number' field is the 'street' field and with a length of 30 bytes We can see this layout in Table 2.3

Table 2.3 A basic data structure for the house number and street name

Byte Range Description

If we want to write data to a storage device, we refer to the appropriate data structure to determine where each value should be written For example, if we want to store the address 1 Main St., we would first break the address up into the number and name We would write the number 1 to bytes 0 to 1 of our storage space and then write "Main St." in bytes 2 to 9 by determining what the ASCII values are for each character The remaining bytes can be set to

0 since we do not need them In this case, we allocated 32 bytes of storage space, and it can

be any where in the device The byte offsets are relative to the start of the space we were allocated Keep in mind that the order of the bytes in the house number depends on the endian ordering of the computer

When we want to read data from the storage device, we determine where the data starts and then refer to its data structure to find out where the needed values are For example, let's read the data we just wrote We learn where it starts in the storage device and then apply our data structure template Here is the output of a tool that reads the raw data

0000000: 0100 4d61 696e 2053 742e 0000 0000 0000 Main St

We look up the layout of our data structure and see that each address is 32 bytes, so the first address is in bytes 0 to 31 Bytes 0 to 1 should be the 2 byte number field, and bytes 2 to 31 should be the street name Bytes 0 to 1 show us the value 0x0100 The data are from an Intel system, which is little-endian, and we will therefore have to switch the order of the 0x01 and the 0x00 to produce 0x0001 When we convert this to decimal we get the number 1

The second field in the data structure is in bytes 2 to 31 and is an ASCII string, which is not effected by the endian ordering of the system, so we do not have to reorder the bytes We can either convert each byte to its ASCII equivalent or, in this case, cheat and look on the right column to see "Main St " This is the value we previously wrote We see that another address data structure starts in byte 32 and extends until byte 63 You can process it as an exercise (it

is for 25 South St)

Obviously, the data structures used by a file system will not be storing street addresses, but they rely on the same basic concepts For example, the first sector of the file system typically contains a large data structure that has dozens of fields in it and we need to read it and know that the size of the file system is given in bytes 32 to 35 Many file systems have several large data structures that are used in multiple places

Trang 28

Flag Values

There is one last data type that I want to discuss before we look at actual data structures, and

it is a flag Some data are used to identify if something exists, which can be represented with either a 1 or a 0 An example could be whether a partition is bootable or not One method of storing this information is to allocate a full byte for it and save the 0 or 1 value This wastes a lot of space, though, because only 1 bit is needed, yet 8 bits are allocated A more efficient method is to pack several of these binary conditions into one value Each bit in the value

corresponds to a feature or option These are frequently called flags because each bit flags

whether a condition is true To read a flag value, we need to convert the number to binary and then examine each bit If the bit is 1, the flag is set

Let's look at an example by making our previous street address data structure a little more complex The original data structure had a field for the house number and a field for the street name Now, we will add an optional 16-byte city name after the street field Because the city name is optional, we need a flag to identify if it exists or not The flag is in byte 31 and bit 0

is set when the city exists (i.e., 0000 0001) When the city exists, the data structure is 48 bytes instead of 32 The new data structure is shown in Table 2.4

Table 2.4 A data structure with a flag value

Byte Range Description

Here is sample data that was written to disk using this data structure:

0000000: 0100 4d61 696e 2053 742e 0000 0000 0000 Main St

"Boston." The next data structure starts at byte 48, and its flag field is in byte 79 Its value is 0x60, and the city flag is not set Therefore, the third data structure would start at byte 80

We will see flag values through out file system data structures They are used to show which features are enabled, which permissions are in effect, and if the file system is in a clean state

Booting Process

In the following chapters of this book, we are going to discuss where data reside on a disk and which data are essential for the operation of the computer Many times, I will refer to boot code, which are machine instructions used by the computer when it is starting This section describes the boot process and where boot code can be found Many disks reserve space for boot code, but do not use it This section will help you to identify which boot code

is being used

Trang 29

Central Processing Units and Machine Code

The heart of a modern computer is one or more Central Processing Units (CPU) Example

CPUs are the Intel Pentium and Itanium, AMD Athlon, Motorola PowerPC, and Sun UltraSPARC CPUs by themselves are not very useful because they do only what they are told They are similar to a calculator A calculator can do amazing things, but a human needs

to be sitting in front of it and entering numbers

CPUs get their instructions from memory CPU instructions are written in machine code, which is difficult to read and not user-friendly It is, in general, two levels below the C or Perl programming languages that many people have seen The level in between is an assembly language, which is readable by humans but still not very user-friendly

I will briefly describe machine code so that you know what you are looking at when you see machine code on a disk Each machine code instruction is several bytes long, and the first

couple of bytes identify the type of instruction, called the opcode For example, the value 3

could be for an addition instruction Following the opcode are the arguments to the instruction For example, the arguments for the addition instruction would be the two numbers to add

We do not really need much more detail than that for this book, but I will finish with a basic example One of the machine instructions is to move values into registers of the CPU Registers are places where CPUs store data An assembly instruction to do this is MOV AH,00

where the value 0 is moved into the AH register The machine code equivalent is the hexadecimal value 0xB400 where B4 is the opcode for MOV AH and 00 is the value, in hexadecimal, to move in There are tools that will translate the machine code to the assembly code for you, but as you can see, it is not always obvious that you are looking at machine code versus some other random data

Boot Code Locations

We just discussed that the CPU is the heart of the computer and needs to be fed instructions Therefore, to start a computer, we need to have a device that feeds the CPU instructions, also known as boot code In most systems, this is a two-step process where the first step involves getting all the hardware up and running, and the second step involves getting the OS or other software up and running We will briefly look into boot code because all volume and file systems have a specific location where boot code is stored, and it is not always needed When power is applied to a CPU, it knows to read instructions from a specific location in

memory, which is typically Read Only Memory (ROM) The instructions in ROM force the

system to probe for and configure hardware After the hardware is configured, the CPU searches for a device that may contain additional boot code If it finds such a device, its boot code is executed, and the code attempts to locate and load a specific operating system The process after the bootable disk is found is platform-specific, and I will cover it in more detail

in the following chapters

As an example, though, we will take a brief look at the boot process of a Microsoft Windows

system When the system is powered on, the CPU reads instructions from the Basic Input / Output System (BIOS), and it searches for the hard disks, CD drives, and other hardware

devices that it has been configured to support After the hardware has been located, the BIOS examines the floppy disks, hard disks, and CDs in some configured order and looks at the first sector for boot code The code in the first sector of a bootable disk causes the CPU to process the partition table and locate the bootable partition where the Windows operating system is located In the first sector of the partition is more boot code, which locates and loads the actual operating system We can see how the various components refer to each other

in Figure 2.7

Trang 30

Figure 2.7 The relationship among the various boot code locations in an IA32 system

In the Windows example, if the boot code on the disk were missing, the BIOS would not find

a bootable device and generate an error If the boot code on the disk could not find boot code

in one of the partitions, it would generate an error We will examine each of these boot code locations in the following chapters

Hard Disk Technology

If a digital investigator can learn about only one piece of hardware in a computer, hard disks are probably his best choice because they are one of the most common sources of digital evidence This section covers hard disks basics and discusses topics that are of interest to an investigator, such as access methods, write blocking, and locations where data can be hidden

The first section is an overview of how a disk works, and the next two sections cover AT Attachment (ATA/IDE) disks and Small Computer Systems Interface (SCSI) disks,

respectively

Hard Disk Geometry and Internals

Let's start with the internals of all modern hard disks This information is useful for a basic understanding of how data are stored and because older file systems and partitioning schemes use disk geometry and other internal values that are hidden with modern disks Therefore, knowing about disk geometry will help you to understand some of the values in a file system The goal of this section is not to enable you to fix hard disks Instead, the goal is to obtain a conceptual understanding of what is going on inside

Hard disks contain one or more circular platters that are stacked on top of each other and spin

at the same time A picture of the inside of a disk can be found in Figure 2.8 The bottom and top of each platter is coated with a magnetic media, and when the disk is manufactured, the platters are uniform and empty

Trang 31

Figure 2.8 The inside of an ATA disk where we see the platters on the right and an arm on the

left that reads from and writes to the platters

Inside the disk is an arm that moves back and forth, and it has a head on the top and bottom

of each platter that can read and write data, although only one head can read or write at a time

A low-level format is performed on the blank platters to create data structures for tracks and sectors A track is a circular ring that goes around the platter It is similar to a lane on a running track so that if you go around the entire circle, you will end in the same location that you started Each track on the hard disk is given an address from the outside inward, starting with 0 For example, if there were 10,000 tracks on each platter, the outside track of each platter would be 0, and the inside track (nearest the center of the circle) would be 9,999 Because the layout of each platter is the same and the tracks on each platter are given the same address, the term cylinder is used to describe all tracks at a given address on all platters For example, cylinder 0 is track 0 on the bottom and top of all platters in the hard disk The heads in the disk are given an address so that we can uniquely identify which platter and on which side of the platter we want to read from or write to

Each track is divided into sectors, which is the smallest addressable storage unit in the hard disk and is typically 512 bytes Each sector is given an address, starting at 1 for each track Therefore, we can address a specific sector by using the cylinder address (C) to get the track, the head number (H) to get the platter and side, and the sector address (S) to get the sector in the track We can see this in Figure 2.9

Trang 32

Figure 2.9 Disk geometry of one platter showing the track (or cylinder) and sector addresses

(not even close to scale)

We will discuss in the "Types of Sector Addresses" section that the CHS address is no longer

used as the primary addressing method The Logical Block Address (LBA) is instead used,

and it assigns sequential addresses to each sector The LBA address may not be related to its physical location

A sector can become defective and should therefore no longer be used to store user data With older disks, it was the responsibility of the operating system to know where the bad sectors were and to not allocate them for files Users could also manually tell the disk which sectors to ignore because they were bad In fact, many file systems still provide the option to mark sectors as bad This is typically not needed, though, because modern disks can identify

a bad sector and remap the address to a good location somewhere else on the disk The user never knows that this has happened

The previous description of the layout is overly simplified In reality, the disk arranges the location of the sectors to obtain the best performance So, sectors and tracks may be offset to take advantage of the seek times and speeds of the drive For the needs of many investigators, this simplistic view is good enough because most of us do not have clean rooms, and the equipment to locate where a specific sector is located on a platter A more detailed discussion

of drive internals can be found in Forensic Computing [Sammes and Jenkinson 2000]

ATA / IDE Interface

The AT Attachment (ATA) interface is the most popular hard disk interface Disks that use this interface are frequently referred to as IDE disks, but IDE simply stands for Integrated Disk Electronics and identifies a hard disk that has the logic board built into it, which older

disks did not The actual interface that the "IDE" disks use is ATA This section goes into some of the relevant details of the ATA specification so that we can discuss technologies, such as hardware write blockers and host protected areas

Trang 33

The ATA specifications are developed by the T13 Technical Committee (http://www.t13.org), which is a committee for the International Committee on Information Technology Standards (INCITS) The final version of each specification is

available for a fee, but draft versions are freely available on the INCITS Web site For the purposes of learning about hard disks, the draft versions are sufficient

ATA disks require a controller, which is built into the motherboard of modern systems The

controller issues commands to one or two ATA disks using a ribbon cable The cable has maximum length of 18 inches and has 40 pins, but newer disks have an extra 40 wires that are not connected to any pins The interface can be seen in Figure 2.10 The extra wires are there to prevent interference between the wires Laptops frequently have a smaller disk and use a 44-pin interface, which includes pins for power Adaptors can be used to convert between the two interfaces, as can be seen in Figure 2.11 There is also a 44-pin high-density interface that is used in portable devices, such as Apple iPods

Figure 2.10 An ATA disk with the 40-pin connector, jumpers, and power connector

Figure 2.11 A 44-pin ATA laptop drive connected to a 40-pin ATA ribbon cable using an

adaptor (Photo courtesy of Eoghan Casey)

The interface data path between the controller and disks is called a channel Each channel can

have two disks, and the terms "master" and "slave" were given to them, even though neither has control over the other ATA disks can be configured as master or slave using a physical jumper on the disk Some also can be configured to use "Cable Select," where they will be assigned as master or slave based on which plug they have on the ribbon cable Most consumer computers have two channels and can support four ATA disks

Types of Sector Addresses

To read or write data from the disk, we need to be able to address the sectors As we will see later in the book, a single sector will be assigned a new address each time a partition, file system, or file uses it The address that we are referring to in this section is its physical

Trang 34

address The physical address of a sector is its address relative to the start of the physical

media

There are two different physical addressing methods Older hard disks used the disk geometry and the CHS method, which we already discussed Refer to Figure 2.9 for a simplistic example of how the cylinder and head addresses are organized

The CHS addressing scheme sounds good, but it has proven to be too limiting and is not used much anymore The original ATA specification used a 16-bit cylinder value, a 4-bit head value, and an 8-bit sector value, but the older BIOSs used a 10-bit cylinder value, 8-bit head value, and a 6-bit sector value Therefore, to communicate with the hard disk through the BIOS, the smallest size for each value had to be used, which allowed only a 504MB disk

To work around the 504MB limit, new BIOSes were developed that would translate the address ranges that they liked to the addresses that the ATA specification liked For example,

if the application requested data from cylinder 8, head 4, and sector 32, the BIOS might translate that and request cylinder 26, head 2, sector 32 from the disk For translation to work, the BIOS will report a hard disk geometry that is different from what actually existed on the disk The translation process does not work for disks that are larger than 8.1GB

BIOSes that perform address translation are not as common anymore, but an investigator may run into difficulties if he encounters such a system If he pulls the disk out of the system and puts it into one of his systems, the translation might not exist or might be different, and an acquisition or dead analysis cannot be performed because the wrong sectors will be returned

To solve this problem, an investigator needs to use the original system or find a similar system that performs the same translation An investigator can determine if a system is doing translation by looking up the BIOS version on the vendors website or by looking for references in the BIOS

To overcome the 8.1GB limit associated with translation, the CHS addresses were

abandoned, and Logical Block Addresses (LBA) became standard LBA uses a single

number, starting at 0, to address each sector and has been supported since the first formal ATA specification With LBA, the software does not need to know anything about the geometry; it needs to know only a single number Support for CHS addresses was removed from the ATA specification in ATA-6

Unfortunately, some of the file system and other data structures still use CHS addresses, so

we need to be able to convertfrom CHS to LBA throughout this book LBA address 0 is CHS address 0,0,1 and LBA 1 is CHS address 0,0,2 When all the sectors in the track have been used, the first sector at the next head in the same cylinder is used, which is CHS address 0,1,1 You can visualize this as filling the outer ring of the bottom platter, then moving up platters until the top platter is reached Then, the second ring on the bottom platter is used The conversion algorithm is

LBA = (((CYLINDER * heads_per_cylinder) + HEAD) * sectors_per_track) + SECTOR - 1

where you replace CYLINDER, HEAD, and SECTOR with the respective CHS address values For example, consider a disk that reported 16 heads per cylinder and 63 sectors per track If we had a CHS address of cylinder 2, head 3, and sector 4, its conversion to LBA would be as follows:

2208 = (((2 * 16) + 3) * 63) + 4 - 1

Interface Standards

There are a lot of interface terms in the consumer hard disk arena, which can be confusing Some of the terms mean the same thing, where a standards committee chose one, and a hard disk company chose another A description of the unofficial terms, such as "Enhanced IDE" and "Ultra ATA," can be found in the PC Guide's "Unofficial IDE/ATA Standards and

Trang 35

Marketing Programs" (http://www.pcguide.com/ref/hdd/if/ide/unstd.htm) In general, each new standard adds a faster method of reading and writing data or fixes a size limitation of a previous standard

Note that ATA specifications are applicable to only hard disks Removable media, such as

CD-ROMs and ZIP disks, need to use a special specification, called AT Attachment Packet Interface (ATAPI) ATAPI devices typically use the same cables and controller, but they

require special drivers

Here are some of the highlights of the specifications that are of interest to an investigation:

• ATA-1: Originally published in 1994 This specification had support for CHS and bit LBA addresses [T13 1994]

28-• ATA-3: This specification was published in 1997 and added reliability and security

features.Self-Monitoring Analysis and Reporting Technology (SMART) was

introduced, which attempts to improve reliability by monitoring several parts of the disk Passwords were also introduced in this specification [T13 1997]

• ATA / ATAPI-4: ATAPI, the specification for removable media, was integrated into the ATA specification in ATA-4, which was published in 1998 The 80-wire cable was introduced to help decrease interference ATA-4 added the HPA, which will be discussed later [T13 1998]

• ATA / ATAPI-6: This specification was published in 2002, added 48-bit LBA addresses, removed support for CHS addresses, and added the DCO [T13 2002]

• ATA / ATAPI-7: This specification is still in draft form at the time of this writing The drafts include serial ATA, which will be discussed later

Disk Commands

This section provides an overview of how the controller and hard disk communicate, which will help when we discuss hardware write protectors and the host protected area This section does not apply to ATAPI devices, such as CD-ROMs

The controller issues commands to the hard disk over the ribbon cable The commands are issued to both disks on the cable, but part of the command identifies if it is for the master or slave The controller communicates with the hard disk by writing to its registers, which are small pieces of memory The registers work like an online order form where the controller writes data into specific registers like you would write data into specific fields of the form When all the necessary data has been written to the registers, the controller writes to the command register, and the hard disk processes the command This is like hitting the submit button of an HTML form In theory, the disk should not do anything until the command register is written to

For example, consider a case where the controller wants to read a sector of the disk It would need to write the sector address and number of sectors to read in the appropriate registers After the command details have been written to the registers, the controller would instruct the hard disk to perform the read action by writing to the command register

Hard Disk Passwords

The ATA-3 specification introduced new optional security features, including passwords that can be set through the BIOS or various software applications If implemented, there are two passwords in hard disks, the user and the master passwords The master password was designed so that a company administrator could gain access to the computer in case the user password was lost If passwords are being used, there are two modes that the disk can operate in: high and maximum In the high security mode, the user and master password can unlock the disk In maximum-security mode, the user password can unlock the disk but the master password can unlock the disk only after the disk contents have been wiped After a certain

Trang 36

number of failed password attempts, the disk will freeze, and the system will need to be rebooted

The hard disk will require the SECURITY_UNLOCK command to be executed with the correct password before many of the other ATA commands can be executed After the correct password has been entered, the disk works normally until the disk is powered down

Some ATA commands are still enabled on the hard disk when it is locked, so it may show up

as a valid disk when inserted into a computer However, when you try to read actual user data from a locked disk, it will either produce an error or require a password There are several free programs on the Internet that will tell you if the disk is locked and will allow you to unlock it with the password Two such programs are atapwd and hdunlock[1] The password can be set through the BIOS or through various software applications Some data-recovery companies may be able to bypass the password by opening the disk

Host Protected Area

The Host Protected Area (HPA) is a special area of the disk that can be used to save data, and

a casual observer might not see it The size of this area is configurable using ATA commands, and many disks have a size of 0 by default

The HPA was added in ATA-4, and the motivation was for a location where computer vendors could store data that would not be erased when a user formats and erases the hard disk contents The HPA is at the end of the disk and, when used, can only be accessed by reconfiguring the hard disk

Let's go over the process in more detail using the ATA commands Some of the commands I will use have two versions depending on the size of the disk, but we will use only one of them There are two commands that return maximum addressable sectors values, and if a HPA exists their return values will be different The READ_NATIVE_MAX_ADDRESS command will return the maximum physical address, but the IDENTIFY_DEVICE command will return only the number of sectors that a user can access Therefore, if an HPA exists, the READ_NATIVE_MAX_ADDRESS will return the actual end of the disk and the IDENTIFY_DEVICE command will return the end of the user area (and the start of the HPA) Note that the next section will show that the READ_NATIVE_MAX_ADDRESS is not always the last physical address of the disk

To create an HPA, the SET_MAX_ADDRESS command is used to set the maximum address

to which the user should have access To remove an HPA, the SET_MAX_ADDRESS command must be executed again with the actual maximum size of the disk, which can be found from READ_NATIVE_MAX_ADDRESS

For example, if the disk is 20GB, READ_NATIVE_MAX_ADDRESS will return a sector count of 20GB (41,943,040 for example) To create a 1GB host protected area, we execute SET_MAX_ADDRESS with an address of 39,845,888 Any attempt to read from or write to the final 2,097,152 sectors (1GB) will generate an error, and the IDENTIFY_DEVICE command will return a maximum address of 39,845,888 We can see this in Figure 2.12 To remove the HPA, we would execute SET_MAX_ADDRESS with the full sector count

Trang 37

Figure 2.12 A 1GB Host Protected Area (HPA) in a 20GB disk

One of the settings for the SET_MAX_ADDRESS command is a 'volatility bit' that, when set, causes the HPA to exist after the hard disk is reset or power cycled There is also a set of locking commands in the ATA specification that prevents modifications to the maximum address until the next reset This allows the BIOS to read or write some data in the host protected area when the system is powering up, set the host protected area so that the user cannot see the data, and then lock the area so that it cannot be changed A password can even

be used (which is a different password than is used for accessing the disk)

In summary, a hard disk may have an HPA containing system files, hidden information, or maybe both It can be detected by comparing the output of two ATA commands To remove

it, the maximum address of the hard disk must be reset, but the volatility setting allows the change to be temporary

Device Configuration Overlay

In addition to data being hidden in an HPA, data can be hidden using Device Configuration Overlay (DCO) DCO was added to ATA-6 and allows the apparent capabilities of a hard

disk to be limited Each ATA specification has optional features that a disk may or may not implement The computer uses the IDENTIFY_DEVICE command to determine which features a hard disk supports A DCO can cause the IDENTIFY_DEVICE command to show that supported features are not supported and show a smaller disk size than actually exists Let's look at some of the DCO commands The DEVICE_CONFIGURATION_IDENTIFY command returns the actual features and size of a disk Therefore, we can detect a DCO by comparing the outputs of DEVICE_CONFIGURATION_IDENTIFY and IDENTIFY_DEVICE Further, recall that the READ_NATIVE_MAX_ADDRESS command returns the size of the disk after an HPA We can detect a DCO that hides sectors by

DEVICE_CONFIGURATION_IDENTIFY

For example, consider a 20GB disk where a DCO has set the maximum address to 19GB The READ_NATIVE_MAX_ADDRESS and IDENTIFY_DEVICE show that the disk is only 19GB If a 1GB HPA is also created, the IDENTIFY_DEVICE command shows that the size

of the disk is 18GB We can see this in Figure 2.13

Trang 38

Figure 2.13 A DCO can hide sectors at the end of the disk, in addition to sectors hidden by an

HPA

To create or change a DCO, the DEVICE_CONFIGURATION_SET command is used To remove a DCO, the DEVICE_CONFIGURATION_RESET command is used Unlike HPA, there is no volatile option that allows the device to change the settings for that one session All DCO changes are permanent across resets and power cycles

Serial ATA

Working with ATA devices has its drawbacks The cables are big, not flexible, and have the connectors in places that are frequently not where you want them The hard disks also can be difficult to configure with master and slave jumpers The cable and speed of the interface were some of the motivations behind the development of Serial ATA, which is included in the ATA-7 specification

The interface is called serial because only one bit of information is transmitted between the controller and the disk at a time, as compared to 16 bits at a time with the original interface,

or parallel ATA The serial ATA connectors are about one-fourth the size of the parallel ATA connectors and they have only seven contacts Each serial ATA device is connected directly

to the controller, and there is no chaining of multiple devices

Serial ATA has been designed so that a new controller can be placed in a computer, and the computer does not know the difference between the original ATA (parallel ATA) and the new serial ATA In fact, the serial ATA controller has registers that make the computer think that it is talking to a parallel ATA disk The host computer sees each disk that is connected to the serial ATA controller as the master disk on its own channel

BIOS versus Direct Access

Now that we know how the ATA hard drives work and how they are controlled, we need to discuss how software interfaces with them because this can cause problems when acquiring the contents of a disk There are two methods that software can use to access the disk: directly through the hard disk controller or through the BIOS

Direct Access to Controller

We saw in a previous section that the hard disk is connected to a hard disk controller, which issues commands to the hard disk using the ribbon cable One technique for reading and writing data is for software to communicate directly with the hard disk controller, which then communicates with the hard disk

To communicate this way, the software needs to know how to address the controller and how

to issue commands to it For example, the software needs to know what the command code for the read operation is, and it needs to know how to identify which sectors to read The software also will have to be able to query the hard disk for details such as type and size

Trang 39

BIOS Access to Controller

Accessing the hard disk directly is the fastest way to get data to and from the disk, but it requires the software to know quite a bit about the hardware One of the jobs of the BIOS is

to prevent software from having to know those details The BIOS knows about the hardware, and it provides services to the software so that they can more easily communicate with hardware

Recall from the "Boot Code Locations" section that the BIOS is used when the computer starts The BIOS performs many tasks during the boot process, but there are two that we are interested in for this discussion The first relevant task is that it determines the details of the currently installed disks The second relevant task is that it loads the interrupt table, which will be used to provide services to the operating system and software

To use the BIOS hard disk services, the software must load data, such as the sector address and sizes, into the CPU registers and execute the software interrupt command 0x13 (commonly called INT13h) The software interrupt command causes the processor to look at the interrupt table and locate the code that will process the service request Typically, the table entry for interrupt 0x13 contains the address of the BIOS code that will use its knowledge of the hard disk to communicate with the controller In essence, the BIOS works

as a middleman between the software and the hard disk

INT13h is actually a category of disk functions and includes functions that write to the disk, read from the disk, format tracks on the disk, and query the disk for information The original INT13h functions for reading and writing used CHS addresses and allowed the software to access a disk that was only 8.1GB or less To overcome this limitation, new functions were added to INT13h in the BIOS, called the "extended INT13h."

The extended INT13h functions required new BIOS code and used a 64-bit LBA address For backward compatibility reasons, the old CHS functions remained, and software had to be rewritten to take advantage of the new LBA INT13h functions

SCSI Drives

When building a portable incident response kit, some of the more difficult decisions may

include identifying what types of Small Computer Systems Interface (SCSI) cables, drives,

and connectors should be included This section gives an overview of SCSI, focuses on the different types, and describes how it is different from ATA SCSI hard disks are not as common as ATA hard disks for consumer PCs, but they are standard on most servers

Like ATA, there are many specifications of SCSI, which are published by the T10 Technical Committee for INCITS (http://www.t10.org) There are three SCSI specifications, SCSI-1, SCSI-2, and SCSI-3 SCSI-3 actually includes many smaller specifications, but covering all the details is out of the scope for this book

SCSI versus ATA

There are both high-level and low-level differences between SCSI and ATA The most obvious high-level difference includes the numerous connector types With ATA, there was only 40- and 44-pin connectors, but SCSI has many shapes and styles The SCSI cables can

be much longer than ATA cables and there can be more than two devices on the same cable Each device on the SCSI cable needs a unique numerical ID, which can be configured with jumpers on the disk or with software Many SCSI disks also have a jumper to make the disk read only, which provides a similar function to an ATA write blocker ATA write blockers are external devices that block write commands, and they will be discussed in Chapter 3,

"Hard Disk Data Acquisition."

The first low-level difference between ATA and SCSI is that SCSI does not have a controller The ATA interface was designed for a single controller to tell one or two hard disks what to

Trang 40

do SCSI was designed as a bus where different devices communicate with each other and the devices are not limited to hard disks With a SCSI configuration, the card that plugs into the computer is not a controller because each device on the SCSI cable is essentially an equal and can make requests of each other

Like ATA, standard SCSI is parallel and data transfers occur in 8-bit or 16-bit chunks Also like ATA, there is a serial version of the specification, which is the serial attached SCSI specification

Types of SCSI

The differences in SCSI versions boil down to how many bits are transferred at a time, the frequency of the signals on the cable (the speed of the transfer), and what types of signals are used Older types of SCSI had a normal version and a wide version, where the normal version transferred 8 bits at a time and the wide version transferred 16-bits at a time For example, an Ultra SCSI device performs an 8-bit transfer and a Wide Ultra SCSI device performs a 16-bit transfer All newer systems use 16-bit transfers, and there is no need to differentiate between normal and wide

The second difference in SCSI versions is the speed of the signals in the cable Table 2.5 shows the names of the SCSI types, the speed, and the transfer rates for an 8-bit normal bus and a 16-bit wide bus

Table 2.5 Speed differences among the different types of SCSI

Type Frequency 8-bit Transfer Rate 16-bit (wide) Transfer Rate

Within each of these types, there are different ways that the data are represented on the wire

The obvious method is single ended (SE), where a high voltage is placed on the wire if a 1 is

being transmitted and no voltage is placed on the wire if a 0 is transmitted This method runs into problems at higher speeds and with longer cables because the electric signal cannot stabilize at the high clock rate and the wires cause interference with each other

The second method of transmitting the data is called differential voltage, and each bit actually

requires two wires If a 0 is being transmitted, no voltage is applied to both wires If a 1 is being transmitted, a positive voltage is applied to one wire and the opposite voltage is applied

to the second wire When a device reads the signals from the cable, it takes the difference

between the two wires A high voltage differential (HVD) signal option has existed in SCSI since the first version and a low voltage differential (LVD) signal option uses a smaller

signal, has existed since Ultra2 SCSI, and is the primary signal type for new disks Table 2.6 shows the types of SCSI that use the different signal types

Table 2.6 Signal types that can found in each type of SCSI

Signal Type SCSI Types

Tiêu đề	File System Forensic Analysis
Trường học	Unknown University
Chuyên ngành	Computer Science
Thể loại	N/A
Năm xuất bản	N/A
Thành phố	N/A

Định dạng
Số trang	382
Dung lượng	4,03 MB