Google’s Part in an Information Collection Framework Solutions in this chapter: ■ The Principles of Automating Searches ■ Applications of Data Mining ■ Collecting Search Terms Chapter 5
Trang 1Google’s Part
in an Information Collection
Framework
Solutions in this chapter:
■ The Principles of Automating Searches
■ Applications of Data Mining
■ Collecting Search Terms
Chapter 5
Trang 2There are various reasons for hacking When most of us hear hacker we think about com-puter and network security, but lawyers, salesmen, and policemen are also hackers at heart It’s really a state of mind and a way of thinking rather than a physical attribute Why do people hack? There are a couple of motivators, but one specific reason is to be able to know things that the ordinary man on the street doesn’t From this flow many of the other motiva-tors Knowledge is power—there’s a rush to seeing what others are doing without them knowing it Understanding that the thirst for knowledge is central to hacking, consider Google, a massively distributed super computer, with access to all known information and with a deceivingly simple user interface, just waiting to answer any query within seconds It
is almost as if Google was made for hackers
The first edition of this book brought to light many techniques that a hacker (or pene-tration tester) might use to obtain information that would help him or her in conventional security assessments (e.g., finding networks, domains, e-mail addresses, and so on) During such a conventional security test (or pen test) the aim is almost always to breach security measures and get access to information that is restricted However, this information can be reached simply by assembling related pieces of information together to form a bigger pic-ture.This, of course, is not true for all information.The chances that I will find your super secret double encrypted document on Google is extremely slim, but you can bet that the way to get to it will eventually involve a lot of information gathering from public sources like Google
If you are reading this book you are probably already interested in information mining, getting the most from search engines by using them in interesting ways In this chapter I hope to show interesting and clever ways to do just that
The Principles of Automating Searches
Computers help automate tedious tasks Clever automation can accomplish what a thousand disparate people working simultaneously cannot But it’s impossible to automate something that cannot be done manually If you want to write a program to perform something, you need to have done the entire process by hand, and have that process work every time It makes little sense to automate a flawed process Once the manual process is ironed out, an algorithm is used to translate that process into a computer program
Let’s look at an example A user is interested in finding out which Web sites contain the
e-mail address andrew@syngress.com As a start, the user opens Google and types the e-mail
address in the input box.The results are shown in Figure 5.1
Trang 3Figure 5.1 A Simple Search for an E-mail Address
The user sees that there are three different sites with that e-mail address listed:
g.bookpool.com, www.networksecurityarchive.org, and book.google.com In the back of his or her
mind is the feeling that these are not the only sites where the e-mail address appears, and
remembers that he or she has seen places where e-mail addresses are listed as andrew at
syn-gress dot com When the user puts this search into Google, he or she gets different results, as
shown in Figure 5.2
Clearly the lack of quotes around the query gave incorrect results.The user adds the quotes and gets the results shown in Figure 5.3
Trang 4Figure 5.2 Expanding the search
Figure 5.3 Expansion with Quotes
Trang 5By formulating the query differently, the user now has a new result:
taosecurity.blogspot.com.The manipulation of the search query worked, and the user has found
another site reference
If we break this process down into logical parts, we see that there are actually many dif-ferent steps that were followed Almost all searches follow these steps:
■ Define an original search term
■ Expand the search term
■ Get data from the data source
■ Parse the data
■ Post-process the data into information Let’s look at these in more detail
The Original Search Term
The goal of the previous example was to find Web pages that reference a specific e-mail
address.This seems rather straightforward, but clearly defining a goal is probably the most
difficult part of any search Brilliant searching won’t help attain an unclear goal When
automating a search, the same principles apply as when doing a manual search: garbage in,
garbage out
Tools & Traps…
Garbage in, garbage out
Computers are bad at “thinking” and good at “number crunching.” Don’t try to make
a computer think for you, because you will be bitterly disappointed with the results.
The principle of garbage in, garbage out simply states that if you enter bad informa-tion into a computer from the start, you will only get garbage (or bad informainforma-tion) out Inexperienced search engine users often wrestle with this basic principle
In some cases, goals may need to be broken down This is especially true of broad goals, like trying to find e-mail addresses of people that work in cheese factories in the
Netherlands In this case, at least one sub-goal exists—you’ll need to define the cheese fac-tories first Be sure your goals are clearly defined, then work your way to a set of core
search terms In some cases, you’ll need to play around with the results of a single query
in order to work your way towards a decent starting search term I have often seen results
Trang 6of a query and thought, “Wow, I never thought that my query would return these results.
If I shape the query a little differently each time with automation, I can get loads of inter-esting information.”
In the end the only real limit to what you can get from search engines is your own imagination, and experimentation is the best way to discover what types of queries work well
Expanding Search Terms
In our example, the user quickly figured out that they could get more results by changing the original query into a set of slightly different queries Expanding search terms is fairly natural for humans, and the real power of search automation lies in thinking about that human process and translating it into some form of algorithm By programmatically
changing the standard form of a search into many different searches, we save ourselves from manual repetition, and more importantly, from having to remember all of the expansion tricks Let’s take a look at a few of these expansion techniques
E-mail Addresses
Many sites try obscure e-mail addresses in order to fool data mining programs.This is done for a good reason: the majority of the data mining programs troll sites to collect e-mail addresses for spammers If you want a sure fire way to receive a lot of spam, post to a mailing list that does not obscure your e-mail address While it’s a good thing that sites automatically obscure the e-mail address, it also makes our lives as Web searchers difficult Luckily, there are ways to beat this; however, these techniques are also not unknown to spammers
When searching for an e-mail address we can use the following expansions.The e-mail
address andrew@syngress.com could be expanded as follows:
■ andrew at syngress.com
■ andrew at syngress dot com
■ andrew@syngress dot com
■ andrew_at_syngress.com
■ andrew_at_syngress dot com
■ andrew_at_syngress_dot_com
■ andrew@syngress.remove.com
■ andrew@_removethis_syngress.com
Note that the “@” sign can be written in many forms (e.g., – (at), _at_ or -at-).The same goes for the dot (“.”).You can also see that many people add “remove” or “removethis”
Trang 7in an e-mail address At the end it becomes an 80/20 thing—you will find 80 percent of
addresses when implementing the top 20 percent of these expansions
At this stage you might feel that you’ll never find every instance of the address (and you may be right) But there is a tiny light at the end of the tunnel Google ignores certain
char-acters in a search A search for andrew@syngress.com and “andrew syngress com” returns the
same results.The @ sign and the dot are simply ignored So when expanding search terms,
don’t include both, because you are simply wasting a search
Tools & Traps…
Verifying an e-mail address
Here’s a quick hack to verify if an e-mail address exists While this might not work
on all mail servers, it works on the majority of them – including Gmail Have a look:
■ Step 1 – Find the mail server:
$ host -t mx gmail.com gmail.com mail is handled by 5 gmail-smtp-in.l.google.com.
gmail.com mail is handled by 10 alt1.gmail-smtp-in.l.google.com.
gmail.com mail is handled by 10 alt2.gmail-smtp-in.l.google.com.
gmail.com mail is handled by 50 gsmtp163.google.com.
gmail.com mail is handled by 50 gsmtp183.google.com.
■ Step 2 – Pick one and Telnet to port 25
$ telnet gmail-smtp-in.l.google.com 25 Trying 64.233.183.27
Connected to gmail-smtp-in.l.google.com.
Escape character is '^]'.
220 mx.google.com ESMTP d26si15626330nfh
■ Step 3: Mimic the Simple Mail Transfer Protocol (SMTP):
HELO test
250 mx.google.com at your service MAIL FROM: <test@test.com>
250 2.1.0 OK
■ Step 4a: Positive test:
RCPT TO: <roelof.temmingh@gmail.com>
250 2.1.5 OK
Continued
Trang 8■ Step 4b: Negative test:
RCPT TO: <kosie.kramer@gmail.com>
550 5.1.1 No such user d26si15626330nfh
■ Step 5: Say goodbye:
quit
221 2.0.0 mx.google.com closing connection d26si15626330nfh
By inspecting the responses from the mail server we have now verified that
roelof.temmingh@gmail.com exists, while kosie.kramer@gmail.com does not In the
same way, we can verify the existence of other e-mail addresses.
NOTE
On Windows platforms you will need to use the nslookup command to find
the e-mail servers for a domain You can do this as follows:
nslookup -qtype=mx gmail.com
Telephone Numbers
While e-mail addresses have a set format, telephone numbers are a different kettle of fish It appears that there is no standard way of writing down a phone number Let’s assume you have a number that is in South Africa and the number itself is 012 555 1234.The number can appear on the Internet in many different forms:
■ 012 555 1234 (local)
■ 012 5551234 (local)
■ 012555124 (local)
■ +27 12 555 1234 (with the country code)
■ +27 12 5551234 (with the country code)
■ +27 (0)12 555 1234 (with the country code)
■ 0027 (0)12 555 1234 (with the country code) One way of catching all of the results would be to look for the most significant part of the number, “555 1234” and “5551234.” However, this has a drawback as you might find that the same number exists in a totally different country, giving you a false positive
An interesting way to look for results that contain telephone numbers within a certain
range is by using Google’s numrange operator A shortcut for this is to specify the start
Trang 9number, then “ ” followed by the end number Let’s see how this works in real life Imagine
I want to see what results I can find on the area code +1 252 793.You can use the numrange
operator to specify the query as shown in Figure 5.4
Figure 5.4 Searching for Telephone Number Ranges
We can clearly see that the results all contain numbers located in the specified range in North Carolina We will see how this ability to restrict results to a certain area is very useful
later in this chapter
People
One of the best ways to find information about someone is to Google them If you haven’t
Googled for yourself, you are the odd one out.There are many ways to search for a person
and most of them are straightforward If you don’t get results straight away don’t worry, there are numerous options Assuming you are looking for Andrew Williams you might search for:
■ “Andrew Williams”
■ “Williams Andrew”
■ “A Williams”
■ “Andrew W”
■ Andrew Williams
■ Williams Andrew
Trang 10Note that the last two searches do not have quotes around them.This is to find phrases like “Andrew is part of the Williams family”
With a name like Andrew Williams you can be sure to get a lot of false positives as there are probably many people named Andrew Williams on the Internet As such, you need to add as many additional search terms to your search as possible For example, you may try
something like “Andrew Williams” Syngress publishing security Another tip to reduce false
posi-tives is to restrict the site to a particular country If Andrew stayed in England, adding the
site:uk operator would help limit the results But keep in mind that your searches are then
limited to sites in the UK If Andrew is indeed from the UK but posts on sites that end in any other top level domains (TLD), this search won’t return hits from those sites
Getting Lots of Results
In some cases you’d be interested in getting a lot of results, not just specific results For instance, you want to find all Web sites or e-mail addresses within a certain TLD Here you want to combine your searches with keywords that do two things: get past the 1,000 result restriction and increase your yield per search As an example, consider finding Web sites in
the ****.gov domain, as shown in Figure 5.5.
Figure 5.5 Searching for a Domain
You will get a maximum of 1,000 sites from the query, because it is most likely that you will get more than one result from a single site In other words, if 500 pages are located on one server and 500 pages are located on another server you will only get two site results