We already have a scraper see the previous section and thus we just need something that will extract the meta information from the file.Thomas Springer at ServerSniff.net was kind enough
Trang 1Figure 5.20 The LinkedIn Profile of the Author of a Government Document
Can this process of grabbing documents and analyzing them be automated? Of course!
As a start we can build a scraper that will find the URLs of Office documents (.doc, ppt, xls, pps) We then need to download the document and push it through the meta information
parser Finally, we can extract the interesting bits and do some post processing on it We
already have a scraper (see the previous section) and thus we just need something that will
extract the meta information from the file.Thomas Springer at ServerSniff.net was kind
enough to provide me with the source of his document information script After some slight changes it looks like this:
#!/usr/bin/perl
# File-analyzer 0.1, 07/08/2007, thomas springer
# stripped-down version
# slightly modified by roelof temmingh @ paterva.com
# this code is public domain - use at own risk
# this code is using phil harveys ExifTool - THANK YOU, PHIL!!!!
# http://www.ebv4linux.de/images/articles/Phil1.jpg
Trang 2use strict;
use Image::ExifTool;
#passed parameter is a URL
my ($url)=@ARGV;
# get file and make a nice filename
my $file=get_page($url);
my $time=time;
my $frand=rand(10000);
my $fname="/tmp/".$time.$frand;
# write stuff to a file
open(FL, ">$fname");
print FL $file;
close(FL);
# Get EXIF-INFO
my $exifTool=new Image::ExifTool;
$exifTool->Options(FastScan => '1');
$exifTool->Options(Binary => '1');
$exifTool->Options(Unknown => '2');
$exifTool->Options(IgnoreMinorErrors => '1');
my $info = $exifTool->ImageInfo($fname); # feed standard info into a hash
# delete tempfile
unlink ("$fname");
my @names;
print "Author:".$$info{"Author"}."\n";
print "LastSaved:".$$info{"LastSavedBy"}."\n";
print "Creator:".$$info{"creator"}."\n";
print "Company:".$$info{"Company"}."\n";
print "Email:".$$info{"AuthorEmail"}."\n";
exit; #comment to see more fields
foreach (keys %$info){
print "$_ = $$info{$_}\n";
}
Trang 3sub get_page{
my ($url)=@_;
#use curl to get it - you might want change this
# 25 second timeout - also modify as you see fit
my $res=`curl -s -m 25 $url`;
return $res;
}
Save this script as docinfo.pl.You will notice that you’ll need some PERL libraries to use this, specifically the Image::ExifTool library, which is used to get the meta data from the files.
The script uses curl to download the pages from the server, so you’ll need that as well Curl
is set to a 25-second timeout On a slow link you might want to increase that Let’s see how
this script works:
$ perl docinfo.pl http://www.elsevier.com/framework_support/permreq.doc
Author:Catherine Nielsen
LastSaved:Administrator
Creator:
Company:Elsevier Science
Email:
The scripts looks for five fields in a document: Author, LastedSavedBy, Creator, Company, and AuthorEmail.There are many other fields that might be of interest (like the software used
to create the document) On it’s own this script is only mildly interesting, but it really starts
to become powerful when combining it with a scraper and doing some post processing on
the results Let’s modify the existing scraper a bit to look like this:
#!/usr/bin/perl
use strict;
my ($domain,$num)=@ARGV;
my @types=("doc","xls","ppt","pps");
my $result;
foreach my $type (@types){
$result=`curl -s -A moo
"http://www.google.com/search?q=filetype:$type+site:$domain&hl=en&
num=$num&filter=0"`;
parse($result);
}
sub parse {
Trang 4my $start;
my $end;
my $token="<div class=g>";
my $count=1;
while (1){
$start=index($result,$token,$start);
$end=index($result,$token,$start+1);
if ($start == -1 || $end == -1 || $start == $end){
last;
}
my $snippet=substr($result,$start,$end-$start);
my ($pos,$url) = cutter("<a href=\"","\"",0,$snippet);
my ($pos,$heading) = cutter(">","</a>",$pos,$snippet);
my ($pos,$summary) = cutter("<font size=-1>","<br>",$pos,$snippet);
# remove <b> and </b>
$heading=cleanB($heading);
$url=cleanB($url);
$summary=cleanB($summary);
print $url."\n";
$start=$end;
$count++;
}
}
sub cutter{
my ($starttok,$endtok,$where,$str)=@_;
my $startcut=index($str,$starttok,$where)+length($starttok);
my $endcut=index($str,$endtok,$startcut+1);
my $returner=substr($str,$startcut,$endcut-$startcut);
my @res;
push @res,$endcut;
push @res,$returner;
return @res;
}
sub cleanB{
Trang 5my ($str)=@_;
$str=~s/<b>//g;
$str=~s/<\/b>//g;
return $str;
}
Save this script as scraper.pl.The scraper takes a domain and number as parameters.The
number is the number of results to return, but multiple page support is not included in the
code However, it’s child’s play to modify the script to scrape multiple pages from Google.
Note that the scraper has been modified to look for some common Microsoft Office
for-mats and will loop through them with a site:domain_parameter filetype:XX search term Now
all that is needed is something that will put everything together and do some post processing
on the results.The code could look like this:
#!/bin/perl
use strict;
my ($domain,$num)=@ARGV;
my %ALLEMAIL=(); my %ALLNAMES=();
my %ALLUNAME=(); my %ALLCOMP=();
my $scraper="scrape.pl";
my $docinfo="docinfo.pl";
print "Scraping please wait \n";
my @all_urls=`perl $scraper $domain $num`;
if ($#all_urls == -1 ){
print "Sorry - no results!\n";
exit;
}
my $count=0;
foreach my $url (@all_urls){
print "$count / $#all_urls : Fetching $url";
my @meta=`perl $docinfo $url`;
foreach my $item (@meta){
process($item);
}
$count++;
}
#show results
Trang 6print "\nEmails:\n -\n";
foreach my $item (keys %ALLEMAIL){
print "$ALLEMAIL{$item}:\t$item";
}
print "\nNames (Person):\n -\n";
foreach my $item (keys %ALLNAMES){
print "$ALLNAMES{$item}:\t$item";
}
print "\nUsernames:\n -\n";
foreach my $item (keys %ALLUNAME){
print "$ALLUNAME{$item}:\t$item";
}
print "\nCompanies:\n -\n";
foreach my $item (keys %ALLCOMP){
print "$ALLCOMP{$item}:\t$item";
}
sub process {
my ($passed)=@_;
my ($type,$value)=split(/:/,$passed);
$value=~tr/A-Z/a-z/;
if (length($value)<=1) {return;}
if ($value =~ /[a-zA-Z0-9]/){
if ($type eq "Company"){$ALLCOMP{$value}++;}
else {
if (index($value,"\@")>2){$ALLEMAIL{$value}++; } elsif (index($value," ")>0){$ALLNAMES{$value}++; } else{$ALLUNAME{$value}++; }
}
}
}
This script first kicks off scraper.pl with domain and the number of results that was
passed to it as parameters It captures the output (a list of URLs) of the process in an array,
and then runs the docinfo.pl script against every URL.The output of this script is then sent
for further processing where some basic checking is done to see if it is the company name,
an e-mail address, a user name, or a person’s name.These are stored in separate hash tables for later use When everything is done, the script displays each collected piece of informa-tion and the number of times it occurred across all pages Does it actually work? Have a look:
Trang 7# perl combined.pl xxx.gov 10
Scraping please wait
0 / 35 : Fetching http://www.xxx.gov/8878main_C_PDP03.DOC
1 / 35 : Fetching http://***.xxx.gov/1329NEW.doc
2 / 35 : Fetching http://***.xxx.gov/LP_Evaluation.doc
3 / 35 : Fetching http://*******.xxx.gov/305.doc
<cut>
Emails:
-1: ***zgpt@***.ksc.xxx.gov
1: ***ikrb@kscems.ksc.xxx.gov
1: ***ald.l.***mack@xxx.gov
1: ****ie.king@****.xxx.gov
Names (Person):
-1: audrey sch***
1: corina mo****
1: frank ma****
2: eileen wa****
2: saic-odin-**** hq
1: chris wil****
1: nand lal****
1: susan ho****
2: john jaa****
1: dr paul a cu****
1: *** project/code 470
1: bill mah****
1: goddard, pwdo - bernadette fo****
1: joanne wo****
2: tom naro****
1: lucero ja****
1: jenny rumb****
1: blade ru****
1: lmit odi****
2: **** odin/osf seat
1: scott w mci****
2: philip t me****
1: annie ki****
Trang 8
-1: cgro****
1: gidel****
1: rdcho****
1: fbuchan****
2: sst****
1: rbene****
1: rpan****
2: l.j.klau****
1: gane****h
1: amh****
1: caroles****
2: mic****e
1: baltn****r
3: pcu****
1: md****
1: ****wxpadmin
1: mabis****
1: ebo****
2: grid****
1: bkst****
1: ***(at&l)
Companies:
-1: shadow conservatory
[SNIP]
The list of companies has been chopped way down to protect the identity of the gov-ernment agency in question, but the script seems to work well.The script can easily be modified to scrape many more results (across many pages), extract more fields, and get other file types By the way, what the heck is the one unedited company known as the “Shadow Conservatory?”
Trang 9Figure 5.21 Zero Results for “Shadow Conservatory”
The tool also works well for finding out what (and if ) a user name format is used.
Consider the list of user names mined from somewhere:
Usernames:
-1: 79241234
1: 78610276
1: 98229941
1: 86232477
2: 82733791
2: 02000537
1: 79704862
1: 73641355
2: 85700136
From the list it is clear that an eight-digit number is used as the user name.This infor-mation might be very useful in later stages of an attack.
Taking It One Step Further
Sometimes you end up in a situation where you want to hook the output of one search as
the input for another process.This process might be another search, or it might be
some-thing like looking up an e-mail address on a social network, converting a DNS name to a
domain, resolving a DNS name, or verifying the existence of an e-mail account How do I
Trang 10link two e-mail addresses together? Consider Johnny’s e-mail address johnny@ihackstuff.com and my previous e-mail address at SensePost roelof@sensepost.com.To link these two addresses
together we can start by searching for one of the e-mail addresses and extracting sites, e-mail addresses, and phone numbers Once we have these results we can do the same for the other e-mail address and then compare them to see if there are any common results (or nodes) In this case there are common nodes (see Figure 5.22).
Figure 5.22 Relating Two E-mail Addresses from Common Data Sources
If there are no matches, we can loop through all of the results of the first e-mail address, again extracting e-mail addresses, sites, and telephone numbers, and then repeat it for the second address in the hope that there are common nodes.
What about more complex sequences that involve more than searching? Can you get locations of the Pentagon data centers by simply looking at public information? Consider Figure 5.23.
What’s happening here? While it looks seriously complex, it really isn’t.The procedure
to get to the locations shown in this figure is as follows: