DOI 10.1007/s11042-013-1826-9Building 3D event logs for video investigation Trung Kien Dang · Marcel Worring · The Duy Bui © Springer Science+Business Media New York 2014 Abstract In sce
Trang 1DOI 10.1007/s11042-013-1826-9
Building 3D event logs for video investigation
Trung Kien Dang · Marcel Worring · The Duy Bui
© Springer Science+Business Media New York 2014
Abstract In scene investigation, creating a video log captured using a handheld camera is
more convenient and more complete than taking photos and notes By introducing videoanalysis and computer vision techniques, it is possible to build a spatio-temporal represen-tation of the investigation Such a representation gives a better overview than a set of photosand makes an investigation more accessible We develop such methods and present an inter-face for navigating the result The processing includes (i) segmenting a log into events usingnovel structure and motion features making the log easier to access in the time dimension,and (ii) mapping video frames to a 3D model of the scene so the log can be navigated inspace Our results show that, using our proposed features, we can recognize more than 70percent of all frames correctly, and more importantly find all the events From there weprovide a method to semi-interactively map those events to a 3D model of the scene Withthis we can map more than 80 percent of the events The result is a 3D event log that cap-tures the investigation and supports applications such as revisiting the scene, examining theinvestigation itself, or hypothesis testing
Keywords Scene investigation· Video analysis · Story navigation · 3D model
1 Introduction
The increasing availability of cameras and the reduced cost of storage have encouragedpeople to use image and videos in many aspects of their life Instead of writing a diary,nowadays many people capture their daily activities in video logs When such capturing iscontinuous this is known as “life logging” This idea goes back to Vannevar Bush’s Memexdevice [5] and is still topic of active research [2,8,33] Similarly, professional activities
Trang 2can be recorded with video to create logs For example, in home safety assessment, aninvestigator can walk around, examine a house and record speech notes when finding a con-struction issue Another interesting professional application is crime scene investigation.Instead of looking for evidences, purposely taking photos and writing notes, investigatorswear a head-mounted camera and focus on finding the evidence, while everything is auto-matically recorded in a log These professional applications all share a similar setup, namely
a first person view video log recorded in a typically static scene In this paper, we focus onthis group of professional logging applications, which we denote by scene investigation.Our proposed scene investigation framework includes three phases: capturing, process-ing, and reviewing
In the capturing phase, an investigator records the scene and all objects of interest it tains using various media including photos, videos, and speech The capturing is a complexprocess in which the investigator performs several actions to record different aspects of thescene In particular, the investigator records the overall scene to get an overview, walksaround to search for objects of interest and then examines those objects in detail Together
con-these actions form the events of the capturing process In the processing phase, the
sys-tem analyzes all data to get information about the scene, the objects, as well as the events.Later, in the reviewing phase, an investigator uses the collected information to perform vari-ous tasks: assessing the evidence, getting an overview of the case, measuring specific scenecharacteristics, or evaluating hypotheses
In common investigation practice, experts take photos of the scene and objects they findimportant and add hand-written notes to them This standard way of recording does notprovide sufficient basis for processing and reviewing for a number of reasons A collection
of photos cannot give a good overview of the scene Thus, it is hard to understand therelation between objects In some cases, investigators use the pictures to create a panorama
to get a better overview [4] But since the viewpoint is fixed for each panorama, it doesnot give a good spatial impression Measuring characteristics of the scene is also not easywith photos if the investigator has not planned for it in advance More complicated tasks,like making a hypothesis on how the suspect moved, are very difficult to perform using
a collection of photos due to the lack of sense of space Finally, a collection of photosand notes hardly captures investigation events, which are important for understanding theinvestigation process itself
The scene can also be captured with the aim to create 3D models 3D models makediscussion easier, hypothesis assessment more accurate, and court presentation muchclearer [11,14] When 3D models are combined with video logs it could enhance sceneinvestigation further
Using a video log to capture the investigation is straightforward in the capturing phase.Instead of taking photos and notes, investigators film the scene with a camera All movesand observations of investigators are thus recorded in video logs However, in order to takethe benefit, it is crucial to have a method to extract information and events from the logsfor reviewing (Fig.1) For example, a 3D model helps visualize the spatial relation betweenevents; while details of certain parts of the scene can be checked by reviewing events cap-tured in these parts Together, a 3D model of the scene and log events form what we call a
3D event log of the case Such a log enables event-based navigation in 3D, and other
appli-cations such as knowledge mining of expert moves, or finding correlation among cases Thequestion is how to get the required information and how to combine them
3D models of a scene can be built in various ways [24,28,30,35] In this work, weuse 3D models reconstructed using a semi-interactive method that builds the model from
Trang 3table
bo dy
a b
c d
tableTV
a
b
c
d
Fig 1 An investigation log is a series of investigation events (like a taking overview, b searching, c getting
details, or d examining) within a scene When reviewing the investigation, it is helpful if an analysis can
point out which events happened and where in the scene they took place
panoramas [6] These models are constructed prior to the work presented in this paper.Here we focus on analyzing investigation log to find events and connecting them to the 3Dmodel
Trang 4Analyzing an investigation log is different from regular video analysis The commontarget in video analysis is to determine the content of a shot, while in an investigationlog analysis we already know the content (the scene) and the purpose (investigation) The
investigation events we want to detect arise from both content and the intentions of the
cameraman For example, when an investigator captures an object of interest she will walkaround the object and zoom-in on it While the general context and purpose of the log isknown, these events are not easy to recognize as the mapping from intention to scene move-ment is not well defined, and to some extent depends on the investigator To identify eventsfeatures we need features that consider both content and camera movements
Once we have the events identified in the logs we need to map the events to the 3D model
of the scene When the data would be high quality imagery, accurate matching would bepossible [19,31] Video frames of investigation logs, however, are not optimal for matching
as they suffer from intensity noise and motion blur This hinders the performance of familiarautomatic matching methods Different mapping approaches capable of dealing with lowerquality data need to be considered
In the next section we review the related work Then two main components of our systemare presented in subsequent sections: (i) analyzing an investigation log to segment it intoevents, (ii) and their mapping to a 3D model for reviewing (Fig.2) In order to segment
an investigation log into events, we introduce in Section3our novel features to classifyframes into classes of events This turns a log into a story of the investigation, making itmore accessible in the time dimension Section4presents our semi-interactive approach
to map events to a reconstructed 3D model Together with the log segmentation step, thisbuilds a 3D event log containing investigation events and their spatial and temporal relations
Fig 2 Overview of the
framework to build a 3D event
log of an investigations
Automatic video log segmentation
interactive matching
Semi-Video log
3D event log
Investigation events
3D reconstructed model
Trang 5Section5evaluates the results of the proposed solution at the two analysis steps: segmenting
a log into events, and mapping the events to a 3D model Finally, in Section6we presentour interface which allows for navigating the 3D events
2 Related work
2.1 Video analysis and segmentation
Video analysis often starts with segmenting a video into units for easier management andprocessing The commonly used unit is a shot, a series of frames captured by a camera in
an uninterrupted period of time Shot boundaries can be detected quite reliable, e.g based
on motion [23] After shot segmentation, various low-level features and machine learningtechniques are used to get more high-level information, such as whether a specific concept
is present [32] Instead of content descriptors, we want to get information on the ments and actions of the investigator Thus attention and intention are two important aspects.Attention analysis tries to capture the passive reaction of the viewer To that end the atten-tion model defines what elements in the video are most likely to get the viewer’s attention.Many works are based on the visual saliency of regions in frames They are then used ascriteria to select key frames [1,20,22] When analyzing video logs intention analysis tries
move-to find the motivation of the cameraman while capturing the scene This information leads
to one more browsing dimension [21], or another way of summarization [1]
Tasks such as summarization are very difficult to tackle as a general problem Indeedexisting systems have been built to handle data in specific domains, such as news [3] andsports [27], In all examples mentioned, we see that domain specific summarization methodsperform better than generic schemes New domains need new demands Social networkingvideo sites, like YouTube, urge for research on analysis of user generated videos (UGV) [1]
as well as life logs [2], a significant sub-class of UGVs Indeed, we have seen research inboth hardware [2,7,10] and algorithms [8,9] to meet that need Research on summarizinginvestigation logs is very limited
Domains dictate requirements and limit the techniques applicable for analysis For ple, life logs, and in general many UGVs, are one-shot This means that the familiar unit
exam-of video analysis (shots) is no longer suitable, and new units as well as new segmentationmethods must be developed [1] The quality of those videos is also lower than professionallyproduced videos Unstable motion and varying types of scenes violate the common assump-tions on the motion model A more difficult issue is that those videos are less structured,making it harder to analyze contextual information In this work we consider professionallogs of scene investigation that shares many challenges with general UGVs, but also has itsown domain specific characteristics
2.2 Video navigation
The simplest video navigation scheme, as seen on every DVD, divides a video into tracksand presents them with a representative frame and description If we apply multimedia anal-ysis and know more about the purpose of navigation, there are many alternative ways tonavigate a video For example, in [16] the track and representative frame scheme is enhancedusing an interactive mosaic as a customized interface The method to create the video mosaictakes into account various features including color distribution, existence of human faces,and time, to select and pack key frames into a mosaic template Apart from the familiar time
Trang 6dimension, we can also navigate in space Tour into video [15] shows the ability of tial navigation in video by decomposing an object into different depth layers allowing users
spa-to watch the video from new perspectives The navigation can be object-based or framebased In [12], object tracking enables an object-based video navigation scheme For exam-ple, users can navigate video by dragging an object from one frame to a new location Thesystem then automatically navigates to the frame in which the object location is closest tothat expectation
The common thing in the novel navigation schemes described above is that they depend
on video analysis to get the information required for navigation That is also the way weapproach the problem We first analyze video logs and then use the result as basis fornavigation in a 3D model
3 Analyzing investigation logs
In this section we first discuss investigation events and their characteristics Based on that
we motivate our solution for segmenting investigation logs, which is described subsequently.3.1 Investigation events
Watching logs produced by professionals (policemen), we identify four types of events:
search, overview, detail, and examination In a search segment investigators look around
the scene for interesting objects An overview segment is taken with the intention to capturespatial relations between objects and to position oneself in the room In a detail segmentthe investigator is interested in a specific object e.g an important trace, and moves closer
or zooms in to capture it Finally, in examination segments, investigators carefully look atevery side of an important object The different situations lead to four different types ofsegments in an investigation log As a basis for video navigation, our aim is to automaticallysegment an investigation log into these four classes of events
There are several clues for segmentation, namely structure and motion, visual content,and voice Voice is an accurate clue, however, as users usually add voice notes at a fewimportant points only it does not cover all the frames of the video Since in investigation,the objects of interest vary greatly and are unpredictable, both in type and appearance, thevisual content approach is infeasible So understanding the movement of the cameraman
is the most reliable clue The class of events can be predicted by studying the trajectory
of cameramen movement and his relative position to the objects In computer vision termsthis represents the structure of the scene and the motion of the camera We observe that thefour types of events have different structure and motion patterns For example, an overviewhas moderate pan and tilt camera motion and the camera is far from the objects Table1
summarizes the different types of events and characteristics of their motion patterns.Though the description in Table1looks simple, performing the log segmentation is not.Some terms, such as “go around objects”, are at a conceptual level These are not considered
in standard camera motion analysis, which usually classifies video motion into pan, tilt, andzoom Also it is not just camera motion For example, the term “close” implies that we alsoneed features representing the structure (depth) of the scene Thus, to segment investigationlogs, we need features containing both camera motion and structure information
Trang 7Table 1 Types of investigation events, and characteristics of their motion patterns
3.2 Segmentation using structure-motion features
As discussed, in order to find investigation events, we need features capturing patterns ofmotion and structure We propose such features below These features, employed in a three-step framework (Fig.3), help to segment a log into investigation events despite of varyingcontents
3.2.1 Extracting structure and motion features
From the definition of the investigation event classes, it follows that to segment a log intothese classes, we need both structure and motion information
While it is possible to estimate the structure and motion even from an uncalibratedsequence [24], that approach is not robust and not efficient enough for the freely capturedinvestigation videos To come to a solution, we look at geometric models capturing structureand motion We need models striking a balance between simple models which are robust
to nosie and detailed models capturing the full geometry and motion but easily affected bynoise
In our case of investigation, the scene can be assumed static The most general modelcapturing structure and motion in this case is the fundamental matrix [13] In practice, manyapplications, taking advantage of domain knowledge, use more specific models Table 2
shows those models from the most general to the most specific In studio shots where eras are mounted on tripods, i.e no translation presents, the structure and motion are wellcaptured by the homography model [13] If the motion between two consecutive frame issmall, it can be approximated by the affine model This fact is well exploited in video anal-ysis [26] When the only information required is to know whether a shot is a pan, tilt, orzoom then the three-parameter model is enough [17,23] In that way, the structure of thescene, i.e variety in 3D depth, is ignored
cam-We base our method on the models in Table2 We first find the correspondences betweenframes to derive the motion and structure information How well those correspondences fitinto the above models tells us something about the structure of the scene as well as thecamera motion Such a measurement is called an information criterion (IC) [34]
An IC measures the likelihood of a model being the correct one, taking into account bothfitting errors and model complexity The lower the IC value, the more likely the model iscorrect Vice versa, the higher the value the more likely the structure and motion possessesthe properties not captured by the model A series of IC values computed on the four modelsrepresented in Table2characterize the scene and the camera motion within the scene Based
on those IC values we can build features capturing structure and motion information invideo Figure4summarizes the proposed features and their meaning, derived from Table2
Trang 8Fig 3 Substeps to automatically
sengment an investigation log
into events
Video log
Extracting
SM features
Classifying frames
Merging labeled frames
Investigation events
Trang 9Table 2 Models commonly used in video analysis, their degrees of freedom; names and structure and motion
condition under which they hold
Homography H P(8) Flat scene, or no translation in motion
Similarity model H S(4) Far flat scene, image plane parallel to
scene plane
Three-parameter model H R(3) Same as H S, no rotation around the
principal ray The degrees of freedoms are required to compute the proposed features
The IC we use here is the Geometric Robust Information Criterion (GRIC) [34] GRIC,
as reflected in its name, is robust against outliers It has been successfully used in 3D struction from images [25] The main purpose of GRIC is to find the least complex modelcapable of describing the data
recon-To introduce GRIC let us first define some parameters Let d denote the dimension of the model ; r the input dimension; k the model’s degrees of freedom; and E = [e1, e2, , en]
the set of residuals resulting from fitting n corresponding data points in the model and the
input The GRIC is now formulated as:
g(d, r, k, E)=
e i ∈E min
where σ is the standard deviation of the residuals.
The left term of (1), derived from fitting residuals, is the model fitting error The
mini-mum function used in ρ is meant to threshold outliers The right term, consisting of model parameters, is the model complexity λ1, λ2, and λ3 are parameters steering the influence
of the fitting error and the model complexity on the criterion Their suggested values are 2,
log(r) , and log(rn) respectively [34]
In our case, we consider a two dimensional problem, i.e d = 2; and the dimension of
the input data is r = 4 (two 2D points) The degrees of freedom k for the models are given
H P- Flat scene, or no translation
H A- Far flat scene
H s- Far flat scene, image plane parallel to scene plane
H R- Same as H S , no rotation around the principal ray
c R– same as c S , , plus camera
rotation around the principal ray
i r e t r c o i o m d a r
Trang 10in Table2; and n is the number of correspondences The GRIC equation, adding explicitly
the dependence on the models in Table2, is simplified to:
gH (k, E)=
e i ∈E min
In order to make the criteria comparable over frames, the number of correspondences n
must be the same To enforce this condition, we compute motion fields by computing respondences with a fixed sampling grid As mentioned, GRIC is robust against outliers,thus outliers often existing in motion fields should not be a problem For a pair of consecu-tive frames, we compute GRIC for each of the four models listed in Table2and Fig.4 For
cor-example, c p = g H ( 8, E), given E is the set of residuals of fitting correspondences to the
Hpmodel
Our features include estimations of the three 2D frame motion parameters of the H R
model, and four GRIC values of the four motion models (Fig.4) The frame motion
param-eters (namely the dilation factor o, the horizontal movement x, and the vertical movement y)
have been used in video analysis before [17,23] e.g to recognize detail segments [17]
We consider them as the baseline features Our proposed measurements (c P , c A , c S, and
c R) add more 3D structure and motion information to those baseline features To make itrobust again noisy measurements and to capture the trend in the structure and the motion,
we use the mean and variance of criteria/parameters over a window of frames This yields a14-element feature vector for each frame
F =¯o, ¯x, ¯y, ¯ cP , cA,¯ ¯c S , cR,¯ ˜o, ˜x, ˜y, ˜ cP , cA,˜ ˜c S, cR˜
(3)where¯ is the mean, and ˜ is the variance of the value over the feature window w f
In summary, computing features includes:
1 Compute optical flow though out the video Sample the optical flow using a fixed grid
others” class, containing frames that do not have a clear intention
Since logs are captured by handheld or head-mounted cameras, the motion in the logs isunstable Consequently, the input features are noisy It is even hard for humans to classifyevery class correctly While the detail class is quite recognizable from its zooming motion,
it is hard to distinguish the search and examination classes Therefore, we expect that theboundary between classes is not well defined by traditional motion features While the pro-posed features are expected to distinguish classes, we do not know which features are best
to recognize the classes Feature selection is needed here Two popular choices with implicitfeature selection are support vector machines and random forest classifiers We have car-ried out experiments with both and the random forest classifier gave better results Hence,
in Section5, we only present result with the random forest classifier
Trang 113.2.3 Merging labeled frames
As mentioned, the captured motion is unstable and the input data for classification is dirty
We thus expect many frames of other classes to be misclassified as search frames To
improve the result of the labeling step, we first apply a voting technique over a window
of frames length w v, the voting window Within the voting window, each type of label iscounted Then the center frame is relabeled using the label with the highest vote count.Finally, we group consecutive frames having the same label into events
4 Mapping investigation events to a 3D model
The video is now represented in the time dimension as a series of events In this tion, we present the method to enhance the comprehensiveness of the investigation in thespace dimension by connecting events to a 3D model of the scene In this way we enableinteraction with a log in 3D
sec-For each type of event, we need one or more representative frames These frames give ahint to the user which part of the scene is covered by an event, as well as a rough indica-tion of the camera motion Overview and detail events are presented by the middle frames
of the events For this we make the assumption that the middle frame of an overview or adetail event is close to the average pose of all the frames in the event Another way to create
a representative frame for an overview event would be a large virtual frame, a panorama,that approximately covers the space captures by that overview It is, however, costly and notalways feasible, and thus is not implemented in this work As for searching and examina-tion events one frame is not sufficient to represent the motion these are presented by threeframes, the first, the middle and the last To visualize the event in space, we have to matchthose representative frames to the 3D model
Matching frames is a non-trivial operation Logs are captured at varying locations in thescene and with different poses and video frames are not as clear as high resolution images.Also the number of images calibrated to the 3D model is limited Thus we expect that somerepresentative frames may be poorly matched or cannot be matched at all To overcomethose problems, we propose a semi-interactive solution containing two steps (Fig 5): (i)automatically map as many representative frames as possible to the 3D model, and then (ii)let users interactively adjust predicted camera poses of other representative frames.4.1 Automatic mapping of events
Since our 3D model is built using an image-based method [6], the frame-to-model ping is formulated as image-to-image matching Note that color laser scanners also useimages, calibrated to the scanning points, to capture color information So our solution isalso applicable for laser scanning based systems
map-Let I denote the set of images from which the model is built, or more generally a set which is calibrated to the 3D model Matching a representative frame i to one of the images
in I enables us to recover the camera pose To do the matching, we use the well-known
SIFT detector and descriptor [19] First SIFT keypoints and descriptors are computed for
representative frame i and every image in I Keypoints of frames i are initially matched to keypoints of every image in I , based on comparing descriptors [19] only Correctly matchedkeypoints are found by robustly estimating the geometric constraints between the twoimages [13] When doing so there might be more than one image in I matched to frame i.