1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Dynamic Vision for Perception and Control of Motion - Ernst D. Dickmanns Part 14 potx

30 280 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 30
Dung lượng 475,45 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Control of gaze and atten-tion can then turn the high-resolution camera to this region yielding one to two orders of magnitude more pixels on this object depending on the ratio of focal

Trang 1

11.3 Detecting and Tracking Moving Obstacles on Roads 375

time Most of these activities are based on two different types of radar (long- and short range, different frequencies) and on various types of laser range finders (LRF) Multiple planes in LRF with both scanning and multiple beam designs are under consideration Typical angular resolutions for modern LRF designs go down

to about 0.1° (§ 2 mrad) This means that at 50 m distance, the resolution is about

10 cm, a reasonable value for slow speeds driven If interference problems with tive sensing can be excluded, these modern LRF sensors just being developed and tested may be sufficient to solve the problem of obstacle recognition

ac-However, human vision and multifocal, active technical vision can easily ploit ten times this resolution with systems available today It will be interesting to observe, which type of technical vision system will win the race for industrial im-plementation in the long run In the past and still now, computing power and knowledge bases needed for reliable visual perception of complex scenes have been marginal

ex-11.3.5 Outlook on Object Recognition

With several orders of magnitude in computing power per processor becoming available in the next one or two decades (las in the past according to “Moore’s law”), the prospects are bright for high-resolution vision as developed by verte-brates Multifocal eyes and special “glasses”, under favorable atmospheric condi-tions, will allow passive viewing ranges up to several kilometers High optical resolution in connection with “passive” perception of colors and textures will allow understanding of complex scenes much more easily than with devices relying on reflected electromagnetic radiation sent out and reflected at far distances

Generations of researchers and students will compile and structure the edge base needed for passive vision based on spatiotemporal models of motion processes in the world Probably other physical properties of light like direction of polarization or other spectral ranges may become available to technical vision sys-tems as for some animal species This would favor passive vision in the sense of no active emission of rays by the sensor Active gaze control is considered a “must” for certain (if not most) application areas; Near (NI) or far infrared radiation are such fields of practical importance for night vision and night driving

knowl-In the approach developed, bifocal vision has become the standard for low to medium speeds; differences in focal length from three to about ten have been in-vestigated It seems that trifocal vision with focal lengths separated by a factor of 3

to 5 is a good way to go for fast driving on highways If an object has been tected in a wide-angle image and is too small for reliable recognition, attention fo-cusing by turning the camera with a larger focal length onto the object will yield the improved resolution required Special knowledge based algorithms (rules and inference schemes) are required for recognizing the type of object discovered These object recognition specialists may work at lower cycle times and analyze shape details while relative motion estimation may continue to be done in parallel

de-at high frequency with low spde-atial resolution exploiting the “encasing box” model This corresponds to two separate paths to the solution of the “where” problem and

of the “what” problem

Trang 2

Systematic simultaneous interpretation of image sequences on different

pyra-mid levels of images has not been achieved in our group up to now, though data

processing for correlation uses this approach successfully, e.g.,[Burt 1981;

Mandel-baum et al 1998] This approach may be promising for robust blob- and corner tracking and for spatiotemporal interpretation in complex scenes

For object detection in the wide-angle camera, characteristic features of mum size are required Ten to twenty pixels on an object seem to be a good com-promise between efficiency and accuracy to start with Control of gaze and atten-tion can then turn the high-resolution camera to this region yielding one to two orders of magnitude more pixels on this object depending on the ratio of focal lengths in use This is especially important for objects with large relative speed such as vehicles in opposite traffic direction on bidirectional high-speed roads Another point needing special attention is discovery of perturbations: Sudden disappearance of features predicted to be very visible, usually, is an indication of occlusion by another object If this occurs for several neighboring features at the same time, this is a good hint to start looking for another object which has newly appeared at a shorter range It has to be moving in the opposite direction relative to the side where the features started disappearing If just one feature has not been measured once, this may be due to noise effects If measurements fail to be suc-cessful at one location over several cycles, there may be some systematic discrep-ancy between model and reality and, therefore, this region has to be scrutinized by allocating more attention to it (more and different feature extractors for discovering the reason) This will be done with a new estimation process (new object hypothe-sis) so that tracking and state estimation of the known object is not hampered First results of systematic investigations for situations with occlusions were obtained in the late 1980s by M Schmid and are documented in [Schmid 1992] This area needs further attention for the general case

Trang 3

mini-12 Sensor Requirements for Road Scenes

In the previous chapters, it has been shown that vision systems have to satisfy tain lower bounds on requirements to cover all aspects of interest for safe driving if they shall come close to human visual capabilities on all types of roadways in a complex network of roads, existing in civilized countries

Based on experience in equipping seven autonomous road vehicles with namic machine vision systems, an arrangement of miniature TV cameras on a pointing platform was proposed in the mid-1990s which will satisfy all major re-

dy-quirements for driving on all types of roads It has been dubbed multifocal, active, reflex-like reacting vehicle eye (MarvEye) It encompasses the following proper-

ties:

1 large binocular horizontal field of view (e.g.,• 110°),

2 bifocal or multifocal design for region analysis with different resolution in allel,

par-3 ability of view fixation on moving objects while the platform base also is

mov-ing; this includes high-frequency, inertial stabilization (f > 200 Hz),

4 saccadic control of region of interest with stabilization of spatial perception in the interpretation algorithms;

5 capability of binocular (trinocular) stereovision in near range (stereo base lar to the human one, which is 6 – 7 cm);

simi-6 large potential field of view in horizontal range (e.g., 200° with sufficient

reso-lution) such that the two eyes for the front and the rear hemisphere can cover the full azimuth range (360°); stereovision to the side with a large stereo base be-comes an option (longitudinal distance between the “vehicle eye” looking for-ward and backward, both panned by ~ 90° to the same side)

7 high dynamic performance (e.g., a saccade of § 20° in a tenth of a second)

In cars, the typical dimension of this “vehicle eye” should not be larger than about

10 cm; two of these units are proposed for road vehicles, one looking forward, cated in front of the inner rearview mirror (similar to Figure 1.3), the other one backward; they shall feed a 4-D perception system capable of assessing the situa-tion around the vehicle by attention control up to several hundred meters in range This specification is based on experience from over 5000 km of fully autono-mous driving of both partners (Daimler-Benz and UniBwM) in normal traffic on German and French freeways as well as state and country roads since 1992 A hu-man safety pilot – attentively watching and registering vehicle behavior but other-wise passive – was always in the driver’s seat, and at least one of the developing engineers (Ph.D students with experience) checked the interpretations of the vision system on computer displays

Trang 4

lo-Based on this rich experience in combination with results from aeronautical plications (onboard autonomous visual landing approaches till touch down with the same underlying 4-D approach), the design of MarVEye resulted This chapter first discusses the requirements underlying the solution proposed; then the basic design

ap-is presented and sensible design parameters are dap-iscussed Finally, steps towards first realizations are reviewed Most experimental results are given in Chapter 14

12.1 Structural Decomposition of the Vision Task

The performance level of the human eye has to be the reference, since most of the competing vision systems in road vehicle guidance will be human ones The design

of cars and other vehicles is oriented toward pleasing human users, but also ploiting their capabilities, for example, look-ahead range, reaction times, and fast dynamic scene understanding

ex-12.1.1 Hardware Base

The first design decision may answer the following: Is the human eye with its acteristics also a good guide line for designing technical imaging sensors, or are the material substrates and the data processing techniques so different that completely new ways for solving the vision task should be sought? The human eye contains about 120 million light-sensitive elements, but two orders of magnitude fewer fi-bers run from one eye to the brain, separated for the left and the right halves of the field of view The sensitive elements are not homogeneously distributed in the eye; the fovea is much more densely packed with sensor elements than the rest of the retina The fibers running via “lateral geniculate” (an older cerebral structure) to the neo-cortex in the back of the head obtain their signals from “receptive fields”

char-of different types and sizes depending on their location in the retina; so ing for feature extraction is already performed in retinal layers [Handbook of Physi- ology: Darian-Smith 1984]

preprocess-Technical imaging sensors with some of the properties observed in biological vision have been tried [Debusschere et al 1990; Koch 1995], but have not gained ground Homogeneous matrix arrangements over a very wide range of sizes are state of the art in microelectronic technology; the video standard for a long time has been about 640 × 480 § 307 000 pixels; with 1 byte/pixel resolution and 25 Hz frame rate, this results in a data rate of § 7.7 MB/s (Old analogue technology could be digitized to about 770 × 510 pixels, corresponding to a data rate of about 10MB/s.) Future high-definition TV intends to move up to 1920 × 1200 pixels with more than 8-bit intensity coding and a 75 Hz image rate; data rates in the giga-bit/second-range will be possible In the beginning of real-time machine vision (mid-1980s) there was much discussion whether there should be preprocessing steps near the imaging sensors as in biological vision systems; “massively parallel processors” with hundreds of thousands of simple computing elements have been proposed (DARPA: “On Strategic Computing” [Klass 1985; Roland, Shiman 2002]

Trang 5

12.1 Structural Decomposition of the Vision Task 379

With the fast advancement of general-purpose microprocessors (clock rates ing from MHz to GHz) and communication bandwidths (from MB/s to hundreds of MB/s) the need for mimicking carbon-based data processing structures (as in biol-ogy) disappeared for silicon-based technical systems

mov-With the advent of high-bandwidth communication networks between multiple general-purpose processors in the 1990s, high-performance, real-time vision sys-tems became possible without special developments for vision except frame grab-bers The move to digital cameras simplified this step considerably To develop methods and software for real-time vision, relatively inexpensive systems are suf-ficient The lower end video cameras cost a few dollars nowadays, but reasonably good cameras for automotive applications with increased dynamic intensity range have also come down in price and do have advantages over the cheap devices For later applications with much more emphasis on reliability in harsh environments, special “vision hardware” on different levels may be advantageous

12.1.2 Functional Structure

Contrary to the hardware base, the functional processing steps selected in cal evolution have shown big advantages: (1) Gaze control with small units having low inertia is superior to turning the whole body (2) Peripheral-foveal differentia-tion allows reducing maximal data rates by orders of magnitude without sacrificing much of the basic transducer-based perception capabilities if time delays due to saccadic gaze control are small (Eigenfrequencies of eyes are at least one order of magnitude higher than those for control of body movements.) (3) Inertial gaze sta-bilization by negative feedback of angular rates, independent of image evaluation, reduces motion blur and extends the usability of vision from quasi-static applica-tions for observation to really dynamic performance during perturbed egomotion (4) The construction of internal representations of 3-D space over time based on previous experience (models of motion processes for object classes) triggered by visual features and their flow over time allows stabilizing perception of “the world” despite the very complex data input resulting from saccadic gaze control: Several frames may be completely noninterpretable during saccades

biologi-Note that controllable focal length on one camera is not equivalent to two or more cameras with different focal lengths: In the latter case, the images with dif-ferent resolution are available in parallel at the same time, so that interpretation can rely on features observed simultaneously on different levels of resolution On the contrary, changing focal length with a single camera takes time, during which the gaze direction in dynamic vision may have changed For easy recognition of the same groups of features in images with different resolution, a focal length ratio of three to four experimentally yields the best results; for larger factors, the effort of searching in a high-resolution image becomes excessive

The basic functional structure developed for dynamic real-time vision has been shown in Figure 5.1 On level 1 (bottom), there are feature extraction algorithms working fully bottom-up without any reference to spatiotemporal models Features may be associated over time (for feature flow) or between cameras (for stereointer-pretation)

Trang 6

On level 2, single objects are hypothesized and tracked by prediction-error

feed-back; there are parallel data paths for different objects at different ranges, looked at with cameras and lenses of different focal lengths But the same object may also be observed by two cameras with different focal lengths Staging focal lengths by a factor of exactly 4 allows easy transformation of image data by pyramid methods

On all of these levels, physical objects are tracked “here and now”; the results on

the object level (with data volume reduced by several orders of magnitude pared to image pixels and features) are stored in the DOB Using ring buffers for several variables of special interest, their recent time history can be stored for analysis on the third level which does not need access to image data any longer, but looks at objects on larger spatial and temporal scales for recognition of maneuvers and possibly cues for hypothesizing intentions of subjects Knowledge about a sub-ject’s behavioral capabilities and mission performance need be available only here The physical state of the subject body and the environmental conditions are also monitored here on the third level Together they provide the background for judg-ing the quality and trustworthiness of sensor data and interpretations on the lower levels Therefore, the lower levels may receive inputs for adapting parameters or for controlling gaze and attention (In the long run, maybe this is the starting point for developing some kind of self-awareness or even consciousness.)

com-12.2 Vision under Conditions of Perturbations

It is not sufficient to design a vision system for clean conditions and later on take care of steps for dealing with perturbations In vision, the perturbation levels toler-able have to be taken into account in designing the basic structure of the vision sys-tem from the beginning One essential point is that due to the large data rates and the hierarchical processing steps, the interpretation result for complex scenes be-comes available only after a few hundred milliseconds delay time For high-

frequency perturbations, this means that reasonable visual feedback for

counterac-tion is nearly impossible

12.2.1 Delay Time and High-frequency Perturbation

For a time delay of 300 ms (typical of inattentive humans), the resulting phase shift for an oscillatory 2-Hz motion (typical for arms, legs) is more than 200°; that means that in a simple feedback loop, there is a sign change in the signal (cos (180°) = í1) Only through compensation from higher levels with corresponding methods is this type of motion controllable In closed-loop technical vision systems onboard a vehicle with several consecutive processing stages, 3 to § 10 video cy-cles (of 40 or 33 ms duration) may elapse until the control output derived from vis-ual features hits the physical device effecting the command This is especially true

if a perturbation induces motion blur in some images

Trang 7

12.2 Vision under Conditions of Perturbations 381

This is the reason that direct angular rate feedback in pitch and yaw from sors on the same platform as the cameras is used to command the opposite rate for the corresponding platform component Reductions of perturbation amplitudes by more than a factor of 10 have been achieved with a 2 ms cycle time for this inner loop (500 Hz) Figure 12.1 shows the block diagram containing this loop: Rota-

sen-tional rates around the y- and z-axes of the gaze platform (center left) are directly

fed back to the corresponding torque motors of the platform at a rate of 500 Hz if

no external commands from active gaze control are received The other data paths for determining the inertial egostate of the vehicle body in connection with vision will be discussed below The direct inertial feedback loop of the platform guaran-tees that the signals from the cameras are freed from motion blur due to perturba-tions Without this inertial stabilization loop, visual perception capability would be deteriorated or even lost on rough ground

If gaze commands are received from the vision system, of course, counteraction

by the stabilization loop has to be suppressed There have to be specific modes available for different types of gaze commands (smooth pursuit of saccades); this will not be treated here The beneficial effect of gaze stabilization for a braking maneuver with 3° of perturbation amplitude (min to max) in vehicle pitch angle is

Figure 12.1 Block diagram for joint visual/inertial data collection (stabilized gaze,

cen-ter left) and incen-terpretation; the high-frequency component of a rotational ego-state is termined from integration of angular rates (upper center), while long-term stability is de- rived from visual information (with time delay, see lower center) from objects further

de-away (e.g., the horizon) Gravity direction and ground slope are derived from x- and

y-accelerations together with speed measured conventionally.

3 orthogonal angular rates

me- ras

Ca-Frame grabber

Feature extraction

Ego-state (own body)

States of other objects

t / TV-cycles

Environ - ment

(static objects)

Low-frequency (time delayed) estimates

Best low-frequency inertial ego state estimate

+

X

Prediction for inertial measurements

v,E

Time delayed vis- ual

Typical time delays in image sequence processing

~ No time delays

4-D visual/inertial joint data interpretation for dynamic ground vehicle guidance

Trang 8

shown in Figure 12.2 The corresponding reduction in amplitude on the stabilized platform experienced by the cameras is more than a factor of 10 The strong devia-tion of the platform base from level, which is identical with vehicle body motion and can be seen as the lower curve, is hardly reflected in the motion of the camera

sitting on the platform head (upper, almost constant curve).

Figure 12.2 Gaze stabilization in pitch by negative feedback of angular rate for test

vehicle VaMoRs (4-ton van) during a braking maneuver

The most essential state components of the body to be determined by integration

of angular rate signals with almost no delay time are the angular orientations of the vehicle For this purpose, the signals from the inertial rate sensors mounted on the vehicle body are integrated, shown in the upper left of Figure 12.1; the higher fre-quency components yield especially good estimates of the angular pose of the body Due to low-frequency drift errors of inertial signals, longer term stability in orientation has to be derived from visual interpretation of (low-pass-filtered) fea-tures of objects further away; in this data path, the time delay of vision does no harm [It is interesting to note that some physiologists claim that sea-sickness of humans (nausea) occurs when the data from both paths are strongly contradicting.] Joint inertial/visual interpretation also allows disambiguating relative motion when only parts of the subject body and a second moving object are in the fields of view; there have to be accelerations above a certain threshold to be reliable, how-ever

12.2.2 Visual Complexity and the Idea of Gestalt

When objects in the scene have to be recognized in environments with strong ual perturbations like driving through an alley with many shadow boundaries from branches and twigs, picking “the right” features for detection and tracking is essen-tial On large objects such as trucks, coarse-scale features averaging away the fine details may serve the purpose of tracking better than fine-grained ones On cars with polished surfaces, disregarding the upper part and mildly inclined surface ele-ments of the body altogether may be the best way to go; sometimes single high-lights or bright spots are good for tracking over some period of time with given as-

Trang 9

12.3 Visual Range and Resolution Required for Road Traffic Applications 383

pect conditions When the aspect or the lighting conditions change drastically, other combinations of features may be well suited for tracking

This is to say that image evaluation should be quickly adaptable to situations, both with respect to single features extracted and to the knowledge base establish-ing correspondence between groups of features in the images and the internal rep-resentation of 3-D objects moving over time through an environment affecting the lighting conditions This challenge has hardly been tackled in the past but has to be solved in the future to obtain reliable technical vision systems approaching the per-formance level of trained humans The scale of visual features has to be expanded considerably including color and texture as well as transparency; partial mirroring mixed with transparency will pose demanding challenges

12.3 Visual Range and Resolution Required for Road Traffic Applications

The human eyes have a simultaneous field of view of more than 180°, with coarse resolution toward the periphery and very high resolution in the foveal central part

of about 1 to 2° aperture; in this region, the grating resolution is about 40 to 60 seconds or about 2.5 mrad The latter metric is a nice measure for practical applica-tions since it can be interpreted as the length dimension normal to the optical axis per pixel at 1000 times its distance (width in meters at 1 km, in decimeters at 100

arc-m, or in millimeters at 1 arc-m, depending on the problem at hand) Without going into details about the capability of subpixel resolution with sets of properly arranged sensor elements and corresponding data processing, let us take 1 mrad as the hu-man reference value for comparisons

Both of the human eye and head can be turned rapidly to direct the foveal region

of the eye onto the object of interest (attention control) Despite the fast and quent viewing direction changes (saccades) which allocate the valuable high-resolution region of the eye to several objects of interest in a time slicing multiplex procedure, the world perceived looks stable in a large viewing range This biologi-cal system evolved over millennia under real-world environmental conditions: the technical counterpart to be developed has to face these standards

fre-It is assumed that the functional design of the biological system is a good

start-ing point for a technical system, too: however, technical realizations have to start from a hardware base (silicon) quite different from biological wetware Therefore, with the excellent experience from the conventional engineering approach to dy-namic machine vision, our development of a technical eye continued on the well proven base underlying conventional video sensor arrays and dynamic systems theory of the engineering community

The seven properties mentioned in the introduction to this chapter are detailed here to precise specifications

Trang 10

12.3.1 Large Simultaneous Field of View

There are several situations when this is important First, when starting from stop, any object or subject within or moving into the area directly ahead of the vehicle

should be detectable; this is also a requirement for stop-and-go traffic or for very

slow motion in urban areas A horizontal slice of a complete hemisphere should be covered with gaze changes in azimuth (yaw) of about ± 35° Second, when looking tangentially to the road (straight ahead) at high speeds, passing vehicles should be detected sufficiently early for prompt reaction when they start moving into the sub-ject lane directly in front Third, when a lane change or a turnoff is intended, simul-taneous observation and tracking of objects straight ahead and about 90° to the side are advantageous; with nominal gaze at 45° and a field of view (f.o.v.) > 100°, this

is achievable

In the nearby range, a resolution of about 5 mm per pixel at 2.5 m or 2 cm at 10

m distance is sufficient for recognizing and tracking larger subobjects on vehicles

or persons (about 2 mrad/pixel); however, this does not allow reading license plates at 10 m range With 640 pixels per row, a single standard camera can cover about a 70° horizontal f.o.v at this resolution (§ 55° vertically) Mounting two of these (wide-angle) cameras on a platform with optical axes in the same plane but turned in yaw to each side by ȥobl ~ 20° (oblique views), a total f.o.v for both cameras of 110° results; the difference between half the f.o.v of a single camera and the yaw angle ȥobl provides an angular region of central overlap (± 15° in the example) Separating the two cameras laterally generates a base for binocular ste-reo evaluation (Section 12.3.4)

The resolution of these cameras is so low that pitch perturbations of about 3° (accelerations/decelerations) shift features by about 5% of the image vertically This means that these cameras need not be vertically stabilized and do not induce excessively large search ranges; this reduces platform design considerably The numerical values given are just examples; they may be adapted to the focal lengths available for the cameras used Smaller yaw angles ȥobl yield a larger stereo f.o.v and lower distortions from lens design in the central region

12.3.2 Multifocal Design

The region of interest does not grow with range beyond a certain limit value; for example, in road traffic with lane widths of 2.5 to 4 m, a region of simultaneous in-terest larger than about 30 to 40 m brings no advantage if good gaze control is available With 640 pixels per row in standard cameras, this means that a resolu-tion of 4 to 6 cm per pixel can be achieved in this region with proper focal lengths Considering objects of 10 to 15 cm characteristic length as serious obstacles to be avoided, this resolution is just sufficient for detection under favorable conditions (2

to 3 pixel on this object with sufficient contrast) But what is the range that has to

be covered? Table 11.1 contains braking distances as a function of speed driven for three values of deceleration

Trang 11

12.3 Visual Range and Resolution Required for Road Traffic Applications 385

About a 240-m look-ahead range should be available for stopping in front of an

obstacle from V = 180 km/h (50 m/s) with an average deceleration of 0.6 Earth gravity (g) or from 130 km/h (36 m/s) with an average deceleration of 0.3 g To be

on the safe side, a 250 to 300 m look-ahead range is assumed desirable for speed driving For the region of interest mentioned above, this requires a f.o.v of 5

high-to 7° or about 0.2 mrad resolution per pixel This is one order of magnitude higher than that for the near range With the side constraint mentioned for easy feature correspondence in images of different resolution (ratio of focal lengths no larger than 4), this means that a trifocal camera arrangement should be chosen Figure 12.3 visualizes the geometric relations

If lane markings of 12 cm width shall be recognizable at 200 m distance, this requires about 2.5 pixel on the line, corresponding to an angular resolution of 0.25 mrad per pixel For landmark recognition at far distances, this resolution is also de-sirable For maximal speeds not exceeding 120 km/h, a bifocal camera system may

be sufficient

12.3.3 View Fixation

Once gaze control is available in a vision system, it may be used for purposes other than the design goal Designed initially for increasing the potential f.o.v or for counteracting perturbations on the vehicle (inertial stabilization), it can in addition

be used for gaze fixation onto moving objects to reduce motion blur and to keep one object centered in an image sequence This is achieved by negative visual feedback of the deviation of the center of characteristic features from the center of the image; horizontal and vertical feature search may be done every second image

if computing resources are low Commanding the next orthogonal search around the column or row containing the last directional center position has shown good tracking properties, even without an object model installed A second-order track-ing model in the image plane may improve performance for smooth motion However, if harsh directional changes occur in the motion pattern of the object, this approach may deteriorate the level of perturbation tolerable For example, a ball or another object being reflected at a surface may be lost if delay times for vis-ual interpretation are large and/or filter tuning is set to too strong low-pass filter-ing Decreasing cycle time may help considerably: In conventional video with two consecutive fields (half frames), using the fields separately but doubling interpreta-tion frequency from 25 to 50 Hz has brought about a surprising increase in tracking

Figure 12.3 Fields of view and viewing ranges for observing a lateral range of 30 m

normal to the road

Trang 12

performance when preparing the grasping experiment in orbit onboard the Space Shuttle Columbia [Fagerer et al 1994].

View fixation need not always be done in both image dimensions; for motion along a surface as in ground traffic, just fixation in yaw may be sufficient for im-proved tracking It has to be taken into account, however, that reducing motion blur

by tracking one object may deteriorate observability for another object This is the case, for example, when driving through a gap between two stationary obstacles (trees or posts of a gate) Fixation of the object on one side doubles motion blur on the other side; the solution is reducing speed and alternating gaze fixation to each side for a few cycles

12.3.4 Saccadic Control

This alternating attention control with periods of smooth pursuit and fast gaze changes at high angular rates is called “saccadic” vision During the periods of fast gaze changes, the entire images of all cameras are blurred; therefore, a logic bit is set indicating the periods when image evaluation does not make sense These gaps

in receiving new image data are bridged by extrapolation based on the poral models for the motion processes observed In this way, the 4-D approach quite naturally lends itself to algorithmic stabilization of space perception, despite the fast changing images on the sensor chip Building internal representations in 3-

spatiotem-D space and time allows easy fusion of inertial and other conventional data such as odometry and gaze angles relative to the vehicle body, measured mechanically After a saccadic gaze change, the vision process has to be restarted with initial values derived from the spatiotemporal models installed and from the steps in gaze angles Since gaze changes, usually, take 1 to 3 video cycles, uncertainty has in-creased and is reflected in the corresponding parameters of the recursive estimation process If the goal of the saccade was to bring a certain region of the outside

world into the f.o.v of a camera with a different focal length (e.g., a telecamera),

the measurement model and computation of the Jacobian elements have to be changed correspondingly Since the region of special interest also remains in the f.o.v of the wide-angle camera, tracking may be continued here, too, for redun-dancy until high-resolution interpretation has become stable

ex-According to the literature, human eyes can achieve turn rates up to several hundred degrees per second; up to five saccades per second have been observed For technical systems in road traffic applications, maximum turn rates of a few hundred degrees per second and about two saccades per second may be sufficient

A thorough study of controller design for these types of systems has been done in [Schiehlen 1995] The interested reader is referred to this dissertation for all details

in theoretical and practical results including delay time observers Figure 12.4 shows test results in saccadic gaze control based on this work; note the minimal overshoot at the goal position A gaze change of 40° is finished within 350 ms (in-cluding 67 ms delay time from command till motion onset) Special controller de-sign minimizes transient time and overshoot

Trang 13

12.3 Visual Range and Resolution Required for Road Traffic Applications 387

Figure 12.4 Saccadic gaze control in tilt (pitch) for the inner axis of a two-axis

platform in the test vehicle VaMoRs (see Figure 14.16)

12.3.5 Stereovision

At medium distances when the surface can be seen where the vehicle or object touches the ground, spatial interpretation may be achieved relatively easily by tak-ing into account background knowledge about the scene and integration over time;

in the very near range behind another vehicle, where the region in which the cle touches the ground is obscured by the subject’s own motor hood, monocular range estimation is impossible Critical situations in traffic may occur when a pass-ing vehicle cuts into the observer’s lane right in front; in the test vehicle VaMP, the ground ahead was visible at a range larger than about 6 m (long motor hood, cam-era behind center top of windshield)

vehi-Range estimation from a single, well-recognizable set of features is desirable in this case A stereo base like the human one (of about 6 to 7 cm) seems sufficient; the camera arrangement as shown in Figure 1.4 (one to each side of the tele-camera(s) satisfies this requirement; multiocular stereo with improved performance may be achievable also by exploiting the tele-images for stereointerpretation Us-ing a stereo camera pair with non-parallel optical axes increases the computing load somewhat but poses no essential difficulties; epipolar lines have to be adjusted for efficient evaluation [Rieder 1996]

By its principle, stereovision deteriorates with range (inverse quadratic); so ocular stereo for the near range and intelligent scene interpretation for larger ranges are nicely complementary Figure 12.6 shows images from a trinocular camera set (see Figure 14.16); the stereo viewing ranges, which might be used for understand-ing vertical structure on unpaved hilly roads without markings, are shown by dashed white lines The stereo base was selected as 30 cm here Slight misalign-ments of multiple cameras are typical; their effects have to be compensated by careful calibration which is of special importance in stereointerpretation The tele-camera allows detecting a crossroad from the left, here, which cannot be discov-

Trang 14

bin-ered in the wide-angle images; trinocular stereo is not possible with the pitch angle

of the telecamera given here

Trinocular

stereo range

Binocular stereo range

camera

Tele-Wide-angle

left

Wide-angle right

Figure 12.5 Sample images from MarVEye in VaMoRs with three cameras: Two

wide-angle cameras with relatively little central overlap for binocular stereo evaluation tom); image from mild telecamera (top) Trinocular stereointerpretation is possible in the region marked by the white rectangle with solid horizontal and dashed vertical lines

(bot-12.3.6 Total Range of Fields of View

Large potential f.o.v in azimuth are needed in traffic; for low-resolution imaging, about 200° are a must at a T-junction or a road crossing without a traffic light or round-about If the crossroad is of higher order and allows high-speed driving, the potential f.o.v for high-resolution imaging should also be about 200° for checking oncoming traffic at larger ranges from both sides Precise landmark recognition over time, under partial obscuration from objects being passed in the near vicinity, will also benefit from large potential f.o.v However, the simultaneous f.o.v with high resolution need not be large, since objects to be viewed with this resolution are far away, usually; if the bandwidth of gaze control is sufficiently high, the high-resolution viewing capability can always be turned to the object of interest be-fore the object comes near, and thus good reactions are required

The critical variable for gaze control, obstacle detection, and behavior decision

is the “reaction time available” Trea for dealing with the newly detected object The critical variable is the radial speed component when the components normal to it are small; in this case, the object is moving almost on a direct collision path If the

Trang 15

12.3 Visual Range and Resolution Required for Road Traffic Applications 389

object were a point and its environment would show no features, motion would be visually undetectable It is the features off the optical line of sight that indicate range rate An extended body, whose boundary features move away from each other with their center remaining constant, thus indicates decreasing range on a collision trajectory (“looming” effect) Shrinking feature distributions indicate in-creasing range (moving away) If the feature flow on the entire circumference is not radial and outward but all features have a flow component to one side of the line of sight, the object will pass the camera on this side With looming feature

sets, the momentary time to collision (TTC) is (with r = range and the dot on top

for the time derivative)

/

Assuming a pinhole camera model and constant object width B, this value can

be determined by measuring the object width in the image (bB1 and bB2) at two

times t1 and t2,'t = t2 – t1 apart A simple derivation (see Figure 2.4) with

deriva-tives approximated by differences and bBi = measured object width in the image at time ti yields

2/ B1/( B1 B

The astonishing result is that this physically very meaningful term can be tained without knowing either object size or actual range Biological vision sys-

ob-tems (have discovered and) use this phenomenon extensively (e.g., gannets

stretch-ing their wstretch-ings at a proper time before hittstretch-ing the water surface in a steep dash to catch fish)

To achieve accurate results with technical systems, the resolution of the camera has to be very high If the objects approaching may come from any direction in the front hemisphere (like at road junctions or forks), this high resolution should be available in all directions If one wants to cover a total viewing cone of 200° by 30° with telecameras having a simultaneous f.o.v of about 5 to 7° horizontally, each with a side ratio of 4:3 (see Section 12.3.2), the total number of cameras re-quired would be 150 to 200 on the vehicle periphery Of course, this does not make sense

Putting a single telecamera on a pan and tilt platform, the only requirement for achieving the same high resolution (with minor time delays, usually) is to allow a

±97° gaze change from straight ahead in the vehicle To keep the inertial tum of the platform small, tilt (pitch) changes can be effected by a mirror which is rotated around the horizontal axis in front of the telelens

momen-The other additional (mechanical) requirement is, of course, that the ous f.o.v can be directed to the region of actual interest in a time frame leaving sufficient time for proper behavior of the vehicle; as in humans, a fraction of a sec-ond for completing saccades is a reasonable compromise between mechanical and perceptual requirements Figure 12.6 shows two of the first realizations of the

simultane-“MarVEye”-idea for the two test vehicles

To the left is an arrangement with three cameras for the van VaMoRs; its maximal speed does not require very large look-ahead ranges The stereo base is rather large (~ 30 cm); a color camera with medium telelens sits at the center To the right is the pan platform for VaMP with the camera set according to Figure 1.4 Since a sedan Mercedes 500 SEL is a comfortable vehicle with smooth riding qualities, gaze control in pitch has initially been left off for simplicity

Ngày đăng: 10/08/2014, 02:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm