PropertyValue
is nif:broaderContext of
nif:broaderContext
is schema:hasPart of
schema:isPartOf
nif:isString
  • Although it is common to observe occlusions for multi-object motion in single-view images, objects may actually be physically distant from each other and an occlusion may not exist at all if the observation is made in 3D space. In fact, an occlusion is caused by different observation perspectives. Therefore, in order to solve occlusions in the tracking process, three cameras are used to capture objects’ motion from different directions simultaneously (Fig 1). Thus, integrated multi-view tracking results are able to accurately acquire 3D motion trajectories. The proposed method is introduced in two parts, object detection and object tracking, respectively. Figure data removed from full text. Figure identifier and caption: 10.1371/journal.pone.0180254.g001 Experiment setup.Fish move in a rectangular container full of water. Three synchronized cameras capture swimming behavior from one top-view and two side-view directions. Each pair of the three directions are vertical to one another. All experimental procedures were in compliance with the Institutional Animal Care and Use Committee (IACUC) of Shanghai Research Center for Model Organisms (Shanghai, China) with approval ID 2010–0010, and all efforts were made to minimize suffering. This study was approved by the Institutional Animal Care and Use Committee (IACUC), and written informed consent was obtained. In the laboratory environment, objects move in a stationary container so the background is relatively stable. Therefore, motion regions can be segmented effectively with the aid of a background subtraction method: Rt(x,y)={1, |ft(x,y)−fb(x,y)|>Tg0,otherwise(1) where fb(x,y) represents the background obtained by calculating the average of N consecutive frames of video sequences. Since rough edges in the motion region are not conducive for the extraction and analysis of the main skeleton, some preprocessing is necessary. Isolated interior pixels of the moving region are first filled, and small interfering blocks are then removed. Finally, a median filter is used to smooth the boundary. Fig 2 shows the effects of preprocessing on the motion regions. Preprocessing is shown to effectively reduce detail interferences while preserving the main structure of the motion regions. Figure data removed from full text. Figure identifier and caption: 10.1371/journal.pone.0180254.g002 The main skeleton obtained using different thresholds Tu.With an increase in the value of threshold Tu, the obtained skeleton can better represent the main structure of the motion region while ignoring more details. (a) Top view. (b) Side view. Since the object has a belt-like appearance, its center curve can be obtained through extraction of the main skeleton, thus transforming the object’s structure from a 2D region to a 1D curve. It also alleviates tracking difficulty by reducing the number of pixel points that need to be tracked. In the existing skeleton extraction methods, the augmented fast marching method (AFMM) [19] is used to extract the main skeleton. First, an arrival time U is set for each point at the edge of the region. Then, the value of U for the entire region is obtained by iteration of the fast marching method. Based on the distribution of U, skeleton points can be defined as: S={(i,j)|max(|ux|,|uy|)>Tu}   s.t. {ux=U(i+1,j)−U(i,j)uy=U(i,j+1)−U(i,j)(2) which show that a given point (i,j) is regarded as a skeleton point when the difference of U between this point and its neighbors in the x and y directions is larger than the threshold Tu. The reasons for choosing AFMM are as follows: (1) Speed of operation. Considering that each frame contains many motion regions, a low skeleton extraction efficiency has a significant influence on tracking performance. As one of the fastest skeleton extraction methods, AFMM is particularly suitable for skeleton extraction of regions in video images. (2) Multi-scale skeleton representation. Although preprocessing is, to some extent, conducive to extraction of the main skeleton, it cannot completely remove interference of small branches. A simple and effective way to accurately extract the main skeleton of motion regions is to analyze the region for different scales in order to find an optimal scale for skeleton extraction. The function of the threshold Tu in AFMM is similar to a scale factor. A smaller threshold provides more skeleton details, and a large threshold provides less skeleton details. Fig 2 shows skeleton extraction results for different threshold values. Results demonstrate that, as the threshold Tu increases, skeleton details gradually decrease, but the skeleton's ability to describe the main structure of the object gradually increases. After obtaining the main skeleton, points that best represent the object’s shape are selected as feature points, which further simplifies the object’s structure. First, a point in the main skeleton is selected to represent the center position of the object. In the skeleton obtained through AFMM, the maximum U value is usually located at the center, which better reflects the midpoint of the motion region. Therefore, this point is defined as the central feature point of the object. Compared with the centroid of the motion region, the maximum U point has better stability since it is less susceptible to changes in the shape of the motion region. Then, by considering the discrepancy in objects’ appearances from different views, two feature point models are used to represent the objects in the top view and side view, respectively. As shown in Fig 3(a), the object’s appearance in the top view has the following characteristics: (1) It is composed of two parts, the rigid first-half region that has less deformation, and the non-rigid second-half region that has greater deformation; and (2) The structure gradually becomes narrower from head to the tail. Based on the above characteristics, two points can be selected from the object’s main skeleton as its head and tail. First, the average width of skeleton points on both sides of the central feature point are obtained using the shortest distance between the skeleton point and the region edge as the radius. Then, the corresponding skeleton of the rigid region can be distinguished from that of the non-rigid region based on the average width. Since the object’s main skeleton has endpoints at the head and tail positions, skeleton endpoints of the object's rigid and non-rigid regions are defined as the head feature point and tail feature point. Then, a double feature point model (DFPM) comprised of the head feature point and central feature point is used to represent the object as shown in Fig 3(b). DFPM has the following advantages: (1) Simple and accurate. It is composed of two points of the rigid region of the object, and can accurately reflect the object's spatial location; and (2) Comprehensive information. It not only indicates the object’s location, but denotes the motion direction, thus reducing the difficulty in object tracking. Figure data removed from full text. Figure identifier and caption: 10.1371/journal.pone.0180254.g003 Feature point models. (a) Fish appearance model. The blue line represents the main skeleton of fish. The endpoints of the skeleton are located at the head and tail, respectively. (b) Double feature point model (DFPM). The model consists of the central feature point and head feature point. (c) Three feature point model (TFPM). The model consists of the central feature point and two skeleton endpoints. In the side-view direction, locating the head and tail of the object based on its shape or texture features is difficult. Therefore, unlike DFPM, a three feature point model (TFPM) consists of two skeleton endpoints and a central feature point, as shown in Fig 3(c). If feature point matching between the top and side views is successful using the epipolar constraint (see 'Stereo matching' section), then TFPM can be simplified as DFPM. If the matching fails, TFPM can be simplified based on the object’s positional relationship between adjacent frames (see 'Motion association' section). In order to obtain the 3D motion trajectories of objects, a strategy is needed to analyze objects from multi-view images. The strategy of tracking objects in one view, while using auxiliary stereo matching in the other two views, is insufficient to handle motion occlusion. However, the strategy of tracking objects in multiple views simultaneously results in decreased tracking performance due to the difficulty in side-view tracking. In order to solve this problem, an effective strategy is adopted that focuses on top-view tracking, where appearance changes in the object are relatively slight, while using side-view tracking for supplementary attempts. As shown in Fig 4, point p in 3D space is projected onto views v1 and v2, where the projected points are p1 and p2. Let o1 and o2 denote the centers of two cameras. Then p, o1, and o2 form a plane s in 3D space. The intersection line l1 of s and v1 is called the epipolar line of p2, and the intersection line l2 of s and v2 is called the epipolar line p1. The epipolar line constraint can be formulated as follows: if the projected point in v1 is p1, then the projected point p2 in v2 which corresponds to it is located in the epipolar line l2 of p1. Their relationship can be written as: p2TFp1=0(3) where F is a 3×3 fundamental matrix. By manually selecting eight pairs of matching points in a stereo image obtained from the top and side views, the fundamental matrix can be calculated using the eight-point method [20]. Based on the fundamental matrix, feature points that meet the epipolar constraint can be found from the top to side views. Figure data removed from full text. Figure identifier and caption: 10.1371/journal.pone.0180254.g004 Illustration of the epipolar constraint.Regarding the projected point p1 in view v1, the projected point p2 in view v2 which corresponds to p1 is located in the epipolar line l2 of p1. Assume pi,ttop is the central feature point of object it in the top view, pj,tsidev is the central feature point of object jt in side view v, and li,tsidev is the corresponding epipolar line of pi,ttop in side view v. The association probability is inversely proportional to the Euclidean distance from pj,tsidev to li,tsidev, and the result of stereo matching can be expressed as: em(it,jt)={1,if distance(pj,tsidev,li,tsidev)MAL or ep>21, otherwise(6) where it denotes arbitrary objects in a view, sp represents the skeleton length, and ep represents the number of skeleton endpoints. If the number of skeleton endpoints is greater than 2, or if the skeleton length exceeds the maximum length (MAL), an occlusion occurs in the region represented by the skeleton. (2) Motion association The motion states of non-occluded objects between adjacent frames have good consistency, which can be expressed more specifically as: changes in the position and direction for the same object are small, while changes for different objects are large. According to this statement, the position and direction of an object can be used to construct an association cost function based on our method proposed in [7]. Let it-1 and it denote arbitrary objects in frame t-1 and frame t for a view, and pc(it-1,i) and dc(it-1,i) represent the changes in position and direction between it-1 and it, respectively. The cost function can then be expressed as: cv(it−1,it)=ω(dc(it−1,it)dcmax)+(1−ω) (pc(it−1,it)pcmax)(7) where pcmax and dcmax denote the maximum moving distance and the maximum deflection angle of the object between adjacent frames, respectively. Based on the cost function, a locally optimized object association is realized using the greedy algorithm. The greedy algorithm always tries to make the best choice at that time when it is used to address a problem. In other words, it always makes a locally optimal choice instead of a globally optimal choice. When associating objects in neighboring frames, it first needs to compute the cost function for each pair of objects, sort them, and then associate the two objects that produce the smallest cost function. This process is repeated to associate as many objects as possible. In the top view, since DFPM has directional information, a direct association is possible. However, in the side view, it is necessary to first simplify TFPM to DFPM in accordance with the relative positional relationship between objects, and then proceed with the association. Fig 6 shows the motion association process for the feature point models. In order to improve association efficiency, if the distance between objects is larger than pcmax, the association should be abandoned. Figure data removed from full text. Figure identifier and caption: 10.1371/journal.pone.0180254.g006 Illustration of motion association. (a) Motion association for DFPM. (b) Model simplification for TFPM. (3) Matching association Since occluded objects are not associated, the tracking results will contain many trajectory fragments. In order to link these trajectory fragments into complete trajectories, it is essential to match the objects before and after occlusion. Since the occlusion of objects in 2D space is complex, accurate matching of trajectory fragments based on information acquired from a single view can hardly be realized. Fortunately, object motions with frequent occlusions in the single view are less likely to collide in 3D space. Therefore, trajectory information integrated from multiple views effectively solves the occlusion problem. First, a state flag is set for each object in the top view: f(it){−1, if it is the end of a trajectory+1, if it is the start of a trajectory0,otherwise(8) If the situation exists where the object f(it) = +1 in frame t, then match all objects with state flag -1 before occlusion based on tracking results of two side views. ma(it−n,it)={1,if it−n→∈Tjsidev &  it→∈Tjsidev0,otherwises.t. {f(it)=+1f(it−n)=−1(9) where n represents the frame number of occlusion duration and it−n→ and it→ denote the matching objects it-n and it in side view v, respectively. Eq (8) indicates that matching association is successful when objects in the top view, before and after occlusion, are matched with the same trajectory fragment of the side view under the epipolar constraint. Fig 7 shows an example of matching association. If matching association is successful, occluded objects in the top view are not occluded in the side view. If matching association fails, objects collide in 3D space, and optimized association is performed on these objects and all other objects with state flag -1 in frame t-n according to Eq (7). Figure data removed from full text. Figure identifier and caption: 10.1371/journal.pone.0180254.g007 Example of matching association. Since the proposed method is based on top-view tracking, it is only necessary to associate occluded objects in the top view. As long as objects can be accurately tracked in the top view, 3D motion trajectories can be obtained by stereo matching with tracking results from the other two views.
rdf:type