Assumimg your cars are moving, you could try to estimate the ground plane (road).
You may get a descent ground plane estimate by extracting features (SURF rather than SIFT, for speed), matching them over frame pairs, and solving for a homography using RANSAC, since plane in 3d moves according to a homography between two camera frames.
Once you have your ground plane you can identify the cars by looking at clusters of pixels that don't move according to the estimated homography.
A more sophisticated approach would be to do Structure from Motion on the terrain. This only presupposes that it is rigid, and not that it it planar.
Update
I was wondering if you could expand on how you would go about looking for clusters of pixels that don't move according to the estimated homography?
Sure. Say I
and K
are two video frames and H
is the homography mapping features in I
to features in K
. First you warp I
onto K
according to H
, i.e. you compute the warped image Iw
as Iw( [x y]' )=I( inv(H)[x y]' )
(roughly Matlab notation). Then you look at the squared or absolute difference image Diff=(Iw-K)*(Iw-K)
. Image content that moves according to the homography H
should give small differences (assuming constant illumination and exposure between the images). Image content that violates H
such as moving cars should stand out.
For clustering high-error pixel groups in Diff
I would start with simple thresholding ("every pixel difference in Diff
larger than X is relevant", maybe using an adaptive threshold). The thresholded image can be cleaned up with morphological operations (dilation, erosion) and clustered with connected components. This may be too simplistic, but its easy to implement for a first try, and it should be fast. For something more fancy look at Clustering in Wikipedia. A 2D Gaussian Mixture Model may be interesting; when you initialize it with the detection result from the previous frame it should be pretty fast.
I did a little experiment with the two frames you provided, and I have to say I am somewhat surprised myself how well it works. :-) Left image: Difference (color coded) between the two frames you posted. Right image: Difference between the frames after matching them with a homography. The remaining differences clearly are the moving cars, and they are sufficiently strong for simple thresholding.
Thinking of the approach you currently use, it may be intersting combining it with my proposal:
- You could try to learn and classify the cars in the difference image
D
instead of the original image. This would amount to learning what a car motion pattern looks like rather than what a car looks like, which could be more reliable.
- You could get rid of the expensive window search and run the classifier only on regions of
D
with sufficiently high value.
Some additional remarks:
- In theory, the cars should even stand out if they are not moving since they are not flat, but given your distance to the scene and camera resolution this effect may be too subtle.
- You can replace the feature extraction / matching part of my proposal with Optical Flow, if you like. This amounts to identifying flow vectors that "stick out" from a consistent frame-to-frame motion of the ground. It may be prone to outliers in the optical flow, however. You can also try to get the homography from the flow vectors.
- This is important: Regardless of which method you use, once you have found cars in one frame you should use this information to robustify your search of these cars in consecutive frame, giving a higher likelyhood to detections close to the old ones (Kalman filter, etc). That's what tracking is all about!