If I understand correctly, you have two images taken from a smartphone camera, for which you know (at least approximately) the intrinsics matrix, and the relative 3D rotation between the poses where the two images where taken. You are also saying that there is a small translation between the two images, which is good since you would not have been able to calculate depth otherwise.
Unfortunately, you do not have enough information to be able to directly estimate depth. Basically, estimating depth from two images requires to:
1. Find point correspondences between the two images
Depending what you want to do, this can be done either for all points in the images (i.e. in a dense way) or only for a few points (i.e. in a sparse way). Of course the latter is less computationally expensive, hence more appropriate for smartphones.
Dense matching requires to rectify the images, in order to make the computation tractable, however this will probably take a long time if performed on a smartphone. Image rectification can be achieved either using a calibrated method (which requires to know the rotation+translation between the two poses of the images, the intrinsics camera matrix and the distortion coefficients of the camera) or a non-calibrated method (which requires to know sparse point matches between the two images and the fundamental matrix, which can be estimated from the matches).
Sparse matching requires to match salient features (e.g. SURFs or SIFTs, or more efficient ones) between the two images. This has the advantage of being more efficient than dense matching and also more accurate.
2. Triangulate the corresponding points to estimate depth
Triangulation requires to know the intrinsics parameters (camera matrix and distortion coefficients) and the extrinsics parameters (relative rotation and translation between the poses form which the images where taken).
In your case, assuming your relative rotation and intrinsics camera matrix are accurate enough (which I doubt), you still lack the translation and the distortion coefficients.
However, you can still apply the classical approach for stereo triangulation, which require an accurate calibration of your camera and an estimation of the full relative pose (i.e. rotation + translation).
The calibration of your camera will enable you to estimate an accurate intrinsics matrix and the associated distortion coefficients. Doing this is recommanded because your camera will not be exactly the same as the cameras in other phones (even if it is the same phone model). See e.g. this tutorial, which shows the methodology eventhough the code samples are in C++ (the equivalent must exist for android).
Once you have estimated accurately the intrinsics parameters, one way to estimate the full relative pose (i.e. rotation and translation) is to compute the fundamental matrix (using feature matches found between the two images), then to infer the essential matrix using the camera matrix, and finally to decompose the essential matrix into the relative rotation and translation. See this link, which gives the formula to infer the essential matrix from the fundamental matrix, and this link, which explain how to compute the rotation and translation from the essential matrix.
Also, to answer your other question related to warpPerspective
, you would need to use K.R.inv(K)
or K.inv(R).inv(K)
, depending on the image you are warping. This is because R
is a 3D rotation, which has quite nothing to do with pixel coordinates.
image rectification
for this. Any ideas on how to do it with the available information? – Ambiversion