###########################################################################
#                     The INV-Flow2PoseNet Datasets                       #
#               Torben Fetzer   Gerd Reis   Didier Stricker               #
#                  Technische Universität Kaiserslautern                  #
#    Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI GmbH)   #
###########################################################################

This file describes the different datasets used for training and evaluating 
the networks in INV-Flow2PoseNet. It contains two different subsets:

- SyntheticData
    - Train_ConsistentLight
    - Test_ConsistentLight
    - Train_InConsistentLight
    - Test_InConsistentLight

- BuddhaBirdRealData

The SyntheticData has been rendered using Unity game engine. The training 
sets use 22 randomly choosen and randomly rotated and translated objects to
create the scenes. The test sets contain 8 different objects for the same.
Between the two views the light source stays either consistent or moves in 
order to create inconsistently illuminated and shaded scenes, as it may 
happen for rotating objects. The training sets consist of 20.000 randomly 
created scenes, while the test sets only contain 1.000 scenes each.

The BuddhaBirdRealData is the real pendant to the synthetic data. It 
consists from 5 different objects, that have been captured by a Structured 
Light Scanner from 8 different perspectives each using a stereo scan head.
The projector, used for illuminating the sceneas, been calibrated such that 
the light source is also known. The partial reconstructions have been 
aligned carefully, such that also optical flows and relative poses between 
adjacent scans are available. Within each scan head the image pairs repre-
sent the case of consistent light (same projector). Between neighboring 
views the inconsistent light case appears. The first 40 pairs represent the 
scans within one scan head (consistent light) with 8 reconstructions per 
object. The last 160 pairs represent the inconsistent light case with 
combinations of camera views between adjacent scans (that use different 
projectors).


Data Format Description
=======================

Every scene consists of the following data parts:

- image0 and image1 contain the 8 bit integer grayscale images of the two 
  camera views.

- data0 and data1 are .json files that contain the intrinsic calibration 
  matrices K, camera rotation R and translation t, the minimal and maximal 
  depth values minDepth and maxDepth, the minimal and maximals values of 
  the horizontal and vertical optical flows minFlowX, maxFlowX, minFlowY 
  and maxFlowY and the coordinates if the light source lightPos.

- depth0 and depth1 are 16 bit integer grayscale images that need to be 
  scaled after loading using minimal and maximal depth values from the data 
  files:
  
  D = D * (maxDepth - minDepth) / 65535 + minDepth

- normal0 and normal1 are 24 bit integer RGB images in tangent space that 
  can be re-transformed to spatial space by:

  n = (2/255 * n1 - 1, 2/255 * n2 - 1, 1 - 2/255 * n3)

- flow0 and flow1 contain the horizontal and vertical displacements of the 
  respective flow fields between the views. The flows are stored as 16 bit 
  integers in three channel images (flowX, flowY, zeros) and are scaled 
  similar to the depth files.

Note that missing / masked pixels for which no depth information is avail-
able contain zeros in the depth, flow and normal files. After rescaling and 
shifting these files, the mask should be applied again to keep the masking 
information with values of zero.

The presented network uses vertex maps instead of depth maps. These can be 
computed from the depth data and the given calibration information by 
applying the following operation to each image pixel (x,y):

  V(x,y) = inv(K) * (x; y; 1) / norm(inv(K) * (x; y; 1)) * D(x,y) 

The given depth, vertex and normal maps are independent of any camera pose, 
as these are usually not available beforehand and need to be computed by 
the procedure. In order to use them to triangulate point clouds with re-
spect to the given pose, the vertex maps (or point clouds) and normal maps 
can be transformed in the following way. Given a camera pose P=(R, t), the 
3D point with respect to a complete camera matrix P=K[R|t] is given by:

  V(x,y) = -R^T * t + R^T * V(x,y)

and the normals of the respective 3D points are given by:

  N(x,y) = R^T * N(x,y)


Visualizing the Data
=======================

In addition there is a Matlab script testData.m that loads the data of a 
scene and visualizes the information:

- It writes pointcloudCalib0.ply and pointcloudCalib1.ply to the specified
  path, that contain the triangulated point clouds and visualization of the 
  cameras and light source (see visualization_pointcloud_calib.png).

- It computes warped images using the given optical flow, that show the 
  transfer from one view to the other (see visualization_flow_warp.png).