########################################################################### # The INV-Flow2PoseNet Datasets # # Torben Fetzer Gerd Reis Didier Stricker # # Technische Universität Kaiserslautern # # Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI GmbH) # ########################################################################### This file describes the different datasets used for training and evaluating the networks in INV-Flow2PoseNet. It contains two different subsets: - SyntheticData - Train_ConsistentLight - Test_ConsistentLight - Train_InConsistentLight - Test_InConsistentLight - BuddhaBirdRealData The SyntheticData has been rendered using Unity game engine. The training sets use 22 randomly choosen and randomly rotated and translated objects to create the scenes. The test sets contain 8 different objects for the same. Between the two views the light source stays either consistent or moves in order to create inconsistently illuminated and shaded scenes, as it may happen for rotating objects. The training sets consist of 20.000 randomly created scenes, while the test sets only contain 1.000 scenes each. The BuddhaBirdRealData is the real pendant to the synthetic data. It consists from 5 different objects, that have been captured by a Structured Light Scanner from 8 different perspectives each using a stereo scan head. The projector, used for illuminating the sceneas, been calibrated such that the light source is also known. The partial reconstructions have been aligned carefully, such that also optical flows and relative poses between adjacent scans are available. Within each scan head the image pairs repre- sent the case of consistent light (same projector). Between neighboring views the inconsistent light case appears. The first 40 pairs represent the scans within one scan head (consistent light) with 8 reconstructions per object. The last 160 pairs represent the inconsistent light case with combinations of camera views between adjacent scans (that use different projectors). Data Format Description ======================= Every scene consists of the following data parts: - image0 and image1 contain the 8 bit integer grayscale images of the two camera views. - data0 and data1 are .json files that contain the intrinsic calibration matrices K, camera rotation R and translation t, the minimal and maximal depth values minDepth and maxDepth, the minimal and maximals values of the horizontal and vertical optical flows minFlowX, maxFlowX, minFlowY and maxFlowY and the coordinates if the light source lightPos. - depth0 and depth1 are 16 bit integer grayscale images that need to be scaled after loading using minimal and maximal depth values from the data files: D = D * (maxDepth - minDepth) / 65535 + minDepth - normal0 and normal1 are 24 bit integer RGB images in tangent space that can be re-transformed to spatial space by: n = (2/255 * n1 - 1, 2/255 * n2 - 1, 1 - 2/255 * n3) - flow0 and flow1 contain the horizontal and vertical displacements of the respective flow fields between the views. The flows are stored as 16 bit integers in three channel images (flowX, flowY, zeros) and are scaled similar to the depth files. Note that missing / masked pixels for which no depth information is avail- able contain zeros in the depth, flow and normal files. After rescaling and shifting these files, the mask should be applied again to keep the masking information with values of zero. The presented network uses vertex maps instead of depth maps. These can be computed from the depth data and the given calibration information by applying the following operation to each image pixel (x,y): V(x,y) = inv(K) * (x; y; 1) / norm(inv(K) * (x; y; 1)) * D(x,y) The given depth, vertex and normal maps are independent of any camera pose, as these are usually not available beforehand and need to be computed by the procedure. In order to use them to triangulate point clouds with re- spect to the given pose, the vertex maps (or point clouds) and normal maps can be transformed in the following way. Given a camera pose P=(R, t), the 3D point with respect to a complete camera matrix P=K[R|t] is given by: V(x,y) = -R^T * t + R^T * V(x,y) and the normals of the respective 3D points are given by: N(x,y) = R^T * N(x,y) Visualizing the Data ======================= In addition there is a Matlab script testData.m that loads the data of a scene and visualizes the information: - It writes pointcloudCalib0.ply and pointcloudCalib1.ply to the specified path, that contain the triangulated point clouds and visualization of the cameras and light source (see visualization_pointcloud_calib.png). - It computes warped images using the given optical flow, that show the transfer from one view to the other (see visualization_flow_warp.png).