Coll_Vis_Touch

Using Collocated Vision and Tactile Sensors
for Visual Servoing and Localization

Abstract

overview

Coordinating proximity and tactile imaging by collocating cameras with tactile sensors can 1) provide useful information before contact such as object pose estimates and visually servo a robot to a target with reduced occlusion and higher resolution compared to head-mounted or external depth cameras, 2) simplify the contact point and pose estimation problems and help tactile sensing avoid erroneous matches when a surface does not have significant texture or has repetitive texture with many possible matches, and 3) use tactile imaging to further refine contact point and object pose estimation. We demonstrate our results with objects that have more surface texture than most objects in standard manipulation datasets. We learn that optic flow needs to be integrated over a substantial amount of camera travel to be useful in predicting movement direction. Most importantly, we also learn that state of the art vision algorithms do not do a good job localizing tactile images on object models, unless a reasonable prior can be provided from collocated cameras.

Video

Slides

Visual Servoing with Collocated Cameras

Here, we demonstrate how a pair of collocated cameras can be used to correct trajectory errors on the fly using the optic flow observed by the camera. As a use case, we start with a initial erroneous estimate of the workspace goal (in the world coordinates) and we use the optical flow observed by the cameras en-route to correct against refined estimates of the goal (specified in pixel space by an external agent like a separate algorithm or a human annotator). The key idea behind this is that if the current heading direction is correct, the point of expansion corresponding to the current movement of the cameras should correspond to the workspace goal (in 3d coordinates) imaged by the cameras. We first calculate the point of expansion and then the error (in pixel space) between the current heading direction and the imaged true workspace goal. This pixel space error is then upgraded to the trajectory error using our knowledge of the camera intrinsic and extrinsic parameters (with respect to the robot) and the trajectory error is corrected.

We provide the algorithm we used to identify the point in image space corresponding to the momentary heading direction of the robot below. We call this point as the Point of Expansion (POE).

Correcting trajectory errors with POE estimates

Let us denote the true workspace goal as \(\mathbf{X}_G^W \) and it's estimate \(\widehat{\mathbf{X}_G^W} \) in homogeneous 3 space. The robot is initially planned to move to \(\widehat{\mathbf{X}_G^W}\) and let the POEs obtained for the subsequent frames be \(\mathbf{p}\), in the pixel space. As the camera is registered to the robot, a given point in time, we can calculate the world to camera projection matrix as \(\mathbf{P} = \mathbf{K}[\mathbf{R}_{world}^{cam}|\mathbf{t}_{world}^{cam}]\) and can project \(\mathbf{X}_G^W\) and it's estimate \(\widehat{\mathbf{X}_G^W}\) to \(\mathbf{x}_G\) and \(\widehat{\mathbf{x}_G}\) in homogeneous 2 space. If the motion between two frames captured by the camera-in-hand is mostly perpendicular to the imaging plane, we can approximate \(\widehat{\mathbf{x}_G}\) with the point of expansion \(\mathbf{p}\) for each subsequent frames. We also note that, as \(\mathbf{p}\) is in the pixel space, with this approximation, we discard the "z-buffer" of the projective transformations associated with \(\mathbf{x}_G\) and as a consequence, cannot correct for a error in the camera's projective axis (camera's Z axis). We use the camera projection Jacobian and calculate the task-space error as below: $$ \Delta\mathbf{X} = \left[\begin{array}{ccc} \frac{f_x}{Z_W} & 0 & -\frac{X^W_G f_x+Z^W_G c_x}{{Z^W_G}^2}+\frac{c_x}{Z^W_G} \\ 0 & \frac{f_y}{Z^W_G} & -\frac{Y^W_G f_y+Z^W_G c_y}{{Z^W_G}^2}+\frac{y}{Z^W_G} \end{array} \right]^+ \left[\begin{array}{c} p^u - \mathbf{x}_G^u \\ p^v - \mathbf{x}_G^v \end{array} \right] $$ In the equation above, \([f_x, f_y, c_x, c_y]\) are the camera intrinsics, and we denote \(\mathbf{X}_G^W = [X^W_G, Y^W_G, Z^W_G]\) and the \([\cdot]^+\) denotes the Moore-Penrose pseudo-inverse. We use the following procedure to correct trajectory goals as the robot moves towards the true workspace goal \(\mathbf{X}_G^W\) with a erroneous initial estimate.

Empirical evidence of error averaging across 10 cm steps

In this experiment, we move the robot vertically down by 65 cm to a goal location slightly below the yellow cross mark on the handle of the glue gun. There are no errors in the goal location being tracked for this case. This experiment is repeated 10 times. The red dots are the predictions of potential point of contact (calculated as the instantaneous POE). We note that the predictions are centered about the actual point of contact -- a point slightly below the yellow cross mark on the glue gun. We report 6 cases where we predict the potential point of contact by looking at various intervals of the trajectory. We report the interval lengths (in cm) and the standard deviation in predicting the point of contact (in pixels) as the labels of the figures. We note that as we increase the length of the interval, the standard deviation of the prediction decreases (as seen through ``clumping" of the predicted potential points of contact), but the number of possible predictions decreases (as seen through fewer number of red dots with increasing interval sizes). This leads us to conclude that the averages across larger temporal (and spatial) windows produce smoother and more stable error signals (or correction signals in the case of visual servoing). For our use case, averaging across 10 cm intervals provided us with ``enough'' number of correction signals while having reasonably low variance.

Results

We used the following objects for our localization experiments. From top (L-R) a folding knife, a glue gun, a wooden clip, a monkey from the Barrel of Monkey's game, a circuit board, a box cutter and a textured metallic pin. A 10cm \(\times\) 1cm rectangle is inserted at the bottom for scale.

In the experiment above, we focus on the case where the contact point has a unique arrangement of useful tactile features. This is the best case for a combined approach. If the tactile sensor contacts a featureless area, that can be detected and the tactile sensor output can be ignored. For each of the 6 objects used in this work, we fix the object to the robot table. Assuming that there is zero error in position control of our robot (we use a Universal Robots UR5E manipulator fixed to a vention.io table of recommended design for the same robot), we register the object with respect to the robot base and treat this pose as our ground truth. Next, we move the robot vertically 1 m above the object and move the robot down to touch the object at a chosen point that will yield good tactile information, and localize the object with respect to the robot. We repeat this 3 times for the same object positions and repeat this experiment for 2 more positions of the object with respect to the robot -- i.e. localizing each object 9 times with respect to the robot. Fixing the objects is a restrictive assumption in the context of localizing objects especially with touch, however, to ensure repeatability of the experiments reported in the section, we had to fix the objects to a rigid base. Following recent literature we report the repeatibility of our pose estimation pipeline as the measure of its performance. Using tactile sensing the localization errors were brought down to ±1.5mm in translation and ±0.\(5^o\) in rotation from about 1.5cm in translation and 2\(^o\) in rotation using only vision 3 . However, for cases where the tactile features were not unique, e.g. the box cutter teeth and the metallic slender pin object, the tactile sensing actually increased the localization errors in the horizontal directions. The order of these errors were equivalent to the scale of the repeated features -- 5mm for the box-cutter (teeth are about 3mm wide placed in intervals of 5mm) and about 2mm for the long textured metallic pin (the embossed features are very similar at intervals of 3.5 mm). This observation is consistent with the fact that the final gradient descent step to refine the camera based pose estimates will converge to the wrong local minima if the tactile measurements are not distinctive enough.

In this experiment we present the effect of randomly selected contacts for localization. For this set of experiments, we fix each of the objects and the black background plate used in experiments above to a graduated compound slide capable of in plane translation and rotation. We then moved the robot vertically down to make contact at the same location on the object used for the experiments reported in Table I to generate a starting pose. Next, we generated 5 random configurations per object in translation and orientation on the plane of the table and moved the robot vertically down to touch the object and attempted to recover the randomly generated pose perturbations we introduced. For each of the objects, as expected, we observed similar errors in localization using only vision as reported in Table I. For the box cutter most of the contacts yielded useful tactile signals so the errors in recovering the perturbations in pose were in the range reported in Table I -- i.e. \(\sim 8\)mm in translation and \(\sim 1.5^o\) in rotation. This observation was also consistent for the smaller textured objects (like the monkey, metallic in and the camera circuit). However, tactile sensing was not always helpful in localizing the objects -- for the glue gun and the folding knife, significant parts of the object were featureless and the tactile signals obtained when touched at these parts were unusable in localizing the objects as the final gradient descent step re-introduced localization errors of about \(\sim 3-4\) cm and \(15^o\) by converging to incorrect poses. In Table II, we provide a subset of the results of these experiments with possible causes behind the failures.

Some tactile signals captured by our system and qualitative representation of the final result of our localization pipeline.

Citation

TBD

Acknowledgements

We thank Leonid Keselman , Oliver Kroemer, Ben Eisner, Arpit Agarwal and Rishi Veerapaneni for several constructive discussions and feedback on the manuscript. This research was partially funded by the NSF and the Toyota Research Institute. The website template has been borrowed from Michaël Gharbi.