Explorations in the Design of a Computer-Vision-based Electronic Travel Aid

By Andrew B. Raij (raij@cs.unc.edu)

Enabling Technology Final Report



Table of Contents


  1. Introduction
  2. Overview
  3. Previous Travel Aids
  4. Spatial Understanding
  5. Tactile Feedback
  6. Conclusion
  7. Appendix: Stereo Camera Depth Extraction
  8. References




            The electronic travel aid (ETA), as imagined by numerous researchers ([Shoval et al 2000], [Ulrich and Borenstein 2001] and [Zelek et al 2000] to point to a few), is a device that assists a blind user with every day orientation, mobility and navigation.  In general, one can say there are three major components of any ETA.  The first is the input device such as a sonar, camera or GPS that collects information like depth or world position from the environment.  The second is the transformation component, which transforms the gathered information into a format that is useful to a blind person.  Finally, the output component passes the transformed information on to the user.  The clear questions that arise from this description are:


        What is the appropriate information to gather from the world?

        How should this information be transformed and communicated to the user? 


            From the start, I have assumed that the most immediately useful piece of information a person who is blind does not have naturally is a sense of what, if any, obstacles are ahead.  For this reason, my entire project has focused from the start on providing the depth of the environment to a user.  Furthermore, an interest in computer vision led to the early decision to use cameras and the mathematics of multiple view geometry to extract this depth from the world.  My project will attempt to address mobility issues but will not address issues of orientation or navigation directly.  Given this decision, the important question left to answer was how to best transform and present depth information to a blind user so that it is useful.  I correctly anticipated that this would be a hard question to answer.





In the next few sections, I discuss the path I took as I attempted to answer how depth information can be passed to the user.  First, I discuss some travel aids that present depth to the user and the advantages and disadvantages of these devices.  I then discuss some recent research into the spatial understanding of the blind versus the sighted and the potential of tactile maps in bridging the gap between the two.  Finally, I look into different tactile modalities used to express information.  As I do so, I propose some of my own ideas for tactile-feedback-based ETAs.  Also, for lack of a better place to put it, I provide an appendix where I discuss issues related to the use of cameras and one possible way of extracting depth from the world using two cameras. 


Previous Travel Aids

            In this section, I highlight a number of travel aids whose characteristics lead to advantages and disadvantages I want to point out.  There are certainly more travel aids in existence than are discussed here but I think the set I present have certain qualities that make them stand out.


The Cane: Probably the most commonly used travel aid is the white cane.  The cane has been the trusty tool of people who are blind for quite some time.  It is actually so much more than a travel aid as the wonderful essay located here explains but for the purposes of this report I focus on its use as a depth-sensing tool.  The cane allows a user to test for very local depth of the environment.  Under the rules of proper cane use, the user of the cane walks in the direction desired, sweeping the cane back and forth across the area directly in front of the body.  When a cane strikes an obstacle, the user learns where the obstacle is relative to the body.  The user might also learn information about what kind of object was struck based on the sound emitted or the forces and vibrations felt through the cane on impact.  After striking an obstacle, the cane user can then use the cane to further probe the obstacle and determine how to avoid it properly. 

Although the cane is used extensively today, it has some drawbacks that should be pointed out.  The first major drawback is that proper caning technique only leads to sampling the depth of the world in the space directly in front and below the waist of the user.  It does provide depth information about, for example, a low-hanging tree branch or a smaller doorway.  Another major problem with the cane is that it requires actively, physically scanning the environment by sweeping the cane back and forth.  This physical work can not only be tiring but it also introduces a temporal dependency into the depth information gathered from the world.  Due to the sweeping motion, depth information about a point in space is only guaranteed true at the time when the point is probed.  There is no guarantee that an empty point in space sampled at time t will not contain an object at time t+1.  However, it should be noted that it is somewhat rare for an object to move into the space tested at time t in the short amount of time between when the cane probes a point and when the cane user reaches that point.  In any case, this temporal dependency in depth probing is very unlike vision-based depth probing, where knowledge about the entire space in front of the user is available at every instant of time.  Finally, the last major problem with the cane is that its relatively short length prevents the user from gathering long-range depth information.  Such information might enable a person who is blind to make path planning decisions earlier. 


NavBelt: The NavBelt by [Shoval et al 2000] is a project that attempts to apply obstacle avoidance algorithms for mobile robotics to the design of an ETA.  The NavBelt uses a set of mobile robot sonars worn at a userís belt-level to probe the world for depth.  In what is called the guidance mode, the depth information detected by the sonar is used to build a very local two-dimensional birds-eye map of the world near the user.  The map, known as a vector field histogram (VFH), is made up of a set of cells, and each cell contains a number indicating how likely it is an object might be located in the space the cell corresponds to.  As the user navigates an environment the VFH constantly rebuilds and updates itself based on how the user moves and the information returned from the sonar sensors.  (It should be noted that this requires good information about how the user moves, a difficulty the designers of the NavBelt did not address in their work.)  The VFH is then processed and a directional tone is used to indicate which direction the user should walk in.  The frequency of the tone indicates how fast the user should walk in the direction indicated.  In what is called the image mode, the depth of the scene in front of the user is mapped to an acoustic image.  Sound sweeps from left to right through the head of the user and sounds with a higher pitch and volume correspond to areas of low depth.  Note that these two modes represent two styles of providing information to the user.  In the guidance mode, the user is essentially told where to go and doesnít have to think much about what they are doing whereas in the next image, the raw depth is passed to the user and it is the userís job to decide which direction to go in.  Which approach is better may be an issue of personal preference.

            One major advantage of the NavBelt over the cane is that it does not require active sweeping of an environment.  This is taken care of automatically by the array of sonars and since there is no sweeping of the sensor involved here, there is no time dependency in the depth information.  (Note, however, that in the image mode, this time dependency is reintroduced through the scanning mechanism.)  Also, depending on the quality of the sonars, the NavBelt can also probe for long-range depth.  For the guidance mode, the VFH algorithm inherently remembers longer-range depth and as the user gets closer to something once detected as farther away, the VFH algorithm will update the map using both the previous long-range depth measurements and the short range depth measurements that are found as the user gets closer.  In other words, the confidence level of long-range depth will be very high if the obstacle in the distance is approached.  This will translate to better obstacle avoidance in the guidance mode or a higher pitch and volume in the image mode.  For the image mode, long-range depth will just be presented as a low pitch and low volume sound or nothing at all if nothing the obsacles in the distance are too far away.

            However, there are some disadvantages to the NavBelt.  Just as in the case of the cane, the sonars might not be placed correctly to sample depth in front of the head or other areas of the body.  This depends very much on where the userís waist is in relation to the rest of her body.  Furthermore, the use of one horizontal array of sensors and a 2D birds-eye map lead to a very good horizontal resolution of depth but very poor vertical resolution of depth.  In image mode, a user could be told that an obstacle is immediately ahead but it would not be clear if the obstacle is at floor-level or head-level.  Another potential drawback of the NavBelt is its use of auditory feedback.  For some, auditory feedback might make it harder to pay attention to important auditory information not conveyed by the NavBelt.  Everything from listening for an upcoming hallway to having a conversation could be made more difficult using the NavBelt.  This is even more the case in image mode.  When tested, the image mode was found to be too confusing and distracting for users.  It required lots of training and even then, did not allow users to navigate the environment as quickly as those that used guidance mode. 


GuideCane: Based on some of the flaws of the NavBelt, its designers built a new ETA called the GuideCane ([Shoval 2000][Ulrich and Borenstein 2001]).  The GuideCane goes one step further in integrating mobile robot technology into an ETA.  An array of sonars is mounted to a wheeled robot.  Connected to the top of the robot is a standard white cane that is held by the user.  As the user pushes the robot via the cane, the robot uses the sonars and a new version of the VFH algorithm, VFH+, to detect obstacles ahead.  If an obstacle is ahead, the robot will turn its wheels away from the obstacle.  The user will feel this turn through the torque applied to the cane as well as the force preventing the user from pushing the robot forward easily because the cane is not lined up with the robotís wheels.  The designers of the GuideCane found that the users quickly learned how to turn the cane when this occurs so that they could continue moving about the world.  After the obstacle is navigated, the GuideCane robot will turn back towards the original course chosen by the user. 

In general, the GuideCane allowed fast navigation in urban and suburban environments where there is relatively flat ground for the robot to walk on.  The GuideCane gives the user control over global navigation but leaves local navigation to the robot.  The major advantage of the GuideCane is its intuitive force and torque-based feedback mechanism.  This interface allowed users to navigate environments faster than those using the NavBeltís auditory feedback.  The problem of auditory feedback mechanisms interfering with a userís ability to hear the world is avoided and it is shown that there is a feedback mechanism appropriate for ETAs that is can be better than audio.  The success of this tactile form of feedback is one of the reasons I decided to explore tactile feedback in later stages of this project.  Another important advantage of the GuideCane is that the robot is always in front of the user, so if the robot fails to turn before an obstacle is encountered, the robot has a collision with the obstacle, not the user. 

However, there are a number of disadvantages to the GuideCane.  The two major disadvantages are again that depth is not sampled high enough with respect to the body (in this case, itís mostly sampled very low) and that the vertical resolution of this depth is again, very low.  Furthermore, the simple two-wheeled robot is not adequate for environments where the ground is not very flat.  Finally, the bigger question with the GuideCane is how users feel about being led around by a robot.  Do users feel like they do not have enough or any control?  Do users feel awkward about the idea of being led around by a robot?  These issues likely vary from person to person. 


The vOICe: The vOICe by [Meijer 1992] is not so much an ETA as it is a software package that could be integrated into an ETA.  The vOICe provides an interface for inputting either grayscale or red-green anaglyphic (for depth) images for processing into auditory feedback.  Like the NavBelt image mode, images are scanned column by column from left to right and corresponding sounds are played spatially from left to right.  The pitch of the sound corresponds to the height of patterns in the image and the loudness of the sound corresponds to the intensity or depth of patterns in the image.  The best way to get a sense of how this works is to visit the vOICe website and download the free software available there.

The big advantage of the vOICe over some of the other ETA technologies discussed earlier is that the information it provides is sampled from the world in many places over a wide field of view.  Furthermore, these samples have high resolution in both the vertical and horizontal directions.  This level of information could allow a user to have a spatial awareness of the environment more similar to human vision that is not available in other travel aids.  In fact, some users have claimed in testimonials that the experience of using the system is akin to actually seeing. 

            However, the major disadvantage of the vOICe is that the complicated sound patterns produced do not intuitively map to understanding of the information being presented.  Just as with the NavBelt image mode, lots of practice and trainging is needed to understand the images.  Another important disadvantage that should be noted is the temporal dependencies that come from sweeping through the information.  In this case, the information in one sweep through the image is guaranteed to be from the same moment in time but this does not guarantee the information is still accurate at the time the information is acted on. 


Synopsis: There are a few running themes in my analysis of travel aids.  The first is where depth is sampled and how high the resolution of the sampling is.  Clearly the use of cameras is a good choice for getting depth from the world because it automatically samples the world vertically and horizontally and with high-resolution in both directions.  Furthermore the field of view of cameras can be expanded with different lenses or through of several cameras.  The second major theme is that sound can be a distracting method of presenting information to a user and that there is some potential for tactile feedback to the user.  The third major theme is the issue of presenting depth data to the user with temporal dependencies.  Use of cameras removes the temporal dependencies from the data acquisition stage but does not necessarily remove the dependencies from the data presentation stage.  Hence, one question will be how to pass depth in the vertical and horizontal to the user without incorporating these temporal dependencies.  Of course, in many situations, this temporal dependency will not affect the quality of the depth data (i.e. trees donít move), making this question moot.  The fourth theme is related to control and whether the user wishes to be lead or wishes to decide where to go herself.  The final theme is related to appearances and the question of designing an ETA that does not make the user appear different from others in the world.  These last two themes relate more than anything to individual preferences and there is clearly no right or wrong answer in making decisions related to these themes.


Spatial Understanding

            Before exploring tactile feedback, I wanted to learn more about what is known about the understanding of space by people who are blind to see if there were any interesting insights that could be exploited in my design.  I turned to papers by [Kitchin et al 1997] and [Ungar 2000] for such insights.

In general, it is considered true that the sighted and people who were blinded later in life have significantly better spatial skills than those who become blind early in life or are born blind.  The big question is why.  There are three prevailing modes of thought within the psychology community regarding why this is the case.  The first theory, often called deficiency theory, is that the experience of vision is absolutely necessary to understand space and gain spatial skills.  Hence, under this theory, the early and congenitally blind cannot possibly have such skills.  The second theory says that the early and congenitally blind do possess spatial skills, but that these skills are inferior to the skills possessed by the sighted because they are based only on information gained through hearing and touch.  This theory is often called inefficiency theory.  The last theory, called difference theory, says that the early and congenitally blind possess different spatial skills than the sighted do but that these skills are functionally equivalent to that which the sighted possess.  These functionally equivalent spatial skills allow a blind person to understand space just as well as a sighted person.  Plenty of experiments show that the early and congenitally blind clearly understand spatial concepts so deficiency theory has fallen out of favor in the psychology community.  However, no consensus has been reached on whether inefficiency or difference theory is correct, although recent work does appear to favor difference theory.  This recent work hypothesizes that the difference manifests itself in different strategies for understanding and navigating space.

Most studies on spatial understanding separate the world into two spaces, haptic and locomotor.  Haptic space is the space in which humans manipulate the local space in around them.  Reaching for the mouse on my desk and moving it is a haptic task.  Typically action in haptic space is processed mentally relative to the individualís personal coordinate system.  Locomotor space is the global space of the world.  It is the space that can only be interacted with through motion.  The major advantage the sighted have over the blind in navigating locomotor space is that vision automatically connects the two spaces together.  Vision allows a sighted person to connect their individual coordinate system with the coordinate system of the world.  Recent studies seem to indicate that this lack of connection between haptic and locomotor space, or more to the point, the lack of a distinction between locomotor and haptic space, forces the early and congenitally blind to develop strategies for understanding space relative to their own coordinate systems, in haptic space.  Typically people who are blind do not learn strategies for understanding space in locomotor terms.

Given these theories, Ungar proposes that the blind can certainly learn locomotor strategies for understanding space if a bridge, somewhat equivalent to vision in the sighted, can be build between haptic and locomotor space.  He proposes tactile maps, maps that relate the location of a person who is blind to their location in the space about them, be used to bridge this gap.  This endorsement of a tactile tool from within the psychology community bolstered my interest in the subject and lead me to read a number of papers in this field that I discuss in the next section.


Tactile Feedback

            There are many different kinds of tactile feedback devices.  Due to time constraints, I was not able to review the whole field.  I chose to focus on vibration-based devices and thermal devices but I will briefly discuss electrocutaneous devices and pin arrays since Iíve read a bit about them in previous work sections of papers on tactile feedback.


Electrocutaneous Display: Electrocutaneous devices are essentially devices that stimulate the sense of touch using electricity.  Although such devices sound scary, they can be quite safe for low voltages and constant current.  These devices are not used very much, though, because electric stimulation is not a sensory experience typically experienced by people.  Its meaning is not easily understood or interpreted.


Pin Arrays: A pin array is a device that contains a number of electromechanically controlled pins that can move up and down.  One of the more relevant uses of pin arrays is the mapping of visual images to pins to generate a tactile image.  Pin arrays typically are expensive, heavy, and can break easily.  For this reason, they are not well suited to mobile applications like an electronic travel aid.  However, if they did not have these drawbacks, I believe pin arrays would be well suited to presenting real-time tactile maps to a blind user. 


Vibration Display: Vibration-based displays have been used in a number of contexts.  The most prevalent uses have been in the design of small form-factor devices like personal digital assistants and phones and in the design of wearable devices. 

In the Active Click project by [Fukumoto and Sugimura 2001], an actuator (a device that produces vibration) is attached to a touch panel and is vibrated at higher frequencies or for longer periods of time to express different kinds of feedback.  For instance, a short burst could be used to create the sensation of a click when a user touches the panel.  Furthermore, the placement of the actuator on the touch panel can significantly affect the sensation experienced by the user.  For instance, if the actuator is placed on the back of a PDA, vibrations will be sensed coarsely at the palm when the touch panel is touched.  However if the actuator is placed just below the surface of the touch panel, the vibration will be experienced directly at the finger when the touch panel is touched.  The authors also propose using several actuators on larger touch panels to be able to express vibration across the whole panel.  One possible application of these ideas to the design of an ETA would be to use several actuators in a grid or at 4 corners of a panel.  The panel could be mapped to the 2D depth image produced by the cameras.  Then, the vibration produced at each actuator could be made inversely proportional to the actuatorís distance from the point or points of highest depth in its neighborhood.  I do not have any intuition for how well this would work.

The Ambient Touch project by [Poupyrev et al 2002] seeks to pass on information to users of mobile devices ambiently via touch.  Information is said to be communicated ambiently if the user learns the information without being interrupted from whatever task the user is undertaking while the information is communicated.  Ambient communication would be an ideal form of feedback in an ETA since it could potentially allow a user to learn and respond to environment as a secondary task while other tasks, like having a conversation, are accomplished.  To support ambient tactile communication, the authors designed a small piezo-electric actuator that is much more flexible than those designed before it.  It can produce a much larger range of frequencies and amplitudes of vibration and it has approximately 5 ms latency, so sensations are experienced soon enough so that they correspond with the userís actions.  This is particularly important for the design of vibration feedback in an ETA since a fast response time would be needed for the user to understand and respond to the information presented fast enough.  The authors also propose a design space in which tactile ambient feedback devices exist.  They present a two-dimensional graph, where one dimension indicates abstraction of the information being expressed with a tactile device and the other axis represents the cognitive load needed to understand the information presented.  When designing an application that uses tactile feedback, it is interesting to think about where the application fits within this design space.  For instance, the handheld vibration panel mentioned in the previous paragraph might fall somewhere in the middle of the abstraction dimension.  The mapping of the panel vibrations to the 2D space in front of the user is more abstract than actually being able to touch each obstacle ahead but is less abstract than say mapping the frequency of vibration to the spatial location of the depth.  Iíd like to believe that the cognitive load of using such a device would be fairly low since I think the mapping of the 2d space in front of the user to the panel is intuitive. 

Another recent project ([Nashel and Razzaque 2003]) in tactile feedback for mobile devices aims to provide a sense of the spatial location of buttons and whether the buttons have been activated.  Part of the goal is to use tactile feedback to make virtual buttons behave more like real buttons.  Probably the most interesting aspect of the project with regard to ETAs is the mapping of different rows of buttons to different frequencies of vibration.  This allows the user to get a sense of which row is being pushed.  One could imagine a similar mapping being used in an ETA, where a high frequency vibration corresponds to high objects and a low frequency vibration corresponds to low objects.

[Gemperle et al 2001] have done much research into the design of wearable tactile displays and have come to a number of relevant conclusions.  The first is that vibration actuators show the most potential of any form of tactile feedback device.  The combination of low weight, low power, low noise, small inconspicuous size, and the ability to feel them through clothing and put them close to the body make vibration actuators ideal for mobile, wearable applications.  The authors show two different wearable packages they designed.  The first is a ring that encircles the arm and sits on the shoulder.  A number of modules with actuators are attached to the ring.  Another package designed is a vest with pockets designed to hold actuators close to the skin.  The vest is designed to be wearable from a utility and social standpoint.  The authors also conjecture on a possible wearable interface for navigation that uses several actuators.  If one actuator vibrates, it is to indicate the user should walk in the direction of the actuator.  To convey an instruction to rotate, several actuators may go off in a series that travels across the body in the direction of rotation.  Instructions to change velocity and acceleration could be mapped to changing vibration frequency and amplitude.  One concern with a design like this is the potential for information overload.  If several actuators are vibrating at once, possibly even at different frequencies and amplitudes, would it be possible to sense them all individually and react to them all properly?  Furthermore, would this be annoying?

Despite these open questions, I believe an interface like this one that takes advantage of a userís understanding of the space of their body might be very useful in helping a user navigate the world.  In fact, I think the idea could be extended further.  I imagine a distributed tactile feedback network laid out across the body.  When a part of the body approaches an obstacle, the actuators across the body could react to indicate how the user should move to avoid the collision.  Alternatively, the actuators near this body part could react in such a way to indicate a collision will occur.  The depth gathered from the environment is presented to the user by mapping it directly to the body parts that correspond to the 2D spatial location of that body part in the depth map.  Of course, this mapping is difficult to do because body parts are in motion.  One possible approach would be to track body parts with some kind of on-body tracker or possibly with cameras.  However, placing the cameras for this could be exceedingly difficult and cumbersome since they would have to both be on the body and observe the body.  Another approach would be to use inverse kinematics and some probability information relating to how people move to help guess where different parts of the body are at different times.  Probably the easiest solution, though, would be to put individual depth sensors (whether they be cameras or something else) with each actuator.  When a sensor detects upcoming depth, the sensorís associated actuator will respond as described earlier.


Thermal Display: Very little research has been done on thermal displays.  Almost everything I know about the field comes from two papers, only one of which is about an actual thermal output device.  The other is a paper on temperature perception.  Almost all the research that has been done comes from virtual environments researchers looking to better approximate the real-world environments they are trying to mimic.  I was initially led to thermal display because I was inspired by the possibility of a handheld tactile map that indicated information about the world via temperature.  Hotter areas of the map could correspond to areas of low-depth or alternatively the direction that should be walked in to avoid obstacles (much like the childrenís game ďHot or ColdĒ). 

            The first paper I read, by [Ottensmeyer and Salisbury 1997], discusses the addition of thermal output to a PHANToM force feedback device.  The major thing I learned from this paper is how little is understood about temperature perception in general.  For instance, it is known that there are two kinds of thermo-receptors, one for cold and for hot but it is not known where these receptors are exactly under the skin.  It is also not known whether humans detect temperature itself or if humans detect changes in temperature.  I also got a sense of how temperature is generated for such devices.  In the case of the thermal PHANToM device and others, A Peltier cell is used to generate heat and the heat is dissipated quickly with cold water. 

            The second paper I read, by [Jones and Berris 2002], discusses temperature perception in detail from the perspective of researchers trying to understand temperature perception so they could eventually build a thermal display.  I learned about a number of interesting properties of temperature perception.  For instance, there are many more cold receptors than warm receptors under the skin.  In the temperature range of 30-36˚C both kinds of receptors will fire but neither cold nor warm sensation is detected.  I also learned that the spatial resolution of temperature is very poor.  It is very hard for a person to identify exactly where within an area of the skin a temperature change is occurring or even how much the temperature is changing in the area.  Only very rapid increases or decreases are noticeable.  Another interesting property of the skin is that it quickly adapts to changes in temperature within the 30-36˚C.   Even if something applied to the skin of a person is initially 33˚C and does not change temperature, its temperature will slowly become imperceptible to the person.  Probably the most interesting thermal sensation property of the skin is the spatial summation property.  It turns out that changing the spatial pattern of temperatures applied to an area of the skin can actually change the sense of how much temperature is being applied.  It has been shown that if one doubles the spatial size of the thermal stimulus and halves the temperature of the stimulus, the perception of the temperature of the stimulus will be the same as the temperature perceived before making the change.  This property is absolutely amazing to me.  Unfortunately, I cannot think of a way to exploit it in tactile feedback for an ETA.  Furthemore, the poor spatial resolution of temperature and the slow response time for temperatures that change slowly leads me to think temperature is not well suited to providing spatial depth information to a blind user.  One possible way to use temperature would be to again, build a distributed thermal display network across the body and rapidly change temperature when a body part approaches an obstacle.  If thermal displays are spread out enough, itís likely the user will be able to discern which ones are heating up.  However, it is not clear if a user will be able to distinguish between different thermal sensations occurring all over the body at the same time.  Furthermore, I could imagine the varying temperatures will become annoying to a user.



            Despite all the reading and thinking Iíve done about ways to transform and present 2D spatial depth information to a user who is blind, I have not come to any conclusion as to what approach might work best.  I can only conjecture that there is some potential to the use of a distributed sensor and/or tactile feedback that can be worn inconspicuously across the body.  I also believe there is potential in the use of small handheld pin arrays to build constantly updating tactile maps.  However, I havenít done any significant reading on pin arrays so Iím not sure if this would be too difficult to implement.  I am also ending this report with the realization that although I am interested in computer vision, I am not all that interested in implementing the portion of the ETA that extracts depth from the world.  As this report clearly shows, my interests lie much more in issues of perception and feedback to the user.  I certainly would like to read more about spatial perception as well as try some of the forms of tactile feedback I looked at to see how users respond to them.


Appendix: Stereo Camera Depth Extraction

            Since I originally indicated this project would deal directly with depth extraction using cameras, I discuss the topic briefly in this appendix for the sake of completeness.  Much of what I discuss is from a book by [Hartley and Zisserman 2000]. 

            Given two images of a scene weíd like to find an estimated 3D location of a set of interest points seen in both images.  The first thing that must be done is camera calibration.  Note that these two images may come from two separate cameras or possibly even from the same camera.  The camera internal parameters, like focal length and radial distortion, should be found using typical checkerboard calibration techniques.  This should be done as a preprocess, before the user ever uses the ETA, and will probably never need to be done again if the cameras used have fixed focal length.  Despite their somewhat flimsy quality, cheap webcams could be good choices for this application because their internal parameters are usually constant.  Calibrating the external parameters of each camera, the parameters that describe the relationship in space between two views, is more difficult because these parameters could potentially change over time if the camera can move.  Probably the best decision to make up front is to use two cameras and mount them on a board very tightly.  They will probably not move very much over time.  In the case where the views might move (this is certainly true if one camera is in use), one possibility is to use robust automatic algorithms to recalculate the external calibration parameters.  If one camera is in use, this will have to be recalculated at every frame since the spatial relationship between the views would be changing at every frame.  There are a number of cases where robust calibration algorithms might fail, in particular if the camera is pointed at a plain wall with no interesting features.  Since a person who is blind might not be able to insure that the camera is pointed at a wall with lots of features, the external calibration process may be a significant problem.  It seems the best thing to do is to build the most stable stereo camera rig possible and calibrate it before passing it on to the user.

            Once the spatial relationship between the two views is known, the projection matrices of the two cameras can be calculated easily.  These matrices project a point in the camera image into a line in 3D space.  If a point in each view corresponds to the same 3D point, one can determine the location of the 3d point by finding where the projected lines through the corresponding points in each view intersect.  Since noise is generally a problem with cameras, the likelihood of the two lines intersecting exactly is unlikely.  A good compromise then is to find the linear least squares answer to where the two lines intersect. 

            Of course, in order to do this calculation, you need to know which points correspond in the two images.  The spatial relationship between two views leads to the result that a point in one view corresponds to a point along a particular line (the epipolar lines) in another view (assuming the second point is not occluded).  Given this, one should be able to find correspondences by searching along epipolar lines.  It could be very inefficient to search all epipolar lines in both views for all points so one possibility is to search around interesting features in the images, such as edges.  Another thing that can be done to make this process of scanning epipolar lines faster is to orient the cameras so that their images planes are parallel and their scanlines line up.  In this case, epipolar lines correspond to scanlines and the traversal of the images is very efficient because of the better use of memory coherency in moving from pixel to pixel across scanlines. 


References (links to papers provided if available)

Fukumoto, M. and Sugimura, T. (2001).  Active Click: Tactile Feedback for Touch Panels, Proceedings of CHI 2001, 121-122.

Gemperle, F., Ota, N. and Siewiorek, D. (2001). Design of a Wearable Tactile Display, In Proceedings of the Fifth International Symposium on Wearable Computers, ZŁrich, Switzerland, Oct. 2001.

Hartley, R. and Zisserman, A. (2000).  Multiple View Geometry in Computer Vision, Cambridge University Press, 2000

Kitchin, R.M., Blades, M. and Golledge, R.G. (1997). Understanding spatial concepts at the geographic scale without the use of vision, Progress in Human Geography, 21, 2: 225-242

Nashel, A. and Razzaque, S. (2003).  Tactile Virtual Buttons for Mobile Devices, Proceedings of CHI 2003. 

Meijer, P.B.L. (1992). An Experimental System for Auditory Image Representations, IEEE Transactions on Biomedical Engineering, Vol. 39, No. 2, pp. 112-121, Feb 1992. Reprinted in the 1993 IMIA Yearbook of Medical Informatics, pp. 291-300.

Ottensmeyer, M. and Salisbury, J.K. (1997). Hot and Cold Running VR: adding thermal stimuli to the haptic experience, Proceedings of The Second PHANToM User's Group Workshop, October 1997.

Jones, L.A. and Berris, M. (2002). The Psychophysics of Temperature Perception and Thermal-Interface Design, Proceedings of the 10th Symposium On Haptic Interfaces for Virtual Environments and Teleoperator Systems, 2002.

Poupyrev, I., Maruyama, S. and Rekimoto, J. (2002).  Ambient Touch: Designing Tactile Interfaces for Handheld Devices, Proceedings of UIST 2002, 51-60.

Shoval, S., Ulrich, I. and Borenstein, J. (2000).  Computerized Obstacle Avoidance Systems for the Blind and Visually Impaired, In: Teodorescu, H.N.L and Jain, L.C. (eds) Intelligent Systems and Technologies in Rehabilitation Engineering. CRC Press, 414 - 448.

Ulrich, I and Borenstein, J. (2001).  The GuideCane Ė Applying Mobile Robot Technologies to Assist the Visually Impaired, IEEE Transactions on Systems, Man, and Cybernetics, Ė Part A: Systems and Humans, Vol. 31, No. 2, March 2001, 131-136.

Ungar, S. (2000). Cognitive Mapping without Visual Experience, In: Kitchin, R. and Freundschuh, S. (eds.) Cognitive Mapping: Past, Present and Future. London: Routledge.

Zelek, J., Audette, R., Balthazar, J. and Dunk, C. (2000). A Stereo-vision System for the Visually Impaired, Technical Report 2000-41x-1, School of Engineering, University of Guelph.