Computer Graphics &
Geometry
Interaction in Edutainment Applications Using
Monocular Posture Tracking
Z.
Černeková
Comenius University
Bratislava, Slovakia
cernekova@fmph.uniba.sk
C.
Malerczyk
ZGDV Computer Graphics Center, Germany
cmalerc@zgdv.de
N.
Nikolaidis
Aristotle University of Thessaloniki, Greece
nikolaid@aiia.csd.auth.gr
I.
Pitas
Aristotle University of Thessaloniki, Greece
pitas@aiia.csd.auth.gr
Contents
Abstracts:
In this paper, a method for
recognizing pointing gestures without markers is proposed. The
video-based system uses one camera only, which observes the user in
front of a large screen and identifies the 2D position pointed by
him/her on this screen, his/her arm being in the fully extended
position towards the screen. A GVF-snake is used in order to detect the
pointing hand of the user, which is tracked in the following frames
using the particle filters tracker. The center of gravity of the snake
is used as a feature point and is transformed using linear
transformation directly into the canvas coordinates. The method was
tested on a large screen using applications designed for a wide range
of different and even technically unversed users such as an image
exploration for a virtual museum exhibit or intuitive interaction
applications for gaming purposes. Experiments show very promising
results for recognizing the pointing gestures by using a single camera.
1.
Introduction
Human posture and activity
recognition from video is a very active research area in nowadays
because of its important applications in surveillance, human-computer
interaction and computer animation. Body posture recognition is one of
the most challenging problems in computer vision, because of the
articulated motion of human bodies and the large variations in the
appearance of clothing. Hand gesture recognition in particular,
is closely
related to video-based interaction which is one of the most intuitive
kinds of human-computer-interaction with mixed-reality applications
[1]. Users are not wired to a computer, as
it is necessary e.g. with electromagnetic sensors like data gloves, and
maintain mostly unrestricted freedom of interaction. As a consequence,
video-based interaction is the preferred kind of interaction especially
for technically unversed users. Examples of vision-based hand pointing
interface are the digital desk introduced in [2], the virtual
touchscreen described in [3] and the finger paint system proposed in
[4]. All these systems contain a video projector, a camera and a
planar surface (screen).
Several posture recognition systems
have been presented so far [5], [6]. Many of the
traditional systems are based on still cameras and background
subtraction [7], [8]; the
silhouettes of the subjects are then used in posture recognition. The
disadvantage of this scheme is that background subtraction is not
robust and not always possible, and the method cannot distinguish
postures when body parts are occluded by silhouettes. Using multiple
cameras and extracting
depth information for the persons in the scene was proposed Yamamoto et
al. in [9]. Four stereo cameras mounted in
the corners of a ceiling that look down at an oblique angle allow to
capture entire bodies and faces simultaneously. Thus arm pointing
gestures are recognized without imposing restrictions on the position
and orientation of the user.
In order to recognize pointing gestures many methods detect both hands
and use additional information like head orientation [10] or eye
position [11]. Littmann et al. in [12] show that a modular, neural
network based system
can achieve the visual recognition of human hand pointing gestures from
stereo camera pairs. A person is positioned at one side of a table that
is covered with a black 10x10 grid on a yellow surface. Several neural
networks account for image segmentation, estimation of hand location,
estimation of 3D-pointing direction, and necessary transforms from
image to world coordinates and vice versa. The functions of all network
modules can be learned from data examples only, by exploiting various
learning algorithms.
The above mentioned approaches, employing multiple cameras, are rather
expensive to deploy. Kolesnik and Kulessa in [13]
tested a vision system which consists of a single overhead view camera
and exploits a priori knowledge of the human body appearance,
interactive context and environment. The user controls the motion of
virtual objects by pointing with an arm extended towards the screen.
However, only the horizontal coordinate of the
location pointed on the screen is recognized by this method. In [14],
the authors present a neural
architecture that is capable of estimating a target point from a
pointing gesture, thus enabling a user to command a mobile robot by
means of pointing. They studied whether it is possible to implement a
target point estimator using only monocular images from low-cost
webcams. The results indicate that it is in fact possible to realize a
pointing estimator using monocular image data, but further efforts are
necessary to improve the accuracy and robustness of their approach.
In this paper, we focus on recognizing the cell of a grid on a screen
that is pointed by a user, his/her hand being in the fully extended
position towards the screen. The video-based module uses one camera,
which observes the user in front of the screen. A GVF-snake is used to
segment the pointing hand of the user in the first frame of the video.
The pointing hand is subsequently tracked over time.
The remainder of the paper is organized as follows: In Section 2, the
setup used in our method is described. In Section 3, a description of
the proposed method is provided. Implementation is addressed in Section
4 and the possible applications are presented in Section 5.
Experimental results on pointing gesture recognition are presented and
commented in Section 6 and conclusions are drawn in Section 7.
2.
System Setup
Our testing environment is equipped
with a single uncalibrated firewire camera connected to the computer
feeding the system with greyscaled images. The camera is placed on the
top of the screen at approximately 2 meters from ground, thus observing
the user from the front. The user stands in front of a screen of size
2,0m width and 1,5m height located at approximately two meters in front
of him. The position of the user is somewhat pre-defined with respect
to
the camera set-up. There is a marker on the floor indicating the most
suitable position of the user. It is extremely important that the user
be provided with a visual feedback of his own pointing action.
Different
kinds of feedback can be presented to the user at the interface level,
according to the specific application requirements. In our applications
we have used two basic feedbacks; a small red point (understandable as
a laser pointer metaphor) and a magnifying glass-based feedback, where
the area around the pointed point is zoomed, providing an effect of
holding a virtual magnifying glass with the pointing hand. In Figure 1
one can see the testing environment with the frontal camera and the
user pointing with fully extended arm towards the screen.

Figure.
1. The testing environment setup with the frontal
camera. Exploring Hieronymus Bosch's "The Haywain" triptych with a
virtual magnifying glass.
3. Proposed method
The first task to be solved is
segmentation of pointing hand in the first
frame. The static environment and the fixed camera at our setup allow
using background subtraction. However, due to non-uniform lighting, the
shadows cast by the user may cause problems. Therefore, in order to
properly detect the silhouette of the user, we decided to use an
active contour (snake). We have chosen a snake [15]
that uses the gradient vector flow (GVF) field, computed as a diffusion
of the gradient vectors of a gray-level or binary edge map derived from
the image, as its external force. Advantages of the GVF snake over a
traditional snake include its insensitivity to initialization and its
ability to move into boundary concavities.
The snake is applied in the first
frame in order to localize the
pointing hand. For the initialization of the snake a circle is used.
The user is asked to point at a predefined area at the start of the
session and the circle is centered at the area where the projection of
the hand resides in such pointing gesture. The center of gravity of the
snake is used as the hand (xhd=(xhd,
yhd)) position. In the rest of the
frames the
points are tracked by a particle filters tracker [16],
which showed the best performance during the testing of the method.
Since, the position of the user is
approximately set and the position
and dimensions of the area where the user is pointing (canvas) as well
as the camera position are fixed, his hand can appear only in a certain
sub-area of the camera image. To find this area the user is asked in
the
beginning of the session to point at the top right corner and the
bottom left corner of the canvas and the image coordinates of those two
points are recorded. Then the camera image coordinates of the hand xhd (for a certain frame), with respect to the predefined sub-area, are
transformed into normalized canvas coordinates Yhd in range
[0..1]
using and easily defined linear transformation Tlin.
Yhd = Tlin(xhd)
(1)
In Figure. 2 one can see an output of the particle filters tracker with
the pointing hand detected in a camera frame.

Figure. 2: Pointing hand detected and
tracked in a frame acquired in the frontal camera setup.
4. Implementation
The complete hand pointing recognizer was implemented in
C++. We have implemented the core of the method which was integrated
into the hand pointing recognition system developed in ZGDV Darmstadt,
Germany [1]. In order to speed up the
calculation of the GVF snake in the first frame, which is time
consuming, we calculate the gradient vector flow field only in a
predefined image window, where we expect the hand would appear and not
on the whole image. Four types of trackers were tested, namely:
Kanade-Lucas-Tomasi (KLT) tracker, elastic graph matching [17], modal
analysis tracker [18] and particle filters tracker [16], to obtain the
best tracking results. The best performance in terms of precision and
speed, was achieved by the particle filters tracker.
5.
Applications
The single camera pointing gestures
recognitions module can be used to provide input to a number of
different applications whose goal is to create an intuitively usable
experience for any (even technically unversed) user of the system, who
is curious enough to explore virtual worlds with a new interaction
paradigm like the pointing recognition system. For the creation of new
scenario content it is important to have standardized and easy to use
authoring tools and rendering components at hand. We use the
instantreality-framework [19], [20] for the rendering part of the
applications. The instantreality-framework is a high-performance Mixed
Reality (MR) system, which combines various components to provide a
single and consistent interface for AR/VR developers. The framework
provides a comprehensive set of features to support classic Virtual
Reality (VR) and advanced Augmented Reality (AR) equally well. The goal
is to provide a very simple application interface while still including
the latest research results in the fields of high-realistic rendering,
3D user interaction and total-immersive display technology [19]. The
instantreality-framework uses X3D/VRML as the programming language for
the virtual worlds the user interacts with. Like most traditional
toolkits, it uses a scene-graph to organize the data, as well as
spatial and logical relations. In addition to the scene description, VR
applications need to deal with dynamic behavior of objects, and the
user interaction via non-standard input devices. The use of X3D/VRML as
an application programming language leads to a number of advantages
over a proprietary language [20]:
- It is integral to an efficient development cycle to have access to
simple yet powerful scene development tools. With X3D/VRML, the
application developer can use a wide range of systems for modeling,
optimizing and converting.
- The interface is well defined by a company-independent ISO
standard.
- Due to platform independence, development and testing can even be
done on regular standard desktop computers.
- VRML and JavaScript are much easier to learn than the low-level
interfaces often provided by traditional VR toolkits.
- There are a great number of books and tutorials available.
5.1
Object Exploration Scenarios
As a consequence of using an open
X3D/VRML environment one may easily conceive of many different
immersive 2D and 3D applications (e.g. see Figures 1, 3 and 4) that
benefit from using a pointing based input device. Nevertheless, using a
static pointing posture for interaction is slightly different to the
use of traditional input devices like mouse or keyboard due to the fact
that obviously no selection events like a mouse click are possible
without the interpretation of dynamic hand movements. Therefore, status
event changes are simulated using a time driven position evaluation
within the scenario applications. Within a Java script node of the
application the continuous input stream of pointing positions is
observed. A selection event is automatically generated either if the
user is pointing at a predefined selection area on the screen for
longer than a given time span or if the speed of the users hand
movement drops below a given threshold.

Figure. 3: Exploring Hieronymus Bosch's
"The Haywain" triptych with a virtual magnifying glass.
A first chosen scenario application addresses the exploration of a
digitized painting using a virtual magnifying glass. For this scenario
the well known triptych “The Haywain” (Figure 1 and 3) by the
Netherlandish painter Hieronymus Bosch (c. 1450-1516) was used.
Paintings of Hieronymus Bosch perfectly fit for an exploration using a
virtual magnifying glass since Bosch is well known for his complex
painted panels featuring fantastic and very detailed portrayals of
demons, fools and other creatures from Eden to hell. The application
directly starts with the full screen exploration of the painting. While
no menu bars or other objects disturb the visual impression of the
digitized painting, the visitor is able to focus solely on the painting
and its details. An additional post processing step in the pointing
gesture tracking module allows an extremely stable position of the
magnifying glass, if the user is bringing an interesting detail into
focus.

Figure.
4: Interaction with a
three-dimensional object using buttons for object rotation and
acceleration-based event generation for transparency switching.
As an application dealing with the presentation of three-dimensional
objects we have chosen the exploration of a traditional
propeller-driven aircraft. Whereas the exploration of the paintings
described above provides permanent visual feedback (e.g. the
visualization of the magnifying glass) by the application idea itself,
here a small blue box is used as a 3d cursor that ensures permanent
feedback and therefore permits precise interaction by the user. During
interaction, the user is able to look upon the aircraft from every
angle by pivoting it on its physical center using four virtual buttons
arranged at the right hand side and the bottom of the screen (see
Figure 4). These buttons are 3d touch sensors fed by mouse over events
delivered by the pointing posture recognition and connected to a
JavaScript method driving the rotation of the plane accordingly.
Furthermore, the virtual model of the plane is build of different
geometrical layers for several parts of the aircraft such as wings,
engine, cockpit or fuselage. By selecting one of those parts (using a
mouse click when interacting in a traditional manner) the fuselage
exterior is switched to transparent or opaque (see Figure 4).
Nevertheless, interaction using the pointing posture tracking (without
events such as a mouse click at hand) is more difficult due to the fact
that because of the rotation of the object, parts of the plane alter
their position on the screen permanently and a simple surveillance of
predefined regions on the screen is insufficient. Therefore, the
pointing position and its speed and acceleration are constantly traced
and evaluated within the rendering application. If every time speed and
acceleration of the pointing position drop below given thresholds, this
is interpreted as a click event at the last recorded position. Thereby
it is possible to generate mouse click like events at an arbitrary
position on the screen or in 3d space accordingly.
5.2 Multimodal Interaction
The
recognition and tracking of a pointing gesture is obviously suitable to
be used for multimodal interaction
combining speech and gesture modalities. While the
instantreality-framework provides a sensor node for speech recognition, the fusion of spoken
input text and pointing based selection events is performed within a generic Java
script node of the application's scene graph. As a first proof-of-concept for the multimodal sensor
fusion we have developed an interactive version of the well known number placement puzzle Sudoku
(see Figure 5, for further information on Sudoku in general see e.g.
http://en.wikipedia.org/wiki/Sudoku). Integers can be entered into
empty cells of the
current game by selecting a cell by pointing at it and saying the word
"number" followed by an integer
between one and nine. To enable conversation and discussion between
users during the game play
and to avoid unwanted misunderstandings of the system the speech
recognition is grammar based.
In addition to spoken integer input like "number one", "number two",
etc. while pointing at an empty cell
of the Sudoku board, additional commands can be used like "delete that"
or "remove this" to
undo wrongly entered numbers and "start new game" or "reset game" to
control the status of
the puzzle.
Figure.5: Solving a Sudoku puzzle using
multimodal interaction like the fusion of pointing and speech
recognition within the rendering application.
6. Experimental results
Experiments were conducted to show
that the proposed pointing recognition system can indeed be used for
human computer interaction. We have tested the method on the scenario
application of the virtual exploration of the "The Haywain", using a
virtual magnifying glass, as described in Section 5. The users reported
that they were satisfied with the experience and the performance of the
system. In order to obtain accuracy of the system the user was asked to
hold a laser pointer while operating the pointing system. The distance
between the trace of the laser pointer on the canvas and the point
identified by the system (and presented to the user as feedback) was
used to measure the accuracy. The maximum distance between the two
points was about 10 cm. These experiments show that very good results
can be achieved for pointing gesture recognition using only a single
camera.
7. Conclusions and discussion
A method for the recognition of pointing gestures
without markers using only a single camera was presented in this paper.
The proposed method uses a GVF snake to detect the pointing hand and
subsequently tracks obtained features over the video. The purpose of
our method is to recognize 2D position pointed by the user on a screen
to enable intuitive video-based interaction in applications like gaming
(chess playing, puzzle solving, sudoku) or virtual museums (selecting a
part of painting in order to obtain information for this part). Very
good results were achieved by the proposed system. Future work includes
improving the performance of the method, providing automatic
initialization of the session and integrating the method to other
applications that have been already implemented, as described in
Section 5.
Acknowledgement
This work has been conducted in conjunction with the `SIMILAR' European
Network of Excellence on Multimodal Interfaces of the IST Programme of
the European Union (www.similar.cc).
This research was partially supported from Slovak Ministry of
Education grant No. VEGA 1/3083/0.
References
[1] Malerczyk C., Dähne P., and Schnaider, M. ,
Exploring
digitized artworks by pointing posture recognition. , In Proc.
2005 6th Int. Symposium on Virtual Reality, Archeology and Cultural
Heritage, Pisa, Italy.November 2005.
[2] P.Wellner. 1993. Interacting with paper on the
digitaldesk. Commun. ACM 36,
7, pp. 86-96.
[3] Maggioni, C., and Kämmerer, B. 1998. GestureComputer
- History, design and applications. Computer
Vision for Human-Machine
Interaction: Cambridge Univ. Press, pp. 23-51, ch. 2.
[4] Crowley, J. L. Mar. 1997. Vision for man-machine
interaction. Robot. Autonom. Syst. 19, 3 - 4, pp. 347-358.
[5] Heidemann, G., Bekel, H., Bax, I., and Saalbach, A.
Aug. 2004. Hand gesture recognition: self-organising maps as a
graphical user interface for the partitioning of large training data
sets. In Proc. 1996 17th Int. Conf.
on Pattern Recognition (ICPR 2004),
vol. 4, pp. 487-490.
[6] Colombo, C., Bimbo, A. D., and Valli, A. Aug. 2003.
Visual capture and understanding of hand pointing actions in a 3-d
environment. IEEE Trans. Systems,
Man and Cybernetics, Part B vol. 33, no. 4,
pp. 677-686.
[7] Kehl, R., and Gool, L. V. 2004. Real-time pointing
gesture recognition for an immersive environment. In Proc. 2004 IEEE
6th Int. Conf. on Automatic Face and Gesture Recognition (FGR04).
[8] Liu, X., and Fujimura, K. 2004. Hand gesture
recognition using depth data. In Proc.
2004 IEEE 6th Int. Conf. on
Automatic Face and Gesture Recognition (FGR04).
[9] Yamamoto, Y., Yoda, I., and Sakaue, K. August 2004.
Arm-pointing gesture interface using surrounded stereo cameras system.
In Proc. 2004 Int. Conf. Pattern
Recognition, Cambridge.
[10] Nickel, K., Seemann, E., and Stiefelhagen, R. 2004.
3d-tracking of head and hands for pointing gesture recognition in a
human-robot interaction scenario. In Proc.
2004 IEEE 6th Int. Conf. on
Automatic Face and Gesture Recognition (FGR04).
[11] Carbini, S., Viallet, J. E., and Bernier, O. August
2004. Pointing gesture visual recognition for large display. In Proc.
2004 Int. Conf. Pattern Recognition, Cambridge.
[12] Littmann, E., Drees, A., and Ritter, H. 1996. Visual
gesture recognition by a modular neural system. In Proc. 1996 Int.
Conf. on Artificial Neural Networks, pp. 317 - 322.
[13] Kolesnik, M., and Kulessa, T. 2001. Detecting,
tracking and interpretation of a pointing gesture by an overhead view
camera. In B.Radig, editor, LNCS:
Pattern Recognition.
[14] Richarz, J., Martin, C., Scheidig, A., and Gross, H.
M. Sept. 2006. There you go! - estimating pointing gestures in
monocular images for mobile robot instruction. In Proc. 2006 IEEE 15th
Int. Symposium on Robot and Human Interactive Communication (ROMAN
2006), pp. 546-551.
[15] Xu, C., and Prince, J. L. 1998. Snakes, shapes, and
gradient vector flow. IEEE Trans.
Image Processing vol. 7, no. 3, pp. 359-369.
[16] Zhou, S. K., Chellappa, R., and Moghaddam, B. 2004.
Visual tracking and recognition using appearance-adaptive models in
particle filters. IEEE Trans. Image
Processing vol. 13, no. 11, pp. 1491-1506.
[17] Stamou, G., Nikolaidis, N., and Pitas, I. 11-14
September, 2005. Object tracking based on morphological elastic graph
matching. In Proc. of 2005 IEEE Int.
Conf. on Image Processing (ICIP
2005), Genova, Italy.
[18] Krinidis, M., Nikolaidis, N., and Pitas, I. 2007. 2d
feature point selection and tracking using 3d physics-based deformable
surfaces. IEEE Trans. Circuits and
Systems for Video Technology, vol. 17, no. 8, pp. 876-888, 2007.
[19] instantreality - advanced mixed reality technology,
project homepage, "www.instant-reality.org", Retrieved September 2007
[20] Behr, J., Dähne, P., and Roth, M. Utilizing x3d for
immersive environments. In Proc. 2004
Web3D.