# Template For Ieice Transactions Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Collaborative Virtual Environments (CVEs) are important components in Human Computer Interaction (HCI) domain. Second Life" virtual 3\textsc{D} world is regarded a one of most popular CVE. Second Life imitates real world facilitating a social life with interpersonal interactions and social events. This CVE advances people to establish or maintain social relationships, to get real world-like exciting experience, to exchange ideas, and to feel real-world emotions. In order to communicate emotionally in Second Life, a user must trigger the proper emotional gesture and facial expression of the respective avatar via a click event of GUI. This is similar to emoticon approach and fake emotions can be sent easily. In this paper we introduce a new framework to trigger emotional gestures and facial expressions based on voice and facial expressions of the user. This approach greatly enhance natural and effortless emotional communication. In our prototype, six primitive emotional states namely anger, dislike, fear, happiness, sadness, and surprise were considered. An emotion classification system, which uses short time log frequency power coefficients (LFPC) to represent the speech features and a hidden Markov model (HMM) as the classifier, was deployed as voice-based emotion classification unit. An OpenCV-based facial point recognition system with a artificial neural network, was deployed as face-based emotion classification unit.

\end{summary}

\begin{keywords}

emotion characterization, second life, collaborative virtual environments, emotion classification

\end{keywords}

\section{Introduction}

Natural human communication is basically based on speech, facial expressions, body posture and gestures. While speech is an dominant conveyor for carrying our thoughts and ideas, actions, postures, and body movements play a major role in enriching interpersonal communication. Social psychologists believe that more than 65\% of the information conveyed in a person-to-person conversation is carried on the nonverbal band \cite{Knapp1978}. Nonverbal communication potentially carry communicative power and help better understanding \cite{Allwood2002}. Most of \textsc{cve}s encourage nonverbal communication via chat but not with other clues. The motivation behind this paper is to address this issue and propose a framework in which people can feel real life-like emotions in \textsc{cve}s. Beside the context of the speech, speech signal convey messages through prosodic features such as stress, pitch, loudness, juncture, and rate of speech \cite{Fridlund1992}. Visual communication consists of contraction of facial muscles, tint of facial skin, eye movements, gestures, and body postures \cite{Reilly&Seibert2003}. In person to person communication, both auditory and visual clues act together to make an emotion rich communication channel.

Advancement of computer science and communication fields has encouraged scientists to create 3\textsc{d} \textsc{cve}s. As a result, very innovative 3\textsc{d} world such as Second Life and World of Warcarft were evolved. Such \textsc{cve}s has gained great popularity among Internet users and provide new opportunities for social relationships. Studies of Peiris et al.\ \cite{Perisetal2002} revealed that relationship build online are healthy and people consider those relationships as real life face-to-face relationships. In avatar based chat spaces, users follow various methods such as emotions and click events to trigger emotions. A study conducted by Rivera et al.\ \cite{Riveraetal1996} revealed that users are more satisfied with converting emotions with emoticons compared to a plain chat. Similar researches reveal that some chat space users emphasis their emotions by other approaches like capitalizing text \cite{Huetal2004}. By the results of above mentioned experiments, we can come to a reasonable conclusion that enhancing expressiveness greatly enhance quality of such communication.

In this paper, we address above issue by incorporating voice and visual-based emotions in to virtual environment without user interactions. This approach has plenty of advantages over manual assigning of emotions to own avatar. The manual emotion assigning process distract the user from the communication process. Also it makes an extra burden to the user by allowing him to select the proper gesture or emotion in a menu driven \textsc{gui}. Accidental selection of incorrect emotion may distract the mood of both parties. Also an incorrect emotion can be selected to mislead others.

%Th remainder of the paper is structured as follows.In Section 2, XXXXXXXXX XXXXXXX. Section 3 XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX 4 XXXXXXXX XXXXXXXXXXXX XXXXXXX XXXXXXXXXXX XXXXXXXXXX. The application of the developed Affect Analysis Model in Second Life (EmoHeart) and analysis of the EmoHeart log data are described in Section 5 and Section 6, respectively. Finally, Section 7 concludes the paper.

%[Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis] can be used to add some more information to the introduction.

\begin{figure}[ht]

\centering

\includegraphics[width=0.5\textwidth]{faces.png}

\caption{Faces expressing various emotions in Second Life (image should be changed for six emotions).}

\label{fig:emotional_faces}

\end{figure}

\begin{figure}[ht]

\centering

\includegraphics[width=0.5\textwidth]{posers.jpg}

\caption{Emotion representations: (a-c) represents anger, (i-k) represents dislike, (d-f) represent happiness, (l-n) represents sadness, (o-q) represents fear, and (r-t) represents surprise from front, side, and rear view points respectively.}

\label{fig:posers}

\end{figure}

\begin{figure*}[ht]

\centering

\includegraphics[width=1\textwidth]{system.png}

\caption{Bimodal emotion representation process}

\label{fig:system_process}

\end{figure*}

\begin{figure}[ht]

\centering

\includegraphics[width=0.5\textwidth]{points.png}

\caption{Feature points of a human face}

\label{fig:senaka-point}

\end{figure}

\section{Related Work}

The field of affective computing has several research avenues for recognition, interpretation, and representation of affect. Emotional information is conveyed across a wide range of modalities, including affect in written language, speech, facial display, posture, and physiological activity \cite{senakaetal2010}. Biological parameters such as facial electromyograms, electrocardiogram, the respiration effort, and the electrodermal activity were analyzed to detect emotion recognition by Rigas et al.\ \cite{Rigasetal2007}. Recent studies on how emotions conveyed through vocal channel can be found in \cite{nweetal2003,Scherer2003,nicholson2000,Yacoub2003}. Visual information also carries valuable emotional content. Facial expressions \cite{Rosenblum1994,Pantie2000,pantic2007} and vocal features \cite{kwon2003,nweetal2003}, body movements and postures \cite{Zhao2007,Camurri2003213} physiological signals \cite{Picard2001} have been used as parameters for emotion recognition. Also multimodal emotion recognition attempts were recorded in \cite{Bussoetal2004,Kim2005,Sebe2005,Pantic2005,Zeng2009}. Nevertheless, most of the work has considered the integration of information from facial expressions and speech \cite{Zeng07} and there have been relatively few attempts to combine information from body movement and gestures in a multimodal framework. Gunes and Piccardi \cite{Gunes2007}, for example, fused facial expressions and body gestures at different levels for bimodal emotion recognition. Further, Kaliouby and Robinson \cite{Kaliouby2005} proposed a vision-based computational model to infer acted mental states from head movements and facial expressions. Additionally many psychological studies have highlighted the need to consider the integration of multiple modalities for a proper inference of emotions \cite{Scherer2007,Ambady1992}. Various kind of machine learning techniques have been used in emotion recognition approaches \cite{Cowie2001,Pantie2000}. In multi-modal approaches, large number of feature parameters from audio, visual, and other biometric measurements have been used. Within used techniques, neural networks, hidden Markov models, and Bayesian networks are dominant.

\section{Voice-Based

Emotion Classification}

Most widely used speech based clues for emotion recognition are statistics of prosodic features, specially pitch and intensity of sound signal. We have selected \textsc{lfpc}, which is proven to be a good features to represent characteristics of voice signal \cite{nweetal2003}. Hidden Markov model was deployed as the statistical classifier.

Bunch of past research are recorded in literature for realtime emotion classification [citations]. Non of them have explained the physical instrument setup for statistical evaluation of realtime emotion classification but Vogt et al.\ \cite{Vogt2008}. In our experiments, we have used an instrument arrangement to mimic real speakers talking to a microphone as shown in Fig.~\ref{fig:test_setup}. We have selected Berlin database \cite{berlindb2005} and \textsc{esmbs} database \cite{nweetal2003} to train and test system. We trained the system for both two emotions (anger and neutral) and for six emotions (anger, dislike, fear, happiness, sadness, surprise). To test the unimodal accuracy, emotional utterances were played continuously using a audio player. The output was captured via a microphone and forwarded for real time classification. This arrangement has unique advantages and disadvantages. This setup add speaker nice to the input channel. Also it is arguable that whether mechanical speaker can imitate real vocal tract. Anyway the possibility of comparing widely-used databases in realtime domain can be considered an advantage.

Our approach windows a voice stream into two second segments which are to be processed by the classifier. This sound stream is sampled as 16-bit \textsc{pcm} with sampling frequency of 22.05 kHz. Each speech segment is further segmented into 16 ms frames. To reduce spectral leakage, Hamming windowing is applied to each frame. Frame overlapping size is 9 ms. [HMM description]

\begin{figure}[ht]

\centering

\includegraphics[width=0.5\textwidth]{Test_Setup.png}

\caption{Instrument arrangement to imitate talking to a microphone}

\label{fig:test_setup}

\end{figure}

\section{Facial Expression-Based Emotion Recognition}

\subsection{Feature point extraction}

Facial expressions are emotion-rich sources which can be used for emotion recognition. Therefore several approaches have been proposed and tested for face-based emotion classification. Basically, popular features for the above purpose are displacement of feature points and movements regions of the face. [more..]

For our approach, we have selected facial feature points-based parameters to derive features required for classification process. This features are typically based on corners of the eyes, corners of the eyebrows, corners and outer mid points of the lips, corners of the nostrils, tip of the nose, and the tip of the chin. The feature point detection is based on a previously published approach by Vukadinovic and Pantic \cite{Vukadinovic&patnic2005}. Feature point detection process consist of Face Detection, Region of Interest(\textsc{roi}) detection, Feature Extraction, and Feature Classification. Face detection is adopted by Haar feature based GentleBoost classifiers. Then the detected face is divided into 20 regions. Each region is responsible for a point to be detected. This detection process yields 20 feature points as shown in Fig.~\ref{fig:face_points}. A real example of detected facial points are in Fig.~\ref{fig:senaka-point}.

\begin{figure}[ht]

\centering

\includegraphics[width=0.25\textwidth]{FrontalFaceModel.jpg}

\caption{Feature point map}

\label{fig:face_points}

\end{figure}

\subsection{Emotion Classification}

The parameters used for classification are eyebrow and iris distance, height of eye, width of mouth, width of mouth, height of mouth, upper lip to nose distance and chin to nose distance. Literature reveals that recurrent neural networks are good classifiers for time varying pattern classifications \cite{Medsker2000,Samarasinghe2006}. For prototyping, an Elman neural network is deployed. [description of Elman neural network]The architecture of the network is shown in Fig.~\ref{fig:elman_nn}

\begin{figure}[ht]

\centering

\includegraphics[width=0.5\textwidth]{Elman_srnn.png}

\caption{Structure of an Elman artificial neural network}

\label{fig:elman_nn}

\end{figure}

\noindent

Eyebrow and iris distance:

((Ex - Ay) + (E1x - A1x))/2

Height of eye:

((Fx - Gx) + (F1x - G1x))/2

Width of mouth:

Jx - Ix

Height of mouth:

Ky - Ly

Upper lip to nose distance:

Ny - Ky

Chin to nose distance:

Ny - My

Where \begin{math}x\end{math} and \begin{math}y\end{math} represent X-axis and Y-axis respectively. The letter-number prefix combinations refer points of face model in Fig.~\ref{fig:face_points}. For each video segment considered, a metrix is created. If n frames are there in each segment, we obtain a set of n vectors so that each vector has six values extracted from each frame. Dimension of the resultant metrix is \begin{math}6 * n\end{math}.

\section{Emotion Representation in Second Life}

\subsection{Second life}

Second Life (\textsc{sl}) is a virtual world developed by Linden Lab launched on June 23, 2003, and is accessible on the Internet. A free client program called the Viewer enables its users, called Residents, to interact with each other through avatars. Residents can explore, meet other residents, socialize, participate in individual and group activities, and create and trade virtual property and services with one another, or travel throughout the world (which residents refer to as the grid''). Second Life is for people aged 16 and over.

\begin{comment}

\subsection{Nonverbal Behavior}

In virtual worlds applications the avatar is a projection of the user and as such should have the capability to reflect one's emotional state to other participants in a virtual space. Understanding how emotional expressions and body language'' relate to and reflect emotional state, and developing avatars capable of projecting that state, is an extremely important aspect of enabling rich social interaction in virtual environments. It is, perhaps, manifest that human beings are, by nature, sensitive to facial expressions, postures and gestures that reflect wide ranges of emotion along a large number of dimensions. On any encounter we can discern whether an individual standing before us is happy or sad, angry or calm, bored, inquisitive, frightened, curious, exasperated ... The question is what can be projected through an avatar to convey such a variety of emotional states. We focus on body language'' --- posture and gesture --- in addition to facial expression to convey emotional state. As Figure~\ref{fig:imgset} shows, avatars can be developed to project distinct human emotions and show clearly what a user is feeling (or wants to indicate they are feeling), irrespective of the presence of facial features.

\end{comment}

\subsection{Expressing Emotions Using Avatar Facial Expressions}

\subsection{Expressing Emotions Using Avatar Gestures}

Virtual characters are vital components of virtual environments where avatars represents humans. However, creating an interactive, responsive, and expressive virtual character is difficult because of the complex nature of human nonverbal communication such as facial expression, body posture, and gesture \cite{vinoetal2002}. Only limited research has been carried out regarding representation of affective nonverbal communication across posture and gesture \cite{Coulson2004,Kelsmith2005,Vinayagamoorthy2006}. Coulson has explained the relationship between emotions and body postures in his research findings \cite{Coulson2004}, depicted in Fig.~\ref{fig:posers}. Elaborations of those relationships can be found in Table~\ref{various_poses}. For our prototype, existing posture and gestures of avatars were triggered according to postures explained in \cite{Coulson2004} using Linden scripting interfaces.

\section{Evaluation}

\subsection{Voice-Based Emotion Classification}

\subsection{Facial Feature-Based Emotion Classification}

\subsection{Multimodal Based Emotion Classification}

\begin{table*}[!t]

% increase table row spacing, adjust to taste

\renewcommand{\arraystretch}{1.3}

\caption{Emotions and their expression using body postures}

\label{various_poses}

\begin{tabular}{|p{3cm}|p{12cm}|}

\hline

\textbf{Emotion} & \textbf{Expressive means}\\

\hline

Anger & Backward head bend; absence of a backwards chest bend; no abdominal twist; arms raised forwards and upwards; weight transfer is either forwards or backwards\\

\hline

Dislike & Higher degree of abdominal twisting; weight transfer is either forwards or backwards; most features are not very predictive\\

\hline

Fear& Head backwards; no abdominal twist are predictive; no effect of chest bend or upper arm position; forearms are

raised; weight transfer is either backwards or forwards\\

\hline

Happiness & Head backwards; no forwards movement of the chest; arms are raised above shoulder level and straight at the elbow; weight transfer is not predictive\\

\hline

Sadness & Forwards head bend; forwards chest bend; no twisting; arms at the side of the trunk; weight transfer is not predictive \\

\hline

Surprise & Backwards head and chest bends; any degree of abdominal twisting, and arms raised with forearms

straight; Weight transfer is not predictive\\

\hline

\end{tabular}

\end{table*}

%\begin{table*}[tb]%

\begin{table*}[!t]%

\setbox0\hbox{\verb/\documentclass/}%

\caption{Emotions and their expression using facial movements}

\label{facial_movements}

\begin{center}

\begin{tabular}{|p{3cm}|p{12cm}|}

\hline

\textbf{Emotion} & \textbf{Expressive means}\\

\hline

%\texttt{referee}

& initial submission (typeset in one column)\\

%\hline

Anger & {Widely open eyes, fixated; pupils contracted; stare gaze; ajar mouth; teeth usually clenched tightly; rigidity of lips and jaw; lips may be tightly compressed, or may be drawn back to expose teeth}\\

\hline

Disgust & {Narrowed eyes, may be partially closed as result of nose being drawn upward; upper lip drawn up; pressed lips; wrinkled nose; turn of the head to the side quasi avoiding something}\\

\hline

Fear & {Widely open eyes; pupils dilated; raised eyebrows; open mouth with crooked

lips; trembling chin}\\

\hline

Happiness & {ââ‚¬ËœSmilingââ‚¬â„¢ and bright eyes; genuinely smiling mouth}\\

\hline

sadness & {Eyelids contracted; partially closed eyes; downturning mouth }\\

\hline

surprise & {Widely open eyes; slightly raised upper eyelids and eyebrows; the mouth is opened by the jaw drop; the lips are relaxed}\\

\hline

\end{tabular}%

\end{center}

\end{table*}

\section{Conclusion}

% In conclusion we state that human perception of facial emotions can be divided into three categories such as visually dominant emotions, auditory dominant emotion and mixed dominant emotions. Clearly we found that some emotions are strongly visually dominant and some are strongly auditory dominant [DESILVA].

\bibliographystyle{ieicetr}% bib style

%\bibliographystyle{plain}

\bibliography{C:/Dropbox/PhD-Research/Bib-Database/bibliography}

%\begin{thebibliography}{99}% more than 9 --> 99 / less than 10 --> 9

%\bibitem{}

%\end{thebibliography}

%\newpage

%\tableofcontents

\profile{}{}

\profile{}{}

%\profile*{}{}% without picture of author's face

\end{document}