Automatic Scene Generation Using Voice And Text Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Automatic scene generation using voice and text offers a unique approach to human computer interaction with 3D graphics. It is an application of techniques for generation of 3D scenes from a informal input, as a natural language description. Since this is an automatic scene generation tool it overcomes all the disadvantages of manual scene generation which requires users to adopt and learn specific graphics tools and interfaces. Automated 3D scene generation systems let ordinary users quickly create 3D scenes without having to learn special software, acquire artistic skills, or even touch a desktop window-oriented interface. WordsEye is such a system for automatically converting text into representative 3D scenes. (Coyne & Sproat, 2001)

Using natural language inputs in conjunction with automatic scene generation enables any graphics areas to take advantage of the benefits of visualization in 3D while reducing the need for graphics specific knowledge. This system is based on the integration of advances in text-to-visualization, voice-to-visualization and computer graphics technology to generate 3D environment. This makes it easier and faster for non-professionals to create 3D environment.

The Real-time automatic 3D scene generation system has significant importance in any domain that can benefit from real-time 3D graphics and visualization such as education by using 3D scenes to reinforce concepts and ideas during a lesson, discussion, or storytelling. Other areas such as law enforcement and the movie industry can also benefit from this work which reduces the manual work. This system acts as a bridge between natural language input and graphical representation. The graphic depiction of a spatial relationship is dependent on the objects and their geometric representation. This paper focuses on a real-time system for correctly depicting objects based on a given relationship. The figure depicts the process of 3D scene generation using natural language input.

"Scene composition framework" (Realtime Automatic, 2007)

(Fig 1.1 taken from Seversky & Lin,2007)

In the above figure the natural language input given to the system is "The toy train is on the horse" and the objects are positioned according to the given description of the scene. Thus the 3D scene is generated.

History of 3D Graphics in Human Interactions

Computer Graphics is the geometrical representation of objects using a computer and includes modeling - creation, manipulation, and storage of geometric objects and rendering - converting a scene to an image, or the process of transformations, rasterization, shading, illumination, and animation of the image.  

Computer Graphics has been widely used in graphics presentations, paint systems, computer-aided design (CAD), image processing, simulation & virtual reality, and entertainment.  From the earliest text character images of non-graphic mainframe computers to the latest photographic quality images of a high resolution personal computer, from vector displays to raster displays, from 2D input, to 3D input and beyond, computer graphics has gone through its short, rapid changing history. 

In the 1980's output was built-in raster graphics, bitmap images and pixels. In the 1990's, since the introduction of VGA and SVGA, personal computers could easily display photo-realistic images and movies. 3D image renderings were the main advances and it stimulated cinematic graphics applications. 3D computer graphics involves the generation of images of 3D objects within a scene.

From games to virtual reality, to 3D active desktops, from immersive home environments, to scientific and business applications computer graphics technology has touched almost every facet of our life. 

Role of voice and text input for human-computer interaction

From beginning of the modern era researchers have been dreaming of conversational computers-a machine that communicates with the user using natural language input. Voice input has many advantages and it has various applications so as to classify the role of voice with respect to other methods of input for human-computer interaction, and to consider obstacles to the successful development and use of natural language systems. Text input is useful since it is efficient for even with users with little system knowledge. Voice input is useful to users in certain cases such as while interacting with the system if the user has only a limited keyboard or screen is available and if the user is disabled. Natural Language interaction is preferred when pronunciation is the subject matter of computer use.

Put System-Generating 3D scenes

Manually creating 3D graphics on the screen is very difficult and also a time consuming process. This process involves using different software packages and tools. The task of generating 3D scenes can be made easy by using natural language as input. Natural language has been used as input in earlier systems as well. The put system also deals with spatial arrangements of objects. The input given is in the form of Put (X P Y)+ . For the given function, X and Y are objects and P is a spatial preposition.

For example in the figure 1.2 the input given to the system is "John rides the bicycle. John plays the trumpet".

A character riding a bicycle (lower) while playing the trumpet (upper) (Words Eye,2001)

(Fig 1.2 taken from Coyne & Sproat,2001))

Although the put system meets the goals of developing a 3D scene the input in this kind of system is limited to a subset of English which consists of expressions of the form Put (X P Y)+. Also, another drawback of this system is that it is limited only to the spatial arrangement of existing objects. In other words it is applicable only for the objects existing in that particular environment.


Using WordsEye the user can not only generate a 3D scene automatically but also can perform actions on those particular objects in the scene. Another advantage of using WordsEye over the initial Put System is that the input that the user gives is not restricted to the objects present in the environment.

The usage of natural language enables the user to create complex animated scenes at ease.

(Coyne&Sproat, 2001)

Working of WordsEye

The first step is that the user gives natural language input. The input is then parsed using part-of-speech-tagger and a statistical parser. During this step a tree is generated that represents the sentence structure. Then a dependency representation of the tree is generated. This is a list of words in the given input showing the dependency of the words on each other. The dependency representation is then converted into a semantic representation. A semantic representation describes the objects used in scene generation and the relationship between them. This is done by using semantic interpretation frames. The relationship between the objects is defined by placing the object into the WorldNet database. In this database the category of the object is defined, if personal names are used in the input then they are correspondingly mapped to male or female. Semantic functions handle the spatial prepositions. (Coyne&Sproat, 2001)

At the end graphical specifications are applied to the scene. These specifications are called depictors. Depictors are used in order to control the size, transparency and position of the object. Also facial expressions and the human poses in the scene can be modified using these depictors.

WordsEye makes use of an object database which consists of approximately 2000 3D polygonal objects and an other 10,000 are still being added to the database. WordsEye also provides a feature where the user can add their own objects into the database. Each 3D model is also associated with some additional information like parts, default size, spatial tags, color parts, opacity parts. These are described as follows:

Default Size: All the objects that are used in the generation of the scene are initialized with a default size. This size is initialized in terms of feet.

Color Parts

Sometimes when an input is given by the user using the natural language then if a color is specified in the input that particular object should have the specified color when the scene is generated. For example: if the user gives the input as a flower with a brown stem then the object present in the scene that is generated must have a stem that is colored brown. The flower remains plain that is it is not colored. In case where nothing is specified than the largest area is colored by default.

Shape Displacements

Shape displacements are used by the objects especially human figures in order to depict emotions with respect to the object.

Opacity Parts

There are certain objects that are used in the scene generation which require transparency such as doors and windows or any other glass objects.


It is used in order to define poses for animal and human characters.

Spatial Relation

Prepositions are used to denote spatial relations. For example on, beyond, below, under, above are some of the prepositions used. However complex the layout of any scene may be, it is defined using spatial relations. The relative positions of the objects used in the scene generation is also denoted by spatial relations like the cat is under the table, the tree is behind the house. The exact position of the object in the scene depends upon the structure and shape of the object.

Some examples describing spatial relations

Example 1: The box is on the table

Here the top surface tag of the table is identified and similarly the bottom surface tag of the box is identified. We then reposition the box on the table.

Example 2: The bird is in the bird cage. The bird cage is on the chair.

Example for Spatial Relation

(Figure 1.3 Coyne&Sproat, 2001)

Object Placement Algorithm

This section of the paper describes the placement algorithm for placing the objects. A spatial relationship is applied to the objects. Under, in front of, below, behind are some of the relationships that are used. in addition to these relationships modifiers such as to the left of, center, to the right of, next to can also be used. Higher control is provided over the placement of the object if these modifiers are used.

Five stages of placement algorithm (Seversky & Yin,2006)

In the first stage the polygon mesh representations are converted into voxelized representations.

Voxelization is a technique of volume representation of the object using polygon meshes(Chang, Huang, Kankanhalli & Xu,2003). The voxelized representations consist of cubes by which the original shape of the object is depicted. The voxelized representations are generated using Binvox 3D Mesh Voxelizer. Using this voxelizer the representations of 128x128x128 voxels is created. This stage is executed once.

Spatial Partitioning takes place in the second stage. The region boundary of the object is calculated. In the third stage a region is selected based on the spatial relationship described. In the final stage the object is added to the scene.

a b c d

(Taken from Coyne & Sproat, 2001)

Figure a: Original representation of objects

Figure b: Voxel Representation

Figure c: Surface representation of top region

Figure d: Surface representation of bottom region

Knowledge based definition

The user interacts with the system using natural language. For the system to generate the 3D scenes the input should be converted into a machine understandable low-level language which is also known as program language. The meaning of the words which are given as input is considered to be the main criteria for visualization and word-concept-visual is a format which is used for representing the knowledge. Accurately interpreting the user input forms the basis of efficient user interface. In order to properly understand the spatial relationships an xml file called Virtual Object Spatial-relation (VOS) is used. (Zeng, Mehdi & Gough, 2005)

Application of 3D Scene Generation

Application of 3D Scene Generation for Children with Autism

As we know that children with autism or mental retardation experience problems with thinking and using linguistic approaches. 3D generation of scenes using text and voice helps such children bridge the gap between linguistic expressions and understanding the concepts using the scenes generated. It's been observed that people with mental retardation experience problem grasping abstract concepts and thus tend to think in the form of concrete images rather than linguistic expressions. There is a real time system called s2s which is a benefactor for children with autism which converts Turkish sentences to appropriate scenes. It was tested on children with autism in a special health center and provided beneficial results thus encouraging the work on automated 3D scene generation. (Kilicaslan, Ucar & Guner, 2008)

Application of 3D Scene Generation in Educational Environment

Explanations using 3D scenes have a powerful impact on students interacting with knowledge based learning environments. Using 3D scenes real-time explanations can be depicted. The 3D scenes can also be used to educate students with moral values using a system like that of automated 3D scene generating system which gives a 3D visualization of stories which is a very good interacting technique and provides a very good learning environment.(Bares & Lester,1997)

Application of 3D Scenes for Geographic Representation

Cartographers have made a shift from pen to digital methods. Using three-dimensional (3D) scenes for geographic representation has been an interest of cartographers for some years. The two important methods that the digital production makes possible are the use of animation and the use of 3D Graphics. The 3D scenes can be combined with animations in a visual programming environment. There is a software package GeoVISTA Studio which enables data analysis and data sources to be linked along with user controls. This software eases the creation of experimental three-dimensional animated maps. (Hardisty, MacEachren & Takatsuka, n.d)

Automated 3D Scene Generation of a Car Accident Description:

Using an automated scene generator we can visualize the virtual scenes from the written description of the accident. There is a system called CarSim which processes the textual description of accidents and provides us with the 3d representation the accident scene. This system can be used in real time by insurance companies to generate the accident scene by using the police report as the text input.(Nugue, Dupuy, Egges & Legendre,2001)

The Performance Evaluation

In order to evaluate the performance of the system different factors such as the time taken for completing the task that is generating the scenes and resources that are utilized are taken into consideration. The performance of the system can also be evaluated by increasing the complexities in the virtual environment where the scenes are being generated. Thus, if the system completes the task in minimum time with less number of problem counts then it is said to be a very efficient human computer interface.

Future Enhancements

Present scenes work based on the natural language description, and these descriptions contains spatial relationships. There are many other linguistic challenges that are to be fulfilled. In the future, a system will be developed in which action based positioning can be included in order to place the objects more effectively. Some of the actions that can be included are rotation, movement and resizing an object. A new feature can be implemented through which the user can give suggestions in order to overcome inappropriate outputs. In order to enhance the system another alternative collision can be included for detecting techniques instead of using the voxelization technique that is used for model mesh representations.


Rich 3D animated scenes can have a powerful impact on students in knowledge based environments. Automated scene generation is efficient interfaces in order to communicate with the system using create 3D scenes using spatial relationships. This system acts as a 3D graphical media thus generating 3D images using natural language input from the users and positioning the objects appropriately using voxelization thus detecting collisions in placement.It generates automatically correct graphical representations of objects given spatial relations, which are taken from the natural input that includes text and voice descriptions from the user.