Flexible Operations In Uncertain And Probabilistic Database Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Trio is a robust prototype build to store and retrieve uncertain and lineage data. It also supports some features of relational DBMS. ULDB is an extension of relational database with expressive construct for representing and manipulating both lineage and uncertainty. ULDB representation is complete and it permits straightforward implementation of many relational operation.Currently Trio performs only select-project-join queries and some set operations. Queries are expressed using TriQL query language. This paper highlights on how multiple aggregation can be handled in select clause in Trio system for uncertain and probabilistic data. It also highlights on how distinct clause can be used along with aggregation function. It also highlights on implementation of minus and intersect all clause in Trio system. These operations allows user to use Trio system in a more flexible way.

Index Terms- Aggregation, database management, query processing

1 introduction

In traditional database management systems

(dbms), we can store data item with exact

value. we can not store inexact (uncertain,

probabilistic, fuzzy, approximate, incomplete

and imprecise) data into dbms. Database re-

search has primarily concentrated on how to

store and query exact data. The development

of techniques allows user to express and e_-

ciently process complex queries over large col-

lections of data. Unfortunately, many real

world applications produce large amount of un-

certain data. In such cases, database need to

do more than simply store and retrieve. They

have to help the user shift through the uncer-

tainty and _nd the results most likely to be the


Probabilistic databases have received atten-

tion recently due to need for storing uncer-

tain data produced by many real world ap-

plications. In uncertain database, each data

item instance has multiple possible instances,

each corresponds to a single possible state of


Lineage identi_es a data items derivation, in

terms of other data in the database or outside

data sources. Lineage is also important for un-

certainty within a single database. When user

writes queries against uncertain data, the re-

sult is uncertain too. Lineage facilitated the

correlation and coordination of uncertainty in

query results with uncertainty in the input

data. Relation between uncertain database

and lineage is that lineage can be used for un-

derstanding and resolving uncertainty[2].

In the new Trio Project at Stanford, a pro-

totype management system is under develop-ment, in which data, uncertainty of the data

and data lineage are all _rst-class citizens.The

objective is to address the shortcomings of con-

ventional dbms's. By combining data, un-

certainty and lineage gives a data manage-

ment platform that is useful for data integra-

tion, data cleaning, information extraction sys-

tems, Scienti_c and Sensor data management,

approximate and hypothetical query process-

ing and other modern applications. Trio's

database is managed by uldb's. uldb extend

the standard relational model. Queries are ex-

pressed using TriQL. TriQL is query language

used in Trio for querying data[5][4].

2 literature survey

Trio is system for Integrated Management of

Data, Accuracy and Lineage. Trio is a new

database system that manages not only data,

but also the accuracy and lineage of the data.

The goals of Trio project are to combine and

distill previous work into a simple and usable

model, design a query language as an under-

standable extension to SQL and most impor-

tantly build a working System.

ULDB has been implemented in Trio project

wherein uncertain data is captured by tuples

that may include several alternatives and pos-

sible values for some (or all) of their attributes,

with optional con_dence values associated with

each alternative. The TriQL query language

speci_es a precise generic semantics for any re-

lational query over a uldb. The result of re-

lational query Q on a uldb U is a result R

whose possible instances corresponds to apply-

ing Q to each possible-instance of U. TriQL

includes number of new features speci_c to un-

certainty and lineage. TriQL allows construct

for querying lineage, uncertainty, lineage and

uncertainty together, special types of aggrega-

tion, extension to SQL's data modi_cation, re-

structuring a uldb relation. The Trio Proto-

type system is layered on top of conventional

relation dbms. It is implementation of uldb

model, TriQL query language and other fea-


Initially, Trio system support select-

project operations over uncertain database.

After successful implementation of these ba-

sic operations they implemented join opera-

tion as we perform in SQL over two tables. As

uncertain or probabilistic database based on

possible-instances, for aggregation query the

result size can grow exponentially with data

size. There can be exponential number of pos-

sible instances, with di_erent aggregation re-

sults in each one. To make computation fea-

sible, Trio o_er several variants for aggrega-

tion function. A function returning the lowest

possible value of the aggregate result(low), the

highiest possible value (high) or the expected

value (expected)[5]. Currently we can use only

one aggregate function in select clause. Some

set operations like minus, intersect all are

not implemented in Trio. It also does not

support distinct clause along with aggregate


Many certain databases allow users to use

multiple aggregate function in select clause.

Computing aggregate over uncertain and prob-

abilistic data is useful in situation where an-

alytical processing is required over uncertain

data. To make user friendly database, we need

exibility of queries over the uncertain and

probabilistic database. If user need to add,

count or perform basic statistical function, ag-

gregate functions are helpful. These functions

determine various statistics and values. Flex-

ible operations reduce the amount of coding

that user need to do in order to get infor-

mation. Some times user need aggregation of

distinct values. So to get aggregation of dis-

tinct values, uncertain database should sup-

port aggregation with distinct clause. For

some queries we need minus set operation. So

uncertain database should support use of mi-

nus set operator. These operations extend

exibility of Trio system[7].

3 the trio system

Figure 1 shows the basic three layer architec-

ture of the Trio system. The core system is

implemented in Python and it acts like medi-

ator between relational dbms and Trio inter-

faces and applications. The Trio API accepts

TriQL query and it modi_es into regular SQL

and query result may be uldb tuples or regu-

lar tuples. It provides command line interac-

tive client (TrioPlus) and TrioExplorer graph-

ical user interface.

Trio ddl commands are translated via

python to SQL ddl commands based on en-

coding. Processing of TriQL queries proceeds

in two phases. In the translation phase, a

TriQL parse tree is created and progressively

transformed into a tree representing one or

more standard SQL statements. In execution

phase, the sql statement are executed against

the relational database encoding. TriQL query

results can either be stored or transient. Stored

query results are placed in a new persistent

table. Transient query results are accessed

through the Trio api in a cursor-oriented


4 implementation

First we describe how relational tables are en-

coded in Trio system to e_ciently compute the

queries. Consider a Trio relation T(A1....An).

Relation T is stored in a conventional re-

lational table with four additional attribute:

T enc (xid, aid, conf, certain, A1....An). The

Addition attributes in T enc are as follows:

_ xid identi_es the x-tuple.

_ aid identi_es an alternative within the x-


_ conf contains the con_dence of the alter-


_ certain is ag to indicate whether the x-

tuple has a _ alternative.

For Example, the Trio relation Sightings(time,

color, length) is encoded Sighting enc(xid, aid,

conf, certain, time, color, length) as shown in

Table 1.

We have implemented the minus set opera-

tor in the Trio system. The minus operation

returns unique rows that are returned by the

_rst query but are not returned by the sec-

ond query. Usually, minus is used to compare

data in di_erent data sources (tables). For ex-

ample, di_erences in the same tables across

test and production and/or actual copy and

backup. Visually Query1 minus Query2 can

be expressed as shown in Figure 2.

In Figure 2 shaded portion is the result of

query. The query is executed in following way.

The Trio system Python layer translates the

TriQL query into the corresponding sql query,

sends it to the underlying dbms and opens

a cursor on the result. The translated query

refers to virtual views. To use minus operator

following conditions must be satis_ed.(a) The

result set of both the queries must have the

same number of columns.(b) The data type

of each column in the second result set must

match the data type of the _rst result set.

Let Tfetch denote a cursor call to the Trio

api for the original TriQL query and let Dfetch

denote a cursor call to the underlaying dbms

for the translated sql query. Each call to the

Tfetch must return a complete u-tuple, which

may contain several calls to Dfetch. Each tuple

returned from Dfetch on the sql query corre-

sponds to one alternative in the TriQL query


4.1 Running Example: Squirrel


Trio application was inspired by Christmas

Bird Count[5], an original motivating example

for Trio. Human volunteers observed, Squirrels

on the Stanford campus and recorded its obser-

vations. Volunteer recorded the color(species)

and length of each squirrel sighting along with

time of observation.

For minus operation, we consider volunteer observation of two days. Stored in relation

SightingDay1 (time, color, length) and Sight-

ingDay2 (time, color, length) in Table2 and

Table 3 respectively. We run following query

over two tables.

(select color from SightingDay1) minus

(select color from SightingDay2)

The result of the query is shown in Table 4.

5 conclusion and

future work

Trio supports select-project-join query. It

also supports aggregation functions with vari-

ants. Many database allows user to use multi-

ple aggregate functions in the select clause.

To make user friendly database, we need ex-

ibility of queries over the database. Flexible

operations reduce the amount of coding that

user need to do in order to get information.

We have built new operations in the Trio sys-

tem, which helps the user to use system e_ec-

tively. We have implemented the minus oper-

ation that used to calculate the di_erence be-

tween two resources.

In future, we are going to implement follow-

ing operations to make Trio system more ex-


_ Implementation of multiple aggregate

function in select clause.

_ Implementation of intersect all clause.

_ Working with the Lineage data.