Annotation For Pay As You Go System Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Appearance of dataspaces has been bringing forwards a fresh wind-flow in data management field for entire 6 years. It comes with promising vision of overcoming the shortcomings of classical data integration. Two of its offers are reducing the up-front cost and producing the incremental refinement. However, results of research so far is often applied to specific assumptions or particular domains. A few beginning model managements for dataspaces management system (DSMS) have been proposed. An extended model management which uses feedback-based annotation is considered in this research paper. The change of requirement and its symptoms in user feedback is also taken into account. A new algorithm suitable for this unequability is described.

Keywords: dataspaces, model management

I. Introduction

The concept of dataspaces now becomes not strange gradually. Everybody talk about dataspaces as an inevitable trend. The inspiration is because not only it is new but also people can hope in its promising vision. It offers the distinguishing features such as low/no initialization cost and incremental improvement. Through long development time, the classical data integration has obtained near and high position in data access spectrum [1]. The increase in structured data in the Internet and heterogeneous sources give the opportunities for development of data integration. The schema mapping is resource and time consuming, which is mentioned in [2]. The verification of schema mapping happens before the set-up of data integration which incurs the up-front cost. In this paper, a new approach is proposed. The verification of schema mapping will be considered at the same time with data integration set-up. Schema mapping is used as input of initial data integration and annotated to reuse in next increments. Schema mapping is automatically generated by using some mapping generation techniques [3].

Here in this particular research, the user feedback is selected as one of many tools to annotate the query result. A few previous works are proposed the annotation on the stable results which are unchangeable through time and domains. The change of requirements significantly affects to the query result. The user not only does the annotation to the result but also take the change and its symptoms into account. The changes need to be reflected to the set of the results and in turn reflected to the set of mapping between the set of source schemas and integration schema. Some symptoms of changes can be conflicting of feedback versions or even denial each others between variations.

Up to now, research in dataspaces focuses mainly on solving the problem for specific application assumptions or particular domains. The existing applications such as iMeMex, and Semex are only for finding the authors or articles. The final result must satisfy set of constraints listed from beginning. In this paper we consider model management for dataspaces management system (DSMS). The model management is nothing but a frame work for DSMS. This frame work consists of types and operations and covers entire dataspaces life cycle from initialization phase, query usage phase to improvement and maintenance phase. The algorithms for particular cases using these types and operations are also defined. To improve the user feedback for annotation, an extended algorithm is described based on the algorithms from previous works.

The rest of this paper is organized as follows. In section 2, we review some previous works of attempts modeling the management for DSMS. Section 3 goes over the data types and operations which used in this paper. Section 4 describes specific case and corresponding algorithm which includes the extension. Section 5 is future works and conclusions.

II. Related Works

The theory of dataspaces life cycle is mentioned in [4] for each phase. The operations and data type that address the manipulation on schemas is proposed in [5]. These proposals based on the fact that we need to solve the problems between schemas and need to translate a schema and data from a data model to another one. To overcome the problem of upfront-cost and resource, the schema mapping can be generated automatically using schema matching techniques [6]. The schema matching is process to identify if two objects are semantically related. It is a binary relationship which connects an element of a schema, e.g., a relation in a source schema to a semantically equivalent element in another schema, e.g., a relation in an integration schema. Some schema matching classifications are schema-level, instance-level, and hybrid matchers. The schema mapping derived from these above techniques is based on heuristics. Some of them may not satisfy the user's requirements. In [7], the Clio is described as an application which can specify complex mapping related to multiple relations in the source schemas. This technique can not ensure if the mapping meet the user's needs. This leads to learning about how the mapping can be verified. The verification of schema mapping is carried out in [8]. Spicy is system which can make the decision in order to choose the best one in a set of mapping. The choice is the mapping that represents better transformation from a source schema into a target schema. In [9], a debugging schema tool is developed which can compute the "routes". These routes describe the relationships between source and target schema.

The annotation with precision and recall are talked about in [10]. The incremental annotation based on the user feedback is consistent with the dataspaces aim. With this technique, the benefit of classical data integration is provided while still reduce the up-front cost.

III. Types and Operations

Types and operations used in this paper are previously proposed in [5]. Some additional types and operations will be added for extension the user feedback-based annotation which can keep up with changes in user requirement. We also have the uniform for the denotations which follows the flexible model management in [11]. We use letter "C" to denote a "construct". A construct is nothing but an element of a schema such as a relations, an attribute of a relation, or a relation between two schemas. The capitalized letters is used to denote the "set of something". Therefore, Csi is denotation of set of construct which is a part of a schema si. We have 4 following types in this paper.

Match Type: is denoted by mtsi-sj is matching between two source schema si and sj. This type equals to a tuple of two constructs <Csi, Csj>. A matching algorithm can be used to generate the match. A set of matches is denoted by MTsi-sj

Correspondence Type: is denoted by crsi-sj is association returned from a matching algorithm. This type equals to a given kind between two constructs Csi and Csj. Kind can be a missing attribute, name conflict, horizontal or vertical partitioning. Set of schematic correspondences is denoted as CRsi-sj.

Mapping Type: is denoted by mpss-si is a mapping between set of source schemas Ss and integration schema si. It equals to a tuple of queries. A query qsi is posed over integration schema si, and qSs is a same number of parameter query posed over a set of source schema Ss. A set of mappings is denoted as MPss-si. In the case with only one construct in integration schema and related to a query posed over a set of source schemas, the mapping equals to a tuple of a construct and a query qSs. This case is also called "Global as View".

Query result Type: is set of result tuple of a query posed over a integration schema Si. It is denoted by Rqsi. A single result is denoted by rqsi. It equals to AttV which is a pair of attribute and value.

By using different matching technology, multiple candidate mappings can be returned. The candidate mapping is ranked by score. The score is derived from the confidence of matches. The highest score the mapping is, the more chance the mapping can be used to answer the query. But it is a matter of fact that the highest score does not mean the mapping will meet the user needs. Therefore, we need another source of information which can be used to evaluate the query result. User feedback is chosen as one in many kinds of source in order to select the most suitable mapping. The user will manipulate on the mapping algorithms, he will only need to provide the answer for set of query result. The user comments on the result with one of three following notations: a given tuple was expected in the answer (true positive), a certain tuple was not expected in the answer (false positive) and an expected tuple is not retrieved (false negative). To make the improvement phase better, the most important things is to refine the mappings in order to reduce the number of false positives or to increase the number of true positives.

To create the dataspace management system, a number of operators need to be defined to fulfill operations happening through the dataspaces life cycle. Here we use six following operations proposed in [5] and [11]:

MATCH: return a set of matches between two schemas Si and Sj.

MERGE: the input parameters are two schemas and the set of correspondences between them CRsi−sj. A kind parameter specifies if this operation is "merge" or "union". The result is set of correspondence between two source schemas and the merged schema (CRsi−sm,CRsj−sm).

MAPPING: the input parameters are set of source schemas Ss, an integration schema Si, a set of correspondences between set of source schemas Ss and the integration schema Si: CRsi−Ss. The return is a set of mapping MPSs→si that describe how to transform elements in the source schemas to the corresponding element in the integration schema.

INFERCORRESPONDENCE: automatically retrieves the set of correspondence between two source schemas CRsi−sj based on a given match between elements in two source schemas si and sj.

ANSWERQUERRY: divides a query qsi posed over an integration schema into sub-queries over a set of source schemas Ss. These sub-queries are executed, combined and the results are ranked.

ANNOTATE: annotate the results based on set of annotation "A" that provided by user feedback. What we get are set of annotated query results or schema mappings which can be used iteratively in order to make the next result better than the last result.

We can also use set of control parameters (CP) including the thresholds. This threshold used to specify the precision or the recall which user wishes to obtain and feels satisfied. The user is not required to annotate on all the results or the schema mappings he got. The user is only required to do feedback about the usefulness of results which is related to the user's needs. The feedback of user is a tuple that described in [12]: uf = (AttV, r, exists, provenance). Where r is a relation in integration schema, AttV is pairs of attribute and value in this relation, exists are the evaluation of user on the result, provenance is sources of the pair of attribute and value. For example, we have an instance of user feeback expression:

uf1 = (AttV1, Student, true, {m2, m3})

AttV1 = {(ID, 'A123456'), (name, 'Bob'), (kind, 'graduate'), (dob, '05/03/1985')}

The user feedback uf1 specifies a tuple which derives from the mapping m2 and m3 is a true positive (assume that AttV1 meets the user's requirements). AttV1 is pairs of attribute and value such as ID-"A123456", name-"Bob", kind-"graduate", dob-"05/03/1985".

IV. An Experimental Case and Corresponding Algorithm

In [11], three study cases are described using the operations and types mentioned in section 3. In this paper, we only consider the case that schema matching, derived correspondence, schema mappings are done automatically in initialization phase, usage query phase as well as improvement phase. This case is compared to UDI in ref []. The algorithm for this case as proposed in [11] is:

1: MTs1−s2 = MATCH(s1, s2)


3: <sm, CRs1−sm, CR(s2 − si)> = MERGE(s1, s2, CRs1−s2, merge)

4: MPSs→sm = MAPPING(sm, {s1, s2}, {CRs1−sm, CRs2−sm})

5: loop

6: MTsi−sm = MATCH(si, sm)


8: <sm′, CRsi−sm′, CR(sm − sm′)> = MERGE(si, sm, CRsi−sm, merge)

9: MPSs→sm′ = MAPPING (sm′, {si, sm′}, {CRsi−sm′ , CRsi−sm′})

10: end loop

11: {Poses Query}

12: Rqsm′ = ANSWERQUERY(qsm′, MPSs→sm′)

13: {Improvement phase - User feedback is provided and annotated the results and the mappings}

14: R = ANNOTATE(Rqsm′, A)

15: MP = ANNOTATE(MPSs→sm′, R)

From step 1 to step 4 is done in initialization phase. First we do matching between two source schemas by using some matching tools such as COMA++. Then we infer correspondence between two source schemas based on the matching we have in step 1. A merge schema is created based on the two source schemas and the matching between them. Simultaneously, the correspondences between source schemas and new merged schema are inferred. At last of initialization phase, a set of mappings is generated between set of source schema and merged schema based on the correspondences inferred in step 3. Because the data sources are autonomous and heterogeneous and data integration's needs are changing frequently, the new sources are most likely added to the existing set of sources. They can be added manually by administrator or automatically by assistant tools. Therefore, demand of iteratively cumulate source integration is indispensable. From step 5 to step 10 in the above algorithm is a loop which increments the merged schema. The matching and inferred correspondence is continually created in order to integrate a new source schema to existing merged schema. The set of corresponding mappings is also generated between the new source schema and existing integration schema using the new inferred correspondence. Then a query is posed over the integration schema, this query is divided into sub-queries that are posed over the set of source schema. These sub-queries are executed, combined. The final results are ranked and displayed to the user. The set of results than are annotated about their usefulness by user feedback. These annotated results, in turn, are used to annotate the set of existing mapping. Hopefully, the annotations will help to select the better mapping in the future if the same requirement is repeated.

However, this algorithm does not take the changes of user's requirement and its symptoms into account. For example in the first stage, an administrator wants to retrieve all the graduate students who will be graduate in this Fall. She may use a dataspace tool such as SeMex to get the results as in Figure 1. She also needs to give the comments on the results in order to specify what tuples are expected results (true positive - tuple t1, t4, and t5), what tuples are unexpected results (false positive - tuple t2) and what tuples are expected but were not returned (false negative - tuple t3). In figure 1, tuple t2 is unexpected result because information of an undergraduate student was returned instead of a graduate student. The tuple t3 is expected because it gives information of a graduate student, but this tuple is not in any mappings. Thus, the tuple t3 is expected but were not returned. Those comments actually can be done automatically if we set a threshold in the dataspaces tool in order to tell this program know at what level a tuple is useful. The threshold can consist of precision and recall values. The annotated results and mappings are saved to reuse later. But later in the second stage she needs the results which contain the list of both graduate and undergraduate students graduating in Fall, she cannot reuse the saved results and hence the process needs to be run from the scratch.

Graduate Students












m2, m3























m1, m3

Figure 1: Example of query results

To enhance the algorithm, a type is proposed named Query Modify Type which is denoted by msi. This type provides the interface for user who wants to modify the old query. We will put all steps into a loop which check if the change is made. If a change is made, it will be propagated MATCH operations in a loop from step 5 to step 10. Then ANSWERQUERRY operation also uses msi to answer the new query which reflects the updating requirement. A set of thresholds (SH) is also given to ease the annotation work for user. The optional SH is used as an input parameter for operations. From step 5 to step 12 can be rewritten as follow:

5: loop (check if the requirement has been changed, if yes: )

6: MTsi−sm = MATCH(si, sm, msi, [SH])


8: <sm′, CRsi−sm′, CR(sm − sm′)> = MERGE(si, sm, CRsi−sm, merge, [SH])

9: MPSs→sm′ = MAPPING (sm′, {si, sm′}, {CRsi−sm′ , CRsi−sm′}, [SH])

10: end loop

11: {Poses Query}

12: Rqsm′ = ANSWERQUERY(qsm′, MPSs→sm′, [SH])

V. Conclusion

This paper focused on model management for dataspaces management system. A generic framework for DSMS is considered with types and operations. An algorithm describing the model is combination of those operators. User feedback is used as a main method to annotate the result and propagate the annotation to schema mapping.

Two main contributions are:

i. Modify an existing algorithm describing the model management for DSMS. Some operators are proposed to take the change user's requirements into account.

ii. Use the set of thresholds including precisions and recalls as input parameter for operations of model management.

Up to now, the model which all the phases are done automatically is just in theory. The only theoretical system for is UDI [13], but we do not have any implementation the UDI so far. The evaluation for the proposed extension of algorithm for UDI-liked model needs to be done in practice with some tools developed in the future.