This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
According to statistical reports of World Health Organization, breast cancer is one of the cancers that threaten women nowadays. (Parkin et al., 2005) With the advanced technology and knowledge that existed today, many drugs or compounds have been discovered to treat breast cancer. An anticancer drug discovery program is conducted by United States National Cancer Institute (NCI) in which around 10,000 compounds are screened every year in vitro against a panel of sixty different human cancer cell lines. (Monks et al., 1991; Boyd, 1997) Different compounds act differently against different cell types according to various mechanisms.
1.2 Problem Statements
In this project, pure compounds that have cytototxic effects that against breast cancer cells which can give certain reaction to inhibit growth of the cells will be focused. Due to huge data sets appeared, therefore, a knowledge discovery approach to characterize and data mine the pure compounds that have cytotoxic effects against breast cancer cells.
1.3 Scope of Study
The scope of study is focused on techniques that can be used in knowledge discovery from databases that store information about pure compounds gives cytotoxic effect against breast cancer. Besides that, different categories, mechanisms that involved in giving cytotoxic effects against breast cancer cells will be studied to organize and classify more informative and productive information to users. In addition, genes that involved in the cytotoxic mechanisms will also be studied.
The purpose of this study is to develop a database that can provide suitable information in pure compounds that have cytotoxic effect against breast cancer cells.
The objectives of project are:
To find out pure compounds have cytotoxic effect breast cancer cells.
To provide relevant pure compounds' chemical structure and description.
To find out genes are that affected and mechanisms that are taken by the pure compounds.
To store those data in a database that is connected to informative database that updates information time to time.
To develop interface with knowledge base which provide Graphics User Interface (GUI) front end.
To develop efficient information retrieval (key word based retrieval and query based retrieval) from knowledge base.
Background and Literature Search
This chapeter contains brief researches about basic cytotoxicity information, drug measurement information, database applications and data mining techniques that will be used in developing database for pure compounds that have cytotoxic effects on breast cancer.
Before proceed to the steps or techniques that are needed to build up the database for pure compounds that have cytotoxic effects against breast cancer, we need to understand what cytotoxicty is and relevant terms that are usually seen when discussing cytotoxicity.
Quality of being toxic to cells is known as cytotoxicity. There are several cell fates when cell are given different cytotoxic compounds which are either necrosis or apoptosis. Morphological criteria of apoptosis make it clearly distinct from necrosis. (Leist et al., 1998) Apoptosis or programmed cell death is a process during development of multicellular organisms. After cell receiving specific signals that construct apoptosis, cell will start to shrink, chromatin in the nucleus will break down and cause nuclear condensation, DNA will be cleaved into regular size fragmentations. Cells will be packed into apoptotic bodies and wait phagocytosis by macrophages. (Dash, n.d) On the other hand, necrosis is an uncontrolled cell death which leads cell to cell lysis. When cells undergo necrosis, cells might lose cell membrane integrity; exhibit rapid swelling; shut down cell metabolism and cause cell death rapidly. Lastly, cells will release their cell contents into extracellular matrix. (Golstein and Kroemer, 2006)
2.3 Drug Measurement Information
Therapeutic index or therapeutic ratio of a drug is a basic quantitative attempt to indicate safety of a drug. It is a ratio (LD50/ED50) of the lethal dose (LD) of a drug for 50% of the population to its therapeutic dose or effective dose for 50% of the population (ED). Effective dose is the amount of drug that produces a therapeutic response in 50% of the people that consume it; while lethal dose is the amount of drug that can kill half of the population that used the drug. A high therapeutic index which its numerical value is larger is more preferable, this corresponds to safer of the drug and also a situation in which one would have to take a much higher dose of a drug to reach the lethal threshold than the dose taken to elicit the therapeutic effect. (Cannon, 2007) Lethal Concentration (LC) refers to concentration of a chemical in air or concentration of a chemical in water while LC50 refers to concentration of a chemical that kills 50% of test animals in a given time. (Canadian Centre for Occupational Health and Safety, 2006)
There are another measurement term that is found frequently in some therapeutic agent, which are half maximal inhibitory concentration (IC50) and half maximal effective concentration (EC50). IC50 is a measurement of effectiveness of competition binding assay and functional antagonist in inhibiting 50% biological or biochemical function; while EC50 is usually used for agonist of stimulator assay that provides 50% of maximal response. (Eli Lily and Company, 2008)
2.3 System Overview and workflow
Figure below shows overview and workflow of the system.
This system will get different data from the Public Domain Databases (PDD), which are National Center for Biotechnology Information (NCBI), PubMed and Online Mendelian Inheritance in Man (OMIM). From NCBI, we can get the gene function and drug information; while form PubMed and OMIM, we can get information about gene interaction. (Balajee and Dhanarajan, 2009)
NCBI is a center that works on "uncovering new knowledge". It is a main database for housing all the genome sequence in GenBank (genome database), research article database (PubMed) and other information which are relevant to biotechnology. Databases are available online and search by using the Entrez search engine. (NCBI, 2004) PubMed is a free premiere search system for health information on the Internet which owned by United States National Library of Medicine (NLM). PubMed contains MEDLINE which is database for NLM. It stores enormous number of references for published article in biomedical and related journals that have fully indexed. PubMed also carry citations that are in the process of being analyzed and indexed for MEDLINE and citation that may not receive full index for MEDLINE. (NN/LM staff, 2010) OMIM is a daily updated database that contains information of all known human disorders over 12,000 genes and focuses on relationship between phenotype and genotype. (OMIM, n.d)
After importing different kinds of data from different database, data preprocessing will be done before data can be used to extract any information. Name or 2-dimensional chemical structure of chemical compounds can be inserted into query. Data will be undergone process from database and data mining algorithm. The output will be a 2-dimensional chemical structure of the pure compounds that give cytotoxic effects against breast cancer, with description about the pure compounds and gene that involved in the cytotoxic mechanisms in the cell lines.
2.4 Database Application
An anticancer drug discovery program has been conducted by the United States National Cancer Institute (NCI). In that program, around 10,000 compounds are screened every year in vitro against a panel of 60 human cancer cell lines from different organs and stored in different databases. This screening purpose is to provide the initial evaluation of compounds for cytotoxic or growth inhibitory activity against a diverse panel of cancer cell types. (Shi et al., 1997) There is more than 62,000 synthetic and natural compounds have been tested. By using several algorithms, some information and molecular pharmacology of cancers have gained by the discovery of anticancer drugs. All these information has involved three kinds of databases which are anticancer activity data for compounds across the 60 human tumor cell lines, chemical structural information for the tested compounds and information on possible targets or modulators activity in the 60 cell lines.
Therefore, several databases are needed to store all the data that needed in this system. Those data are either pure compounds or synthetic compounds that have cytotoxic effects towards breast cancer cell lines, information or some journals about the compound's chemical structure, and the gene involved during the cytotoxic mechanisms and some cytotoxic profiles that against breast cancer.
According to NCBI News (National Center for Biotechnology Information, 1994), users who can connect to internet directly and use Network Entrez with no charges and they can install the necessary network software (MacTCP for Macintosh or one of several TCP/IP software packages for Windows PCs) and retrieval software which are available via Anonymous FTP as long as there is a local network administrator take responsibility for stabling and maintaining the Internet connection. While for those have the Internet access but do not have local support necessary for Network Enterz, they can use a version of Entrez that has been adapted for use with Mosaic or World Wide Web hypertext-based information service.
To fulfill the need of this system, daily update will be performed by connecting the system database to the relevant databases with internet access.
2.5 Data Mining Background
In this project, large amount of data sets about natural compounds that have cytotoxic effects against breast cancer cells, genes that are involved and some relevant information about those natural compounds will be used. All these data will be placed in various databases. To retrieve and get useful information from these large volumes of data from these databases, data mining techniques will be needed.
Data mining is known as a synonym for Knowledge Discovery from data (KDD). Actually, data mining is one of the steps to achieve "Knowledge Discovery" in data in databases. It is referring to extracting or "mining" knowledge from large amounts of data. (Han et al., 2006) As the amount and complexity of information in various databases have increased, therefore, data mining techniques are gaining more attention in various areas. Data mining is a process that identifies valid, potentially useful and ultimately understandable patterns form large collections of data. Nowadays, some fundamental techniques such as clustering and decision trees can be found in bioinformatics fields.
Intention of the usage of the system usually defines data mining goals. Usage of this system is to group those tested and identified pure compounds against breast cancer into databases, and to enable researchers to analyze and compare them. It also provides the opportunities to researchers to test them further to study the underlying mechanisms against breast cancer. Thus, goal of system will be a descriptive goal which acquires data of pure compounds that have cytotoxic effects against breast cancer, chemical structures, description about those pure compounds and genes that involved in the cytotoxic mechanism.
2.5.2 Seven Steps of Data Mining
According to Han et al., data mining techniuques are orchestrated by seven basic steps which are data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation and knowledge presentation.
126.96.36.199 Data Selection
Raw materials or data that have noise missing or inconsistent are not suitable for data mining. According to Wang et al., in National Cancer Institute, there are 257,547 compounds in the tumor cell lines and among 44,653 compounds have cell line screening data and total cell lines have gene expression data which consist of 961 gene expression values for each cell line. Thus, after retrieved massive amount data from various online databases, data are needed to be cleaned or rearrange to filter necessary data those are needed in this system.
188.8.131.52 Data Integration
Data integration is the process of merging data from multiple data stores. In this project, different data from OMIM, PubMed, and NCBI need to be merged into appropriate forms for mining purpose. (Han et al., 2006)
184.108.40.206 Data Selection
Data selection is a process to obtain data representation that is much smaller but closely maintain integrity of original data, data reduction is usually used. Data that have been integrated from several selected databases need to be selected for decrease redundancy of the data. (Wang et al., 2007)
According to Wang et al. (2007), several well-established methods of characterizing compounds by chemical properties or structural features which are calculating and profiling predicted property values compared to two other well-established data sets; and, comparing 2-dimensional fingerprint based on structural feature in one of the data.
220.127.116.11 Data Transformation
Data transformation is a technique which removes noise from data and replaces low level data by high level concepts. Noise such as data which are not related to pure compounds that have cytotoxic effects against breast cancer cell lines will be remove by using smoothing, aggregation and generation or attribute construction. (Wang et al., 2007)
18.104.22.168 Data Mining
Clustering, association rules or decision trees or other algorithms will be used in data mining in order to generate interested and usable data for users. (Wang et al., 2007) In program that is created by NCI (Developmental Therapeutics Program), back-propagation neural networks, Kohonen self-organizing maps, principal component analysis, hierarchical cluster analysis and multidimensional scaling have been used to predict mechanism of action or organize compounds into families based of activity patterns. (Shi et al., 1997)
Another study by Glover et al. reviews that DTP has used a unique set of computational tools to analyze and sort complete set of NCI's freely available antitumor drug-screening data, incorporating structural chemotypes as well as biological data into self-organized maps (SOM's). It reorganized data revealed cluster of compounds which exhibit similar toxicity profiles toward the NCI's panel of around 80 cells lines.
22.214.171.124 Pattern Evaluation
Pattern evaluation is the process to build a proper measurement model for user to test on accuracy of the system that pruning uninterested patterns and to speed up mining process.
126.96.36.199 Knowledge Presentation
Lastly, data mining results are needed to present to users who are usually do not expert in data mining or usually are those in the area of application before they can be deployed. Visualization and knowledge representation techniques will be used to present the data mining results to users for understandability. For instances, DTP by NCI has used 2-dimensional structure of compounds as their visualization presentation for users.
2.5.3 Similar System by NCI
National Cancer Institute in United States has tested compounds for their ability to inhibit the growth of human tumor cell lines in culture for ten years. There is some works have been done for demonstrating compounds with similar mechanism of growth inhibition show similar patterns of activity in the NCI screen. This observation was developed into an algorithm called COMPARE and has been successfully used to predict mechanisms for a wide variety of compounds. More recently, this method has been extended to associate patterns of cell growth inhibition by compounds with measurements of molecular entities such as gene expression in the cell lines in the NCI screen. The COMPARE method and associated data are freely available on the Developmental Therapeutics Program (DTP) web site (http://dtp.nci.nih.gov/) (Zaharevitz et al., 2001)
Back- propagation neural networks, Kohonen self-organizing maps, principal component analysis, hierarchical cluster analysis and multidimensional scaling have been used to predict mechanism of action or organize compounds into families based of activity patterns. This "information-intensive" approach to the molecular pharmacology of cancer and drug discovery has proved useful in identifying subgroups of compounds related to particular biological targets. (Shi et al., 1997)
It is used for checking the chemical compounds that have been tested on different cancer cell lines. There are basically sixty human tumor cell lines screening founded in this page.
In this page, any word of phrase of synthetic or pure compounds that have effects on any cancer cell lines can be entered into query box to get relevant information.
This is the results page by searching one of the compounds "Adriamycin" that used in inhibiting Topoisomerase Inhibitor II. This search consists of 37 pages of information about "Adriamycin".
In this page, users can search data based on different categories as stated in the page.
This DTP-COMPARE page is the routine implementation of the NCI in vitro anticancer screen to evaluate the efficacy of synthetic compounds and natural products.
2.6 Problems on Similar System
DTP system is not a complete developed web site which some of its functions are still under development such as the Pathways to Development Services. Therefore, information about drug pathways or mechanisms could not be get.
2.7 Proposed System
An online system which allows users to enter query about pure compounds 2-dimensional structure or name that have cytotoxic effects towards will be developed. Data mining works will be done to cluster or group pure compounds that have similar cytotoxic profiles towards breast cancer cell lines. A web site interface will be worked out for information retrieval by users. Relevant data about pure compounds that have cytotoxic effects against breast cancer cell lines could be retrieved from the system.
This part is the fundamental part in constructing database for pure compounds that have cytotoxic effects on breast cancer. After completing review on cytotoxicity, database techniques, and data mining techniques, this section will describe and introduce tools needed to design and develop the system, such as software, hardware, programming languages and others.
3.2 System Development Life Cycle (SDLC)
System Development Life Cycle (SDLC) is a well known process during creating system in accomplishment of any information technology (IT) project. SDLC plays important roles to build up database for pure compounds that have cytotoxic effects on breast cancer. It is divided into six phases which are planning, analyzing, designing, implementing, testing, and maintaining. (as shown in Figure 3.1) Each phase is connected to the next phase. Before proceed to next stage, each phase has to be completed properly and all aspects have to be considered thoroughly. If user wished to add or change features of the system, system will be start from planning phase again.
In this planning phase, goals, objectives and scope of the online system are identified. Based on literature review discussed in Chapter 2, there is no model implemented for pure compounds that have cytotoxic effect towards breast cancer. There are only a few models that provide both synthetic and pure compounds anticancer drugs for all cancer cell lines. Therefore, goals to implement this system are to group and analyze pure compounds that have cytotoxic effects against breast cancer and enable researchers to compare these various pure compounds and test them further to study the underlying mechanisms against breast cancer.
System architecture is planned to be designed as figure above which information is imported from different online databases and collected into this system database. User can enter query about pure compounds that have cytotoxic effects on breast cancer. After data mining process, results of 2D chemical structure and description of the pure compounds that have cytotoxic effects and the affected breast cancer genes will be generated as an output for users.
In analysis phase, feasibility study will be conducted which include technical, legal, and economic feasibility. Evaluations of software and hardware that will be utilized in building up this system will be considered in technical feasibility. Legal feasibility will mainly consider rights of implementation of the system and legal usage of hardware and software. However, most of the softwares that are chosen to be used in this project are either open source or freeware. Thus, there will be no legal issues raised up for the project in software and hardware aspects. Economic feasibility is economic and marketing aspects to measure cost break-even calculation, cost-benefit, revenue, net profit, lost computing, advertisement cost, and others. In this project, there is no economic feasibility because of this project is not used for marketing or business plan purpose.
System and database analysis will be depicted into Data Flow Diagram (DFD) and Entity Relationship (ER) Diagram. DFD represent process of data interaction between a system, its environment and among system components. (Valacich et al., 2009a) Entity-relationship diagram is a graphical representation of entity relationship model which logically represents entities, associations and data elements for the system. (Valacich et al., 2009b)
Context diagram is an overview of a system that shows system boundaries, external entities that interact with the system and major information flows between entities and the system. (Valacich et al., 2009a)
Main process of this system is to data mine all information that is relevant to pure compounds that have cytotoxic effects on breast cancer. Information will be retrieved from various public databases which are PubMed, OMIM, and NCBI to be processed and generated output for users.
Firstly, literature articles, chemical structure of cancer drugs, and cancer genes will be transferred from PubMed, NCBI and OMIM respectively. Each file will be updated and transformed into another file types and stored in corresponding data store. Data mining will be done on each data stores to retrieve pure compounds that have cytotoxic effects on breast cancer. Results of description, 2D chemical structure of pure compounds that have cytotoxic effects on breast cancer and breast cancer gene that involved in cytotoxicity of relevant pure compounds will be generated for answering users' query.
After all information is preprocess and stored inside the each database, a main database which is database of pure compounds that have cytotoxic effects on breast cancer (as shown in Figure 3-5) will be built.
188.8.131.52 Business Rules
Within the database, pure compounds table that store pure compounds that have cytotoxic effects on breast cancer cell is the main entity. It is linked to articles table, breast cancer genes table and 2D chemical structure table. Each pure compound may have many articles about it; may involve in cytotoxic effects on many breast cancer genes; but it only have one 2D chemical structure.
System interface is designed to make the system more user friendly as user might not know how the underlying works of the system. This interface will use buttons, hyperlinks, drop-down menus, pop-up menus, pop-up dialogues and query entry boxes as main components to make this system ease to use. Buttons are used to represent some features such as home button, navigate button or entry button; hyperlinks are used as connection to other page of the system or as highlight of important notes for users. Both drop-down and pop-up menu will be added to provide users choice of specific commands in the system. Pop-up dialogue will also be included for providing instant communications with users to get response or information to user, for example, error messages will be popped up when users enter the wrong information; while query entry is a box that provided for users to insert their query such as gene name, pure compound's name or 2-dimensional chemical structure of the pure compound.
Implementation of this project will be done in the second phase of the project. During implementation, design will be transformed into a real system by using appropriate hardware, software and programming languages.
This phase can only be done after the system is built up, it must be tested under many circumstances and conditions to ensure there are no mistakes in the data mining technique. Unit testing will be done for each query from users and integration testing will be done for selecting different criteria from users.
Maintenance is important after system is done. It is needed to enhance or improve system based on user requirements in the future.
3.3 Justification of Tools and Techniques
3.3.1 XAMPP Apache Friends Server
XAMPP is an easy to install Apache distribution which contains a collection of free software which includes Apache Web server, MySQL database tools, PHP scripting language and Perl programming language. (Apache Friends) Apache Web server is used to build up a server that can carry database with different function. It interacts with user by using HyperText Markup Language (HTML) documents through Hypertext Transfer Protocol (HTTP). MySQL database tool will be used to set up the database that contains pure compounds that have cytotoxic effect against breast cancer; while PHP scripting language will be used to interpret code into server then sent to the web page; and Perl programming language will be needed in order to write some algorithms for data extracting.
3.3.2 FileZilla FTP Client
FileZilla FTP is a File Transfer Protocol that used to copy file from one host to another over network, such as Internet. FileZilla Client is a free, open source, cross platform software which allow client connect to a FTP server that is run by others. (FileZilla, n.d)
3.3.3 Hyper Text Markup Language (HTML)
HTML is not a programming language but it is a markup language which is made up of a set of markup tags and these tags are used to describe web pages. HTML document contains HTML tags and plain text, it is known as web pages. It can be retrieved across the Internet by using web browser. (University College Cork, n.d)
3.3.4 Hyper Text Transfer Protocol (HTTP)
HTTP is the protocol for transferring, linking and browsing hypertext documents that on the Web. (Boswell, n.d; UITS, 2010)
PROPOSED SOLUTION AND IMPLEMENTATION PLAN / DESIGN
In this chapter, design of this project will be focused before the system is implemented and plan to carry out the system until it is fully built. System flow chart will be shown in Figure 4-1 and discussed in Section 4.2. Data mining algorithm will be discussed in Section 4.3 while project target and milestone will be discussed in Section 4.4.
4.2 System Flow
Architecture of this system will be depicted by using flow chart (as shown in Figure 4-1). This flow chart will show flow from query by user which can be target gene name that can be affected by pure compounds, pure compound's name or 2-dimensional pure compound's chemical structure. After the system has processed the query, it will display all information according to query which is cytotoxicity of the pure compound, description and chemical structure of the pure compound and gene affected by the pure compounds. In addition, sketch of system interface will be designed.
4.2 Proposed Data Mining Technique
Due to this system will extract variety of data from a mixture of databases, therefore, data mining technique has to be performed to select data and mining data before it can be used for knowledge discovery. Clustering technique is chosen to group data into various categories for data mining purpose. Figures below will show several clusters according to different types of data; Figure 4-1 will show clusters that will be grouped according to literature information; while Figure 4-2 will show cluster that will be grouped from 2-Dimensional chemical structure information.
After clustering has been performed, knowledge can be extracted easily when users key in query into system.
4.4 Project Milestones
Project milestone is referring to System Development Life Cycle (SDLC). This project, Developing Database for Pure Compounds that have Cytotoxic Effects on Breast Cancer is divided into two phases which is Project I and Project II.
4.4.1 Project I
During Project I, interim report and system design should be discussed and prepared. Tasks should be done in this phase are planning, analysis, and design.
In planning, discussions and decisions should be made on requirements, objectives and scope of Developing Database for Pure Compounds that have Cytotoxic Effects on Breast Cancer.
During analysis, researches and references on cytotoxictity information that should be considered during developing the system, data mining algorithms and database techniques have to be gathered and analyzed.
System architecture and flow chart and data mining algorithm should be constructed in this step.
184.108.40.206 Gantt Chart for Project I
According to Gantt, Gantt chart is referring to a graphical illustration of a project that shows each task as a horizontal bar which its length is proportional to the time for completion. Gantt chart for Project I milestones will be shown as below.
4.4.2 Project II
During the second phase of the project, a functional, well-designed system has to be implemented and followed by testing and maintenance of the system.
Coding of algorithms and system will be written in this phase. Before writing the codes, pseudocodes of algorithms will be constructed.
Validation of user query has to be done to avoid misclassified and misleading data according to information of pure compounds that have cytotoxic effects on breast cancer.
Maintenance of the system will be done from time to time for improvement on data mining algorithms in searching information on pure compounds that have cytotoxic effects on breast cancers.
220.127.116.11 Gantt Chart
Gantt chart for Project II milestones will be shown as below.