# The Process Of Data Mining Biology Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In recent times, our capabilities of generating and collecting data have been increasing rapidly. The widespread use of bar code for most commercial products, the computerization of many business and government transactions and the advances in data collection tools have provided us with a large amount of data. Millions of data base have been used in business management, government administration and in many other kind of applications. It is noted that the number of such data base are growing rapidly because of the availability of powerful and affordable database systems. This exclusive growth in data and database has generated an urgent need for new techniques and tools that can easily transform the data into useful information and knowledge.

Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data.

With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments.

For example, Blockbuster Entertainment mines its video rental history database to recommend rentals to individual customers. American Express can suggest products to its cardholders based on analysis of their monthly expenditures.

WalMart is pioneering massive data mining to transform its supplier relationships. WalMart captures point-of-sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive 7.5 terabyte Teradata data warehouse. WalMart allows more than 3,500 suppliers, to access data on their products and perform data analyses. These suppliers use this data to identify customer buying patterns at the store display level. They use this information to manage local store inventory and identify new merchandising opportunities. In 1995, WalMart computers processed over 1 million complex data queries.

The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. The Advanced Scout software analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the Cavaliers during that game.

By using the NBA universal clock, a coach can automatically bring up the video clips showing each of the jump shots attempted by Williams with Price on the floor, without needing to comb through hours of video footage. Those clips show a very successful pick-and-roll play in which Price draws the Knick's defense and then finds Williams for an open jump shot.

The most commonly used techniques in data mining are:

Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.

Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) .

Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.

Nearest neighbour method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique.

Rule induction: The extraction of useful if-then rules from data based on statistical significance.

Q: 1

## Using the Law of set Algebra, Simply the Following

## A â‹‚ (Ä€â‹ƒ B)

Sol:

= A â‹‚ (Ä€â‹ƒ B)

= (A â‹‚ Ä€) â‹ƒ (A â‹‚ B) (distributive law)

= ø â‹ƒ (A â‹‚ B) (complement law)

= (A â‹‚ B) (identity law)

## (Ä€ â‹ƒ B ) â‹‚ (A â‹‚ B)

Sol:

= (Ä€ â‹ƒ B) â‹‚ (A â‹‚ B)

= (A â‹‚ B) â‹‚ (A â‹‚ B) (demorgan's law)

= ø (complement law)

## (A â‹ƒ B) â‹‚ (A â‹ƒ B )

Sol:

= (A â‹ƒ B) â‹‚ (A â‹ƒ B)

=Aâ‹ƒ (B â‹‚ B) (distributive law)

=Aâ‹ƒ ø (complement law)

= A (identity law)

## (Ä€ â‹‚ B ) â‹ƒ (A â‹ƒ B)

Sol:

= (Ä€ â‹‚ B) â‹ƒ (A â‹ƒ B)

= (Aâ‹ƒ B) â‹ƒ (Aâ‹ƒ B) (demorgan's law)

=µ

## (Aâ‹ƒ Bâ‹ƒ C)â‹‚ (Aâ‹ƒ Bâ‹ƒ C)â‹‚ (A â‹ƒ B)

Sol:

= (Aâ‹ƒ Bâ‹ƒ C)â‹‚ (Aâ‹ƒ Bâ‹ƒ C)â‹‚ (A â‹ƒ B)

= (Aâ‹ƒ B)â‹ƒ (Câ‹‚ C)â‹‚(Aâ‹ƒ B) (distributive law)

= (Aâ‹ƒ B)â‹ƒøâ‹‚(Aâ‹ƒ B) (complement law)

= (Aâ‹ƒ B)â‹‚(Aâ‹ƒ B) (complement law)

=Aâ‹ƒ (Bâ‹‚ B) (distributive law)

=Aâ‹ƒø (complement law)

=A (identity law)

## (Aâ‹ƒ Bâ‹ƒ C)â‹‚ (Aâ‹ƒ (Bâ‹‚ C))

Sol:

= (Aâ‹ƒ Bâ‹ƒ C) â‹‚ ((Aâ‹ƒ (Bâ‹‚ C))

= (Aâ‹ƒ Bâ‹ƒ C) â‹‚ ((Aâ‹ƒ B) â‹‚ (Aâ‹ƒ C)) (distributive law)

= (Aâ‹ƒ Bâ‹ƒ C) â‹‚ (Aâ‹ƒ B) â‹‚ (Aâ‹ƒ C) (associative law)

= ((Aâ‹ƒ B) â‹ƒ (C â‹ƒ ø)) â‹‚ (Aâ‹ƒ C) (distributive law)

= ((Aâ‹ƒ B) â‹ƒ ø) â‹‚ (Aâ‹ƒ C) (identity law)

= (Aâ‹ƒ B) â‹‚ (Aâ‹ƒ C) (identity law)

=Aâ‹ƒ (B â‹‚ C) (distributive law)

g)

## (

And the last step relies on the identity: ,

## Q: 2

## Two sets A and B belongs to the same Universal Set U, the difference A-B between the two sets is a third set whose elements are those elements of A that are not in B

## Satisfy yourself that A-B = Aâ‹‚B, then verify the following using venn diagram:

## U-A = A

## (A-B) U B = AUB

## Câ‹‚(A-B)=Câ‹‚A -Câ‹‚B

## (AUB)U(B-A) = AUB

Sol:

Let x be any arbitrary element of the set U-A.

= x Ñ” U-A

= x Ñ” U and xÑ” A

= x Ñ” Aâ€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦.1

Let y be any arbitrary element of the set A

=y Ñ” A

=y Ñ” Aâ€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦..2

From 1 and 2

U-A = A

U-A

A

(A-B)UB = AUB

Sol:

Let x be any arbitrary element of the set (A-B) UB

=x Ñ” (A-B) U B

=x Ñ” A and x Ñ” B or x Ñ” B

=x Ñ” A or (x Ñ” B and x Ñ” B)

=x Ñ” A or (xÑ”B and x Ñ” B)

= x Ñ” A or x Ñ” (B â‹‚ B)

= x Ñ” A or x Ñ” (B U B)

= x Ñ” A or x Ñ” (B)

= x Ñ” (A U B)............1

Let y be any arbitrary element of the set (A U B)

= y Ñ” (AUB)

= y Ñ” A or x Ñ” B

= y Ñ” (AUB).............2

From 1 and 2 we get (A-B)UB = AUB

A-B

(A-B)UB

AUB

(AUB) U (B-A) = AUB

Let x be any arbitrary element of the set (AUB) U (B-A)

=x Ñ” (AUB) U (B-A)

= x Ñ” A or x Ñ” B or x Ñ” B and x Ñ” A

=x Ñ” A and x Ñ” A or x Ñ” B or x Ñ” B

= x Ñ” ( A and x Ñ” A) or x Ñ” (B UB)

= x Ñ” ( A or A ) or x Ñ” (B UB)

=x Ñ” ( A ) or x Ñ” (B)

= x Ñ” ( A U B)....................1

Let y be any arbitrary element of the set AUB

= y Ñ” (A U B)

=y Ñ” A or y Ñ” B

=y Ñ” (A or B)

=y Ñ” (A U B).......................2

From 1 and 2 we get (AUB) U (B-A) = AUB

(AUB)

B-A

(AUB) U (B-A)

(AUB)

C â‹‚ (A-B) = C â‹‚ A - C â‹‚B

Let x be any arbitrary element of the set C â‹‚ (A-B)

=x Ñ” { C â‹‚ (A-B) }

=x Ñ” C and( x Ñ” A and x Ñ” B)

=( x Ñ” C and x Ñ” A) and (x Ñ” C and x Ñ” B)

=x Ñ” (Câ‹‚A) - x Ñ” (Câ‹‚B)

= x Ñ” (Câ‹‚A) - (Câ‹‚B)........1

A-B

Câ‹‚(A-B)

C â‹‚ A

Câ‹‚A

C â‹‚ A - C â‹‚B

Q: 3

## In carrying out a survey of the efficiency of the lights, brakes and steering of motor vehicles, 100 vehicles, 100 vehicles were found to be defective as follows:

## 35 had defective lights

## 40 had defective brakes

## 41 had defective steering

## 8 had defective lights and brakes

## 7 had defective lights and steering

## 6 had defective brakes and steering

## Use a Venn diagram to determine

## How many vehicles had defective lights, brakes and steering?

## How many vehicles had defective lights only?

Sol:

Let L, B and S represents Light, Brakes and Steering respectively.

n(Lâ‹ƒ Sâ‹ƒ B) = n(L) + n(S) + n(B) -n(Lâ‹‚ S) -n(Sâ‹‚ B) -n(Lâ‹‚ B) + n(Lâ‹‚ Sâ‹‚ B)

100 = 35+40+41-8-7-6+ n (Lâ‹‚ Sâ‹‚ B)

n (Lâ‹‚ Sâ‹‚ B) = 5

The number of vehicles having defective lights, brakes and steering is 5.

Defective light only.

= n (L) - n (Bâ‹‚ S) -n (Lâ‹‚ S) + n (Lâ‹‚ Sâ‹‚ B)

=35-8-7+5

=25

The number of vehicles having defective light only is 25.

Q: 4

## In a survey of 1000 households, each house had at least of the appliances washing machine, vacuum cleaner or refrigerator. 400 had no refrigerator, 380 had no vacuum cleaner and 542 no washing machine. 294 had both a vacuum cleaner and washing machine, 277 both a refrigerator and a vacuum cleaner, 190 both a refrigerator and a washing machine. How many households had all three appliances? How many had only a vacuum cleaner?

Sol :

Let R denotes refrigerator, V denotes vacuum cleaner and W denotes Washing Machine.

Given that

n (Râ‹ƒVâ‹ƒW) = 1000

n (R) = 400

n (V) = 380

n (W) = 542 â€¦â€¦â€¦ 1

n (Vâ‹‚W) = 294

n (Râ‹‚V) = 277

n (Râ‹‚W) = 199

To Prove:

n(Râ‹‚Vâ‹‚W)

How many household have only vaccum.

Proof: We know that

n (R) = 1000 - n(R)

n (V) = 1000 - n(V) â€¦â€¦â€¦2

n (W) = 1000 - n(W)

From equation 1 we have

n (R) = 1000 - 400

= 600

n (V) = 1000 - 380

= 620

n (W) = 1000 - 542

= 458

Also,

n (Aâ‹ƒBâ‹ƒC) = n(R) + n(V) + n(W) -n(Râ‹‚V) -n(Râ‹‚W) -(Vâ‹‚W) +n(Râ‹‚Vâ‹‚W)

1000 = 600+620+456-277-190-294+ n(Aâ‹‚Bâ‹‚C)

n (Râ‹‚Vâ‹‚W) = 1000-600-620-458+277+190+294

n (Râ‹‚Vâ‹‚W) = 1761 - 1676

n (Râ‹‚Vâ‹‚W) = 83.

Only 82 household had all three appliances.

(b)

We know that

V = (Râ‹‚V) - (Râ‹‚ Vâ‹‚ W) + (Vâ‹‚W) - (Râ‹‚Vâ‹‚W) + (Râ‹‚Vâ‹‚W)

n (V) = n(Râ‹‚V) -n(Râ‹‚ Vâ‹‚ W) +n(Vâ‹‚W) -n(Râ‹‚Vâ‹‚W) +n(Râ‹‚Vâ‹‚W)

Substitute the values and we get the number of vacuum cleaner.

= n (V) - n (Râ‹‚V) +n (Râ‹‚ Vâ‹‚ W) -n (Vâ‹‚W)

= 620-277+83-294

= 132

The result shows that only 132 house hold have vacuum cleaner.

Q: 5

## Verify, using Venn diagrams, de Morgan's laws. c

(Aâ‹ƒ B)

(Aâ‹‚B)

Figure : 1

Sol:

Here we have to prove that

c c c

(Aâ‹ƒ B) = A â‹‚ B

c

A

Figure: 2

c

B

Figure: 3

c c

A â‹‚ B

Figure: 4

From figure 1,2, 3 and 4 we have

c

(Aâ‹ƒ B) = =

c

A = ||

c

B = //

c c

A â‹‚ B = ////

From the figure we conclude that

c c c

(Aâ‹ƒ B) = A â‹‚ B

This is because from figure 1

c

(Aâ‹ƒ B) is the shaded portion and from figure 2 //// portion is same as the shaded portion of figure 1.

c c c

(A â‹‚ B ) = A â‹ƒ B

n c n c

( â‹ƒ Ai ) = â‹‚ Ai

i=1 i=1

and

n c n c

( â‹‚ Ai ) = â‹ƒ Ai

i=1 i=1

A â‹ƒ A = A (INOVOLUTION LAW)

A â‹‚ A = A ( INDEMPOTENCY LAW)

All the four conditions are verified from venn diagram therefore Demorgan's law is proved.

## Q: 6 of page 375

Formulae has been used to Generate Tally

(c)Relative frequency is defined as the number of successful trials to th.e total number of trials. Relative frequency is a very vital concept and can be used in probability particularly when predictions cannot be made just by looking at the situation, By using relative frequency previous can be used to make predictions)

(d) frequency chart

To find the relative frequencies, divide each frequency by the total number of data in the sample, in this case, 50. Relative frequencies can be written as fractions, percents, or decimals.

The relative frequency is obtained by using the formula

e) Relative frequency chart

f)

The distribution of results can be shown a relative frequency curve. From the graph the class range and their relative frequencies corresponding to them can be seen.

## FREQUENCY DISTRIBUTIVE

## Q: 7

## The efficiency of a new computer operating system is being tested on a mainframe computer. A total of 40 runs are carried out and in each run the same number of jobs each chosen to be representative of the particular computer environment, is submitted as a batch to the machine. For each run the throughput rate, measured in jobs per minute, is determined. The results of the 40 runs are as follows:

3.22

3.18

3.25

3.24

3.28

3.21

3.26

3.19

3.30

3.23

3.14

3.22

3.35

3.23

3.27

3.23

3.26

3.37

3.24

3.25

3.34

3.19

3.27

3.28

3.28

3.26

3.18

3.29

3.31

3.30

3.17

3.23

3.25

3.20

3.29

3.22

By constructing a tally chart , group the data into the classes 3.12-3.16, 3.16-3.20, 3.20-3.24,â€¦. Etc.

Calculate the mean and standard deviation of the grouped data, using the coding method.

Draw a cumulative frequency polygon and use it to estimate the median and semi interquartile range of the data.

Estimate the % of runs with throughput rates which lie outside the interval which extends from one standard deviation below the mean to one standard deviation above the mean.

Sol:

## Raw data

3.22

3.18

3.25

3.24

3.28

3.21

3.26

3.19

3.30

3.23

3.14

3.22

3.35

3.23

3.27

3.23

3.26

3.37

3.24

3.25

3.34

3.19

3.27

3.28

3.28

3.26

3.18

3.29

3.31

3.30

3.17

3.23

3.25

3.20

3.29

3.22

N= 40

Min. Value = 3.14

Max. Value = 3.37

No

Class L

Class U

Frequency

Class mid points

Class interval

xf

x*x*f.

Cum freq.

1

3.12

3.16

2

3.14

0.04

6.28

19.72

2

2

3.16

3.20

5

3.18

0.04

15.90

50.56

7

3

3.20

3.24

10

3.22

0.04

32.20

103.68

17

4

3.24

3.28

11

3.26

0.04

35.86

116.90

28

5

3.28

3.32

8

3.30

0.04

26.40

87.12

36

6

3.32

3.36

3

3.34

0.04

10.02

33.47

39

7

3.36

3.40

1

3.38

0.04

3.38

11.42

40

8

3.40

3.44

0

3.42

0.04

0.00

0.00

40

40

130.04

422.88

Mean

3.251

Variance = 0.002999

Standard deviation = 0.054763126

Median is that value corresponding to a cumulative frequency

of N/2.

20

Median class is the class containing the median,

and is therefore the lowest class whose cumulative frequency exceeds N/2

i.e class 4

## Fc

Cum frequency of class below median class

17

## fm

Frequency of median class

11

## L

Upper class boundary of class immediately below

median class

3.24

## c

Class interval

0.04

## Median Q2

3.25091

Lower quartile Q1 is that value corresponding to a cumulative frequency

of N/4.

10

Lower quartile class is the class containing the lower quartile

and is therefore the lowest class whose cumulative frequency exceeds N/4

i.e class 3

## Fc1

Cum frequency of class below lower quartile class

7

## fq1

Frequency of lower quartile class

10

## L1

Upper class boundary of class immediately below

lower quartile class

3.20

## c

Class interval

0.04

## Lower quartile Q1

3.212

Upper quartile Q3 is that value corresponding to a cumulative frequency

of 3N/4.

30

Upper quartile class is the class containing the upper quartile

and is therefore the lowest class whose cumulative frequency exceeds 3N/4

i.e class 5

## Fc3

Cum frequency of class below upper quartile class

28

## fq3

Frequency of lower quartile class

8

## L3

Upper class boundary of class immediately below

upper class

3.28

## c

Class interval

0.04

## Upper quartile Q3

3.29

## Measures of scatter

## Mean - 2s

3.141473747

## Interquartile range

0.078

## Mean -s

3.196236874

(Range within which the middle 50% of readings lie)

## Mean

3.251

## Mean + s

3.305763126

## Semi-interquartile range

0.039

## Mean +2s

3.360526253

## mean + 3s

3.415289379

## Q : 8

## A new hybrid apple is developed with the aim of of producing large apples than a particular previous hybrid. In a sample of 1000 apples the distribution of weights of the apples was as follows.

## Weight (g)

## Frequency

## 0-50

## 20

## 50-100

## 42

## 100-150

## 106

## 150-200

## 227

## 200-250

## 205

## 250-300

## 241

## 300-350

## 106

## 350-400

## 53

## Apples can only be sold to a particular retail outlet with a weight greater than 218g. What propration of the new hybrid would be rejected by this retail outlet?

## How many grammes, above this weight of 218g is the mean weight of apples?

## What is the difference in weights in units of the standard deviation of apple weights?

Sol:

## Weight (g)

## Frequency

## cf

## Class mid-point (x)

## fixi

0-50

20

20

25

500

50-100

42

62

75

3150

100-150

106

168

125

13250

150-200

227

395

175

39725

200-250

205

600

225

46125

250-300

241

841

275

66275

300-350

106

947

325

34450

350-400

53

1000

375

19875

## Î£fixi =

## 223350

## Î£fi =

## 1000

## Mean =

## 223.35

(i) From the table above we can infer, that number of apples weighing less than 150g are 168.

Thus, proportion of apple that cannot be sold to supermarket are

168/1000 = 0.168 = 16.8%

(ii) Mean weight of apples are 223.35g

(iii) Standard deviation = 78.9g

Difference between 150g and mean weight is 223.35 - 150 = 73.35g i.e 73.35/78.9 g = 0.93 of standard deviation.

## Q:9

## The ph level in a river is monitored five times a day. The following twenty sets of five readings were obtained on 20 consecutive days.

## 6.7

## 6.3

## 6.2

## 6.1

## 7.0

## 7.0

## 7.1

## 6.9

## 6.8

## 6.2

## 6.3

## 6.4

## 6.3

## 6.2

## 7.1

## 7.0

## 7.0

## 6.8

## 6.9

## 6.3

## 6.6

## 6.2

## 6.1

## 6.4

## 7.1

## 6.8

## 7.3

## 6.8

## 7.0

## 6.4

## 6.9

## 6.1

## 6.0

## 6.4

## 6.8

## 6.5

## 7.1

## 6.3

## 7.1

## 6.4

## 6.6

## 6.6

## 6.5

## 6.8

## 6.7

## 6.9

## 7.0

## 6.4

## 6.3

## 6.4

## 6.3

## 6.4

## 6.5

## 6.4

## 6.3

## 6.1

## 5.9

## 5.9

## 5.8

## 6.1

## 6.3

## 6.3

## 6.5

## 6.5

## 6.2

## 7.0

## 5.9

## 5.9

## 5.9

## 6.1

## 6.2

## 6.6

## 6.7

## 6.6

## 6.2

## 6.9

## 6.0

## 6.1

## 5.9

## 6.0

## 6.1

## 6.6

## 6.8

## 6.3

## 6.6

## 6.6

## 6.0

## 6.3

## 5.8

## 6.2

## 6.4

## 6.3

## 6.3

## 6.1

## 7.0

## 6.4

## 5.9

## 6.1

## 6.1

## 6.3

## Using your calculator obtain the mean and standard deviation of these 100 readings.

## Obtain the mean ph levels for each of the 20 consecutive days and plot then on a chart showing ph as a function of day.

## A warning should be flagged if a mean ph level on any given day lies outside the range m± 1.96m where m is the mean of the 100 readings and sm = s/âˆš5 where s is the standard deviation of the 100 readings. Identity those days on which a warning would be flagged.

Sol:

Class interval

x

Frequency

x

fx

(xi - x)*(xi-x)

fi(xi - x)*(xi-x)

5.5-6.0

5.75

9

-1

-9

0.53

4.77

6.0-6.5

6.25

49

0

0

0.052

2.54

6.5-7.0

6.75

28

1

28

0.072

2.01

7.0-7.5

7.25

14

2

28

0.77

10.1

âˆ‘f = 100

âˆ‘fx = 47.0

âˆ‘fi(xi - x)*(xi-x) = 19.42

Mean = A + âˆ‘fx * i

âˆ‘f

= 6.25 + 47/100 *0.5

= 6.25 + 0.23

Mean = 6.48

Standard deviation = âˆš(Ïƒ)*(Ïƒ)

(Ïƒ)*(Ïƒ)= 1/N *âˆ‘fi(xi - x)*(xi-x)

= 1/100 * (19.42)

= 0.19

Standard deviation = âˆš( 0.19)

= 0.43.

Mean of 20 consecutive days are :

X1 = sum of first two rows/ no of observation

= 6.7+6.3+6.2+6.1+7.0+7.0+7.1+6.9+6.8+6.2+6.3+6.4+6.3+6.2+7.1+7.1+7.0+7.0+6.8+6.9+6.3/20

= 132.60/20

= 6.63

X2 = sum of 2nd and 3rd rows/ no of observation

=133.1/20

= 6.65

X3 = sum of 3rd and 4th rows/ no of observation

=132.30/20

=6.61

X4 = sum of 4th and 5th rows/ no of observation

= 131.8/20

=6.59

X5 = sum of 5th and 6th rows/ no of observation

=127.9/20

=6.39

X6= sum of 6th and 7th rows/ no of observation

=124.30/20

=6.21

X7 = sum of 7th and 8th rows/ no of observation

=125.30/20

=6.29

X8 = sum of 8th and 9th rows/ no of observation

=126.5/20

=6.32

X9= sum of 9th and 10th rows/ no of observation

=126.2/20

=6.31

Range

= m-1.96Sm â‰¤ Xâ‰¤ m±1.96Sm

= 6.48-(1.96*(0.43)/ âˆš5) â‰¤ X â‰¤6.48+(1.96*(0.43)/ âˆš5)

=6.48-0.376â‰¤Xâ‰¤6.48+0.376

=[6.10â‰¤Xâ‰¤6.85]

The days outside this interval on which a warning would be flagged.

Q : 10

## Standard amount of five different insecticides are found to kill 30%, 45%, 65%, 85% and 90% respectively of a fixed size of insects population. If one of the insecticides is chosen at random, what is the probability that it will kill:

## At least 65% of the insect population?

## At most 45% of the insect population?

## Between 40% and 80% of the insect population?

## If one of the insecticides is chosen at random, what is the probability that any one of the pair chosen will kill at least 85% of the insect population?

## Sol:

Let A,B,C,D and E denotes the events for different insecticides then the probability of selection of any one insectide is 1/5.

At least 65% of insect population is

P(A) + P(B) + P(C)

= 1/5 + 1/5 + 1/5

=3/5

At most 45% of insects population is

P(A) + P(B)

=1/5 + 1/5

=2/5

Between 40% and 80% of insect population.

P(B) + P(C)

=1/5 + 1/5

=2/5

If two of the insecticides are chosen at random then the probability that any one of the pair chosen will kill at least 85% of the insect population is

= 1 - P (Aâ‹ƒ B)

= 1 - [P(A) + P(B) ]

= 1 - [1/5 + 1/5]

= 1 - 2/5

= 3/5

= 0.6

Q: 11

## As part of the safety procedure at an oil refinery a fractionating column is monitored by three independent warning systems, A, B and C. The probabilities that on any given day the warning systems will fail are 0.1, 0.01 and 0.05 for A,B and C respectively. If P(n) denotes the probability that n warning systems fail on a given day, obtain P(0),P(1),P(2) AND P(3). Hence obtain the expected number of warning systems which fail on any given day.

Sol :

Let D = warning system will fail

E1 = warning system A is selected.

E2 = warning system B is selected.

E3 = warning system C is selected.

P (D/E1) = 0.1

P (D/E2) = 0.01

P (D/E3) = 0.05

P (D/E1) = 0.9

P (D/E2) = 0.99

P (D/E3) = 0.95

P (n) = probability that n warning system will fail.

P (0) = probability of no warning system will fail.

= P (D/E1)* P (D/E2)* P (D/E3)

= 0.9*0.99*0.95

= 0.84645

P(1) = probability of 1 warning system will fail.

= P (D/E1)* P (D/E2)* P (D/E3) + P (D/E1)* P (D/E2)* P (D/E3) + P (D/E1)* P (D/E2)* P (D/E3)

= 0.9*0.01*0.95 + 0.1*0.99*0.95 + 0.9*0.99*0.05

=0.14715

P(2) = probability of 2 warning system will fail.

= P (D/E1)* P (D/E2)* P (D/E3) + P (D/E1)* P (D/E2)* P (D/E3) + P (D/E1)* P (D/E2)* P (D/E3)

= 0.9*0.01*0.05 + 0.1*0.99*0.05 + 0.1*0.01*0.95

= 0.00635

P(3) = probability of 3 warning system will fail.

= P (D/E1) * P (D/E2) * P (D/E3)

= 0.1*0.01*0.05

= 0.00005

So, X P(X) XP(X)

0 0.84695 0

1 0.14715 0.14715

2 0.00635 0.1270

3 0.00005 0.00015

Expectation in E(x) = P(1)+P(2)+P(3)

= 0.1600

## Q: 12

## A particular scientific monitoring device is powered by three batteries of the same type. The device is designed so that it will continue to function provided at least two of the three batteries function normally. The three batteries are renewed at a particular time. The probability that a battery will fail within the first 50 hours of operation is 0.25. The probability that a battery within the first 100 hours of operation is 0.70. Determine the probability that:

## The device fails due to battery failure within the first 50 hours of duration.

## The batteries allow the equipment to operate for longer than 100 hrs.

## The device fails due to battery failure within 50-100hr of duration.

## Sol :

Fail within first 50 hours = 0.25

Does not fail within first 50 hours = 0.75.

The device fails due to battery failure within the first 50 hours of duration.

P(The device fails due to battery failure within the first 50 hours of duration)

= (0.25*0.25*0.25) + 3 {(0.25*0.25)*(0.75)}

= {0.015625+3(0.0625*0.75)}

=0.015625+ 3(0.046875)

=0.015625+0.140625

=0.15625

The batteries allow the equipment to operate for longer than 100 hrs.

Fail within first 100 hours = 0.7

Does not fail within 100 hours = 0.3

P(The batteries allow the equipment to operate for longer than 100 hrs)

=(0.3*0.3*0.3) + 3 {(0.3)*(0.3) * (0.7)}

={0.027+ 3(0.09*0.7)}

=0.027 + 3(0.063)

=0.027+0.189

=0.216

The device fails due to battery failure within 50-100hr of duration.

P(The device fails due to battery failure within 50-100hr of duration)

=(0.7*0.7*0.7) + 3 {(0.7)*(0.7) * (0.3)}

={0.343+3(0.49*0.3)}

=0.343+ 3(0.147)

=0.343+0.441

=0.784

## SAS ANSWERS

## Answer of Q.13

Screen Shot of the SAS Program before execution

Screen Shot of the SAS Program After Execution

## Number of Observation: 15

## Number of Variable: 3

After clearing log by using ctrl + E COMMAND

## Answer of Q. 14

## 4 variables

## 10 observations

## Answers of Q.15

proc gchart data=dfw;

title 'Total Pounds of Mail by Date';

vbar date / sumvar=mail;

## run;

proc means data=dfw mean min max maxdec=2;

title 'Average, Minimum & Maximum Pounds of Mail';

var mail;

## run;

## quit;

## Answers of Q.16

b)

How many datasets in the library = 1

How many have been have been assigned in this SAS session? = 5

## Answers of Q.17.

Active data libraries are IA, Maps, Sashelp, Sasuser and work.

To print contents we used

proc contents data=ia._all_ nods;

run;

## Answers of Q.18.

## proc contents data=ia.payroll2;

## run;

The number of observation is not displayed because it is a data view

There are 6 variables in view.

There are 148 observations of the view.

## Answers of Q.19.

data oct_dates;

set ia.october;

date=mdy(month,day,year);

days_gone_by=today()-date;

## run;

proc print data=oct_dates;

## run;

## Answers of Q.20.

data oct_seats;

set oct_dates;

empty_seats=capacity-(boarded+nonrev);

percent_full= 100-(empty_seats/capacity*100);

## run;

proc print data=oct_seats;

## run;

## Answers of Q.21.

data bonus ;

set ia.fltattnd;

BonusAmt=8/100*salary;

AnnivMo=month(HireDate);

keep empid BonusAmt AnnivMo;

## run;

proc print data=bonus;

## run;

## Answer of Q.22.

data lowrev;

set ia.lonpar;

WHERE REVENUE lt 180000;

keep DATE DEST DELAY REVENUE;

## run;

proc print data=lowrev;

## run;

## Answers of Q.23.

data mechs;

set ia.payroll;

length Manager $ 15;

if upcase(jobcode)='ME1' then

do;

Manager='Miss Pearce';

Raise= 5/100*salary;

PenRise=3.5/100*salary;

Total=Raise+PenRise;

end;

else if upcase(jobcode)='ME2' then

do;

Manager='Mr Holt';

Raise= 7.5/100*salary;

PenRise=5/100*salary;

Total=Raise+PenRise;

end;

else if upcase(jobcode)='ME3' then

do;

Manager='Mr Fitz-William';

Raise= 10/100*salary;

PenRise=8/100*salary;

Total=Raise+PenRise;

end;

keep Jobcode Salary Manager Raise PenRise Total;

run;

proc print data=mechs;

## run;

Q:24

## A pair of fair dice is thrown. Find the probability p that the sum is 10 or greater if (i) a 5 appears on the first die, (ii) a 5 appears on at least one of the dice.

Sol:

Sample Space = {(1,1),(1,2),(1,3),(1,4),(1,5),(1,6), (2,1),(2,2),(2,3),(2,4),(2,5),(2,6), (3,1),(3,2),(3,3),(3,4),(3,5),(3,6), (4,1),(4,2),(4,3),(4,4),(4,5),(4,6), (5,1),(5,2),(5,3),(5,4),(5,5),(5,6), (6,1),(6,2),(6,3),(6,4),(6,5),(6,6)}

n(E) = 2

So reduced sample space = 6

P (E) = 2/6

= 1/3

n(E) = 2

So reduced sample space = 11

P (E) = 2/11

= 2/11

Q: 25

The table shows the probabilities of a hard disc crash using a Brand X drive within one year.

## Table Probabilities of hard disc crashes

## Brand X

## Not Brand X

## Crash C

0.6

0.1

0.7

## No crash C'

0.2

0.1

0.3

0.8

0.2

1.0

Using the information in the table, state:

The probability of a crash for both Brand X and all other types of disc.

Sol :

P (crash for both brand X And Others )

= 0.7/1.0

= 7/10

The probability of no crash.

Sol:

P(no crash)

=0.3/1.0

=3/10

The probability of using a Brand X disc.

Sol :

P(using brand X)

=0.8/1.0

=8/10

The probability of not using Brand X.

P(not using brand X)

=0.2/1.0

=2/10

=1/5

The probability of a crash and using Brand X.

Sol:

P(a crash and using Brand X)

=(( 0.7/1.0) * (0.8/1.0)) - 0.6/1.0

=((7/10)* (8/10))- (6/10)

## =

The probability of a crash, given that Brand X is used.

= 0.6/0.7

The probability of a crash, given that Brand X is not used.

= 0.1/0.7

Find the probability of the disc being brand X given that it crashed.

P(E1) = 0.6

P(A/E1) = 0.7

P(E1/A) = P(E1) / P(A/E1)

= 0.6/0.7

=6/7

Find the probability of the disc being brand X given that it did not crash.

P(E2)=0.2

P(A/E2) = 0.8

P(E2/A) = P(E2)/ P(A/E2)

= 0.2/0.8

=2/8

=1/4

Find the probability of the disc not being brand X given that it crashed.

P(E3)=0.1

P(A/E3)=0.7

P(E3/A)=P(E3)/P(A/E3)

=0.1/0.7

=1/7

Find the probability of the disc not being brand X given that it did not crash.

P(E4)=0.1

P(A/E4)=0.3

P(E4/A)=P(E4)/P(A/E4)

= 0.1/0.3

=1/3

## Q: 26

## Present a decision tree from some domain of interest to you, describing how it was constructed and its use for decision making and classification.

Sol:

In data mining, a decision tree is a predictive model which can be used to represent both classifiers and regression models. In operations research, on the other hand, decision trees refer to a hierarchical model of decisions and their consequences. The decision maker employs decision trees to identify the strategy most likely to reach her goal.

When a decision tree is used for classification tasks, it is more appropriately referred to as a classification tree. When it is used for regression tasks, it is called regression tree.

In this , we concentrate mainly on classification trees. Classification trees are used to classify an object or an instance (such as insurant) to a predefined set of classes (such as risky/non-risky) based on their attributes values (such as age or gender).

Classification trees are frequently used in applied fields such as finance, marketing, engineering and medicine. The classification tree is useful as an exploratory technique. However it does not attempt to replace existing traditional statistical methods and there are many other techniques that can be used classify or predict the membership of instances to a predefined set of classes, such as artificial neural networks or support vector machines.

Figure 1.1 presents a typical decision tree classifier. This decision tree is used to facilitate the underwriting process of mortgage applications of a certain bank. As part of this process the applicant fills in an application form that include the following data: number of dependents (DEPEND), loan-to-value ratio (LTV), marital status (MARST), payment-to-income ratio (PAYINC), interest rate (RATE), years at current address (YRSADD),and years at current job (YRSJOB).

Based on the above information, the underwriter will decide if the application should be approved for a mortgage. More specifically, this decision tree classifies mortgage applications into one of the following two classes:

Approved (denoted as "A")The application should be approved.

Denied (denoted as "D")The application should be denied.

Manual underwriting (denoted as "M") An underwriter should manually examine the application and decide if it should be approved (in some cases after requesting additional information from the applicant).

The decision tree is based on the fields that appear in the mortgage applications forms.

The above example illustrates how a decision tree can be used to represent a lassification model. In fact it can be seen as an expert system, which partially automates the underwriting process and which was built manually by a knowledge engineer after interrogating an experienced underwriter in the company. This sort of expert interrogation is called knowledge elicitation namely obtaining knowledge from a human expert (or human experts) for use by an intelligent system. Knowledge elicitation is usually difficult because it is not easy to find an available expert who is able, has the time and is willing to provide the knowledge engineer with the information he needs to create a reliable expert system. In fact, the difficulty inherent in the process is one of the main reasons why companies avoid intelligent systems. This phenomenon is known as the knowledge elicitation bottleneck.

A decision tree can be also used to analyze the payment ethics of customers who received a mortgage. In this case there are two classes:

YRSJOB

LTV

<2 â‰¥2

MARST

<75% â‰¥75% DIVORCED MARRIED

M

DEPEND

YRSADD

SINGLE

A

D

<1.5 â‰¥1.5 >0 =0

A

M

A

D

Fig 1.1 underwriting decision tree

Paid (denoted as "P") - the recipient has fully paid off his or her mortgage.

Not Paid (denoted as "N") - the recipient has not fully paid off his or her mortgage.

This new decision tree can be used to improve the underwriting decision model presented in Figure 9.1. It shows that there are relatively many customers pass the underwriting process but that they have not yet fully paid back the loan. Note that as opposed to the decision tree presented in Figure , this decision tree is constructed according to data that was accumulated in the database. Thus, there is no need to manually elicit knowledge. In fact the tree can be grown automatically. Such a kind of knowledge acquisition is referred to as knowledge discovery from databases.

The use of a decision tree is a very popular technique in data mining.In the opinion of many researchers, decision trees are popular due to their simplicity and transparency. Decision trees are self-explanatory; there is no need to be a data mining expert in order to follow a certain decision tree. Classification trees are usually represented graphically as hierarchical structures, making them easier to interpret than other techniques. If the classification tree becomes complicated (i.e. has many nodes) then its straightforward, graphical representation become useless. For complex trees, other graphical procedures should be developed to simplify interpretation.

YRSJOB

<3 â‰¥ 3.5

I.RATE

PAYING

<20% >-20% <3% â‰¥ 6 %

N

P

DEPEND

N

P

>0 =0

N

P

Actual behaviour of customer

## Characteristics of Decision Trees

A decision tree is a classifier expressed as a recursive partition of the instance space. The decision tree consists of nodes that form a rooted tree, meaning it is a directed tree with a node called a "root" that has no incoming edges. All other nodes have exactly one incoming edge. A node with outgoing edges is referred to as an "internal" or "test" node. All other nodes are called "leaves" (also known as "terminal" or "decision" nodes).

In the decision tree, each internal node splits the instance space into two ormore sub-spaces according to a certain discrete function of the input attribute values. In the simplest and most frequent case, each test considers a single attribute, such that the instance space is partitioned according to the attributes value. In the case of numeric attributes, the condition refers to a range.

Each leaf is assigned to one class representing the most appropriate target value. Alternatively, the leaf may hold a probability vector (affinity vector) indicating the probability of the target attribute having a certain value. Figure describes another example of a decision tree that reasons whether or not a potential customer will respond to a direct mailing.

Internal nodes are represented as circles, whereas leaves are denoted as triangles. Two or more branches may grow from each internal node (i.e. not a leaf).Each node corresponds with a certain characteristic and the branches correspond with a range of values. These ranges of values must give a partition of the set of values of the given characteristic.

Instances are classified by navigating them from the root of the tree down to a leaf, according to the outcome of the tests along the path.

Specifically, we start with a root of a tree; we consider the characteristic that corresponds to a root; and we define to which branch the observed value of the given characteristic corresponds. Then we consider the node in which the given branch appears. We repeat the same operations for this node etc., until we reach a leaf. Note that this decision tree incorporates both nominal and numeric attributes. Given this classifier, the analyst can predict the response of a potential customer (by sorting it down the tree), and understand the behavioural characteristics of the entire potential customer population regarding direct mailing. Each node is labelled with the attribute it tests, and its branches are labeled with its corresponding values.

In case of numeric attributes, decision trees can be geometrically interpreted as a collection of hyperplanes, each orthogonal to one of the axes.