Implementation Of Id3 In Java Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Due to vast availability of data on web, the challenges in retrieving the web data is also increasing. In our paper we discussed about web and how to mould unstructured web documents into structural format. We have also developed a java program for ID3 to classify web documents.

Keywords: Java, ID3, Web documents, Title, classification, Moulding.

Web is popular due to the availability of vast amount of data. It contains several billions of HTML documents, pictures and other multi media files. These documents reside on Internet servers and the information-exchange can be done through HTTP protocols. Individual HTML files having unique address is called Web page and collection of such web pages having same electronic address is called as Web site. The quality of the web page is decided by its title. It is the name of a web page or web site. So giving an appropriate and good title to a web page is very important. To retrieve the required web pages based on the title of the web page, the title of the web page has to be parsed (i.e.) the key words has to be extracted from the given titles. The unstructured web documents have to be moulded into structural format for classifying them using these keywords so as to retrieve the documents efficiently.

2. Moulding Web Documents

We represent the Web data in the binary format where all of the keywords derived from the schema (Title of the web page) posited in the columns and some frequent schemas are posited in the rows. If a keyword is in a frequent schema, a 1 is stored in related cell and otherwise a 0 is stored in it. In the following you can see an example of this form. The attributes of frequent schemas are stated below.

QI1: Deep Web Content mining = {Deep, Web, content, mining}

QI2: Extracting structured data from Web pages = {Extracting, structured, data, Web, pages}

QI3: Page content Rank: An approach to the web content mining = {Page, content, rank, web, content, mining}

QI4: Web mining: Information and pattern discovery on the World Wide Web = {Mining, information, pattern, discovery, World Wide Web}

We represent the web data in binary format as in below table.

Table 1: A Sample Array with Input Information.

Deep

Web

Content

Mining

Extracting

Other Keywords

Category

QI1

1

1

1

1

0

0

1

QI2

0

1

0

0

1

1

2

QI3

0

1

1

1

0

1

3

QI4

0

0

0

1

0

1

1

3. Java Program

To attain knowledge out of retrieved data we classify the documents by converting the unstructured documents into structural format. For classification of web documents we have many methods like classification by Decision tree induction, Bayesian classification, Rule-Based classification, Support vector machine etc. During late 1970s and 1980s J.Ross Quinlan, a researcher in machine learning has developed a decision tree algorithm known as ID3 [1] (Iterative Dichotomiser). ID3 uses information gain as its attribute selection. Information gain Gain(A) is given as

Gain(A) = Info(D) -InfoA(D)

We felt very difficult to calculate this information gain manually. So we have developed a java program to calculate information gain. Though accuracy wise it is not encouraging, methodology wise it is promising. We divided the program into two parts. The first part calculates Info(D) and the second part calculates the Gain(A). The web documents which we represented in binary format, constitutes as input array. This program is described as below.

/*ID3 uses information gain as its attribute selection measure. Gain(A) = Info(D) -InfoA(D). (This part of the program deals with Info(D). */

import java.math.*;

class Entropy1

{

public static void main(String arg[])

{

int a[]={/* Enter values in class columns */};

int b[]= new int[10],count=1;

int n=a.length,temp=0,i ;

for( i=0; i<n;i++)

{

for(int j=i+1;j<n;j++)

{

if(a[i]>a[j])

{

temp=a[i];

a[i]=a[j];

a[j]=temp;

}

}

}

for( i=0;i<n;i++)

{

System.out.print(a[i]);

}

int k=0;

System.out.println(„n-1="+(n-1));

for( i=0;i<n-1;i++)

{

if (a[i]==a[i+1])

{

count=count+1;

}

else

{

b[k]=count;

k=k+1;

count=1;

}

}

if(a[i-1]!=a[i])

{

if(i+1==n)

b[k]=1;

}

else if((a[i-1]==a[i]) && (i+1==n))

{

b[k]=count;

}

System.out.println("k="+k);

for( i=0;i<=k;i++)

System.out.println (b[i]);

double sum=0;

for( i=0;i<=k;i++)

{

float x = (float)b[i]/a.length;

double x1 = Math.log10(x);

double ent =-x*(x1/0.301);

sum=sum+ent;

}

/* Value of Info(D)*/

System.out.println(" info(x) = "+sum);

}

}

Output:-

Info(x) = 2.8156. (When some input array is given (extension of above sample array))

/*Second part deals with Gain(A).*/

import java.util.*;

class Ent1

{

public static void main(String arg[])

{

int[][] a={/* Enter data item values*/};

int b[] = new int[20];

int c[] = new int[20];

int b1[] = new int[20];

int c1[] = new int[20];

double in[] = new double[50];

int sum=0,in1=0;

int count=0,len,k,k1,sum1,count1,k3=0,i3;

double info,info1,info2;

System.out.println(a.length);

len=(a[0].length)-1;

System.out.println(len+1);

for(int i=0;i<len;i++)

{

sum=0;

count=0;

k=0;k1=0;

for(int j=0;j<a.length;j++)

{

if( a[j][i]==1)

{

b[k]=a[j][len];

k++;

sum=sum+1;

}

else

{

c[k1]=a[j][len];

k1++;

count=count+1;

}

}

Arrays.sort(b,0,sum);

Arrays.sort(c,0,count);

/* For array b */

sum1=1;

count1=1;

k3=0;

b1[0]=1;

for( i3=0;i3<sum-1;i3++)

{

if (b[i3]==b[i3+1])

{

sum1=sum1+1;

}

else

{

b1[k3]=sum1;

k3=k3+1;

sum1=1;

}

}

if(i3!=0)

{

if(b[i3-1]!=b[i3])

{

if(i3+1==sum)

b1[k3]=1;

}

else

if((b[i3-1]==b[i3]) && (i3+1==sum))

{

b1[k3]=sum1;

}

}

info1=0;

for( i3=0;i3<=k3;i3++)

{

float x = (float)b1[i3]/sum;

double x1 = Math.log10(x);

double ent =-x*(x1/0.301);

info1=info1+ent;

}

info1=((float)sum/a.length)*info1;

/* For array c */

for( i3=0;i3<count-1;i3++)

{

if (c[i3]==c[i3+1])

{

count1=count1+1;

}

else

{

c1[k3]=count1;

k3=k3+1;

count1=1;

}

}

if(i3!=0)

{

if(c[i3-1]!=c[i3])

{

if(i3+1==count)

c1[k3]=1;

}

else

if((c[i3-1]==c[i3]) && (i3+1==count))

{

c1[k3]=count1;

}

}

info2=0;

for( i3=0;i3<=k3;i3++)

{

float x2 = (float)c1[i3]/count;

if(x2==0)

{

info2=0;

}

else

{

double x3 = Math.log10(x2);

double ent1 =-x2*(x3/0.301);

info2=info2+ent1;

}

}

info2=((float)count/a.length)*info2;

info=/*Info(x) (from Ist program)*/ -(info1+info2);

in[in1]=info;

in1++;

}

for(int i=0;i<len;i++)

{

System.out.println("i="+i);

System.out.println(in[i]);

}

double max=in[0];

int inx=0;

/* To find max data item in the array */

for (int i=1;i<len;i++)

{

if(in[i]>max)

{

max=in[i];

inx = i;

}

}

System.out.println("max="+max+"index="+inx);

}

}

Output:-

Max= 0.489

Index = 39 (which means the value at index 39 will become root value in the decision tree).

Conclusions

In our paper we discussed about web and how to mould web documents into a structural format. Out of many methods available in data mining we have used ID3 algorithm for classification and we implemented a java program for finding information gain. This program is for better understanding of the algorithm.