Effect of feature selection and dataset size on the accuracy of naïve bayesian classifier and logistics regression

 

Table Of Contents


  • <p> </p><p>TITLE PAGE.. ……………………………………………………………………………………………………. i<br>DECLARATION.. ……………………………………………………………………………………………… ii<br>CERTIFICATION …………………………………………………………………………………………….. iii<br>DEDICATION ………………………………………………………………………………………………….. iv<br>ACKNOWLEDGEMENT ………………………………………………………………………………….. v<br>ABSTRACT……………………………………………………………………………………………………… vi<br>TABLE OF CONTENTS……………………………………………………………………………………. vi<br>LIST OF TABLES ……………………………………………………………………………………………… x<br>LIST OF FIGURES …………………………………………………………………………………………… xi

Chapter ONE

INTRODUCTION

  • ………………………………………………………………… 1<br>
  • 1.1Background to the Study ………………………………………………………………………… 1<br>
  • 1.2Statement of the Problem ……………………………………………………………………….. 3<br>
  • 1.3Aim and Objectives of the Study …………………………………………………………….. 3<br>
  • 1.4Significance of the Study ……………………………………………………………………….. 4<br>
  • 1.5Motivation ……………………………………………………………………………………………. 4<br>
  • 1.6Scope and Limitation of the Study …………………………………………………………… 4<br>
  • 1.7Definition of Terms ……………………………………………………………………………….. 5<br>

Chapter TWO

LITERATURE REVIEW

  • …………………………………………………….. 8<br>

Chapter THREE

SYSTEM DESIGN AND IMPLEMENTATION

  • AND MATERIALS …….. 14<br>
  • 3.1Introduction ………………………………………………………………………………………… 14<br>
  • 3.2Source of Data …………………………………………………………………………………….. 14<br>
  • 3.3Method of data analysis………………………………………………………………………… 14<br>
  • 3.4Principal component analysis (PCA) ……………………………………………………… 14<br>
  • 3.5Logistics Regression ……………………………………………………………………………. 17<br>3.
  • 5.1Binary Logit model from the Logistic Function ………………………………… 17<br>viii<br>
  • 3.6Naïve Bayesian Classifier …………………………………………………………………….. 20<br>3.
  • 6.1Formulation of the model……………………………………………………………….. 20<br>3.
  • 6.2Learning the model: ………………………………………………………………………. 21<br>
  • 3.7Classification of New Data …………………………………………………………………… 22<br>
  • 3.8Model‟s performance evaluation……………………………………………………………. 22<br>3.
  • 8.1Confusion Matrix ………………………………………………………………………….. 22<br>3.
  • 8.2Comparing multiple models……………………………………………………………. 23<br>

Chapter FOUR

SYSTEM TESTING AND EVALUATION

  • ANALYSIS AND DISCUSSION ……………………………………….. 25<br>
  • 4.1Introduction ………………………………………………………………………………………… 25<br>
  • 4.2Model Building and evaluation for the Breast Cancer Dataset …………………… 25<br>4.
  • 2.1Building and evaluating NB on the larger dataset, with no PCA …………. 26<br>4.
  • 2.2Building and evaluating NB on the smaller dataset, with no PCA ……….. 26<br>4.
  • 2.3Building and evaluating LR on the larger dataset, with no PCA ………….. 27<br>4.
  • 2.4Building and evaluating LR on the smaller dataset, with no PCA ……….. 28<br>4.
  • 2.5Building and evaluating NB on the larger dataset, with PCA ……………… 29<br>4.
  • 2.6Building and evaluating NB on the smaller dataset, with PCA ……………. 30<br>4.
  • 2.7Building and evaluating LR on the larger dataset, with PCA ………………. 31<br>4.
  • 2.8Building and evaluating LR on the smaller dataset, with PCA ……………. 31<br>
  • 4.3Model Building and evaluation for the Heart Disease Dataset …………………… 32<br>4.
  • 3.1Building and evaluating NB on the larger dataset, with no PCA …………. 33<br>4.
  • 3.2Building and evaluating NB on the smaller dataset, with no PCA ……….. 34<br>4.
  • 3.3Building and evaluating LR on the larger dataset, with no PCA ………….. 34<br>4.
  • 3.4Building and evaluating LR on the smaller dataset, with no PCA ……….. 35<br>4.
  • 3.5Building and evaluating NB on the larger dataset, with PCA ……………… 36<br>4.
  • 3.6Building and evaluating NB on the smaller dataset, with PCA ……………. 37<br>4.
  • 3.7Building and evaluating LR on the larger dataset, with PCA ………………. 38<br>4.
  • 3.8Building and evaluating LR on the smaller dataset, with PCA ……………. 38<br>ix<br>
  • 4.4Summary of Results …………………………………………………………………………….. 39<br>

Chapter FIVE

SUMMARY, CONCLUSION AND RECOMMENDATIONS

  • CONCLUSION AND RECOMMENDATION .. 42<br>
  • 5.1Summary ……………………………………………………………………………………………. 42<br>
  • 5.2Conclusion………………………………………………………………………………………….. 43<br>
  • 5.3Recommendation…………………………………………………………………………………. 43<br>
  • 5.4Recommendation and Suggestion for future research……………………………….. 44<br>
  • 5.5Contribution to knowledge……………………………………………………………………. 44<br>REFERENCES ………………………………………………………………………………………………… 44<br>APENDIX I……………………………………………………………………………………………………… 46<br>APENDIX II …………………………………………………………………………………………………….</p><p>&nbsp;</p> <br><p></p>

Project Abstract

<p> </p><p>Binary Logistics Regression and Naïve Bayesian classifier are two of the common classification modelling techniques that allow one to predict the category that a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. We studied the classification performances of the two linear classification under different feature (variable) selection criteria and dataset size conditions on a medical domain area were studied based on the datasets (breast cancer and heart diseases) obtained from the University of California, Irvine, online respiratory. The result indicated that logistics Regression for classification on relatively large datasets without the application of PCA (for variable selection) has the great accuracy (91.4%), while Naïve Bayesian classifier with PCA (for variable/ feature selection) tops the smaller dataset classification with an accuracy of 90.2%. These two accuracies are close enough and high enough, which is an indication of high relevance of their selections in solving classification problems on datasets from this kind of domain.</p><p>&nbsp;</p> <br><p></p>

Project Overview

<p> INTRODUCTION<br>1.1 Background to the Study<br>We start by considering the following problem: suppose you are a medical laboratory technologist, who has access to a patient‟s health records, who was admitted for heart disease diagnosis, the natural question that comes to mind is, does he or she has a heart disease or not? Or this one: suppose you are a bank, and given a person who wants to take out a loan, will she default on the loan? Or this: how can your email server tell which emails are spam and which ones are actual mail? Intuitively, all of the above situations can be resolved by examining empirical data and taking out the factors that are important in each. For example, in the heart disease case, one might want to look through hospitalization records of patients who have had heart diseases and see if your patient resembles them in age, blood pressure, body temperature, diet and exercise habits, family history and other clinical measurements. The above situations are examples of classification problems. Classification is a statistical method used to build predicative models to separate and classify new data points. In Machine Learning and Statistics, classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Feature (variable) selection is the process of identifying and removing as many irrelevant and redundant features as possible from dataset features (Yu and Liu, 2004). This reduces the dimensionality of the data and enables data mining algorithms to operate effectively. The fact that many features depend on one another often unduly influences the accuracy of models. Classification models are affected by the choice of features (variables). Purkayasthaet al (2014) stated that: selecting the relevant features for classification is significant for a variety of reasons like simplification of performance, computational efficiency, and feature interpretability.<br>2<br>A model which performs classification is known as a classifier. A classifier is a function which maps an input variable X to a class C. Classifiers are broadly divided into linear and non-linear classifiers: with the linear classifiers, models which are based on the linear combination of variables‟ values are built. Linear classifiers work well for practical problems such as medical diagnosis, document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to learn from the training dataset. There are numerous classifiers today and the choice of which to use depends on a number of factors; for example, the simplicity, accuracy and applicability to the domain and structure of dataset under consideration, Kwon and Mun Sim (2013). In this research work, we focused on Discriminative-Generative pair of binary linear classifiers as typified by Binary Logistics Regression (LR) and Naïve Bayes Classifier (NBC). Bayesian Classification represents a supervised learning method for classification. Naïve Bayes classifier is a kind of probabilistic classifier that is based on applying Bayes’ theorem. It assumes that, all features are independent. Naïve Bayes classifier has the following three advantages. First, in some probability models, Naïve Bayes classifier can be effectively trained in supervised learning environment. Second, the amount of training data used to estimate the necessary parameters for classification need not necessarily be large. Third, despite a simple design, Naïve Bayes classifier operates well in various complicated situations (Yoo and Yang, 2015). Given a class C and Variable vector X, we use the training data to determine the probabilities P(X/C) and P(C) for all values of X and C. New X examples can then be classified using these estimated probability distributions with Bayes Rule. This type of classifier is called a generative classifier, because we can view the distribution P(X/C) as describing how to generate random instances of X conditioned on the target class C.<br>3<br>Logistic Regression is a model that uses training data to directly estimate the probability of an instance of some set of features or a variable vector X belonging to a class C, P(C/X), in contrast to Naive Bayes. In this sense, Logistic Regression is often referred to as a discriminative classifier because we can view the distribution P(C/X) as directly estimating the probabilistic value of the target C for any given instance of X. The success of classifiers depends on the nature of the relationship between feature selection and Classification Accuracy. Researchers such as Kwon and Sim (2013), have tried to understand the nature of this relationship using some selected classifier models (algorithm). However, their explanation is too general and therefore not very informative. Ultimately, we would like to understand the performances of Naïve Bayesian Classifier and Logistic Regression when used as binary classifiers on the same domain area (In our case: for health problem diagnosis. Presence or absence of an ailment) under certain interactions of dataset sizes and variable selection methods.<br>1.2 Statement of the Problem<br>As the need to analyze big data sets grows exponentially, the role that classification algorithms play in data mining techniques also increases. As Kwon and Mun Sim (2013), noted that it is still a complex issue to determine which algorithm is strong or weak in relation to which data set, where in they experimentally examined how dataset characteristics affect a models performance. The key problem when dealing with classification problem is not whether a model is superior to others, but under which conditions a particular method can significantly outperform others on a given application problem. Naïve Bayesian Classifier and Logistics Regression have been reported to do well with a variety of datasets. This research proposes to find an optimal choice between these two classifiers.<br>1.3 Aim and Objectives of the Study<br>The aim of this research work is to study the classification performance of Naïve Bayesian Classifier and Logistic Regression under different feature (variable) selection criteria and<br>4<br>dataset size conditions on a domain area. The aim shall be achieved through the following objectives by:<br>i. building a Naïve Bayesian classifier model for each of the pre-determined conditions;<br>ii. building a Logistic Regression classifier model for each of the pre-determined conditions;<br>iii. testing the models in objectives (i) and (ii) on some datasets in order to measure their respective classification accuracies;<br>iv. performing a test of independence in the interaction of feature selection criteria, dataset size and choice of classifier model (algorithm).<br>1.4 Significance of the Study<br>This study helps to understand the optimal performances of Logistics Regression and Naïve Bayesian Classifier which are both linear statistical classification models that are fast becoming the choice of many researchers. Particularly, the results help to make optimal decisions on choice of model, and consequently improve the performances of classification algorithms.<br>1.5 Motivation<br>Our motivation stems from the resolution of Kwon and Sim (2013) that noted the complexity of having to determine which classifier model is strong or weak in relation to data sets from a specified domain of study, and concludes that the conditions under which a particular method significantly outperform the others on a giving application problem is the key to dealing with classification. In view of that, we were motivated to study the effect of feature selection and dataset size on the accuracy NBC and LR as limited to data sets from a medical domain.<br>1.6 Scope and Limitation of the Study<br>This research is limited to empirical data on medical records collected for breast cancer disease and heart disease which are suitable for classification. Data from non-medical domain areas are not considered.<br>5<br>1.7 Definition of Terms<br>a) Classification<br>Classification is a statistical method used to build predicative models to separate and classify new data points.<br>b) Classifier<br>The predicative model built to separate and classify new data points is known as a classifier.<br>c) Naïve Bayes Naive Bayes is a classifier which is based on applying Bayes‟ theorem with the basic assumption of independence between every pair of features.<br>d) Logistics Regression (LR) Logistic regression or logit model is a regression model whose dependent variable is categorical and takes only two values, such as pass or fail, win or lose, alive or dead, presence of disease or absence of disease. Multinomial Logistics regression has cases (in the form of dependent variables) with more than two categories.<br>e) Training set<br>A training set is a set of data used to discover potentially predictive relationships.<br>f) Testing set<br>A testing set is a set of data used to assess the strength and utility of a predictive relationship.<br>g) Confusion matrix<br>6<br>A confusion matrix is a table that is often used in the description of the performance of a<br>model (classification models) on a set of data meant for the purpose of testing (usually<br>called the test data set) for which the true values are known.<br>Predicted (no disease) Predicted (disease)<br>Actual (no disease) TN FP<br>Actual (disease) FN TP<br>Table 1. 1: Sample of a Confusion Matrix<br>h) Accuracy<br>Accuracy is the percentage of correct predictions made. In other words, the accuracy is<br>the proportion of true results (that is, both true positives and true negatives) among the<br>total number of cases examined.<br>i) Precision<br>TP/TP  FP<br>Precision gives information on the proportion of patients diagnosed as having a disease<br>by the classifiers had it in the real case. It can be defined as the proportion of True<br>Positive in the set of subjects diagnosed as positive to the condition been tested upon.<br>j) Sensitivity (Recall)<br>TP/TP  FN<br>Sensitivity computes the proportion of patients that actually had the disease who were<br>diagnosed as having it. One should be careful not to mix up the meaning of sensitivity for<br>precision. Sensitivity gives the proportion of True positive in the set of subjects having<br>the condition in reality, like the proportion of patients who had actually had breast cancer<br>and were diagnosed having it by the classifier model.<br>7<br>k) Specificity (True negative rate)<br>Specificity (SP) is calculated by dividing the number of correct negative predictionsby the<br>total number of negative subjects (patients). Specificity may appear in other texts as true<br>negative rate (TNR) or simply, Specificity, both terms mean the same thing. Specificity of<br>1.0 is considered the best, whereas 0.0 is considered the worst.<br>Specificity is calculated by dividing the number of correct negative predictions (TN) by<br>the total number of negatives (N).<br>Specificity, SP  TN /TN  FP<br>SP  TN / N<br>Remember, TNR  SR<br>l) Principal Component Analysis<br>Principle Component Analysis (PCA) is a statistical technique used to examine the<br>relation that exists among a set of variables in order to identify the structural pattern of<br>those variables. PCA, also called factor analysis, is a non-parametric analysis and answers<br>uniquely and independently of hypothesis about the data distribution.<br>m) Multicollinearity<br>In statistics, multicollinearity (sometimes called collinearity) is a phenomenon whereby<br>two or more independent predictor variables in a multivariate regression model have a<br>high correlation, which is an indication that one may be linearly predicted from the others<br>with an acceptable degree of accuracy.<br>8<br>2 <br></p>

Blazingprojects Mobile App

📚 Over 50,000 Project Materials
📱 100% Offline: No internet needed
📝 Over 98 Departments
🔍 Software coding and Machine construction
🎓 Postgraduate/Undergraduate Research works
📥 Instant Whatsapp/Email Delivery

Blazingprojects App

Related Research

Computer Science. 4 min read

Developing an Intelligent Sentiment Analysis System Using Deep Learning Techniques...

What This Project Is About This project focuses on creating a computer system that can understand and interpret people's feelings or opinions from written text...

BP
Blazingprojects
Read more →
Computer Science. 4 min read

Adaptive Cybersecurity Threat Detection Using Machine Learning Techniques...

What This Project Is About This project focuses on developing a system that can detect cybersecurity threats, such as hacking attempts or malware, more effectiv...

BP
Blazingprojects
Read more →
Computer Science. 3 min read

AI-Powered Real-Time Language Translation System...

What This Project Is About This project involves creating a system that can understand and translate spoken language from one language to another instantly. The...

BP
Blazingprojects
Read more →
Computer Science. 2 min read

Developing an AI-Powered Personal Health Assistant Chatbot...

What This Project Is About This project focuses on creating a chatbot that uses artificial intelligence (AI) to help people manage their health. The chatbot wil...

BP
Blazingprojects
Read more →
Computer Science. 2 min read

Deep Learning-Based Real-Time Cybersecurity Threat Detection System...

This project is about creating a system that can automatically detect cybersecurity threats, such as hacking attempts or malware attacks, in real-time using adv...

BP
Blazingprojects
Read more →
Computer Science. 3 min read

Development of an AI-Powered Personalized Learning Platform...

This project is about creating a smart online learning platform that adapts to each student's individual needs and ways of learning. Traditional education metho...

BP
Blazingprojects
Read more →
Computer Science. 3 min read

Predicting Disease Outbreaks Using Machine Learning and Data Analysis...

The project topic, &quot;Predicting Disease Outbreaks Using Machine Learning and Data Analysis,&quot; focuses on utilizing advanced computational techniques to ...

BP
Blazingprojects
Read more →
Computer Science. 4 min read

Implementation of a Real-Time Facial Recognition System using Deep Learning Techniqu...

The project on &quot;Implementation of a Real-Time Facial Recognition System using Deep Learning Techniques&quot; aims to develop a sophisticated system that ca...

BP
Blazingprojects
Read more →
Computer Science. 2 min read

Applying Machine Learning for Network Intrusion Detection...

The project topic &quot;Applying Machine Learning for Network Intrusion Detection&quot; focuses on utilizing machine learning algorithms to enhance the detectio...

BP
Blazingprojects
Read more →
WhatsApp Click here to chat with us