This paper aims to develop the methodology for enhancing the regression models using Cluster based sampling techniques (CST) to achieve high predictive accuracy and can also be used to handle large datasets. Hard clustering (KMeans Clustering) or Soft clustering (Fuzzy C-Means) to generate samples called clusters, which in turn is used to generate the Local Regression Models (LRM) for the given dataset. These LRMs are used to create a Global Regression Model. This methodology is known as Enhanced Regression Model (ERM). The performance of the proposed approach is tested with 5 different datasets. The experimental results revealed that the proposed methodology yielded better predictive accuracy than the non-hybrid MLR model; also, fuzzy C-Means performs better than the KMeans clustering algorithm for sample selection. Thus, ERM has potential to handle data with uncertainty and complex pattern and produced a high prediction accuracy rate.
Keywords: Clustering; KMeans; fuzzy c-means; multiple linear regression; regression; sampling methods
Regression Model (RM) is common and often used statistical-based predictive technique in many of the research fields. A multiple Linear Regression model is a tool based on statistics that models the linear relationship between a set of predictors and dependent variable. This model predicts a dependent variable based on the model [[
Data sampling is the process of using a small amount of data to obtain the overall characteristics of the whole dataset. This process allows for the selection of a subgroup of individual items from a population in order to assess the characteristics of the entire population. The frequently used sampling techniques in the researches are Sequence sampling, Random sampling, Cluster sampling, Systematic sampling, and Stratified sampling [[
In Sequential sampling, a sequence of one or more individuals is taken from some part of the dataset to form a sample for analysis [[
The Simple Random model is the most basic form of the probability sampling technique. In this model, each individual item who is to be studied has an equal chance of being selected for the study, and researchers use some random process to select members. It provides an unbiased and excellent estimate of the parameters [[
Cluster sampling [[
In the Systematic Sampling model [[
In the Stratified sampling method [[
The advantage of using sampling for regression model is that models can be developed by using subset of the data without affecting the characteristics and quality of the entire database. In this paper, cluster based sampling method is used for sampling data in order to get better prediction accuracy. K-Means and Fuzzy C-Means clustering techniques are used to create the optimal subset of data (i.e. Samples) for building the efficient regression model.
The main contributions of this paper can be summarized below:
- • An Enhanced Regression model (ERM) using clustered sampling methods is introduced in this paper by using hard and soft clustering techniques.
- • To improve the predictive accuracy of MLR models, a hybrid methodology is proposed in the proposed scheme by using MLR and popular clustering techniques (such as KMeans, Fuzzy C-Means, or a combination of both). This proposed ERM has less prone to prediction error.
- • Experimental results of the proposed methodology are presented in this paper to demonstrate its effectiveness over existing techniques.
The format of this research paper is structured as the following. Section 2 introduces the basics and background knowledge to understand the cluster based sampling techniques and metric to assess the performance of the proposed work. Next, the detailed description of proposed regression approach using cluster based sampling technique is given in Section 3. The discussion of results for various datasets is pinned in Section 4. Finally, Section 5 summarizes the proposed work with a possibility of its future extension.
This section provides a brief reference to current works related to this research work. While modeling MLR, the minimum required sample size is determined [[
Tabasnik and Fidel have used the formula "number of 50 + 8 * factors" to calculate the minimum required sample size [[
Albattah [[
The authors of [[
Josien K et al [[
In [[
In [[
The authors of [[
Following a review of the literature, it was revealed that there is no benchmark sampling technique for multiple linear regression (MLR) modeling. Furthermore, there is no rule of thumb to identify the most appropriate sampling technique for the development of an MLR model to achieve the best results. Therefore, the objective of this work is to investigate the feasibility of finding optimal data samples for generating MLR models using KMeans and Fuzzy-C-Means algorithms or a combination of both.to reduce errors and improving prediction accuracy rate.
It is a popular technique for the prediction task. It uses several independent (explanatory) variables to predict the outcome of a dependent (response) variable. Its purpose is to model the linear relationship between the dependent variable and independent variables.
The matrix version of MLR is shown in Fig. 1. In Fig. 1 X is a matrix of independent variables with size n x 2. Y is a column vector of dependent variable of size n×1, β is a column vector consists of co-efficient values of size 2×1, and ɛ is column vector consists of error values of size 2 x 1. The pseudo code of MLR is given in Algorithm 1.
Graph: Fig. 1 Components of MLR.
Step 1. Calculate Co-efficient matrix using where X' is the transpose of X. Step 2. The predicted value of Y for a given X is Step 3. The residuals is defined as: End
It is a type of probability sampling technique in which the entire dataset is divided into several clusters (groups). These clusters are known as sample units. These clusters have homogeneous properties and have an equal probability of being part of the sample.
The KMeans clustering technique [[
Input: 'D', Dataset, 'k', number of clusters(samples) Output: 'k' clusters(samples) Step 1: Initialize 'k' centroid or centers of the clusters randomly. Step 2: Do • Calculate the mean of all the individual items in the cluster • Assign each individual items to the closest cluster • Update the centroid of the clusters Until the convergence of centroid of the 'k' clusters Step 3: Output non-overlapped clusters as sample unit
In Fuzzy C-Means clustering [[
Input: 'D',Dataset, 'k', number of clusters(samples) Output: 'k' clusters(samples) Step 1. Initialize cluster centers, set the number of cluster as k, fuzzy index as m, and maximum number of iteration max as t Step 2. For • Updating partition matrix using • Updating cluster centers using End for Step 3: Output the overlapped cluster as sample unit
Table 1 shows the measures like Root Mean Squared Error (RMSE) and Correlation Co-efficient (R2) are used to evaluate the performance of the Regression model.
Table 1 The qualitative performance measures
Qualitative performance Formula Meaning RMSE It measures error between a predicted value and an actual value. Lower values of RMSE indicate better fit. Where n: number of data items; P: Predicted values; A: Actual values R2 R-squared, also known as the coefficient of determination. The range for R-squared is from 0 to 1. It measures the strength of the relationship between the model and the dependent variable between 0–100%.
Fuzzy C-Means clustering [[
The sampling technique with a minimal 'F' value is selected as the best sampling technique. The weights w1 and w2 are used to varying the degree of importance of RMSE and R2 of the RM respectively. The range of these weights is 0 to 1 but w1 + w2 should be less than and equal to 1.
In Equations (
Graph: Fig. 2 Flowchart for proposed methodology.
Algorithm 4: Enhanced Regression Model(ERM) Input: Data (x1, x2, x3, x4,..),set of independent variable Output: Y, dependent variable, co-efficients of the MLR model. Begin Step 1: Data Pre-processing Step 2: Divide the dataset D into Training and Testing Set Step 3: For Training Dataset •Cluster 1 = KMeans(D, number of clusters) •Cluster 2 = fcm(D, number of clusters) •Local_Regression_Model (LRM)=MLR (Clusters) •Global_Regression_Model(GRM)=Mean(Co- efficient of Local_Regression_Model) End Step 4: Predict the value using trained GRM •Validate GRM using Testing data •Evaluate the Performance of GRM using RMSE, R2 End
The time complexity of KMeans algorithm is O(ncdi) and time complexity of FCM is O(ndc2i) [[
The benchmark regression datasets are taken form UCI Machine Learning Repository: Data Sets website. Dataset 1 is affairs it contains 265 observation and 18 variables, Dataset 2 is bostonhousing contains 506 observations and 14 variables, Dataset 3 is ailerons contains 1754 observations and 40 variables, Dataset 4 is bostonhousingord contains 506 observations and 14 variables, and Dataset 5 is abaloneord contains 4177 observations and 11 variables are taken to study the performance of the proposed methodology. K-Means and fuzzy C-Means are used to create non-overlapped and overlapping partitional clusters with two different k values for five different datasets. In FCM, one records (samples) can belong to more than one cluster. Therefore, clusters in FCM have more number of samples than the KMeans algorithm. Table 2 shows intra-cluster distance among the samples of each cluster for KMeans clustering with K = 2 and K = 3 for five datasets.
Table 2 Intracluster distance of Clusters using KMeans
Datasets KMeans Cluster Distance nc = 2 KMeans Cluster Distance nc = 3 Cluster1 Cluster2 Cluster1 Cluster2 Cluster 3 Dataset1 2.0309 3.3818 1.9533 3.3975 2.0127 Dataset2 177.5425 99.6672 86.6228 95.0425 148.6258 Dataset3 186.2835 178.1267 183.6211 39.7705 175.3967 Dataset4 99.6400 177.5122 148.5934 86.5937 95.0340 Dataset5 1.0542 1.7706 1.4263 1.0590 1.4277
Table 3 shows intra-cluster distance among the samples of each cluster for FCM clustering with K = 2 and K = 3 for five datasets. The number of samples in each cluster of KMeans and FCM clustering techniques is shown in Tables 4 and 5 respectively. There is a slightly high value in the intracluster distance for clusters of FCM. This is due to the overlapping nature of the FCM clustering technique.
Table 3 Intracluster distance of Clusters using FCM
Datasets FCM Cluster Distancenc = 2 FCM Cluster Distance nc = 3 Cluster1 Cluster2 Cluster1 Cluster2 Cluster 3 Dataset1 2.3885 2.5780 2.2987 2.2987 2.3965 Dataset2 185.3292 190.6931 93.6841 75.1141 151.7577 Dataset3 107.7190 168.0013 153.0127 158.9691 118.5412 Dataset4 167.9695 107.6928 50.4372 102.5795 95.0340 Dataset5 1.4051 1.7409 0.9411 1.0726 1.6720
Table 4 Sample size for different dataset using KMeans
Datasets Sample size for nc = 2 Sample size for nc = 3 Cluster1 Cluster 2 Cluster1 Cluster 2 Cluster 3 Dataset1 204 61 50 155 60 Dataset2 149 357 313 38 155 Dataset3 3336 3818 3167 418 3569 Dataset4 357 149 155 313 38 Dataset5 1341 2836 1528 1342 1307
Table 5 Sample size for different dataset using FCM
Datasets Sample size for nc = 2 Sample size for nc = 3 Cluster1 Cluster 2 Cluster1 Cluster 2 Cluster1 Dataset1 180 100 165 165 85 Dataset2 369 144 109 272 137 Dataset3 4244 3573 1985 1983 3621 Dataset4 144 369 137 109 272 Dataset5 2359 2509 1305 1345 1263
Once clustering is complete, the MLR technique is applied to each individual cluster of KMeans and FCM to create a Local Regression Model (LRM). The performance metrics such RMSE and R-Squared value of each cluster are tabulated. Table 6 represents the RMSE value for each LRM that corresponds to the samples of clusters using the KMeans clustering technique with K = 2 and K = 3. Table 7 represents the RMSE value for each LRM that corresponds to the samples of clusters using the FCM clustering technique with K = 2 and K = 3.
Table 6 RMSE value of LRM with KMeans for different dataset
Datasets Sample size for nc = 2 Sample size for nc = 3 Cluster1 Cluster 2 Cluster1 Cluster1 Cluster 2 Dataset1 1.7966 0.5748 Dataset1 1.7966 0.5748 Dataset2 0.6576 0.8661 Dataset2 0.6576 0.8661 Dataset3 1.1521 1.2768 Dataset3 1.1521 1.2768 Dataset4 0.3534 11.3010 Dataset4 0.3534 11.3010 Dataset5 0.6282 0.8784 Dataset5 0.6282 0.8784
Table 7 RMSE value of LRM with FCM for different dataset
Datasets Sample size for nc = 2 Sample size for nc = 3 Cluster1 Cluster 2 Cluster1 Cluster1 Cluster 2 Dataset1 1.0444 0.6307 Dataset1 1.0444 0.6307 Dataset2 0.8739 0.6572 Dataset2 0.8739 0.6572 Dataset3 1.1587 1.2726 Dataset3 1.1587 1.2726 Dataset4 0.3738 11.5440 Dataset4 0.3738 11.5440 Dataset5 0.5526 0.8626 Dataset5 0.5526 0.8626
When comparing the RMSE values of each LRM of the KMeans and FCM clustering techniques, the LRM of the FCM clusters has a lower RMSE value than the KMeans clusters. Table 8 shows the mean RMSE value of LRMs for each dataset using KMeans clustering and FCM clustering techniques. Figure 2 displays the comparison of mean RMSE value for each dataset using KMeans and FCM clustering techniques. It is clearly seen that FCM clustering along with MLR has lower RMSE value for almost all dataset.
Table 8 R-Squared Value of LRM with KMeans for different dataset
Datasets nc = 2 nc=3 Cluster1 Cluster 2 Cluster1 Cluster1 Cluster 2 Dataset1 0.2528 0.1341 Dataset1 0.2528 0.1341 Dataset2 0.8035 0.6035 Dataset2 0.8035 0.6035 Dataset3 0.8053 0.7647 Dataset3 0.8053 0.7647 Dataset4 0.5948 0.2780 Dataset4 0.5948 0.2780 Dataset5 0.5071 0.3820 Dataset5 0.5071 0.3820
Table 8 displays the R-Squared value for each LRM that corresponds to the samples clusters of the KMeans clustering technique with K = 2 and K = 3. Table 9 shows RMSE value for each LRM that corresponds to the samples clusters of the FCM clustering technique with K = 2 and K = 3.When comparing the R-Squared values of each LRM of the KMeans and FCM clustering techniques, the LRM of the FCM clusters has a lower value than the KMeans clusters. It means that LRM for KMeans clusters has a higher correlation between actual and fitted value than that of FCM clustering.
Table 9 R-Squared Value of LRM with FCM for different dataset
Datasets nc=2 nc=3 Cluster1 Cluster 2 Cluster1 Cluster1 Cluster 2 Dataset1 0.01370 –0.0051 Dataset1 0.01370 –0.0051 Dataset2 0.5922 0.8064 Dataset2 0.5922 0.8064 Dataset3 0.8021 0.8006 Dataset3 0.8021 0.8006 Dataset4 0.6581 0.2667 Dataset4 0.6581 0.2667 Dataset5 0.3326 0.3248 Dataset5 0.3326 0.3248
From the above tables, it is observed that either the RMSE or R-Squared performance metric is not sufficient to judge the best sampling technique along with MLR. Therefore, the selection criteria function defined in the equation 1 is used in this work to identify the best cluster- based sampling technique along with MLR to generate the Global Regression Model (GRM) for each dataset.
Table 10 lists the 'F' value of all datasets for weights w1 is 0.5 and w2 is 0.5. For datasets 1 and 5, KMeans based sampling is the best sampling technique. For datasets 2, 3, and 4, FCM based sampling is the best sampling technique. Table 11 lists the best LRMs for weights w1 is 0.7 and w2 is 0.3. But there is no change in the results for datasets 1 and 5. So, KMeans based sampling is the best sampling technique for the datasets 1 and 5. For datasets 2, 3, and 4, FCM based sampling is the best sampling technique. Table 12 lists the best LRMs for weights w1 is 0.3 and w2 is 0. In this case, FCM is the best sampling technique for the datasets 1,4, and 5. KMeans is the best sampling technique for the datasets 2 and 3. The minimum 'f' value for all datasets is reached when only the RMSE of the model is considered. Therefore, for datasets 1,4, and 5, FCM- based LRMs are used for GRM development. Similarly, for datasets 2 and 3, KMeans-based LRMs are used GRM development. Table 13 shows the comparison of proposed GRM and existing MLR RMSE values for all datasets.
Table 10 F values of each dataset for w1 = 0.5 &w2 = 0.5
Dataset KMeans Clustering FCM Clustering Best 'F' Value Best Model nc=2 nc=3 nc=2 nc=3 Dataset1 3.1775 2.9584 116.6978 5.1632 2.9584 KMeans with nc = 3 Dataset2 1.0917 1.0940 1.0978 1.0289 1.0289 FCM with nc = 3 Dataset3 1.2442 1.2459 1.2318 1.2494 1.2318 FCM with nc = 2 Dataset4 4.0593 4.7265 4.0608 3.1375 3.1375 FCM with nc = 3 Dataset5 1.5014 1.5686 1.8749 1.7758 1.5014 KMeans with nc = 2
Table 11 F values of each dataset for w1 = 0.7 &w2 = 0.3
Dataset KMeans Clustering FCM Clustering Best 'F' Value Best Model nc=2 nc=3 nc=2 nc=3 Dataset1 2.3808 2.1236 70.3537 3.3042 2.1236 KMeans with nc = 3 Dataset2 0.9597 0.9405 0.9649 0.9033 0.9033 FCM with nc = 3 Dataset3 1.2323 1.2377 1.2253 1.2379 1.2253 FCM with nc = 2 Dataset4 4.7665 5.7629 4.8200 3.5215 3.5215 FCM with nc = 3 Dataset5 1.2022 1.2588 1.4080 1.3146 1.2021 KMeans with nc = 2
Table 12 F values of each dataset for w1 = 0.3 &w2 = 0.7
Dataset KMeans Clustering FCM Clustering Best 'F' Value Best Model nc=2 nc=3 nc=2 nc=3 Dataset1 1.1857 0.8714 0.8376 0.5157 0.5157 FCM with nc = 3 Dataset2 0.7619 0.7103 0.7656 0.7150 0.7103 KMeans with nc = 3 Dataset3 1.2145 1.2253 1.2157 1.2206 1.2145 KMeans with nc = 2 Dataset4 5.8272 7.3175 5.9589 4.0975 4.0975 FCM with nc = 3 Dataset5 0.7533 0.7941 0.7076 0.6228 0.6228 FCM with nc = 3
Table 13 Characteristics for GRM of all Datasets
Dataset Best Sampling Model No. of Clusters Weightage F-Value RMSE R-Squared Dataset1 FCM 3 w1 = 0.7, w2 = 0.3 0.3915 0.5157 0.1019 Dataset2 KMeans 3 w1 = 0.7, w2 = 0.3 0.7002 0.7103 0.6767 Dataset3 FCM 3 w1 = 0.3, w2 = 0.7 1.0643 1.2206 0.7824 Dataset4 FCM 3 w1 = 0.3, w2 = 0.7 1.9293 4.0975 0.4592 Dataset5 FCM 3 w1 = 0.7, w2 = 0.3 0.5384 0.6228 0.3414
Figure 3 displays the comparison of mean RMSE value for each dataset using KMeans and FCM clustering techniques. It has seen that FCM clustering along with MLR has lower RMSE value for almost all dataset. Figure 4 displays the comparison of mean R-Squared value for each dataset using KMeans and FCM clustering techniques. It is evident that KMeans clustering along with MLR has higher value for almost all dataset. Table 14 shows the RMSE value and R-Squared value of GRM for all datasets.
Graph: Fig. 3 Comparison of RMSE value for all datasets.
Graph: Fig. 4 Comparison of R-Squared value for all datasets.
Table 14 Characteristics for GRM of all Datasets
Dataset GRM RMSE MLR RMSE Dataset1 0.5157 2.7143 Dataset2 0.7103 1.2053 Dataset3 1.2145 2.5452 Dataset4 4.0975 5.0482 Dataset5 0.6228 6.0187
In this work, clustering based Sampling Technique has been hybridized with MLR technique to build Global Regression Model. For cluster based sampling techniques, KMeans clustering (hard) and Fuzzy C-Means clustering (soft) techniques has been used along with MLR technique to reduce errors and improve prediction accuracy rate. The empirical study has shown that if the optimum number of clusters can be created using either KMeans or FCM on a given dataset before applying the MLR, then it effectively reduces prediction errors. It has been found that FCM is a more efficient sampling technique than the KMeans based sampling technique when used for the MLR model because it reduces the prediction error. It is clearly evident that it is not easy to judge whether FCM or KMeans is the best cluster-based sampling technique used in conjunction with MLR. It is purely dependent on the nature of the dataset. Therefore, the proposed methodology has combined the benefits of both cluster-based sampling techniques for building the GRM.
On the basis of empirical study, it has been found that the computation time of GRM is slightly higher than KMeans and FCM algorithms, which is a limitation of the proposed method. But, the proposed GRM methodology achieves good RMSE on all datasets, which means that the GRM can select more representative samples to build the MLR model with higher accuracy. As the future scope of this work, swarm intelligence-based clustering techniques will be explored for sample selection for the regression model to reduce the computational complexity.
The first author acknowledges the UGC-Special Assistance Programme (SAP) for the financial support to her research under the UGC-SAP at the level of DRS-II (Ref.No.F.5-6/2018/DRS-II (SAP-II)), 26 July 2018 in the Department of Computer Science.
By S. Dhamodharavadhani and R. Rathipriya
Reported by Author; Author