Constructing an accurate prediction model from a small training data set is an important but difficult task in the field of forecasting. This is because when the data size is small, the incomplete data may mean that the model produced cannot sufficiently represent the true data structure or cause the model training to be overfitted. To address this issue, this paper presents an approach that combines multiple prediction models to extract data information in multiple facets. In the multi-model approach, a compromise weight method is proposed to determine the relative reliability of each of the prediction model. The methods used include multiple regression, artificial neural network, and support vector machines for regression. A thin-film transistor liquid crystal display manufacturing case study is used to illustrate the details of this research. The empirical results not only show that the proposed multi-model can reduce the manufacturing variation and increase the production yield, but also can propose a robust and reliable parameter interval to the online engineers in the early manufacturing stage.
Keywords: manufacturing; forecasting; multi-model; small data set; TFT-LCD
In traditional pattern-recognition systems, a single (or particular) learning algorithm is used to predict the true pattern of a given training data set (Ho et al. [
In previous studies of manufacturing yield modelling, Kumar et al. ([
Process capability indices (PCI) have been proposed to the manufacturing industry to control the process yield or line yield (which is defined as the percentage of processed product units passing the inspection) by using past in-control data. In a single characteristic, the production yield can be measured by several well-known PCIs, but the challenge is measuring the production yield with multiple characteristics (Pearn et al. [
A multiple-model approach has been proposed to take advantage of the many kinds of pattern analysis and different facets of data information (Kuncheva [
There are many studies that discuss combining multiple models. Todorovski and Džeroski ([
Small data sets are associated with incomplete data and insufficient information. In data analysis, incomplete data may mean that the model produced cannot sufficiently represent the true data structure and thus increase the model variation. Management in the early stages of production is a difficult but critical task, because it is necessary to detect problems and correct policies as early as possible. However, in the early manufacturing process, the online engineers usually make decisions to determine the manufacturing parameters based on small pilot data sets and their own manufacturing experience. The rough parameters thus produced usually cause more variation in the production yield.
In this study, we propose a multi-model system to help the engineers determine the manufacturing parameters in a factory. Different from the past prediction learning methods which use single prediction model, the proposed method here combines several learning models to acquire the predictional pattern from the data. On the other hand, some combined multiple models in the past research only discuss classification problems, and thus the weight combination methods are determined for classification purpose. The main aim of this approach is to produce a robust forecasting model to determine the manufacturing parameters. We develop the multi-model system by combining various forecasting models including multiple regression, a backpropagation (BP) neural network, and support vector machines for regression (SVR) with Gaussian and polynomial kernels. The system is built basically according to the model weights which are determined by their output accuracies (mean absolute errors). Note that, a compromise weight determination method is developed here on the basis of mega-trend diffusion (MTD) function proposed by Li et al. ([
The reasons for selecting these models are as follows. The multiple regression model is used to build up the linear relation between the input and output variables. A backpropagation neural network is a multiple-layer neural network that can forecast well using nonlinear data, such as with the XOR (exclusive OR) problem (Haykin [
A real problem in the thin-film transistor liquid crystal display (TFT-LCD) manufacturing factory in Taiwan is examined in this paper. The robustness and the yield improvement in the factory are the performance indices used to show the proposed multi-model system is effective when the training data set is small.
The remainder of this paper is organised as follows. In Section 2, a review of small-data-set analysis literature is presented, and several prediction models are introduced. Section 3 shows the detailed multi-model procedure. The empirical case study in TFT-LCD manufacturing and the multi-model system performance are presented in Section 4. Finally, conclusions are given in Section 5, along with a suggestion for future research.
In this section, we will review the literature on small-data-set analysis and several prediction models that include the backpropagation neural network and support-vector regression.
Small data sets usually have problems with incomplete data and insufficient information and in data analysis; this may mean that any model that is developed does not accurately represent the data structure. There are many studies that discuss small data set problems. One approach, virtual sample generation, is a data-preprocessing method proposed to enhance the prediction performance for small-data-set problems. The original idea of virtual data generation was proposed by Niyogi et al. ([
Another approach is usually named the wrapper method, in which artificial neural networks and kernel methods are used to deal with small-data-set problems. For example, Huang and Moraga ([
Since the first neural-network model perceptron was proposed in 1957, many different neural-network models have been developed, such as back-propagation neural networks and self-organising map networks. Over the last two decades, artificial neural networks (ANN) (Haykin [
A back-propagation algorithm is also called a generalised delta rule, and from the perspective of a learning algorithm, it can be seen as a nonlinear expansion of the least-mean-square error (LMS) approach. Most of the time, this kind of neural network is known as a backpropagation neural network with the structure shown in Figure 1.
Graph: Figure 1. Structure of a back-propagation neural network.
Backpropagation learning consists of two passes through different layers of the network: a forward pass and a backward pass. In the forward pass, the input vector propagates layer by layer through the network to the output. If is the input data set and denotes the input signals with M attributes, the qth forward training operations of neuron j propagated to each sth layer in the S layer neural network is and , where n
Graph
The new weight, w (q + 1), is then updated by , where .
The activation function is if is the output of a node in the output layer of the network, or if is the output of a node in the hidden layer of the network, and is the propagated local gradient in neuron k of the next layer.
A support-vector machine (SVM) is a promising pattern-recognition technique that was first proposed by Vapnik ([
Graph
Here, parameter C is used to tune the trade-off between the acceptable amounts of errors. are slack variables, and
Vapnik extended the SVM to regression models for treating a regression problem as a single classification case. Given a training set of N samples: , where is the input vector, the corresponding value is the target value of , and an -insensitive error function is proposed as a trade-off between the acceptable amounts of errors.
Graph
When minimising the objective function (also called the error function):
Graph
We can re-express the optimisation problem by introducing slack variables to the model. For each data point , we use two slack variables, and , to outline the points that are out of the interval, , where corresponds to a point for which , and corresponds to a point for which , as illustrated in Figure 2.
Graph: Figure 2. Illustration of SVM regression.
The condition for a target point to lie inside the -tube is that . The purpose of introducing the slack variables is to allow points to lie outside the tube provided that the slack variables are non-zero, and the corresponding conditions are:
Graph
Hence, the error function for support vector regression can then be written as:
Graph
Using the Lagrange multipliers to solve the quadratic programming problem, we can find the dual problem as:
Graph
where is called a positive semidefinite kernel (or Mercer kernel) which satisfies the symmetric (i.e. ) and the following equation:
Graph
Each Mercer kernel can be expressed as , where is a mapping of feature selection, and is the inner product. Moreover, commonly used kernel functions are polynomial kernels, radial-basis function kernels, and two-layer perceptron kernels (Yu et al. 2008):
Graph
where are specified a priori by the user.
There are many studies that discuss the performance of SVR for different kernels, such as Sánchez ([
In this section, we introduce the proposed multi-model for small data sets and the implementation procedure.
Different methods and techniques can be applied to analyse and process data, and each has its own unique approach to extract valuable method-oriented information. As a result, our goal is to develop a prediction model that combines multiple forecast models to extract information from multiple facets. Some of these prediction models have a good forecast performance in general, while others perform well specifically for minimum mean absolute (or square) errors or minimum forecast variances on a small data set. Given that is the training data set, and N is the size of the data set, is the vector of M dimensional variables for the ith sample in S, and the response variable belongs in real number , the forecast model can be constructed by aggregating multiple forecast models for the small size training data set S.
Two commonly used structures are shown in Figure 3 for combining multiple models. The left-hand side model is generated using multiple forecast models, or by applying k different learning algorithms to a single training data set S; on the right-hand side, the model is generated using a single forecast model A with d different model parameters to a single training data S.
Graph: Figure 3. Structures for combining multiple models.
The main aim of the two proposed systems is to extract a variety of relationships among the data attributes to construct a combined forecast model for the small data set. Figure 4 shows the multi-model system which includes two main phases, multiple models, and model combination. In the multiple model phase, we not only use the different learning algorithms (), but also consider the parameter determination () in a single learning machine A
Graph: Figure 4. Multi-model system for small data sets.
In the model combination phase, it is an important task to decide the combination weights in the multi-model system. The main purpose of the multi-model system is to minimise the forecast mean absolute errors. The forecast model which is represented by the multiple models , with the weights belonging to the weight space , is denoted as . Thus, the prediction error (PE) is defined as in Byon et al. ([
Graph
where is the absolute expectation over the training set S, whose members are independently and identically distributed, and is the absolute expectation over the test observations .
In the small training data size problem, it is unlikely that minimum prediction errors will be achieved; thus, the training model may suffer from overfitting, and so compromise weights should be considered. In this paper, we modify the model using the mega-trend diffusion (MTD) function, which is built for dealing with the quantity problem for small data sets. The detailed concept of the MTD function is described below.
The mega-trend diffusion (MTD) function was first proposed by Li et al. ([
Graph: Figure 5. MTD function (Li et al. [
is the given training sample, and the boundaries a and b are defined as follows:
Graph
Graph
where and is the variance of X. N
Graph
In our model, the mean absolute errors, , of the multiple models are given as the input in the modified mega-trend diffusion (MMTD) function. Each mean absolute error, , has the corresponding sample variance for each forecast model. We collect the model variances and compute the expectation model variance to instead of the variance in MTD. Hence, the boundaries a and b in MMTD are defined as follows:
Graph
Graph
where is the average of the minimum and maximum errors, and is the expectation variance of the corresponding models. N
Graph
Figure 6 shows the MMTD function and the height of the point e
Graph: Figure 6. MMTD function.
The proposed procedure can be summarised as follows.
Assuming that is a small training data set, and is the testing data set and each sample , i = 1, , N, in S or (i = 1, , q, in T or) has M attributes (i.e. ), and is the target value of which is a continuous number.
(data standardisation): Standardise the data set when the scales of explanatory variables are widely different.
(training models): Train the chosen models including the single-model with different parameters using the training data set S.
(collect the mean absolute errors): Input the testing data into the models that are trained in Step 2 and collect the mean absolute errors .
(weight determination): Construct the MMTD function and compute the height value of each error in the MMTD as the corresponding weight.
We use an empirical study in TFT-LCD manufacturing to illustrate our multi-model system. The details of problem definition, factor selection, data collection, model parameter setting, and the experiment design are described in the following subsections.
The process of producing a TFT-LCD is mainly separated into three steps: the TFT and Colour Filter (CF) Process, Cell Process, and Module Process. A colour filter is one of the important components of a TFT-LCD, and the key processes in manufacturing a CF include coating the black matrix (BM), green (G), red (R), blue (B), ITO (Indium Tin Oxide), and photo-spacer (PS) layers (Figure 7), and these are flow-shop production-type processes. The coating of the photo-spacer layer is the most important step, because it is the last process in making a CF, in which the target (Photo-Spacer Height) value is determined by the height of the photo-spacer layer. If the does not meet the specifications, the cost of reprocessing is much greater than that for other layers. Therefore, cell-manufacturing engineers state that related panel data must be provided to calculate the cell manufacturing parameters when CF base panels are shipped to the cell making station.
Graph: Figure 7. Main processes in manufacturing CF.
The structure of is shown in Figure 8, where means the thickness of the photo-spacer layer, is the thickness of the R layer, and is the thickness of the BM layer. The thicknesses of the BM, R, and photo-spacer layers are mainly determined by the coating speed in the CF manufacturing process. When the coating speed is low, the thickness is high, and vice versa. The final index that the cell process requires is , and the function to calculate the value is estimated as:
Graph
Graph: Figure 8. PSH structure.
When exceeds the specification in a CF, the cell has a defect, known as 'Gravity Mura', and when is below the specification, it will have a defect known as 'Push Mura'. It is difficult to control the quality performance of in CF, because the value includes the variance of the BM, R, and photo-spacer layers.
In cost control, the random sampling strategy is used to measure the PSH, as the quality-control engineer can only sample one out of every 100 glasses. In the SPC (statistic process control) database, we can find only 13 PSH measurement data for the previous 3 months.
In this case, only 13 measurement data have been collected from a TFT-LCD manufacturing company in Taiwan, and Table 1 shows only the first five data points for business reasons. In the multi-model system, three commonly used forecast models are applied, namely those based on multiple regression, a backpropagation neural network (BPNN) and a support vector machine for regression (SVR). In the SVR model, two different kinds of kernel, polynomial and Gaussian, are employed. The analysis tool is Matlab 7.0 with statistical and neural network toolboxes, and the SVM toolbox, which is downloaded from
Table 1. First five raw data.
PSH Rth BMth PSth 3.4879 2.0471 1.3472 3.8273 3.5525 2.0297 1.3930 3.8796 3.5999 2.0327 1.3936 3.9238 3.5350 2.0216 1.3792 3.8896 3.5178 2.0752 1.3590 3.8684
For the parameter sensitivity analysis, this research uses the 13 collected data and leave-one-out cross-validation to ensure that the learning model is robust. The value of mean absolute error (MAE) is the performance index. In the model parameter sensitivity experiment, the two SVM kernel parameters used are based on the range of parameter values suggested by Ali and Smith-Miles ([
Table 2. Parameter sensitivity analysis results.
SVR learning model Gaussian Parameter 0.4 0.5 0.6 0.7 0.8 0.9 1 MAE 0.0288 0.0285 0.0289 0.0285 0.0287 0.0289 0.0287 Polynomial Parameter 1 2 3 4 5 MAE 0.0289 0.0287 0.0286 0.0285 0.0285 BPNN learning model Parameter (LR/HNs) MAE (0.3/3) (0.3/4) (0.3/5) (0.4/3) (0.4/4) (0.4/5) (0.5/3) (0.5/4) (0.5/5) 0.0210 0.0263 0.0206 0.0244 0.0261 0.0267 0.0245 0.0255 0.0318
Table 3. Parameter settings for the corresponding training models.
Multiple regression BPNN SVR SVR None Learning rate = 0.3 Kernel = polynomial Kernel = Gaussian One hidden layer Degree = 4 Degree = 0.5 Five hidden nodes C = 1000 C = 1000 8000 learning times Epsilon = 0.05 Epsilon = 0.05
The attributes in Table 2, PSH, Rth, and BMth are the input features, and the prediction target is PSth, because we expect to find the best PSth parameter setting in the PS layer for CF manufacturing. In the model building, we train the prediction models with the 13 collected data, and leave-one-out cross-validation is used here to find the mean absolute errors of prediction models. Table 4 shows the mean absolute errors and the corresponding standard deviations (STDs) for the prediction models. Finally, we use the mean absolute errors of the prediction models as the input in the MMTD function and compute the compromise weights for the multiple prediction models. They are , , , and .
Table 4. Mean absolute errors and standard deviation of the prediction models.
Multiple regression BPNN SVR (polynomial) SVR (Gaussian) Mean 0.0427 0.0206 0.0285 0.0285 SD 0.0276 0.0167 0.0155 0.0179
This paper uses model robustness and the empirical results in the factory to verify whether the proposed multi-model system is effective with regard to small-data-set prediction. To test model robustness, we use 30 experiment products as the testing data set to verify whether the proposed multi-model system is effective. Table 5 shows the mean absolute errors and standard deviations for the prediction models, including the proposed multi-model system, using the combination weights with the values computed with the 13 training data. It is clear that the multi-model system has a smaller mean absolute error and STD than other prediction models, except for the multiple regression model. Comparing the performance of multiple regression model in Tables 3 and 4, we find the STDs are different in both cases. In order to verify the effectiveness of the proposed approach, we designed a new experiment for the scenario of small-data-set analysis by using a random-sampling method to generate training data sets from the 30 experiment products, where we set the sizes of the training data at 5, 10, 15, 20, and 25. The experiment repeats 30 times for different sizes of data sets. Table 6, the experiment results, clearly shows that the STD of proposed method outperforms other models when the data size is small. For the case with 30 samples, although the MAE index of the proposed method is not less than linear regression, they are not statistically significantly different in t-test. Hence, we conclude that the proposed method has effectiveness and robustness when the data set is small.
Table 5. Mean absolute errors and standard deviation of the prediction models and the multi-model for 30 data.
Multiple regression BPNN SVR (polynomial) SVR (Gaussian) Multi-model Mean 0.0132 0.0383 0.0315 0.0320 0.0261 SD 0.0118 0.0293 0.0210 0.0294 0.0185
Table 6. Experiment results for the small-data-set scenarios.
Multiple regression BPNN SVR (polynomial) SVR (Gaussian) Multi-model 5 MAE 0.0683 0.0644 0.0715 0.0662 0.0652 SD 0.0658 0.0390 0.0314 0.0546 0.0305 10 MAE 0.0305 0.0328 0.0345 0.0447 0.0312 SD 0.0202 0.0276 0.0301 0.0326 0.0196 15 MAE 0.0287 0.0292 0.0286 0.0299 0.0287 SD 0.0164 0.0314 0.0293 0.0192 0.0157 20 MAE 0.0292 0.0274 0.0330 0.0359 0.0286 SD 0.0156 0.0138 0.0215 0.0253 0.0136 25 MAE 0.0291 0.0258 0.0123 0.0236 0.0209 SD 0.0141 0.0121 0.0209 0.0232 0.0120
For verification of the empirical results, we build the upper and lower limits for the parameter settings of PSth and require the online engineers to follow the specifications when they are setting the PSth parameters. The upper and lower limits of PSH thus obtained are 3.43 and 3.53, respectively, which are set by module engineers, and the corresponding values of Rth and BMth are 2.026 and 1.358. The upper and lower limits of the PSth values obtained from the models are shown in Table 7.
Table 7. PSth values obtained from the models.
PSH Rth BMth Multiple regression BPNN SVR (poly) SVR (Gaussian) Multi-model Lower limit 3.43 2.026 1.358 3.7862 3.8460 3.8365 3.8364 3.8261 Upper limit 3.53 2.026 1.358 3.8578 3.8852 3.8386 3.8587 3.8599
After 20 weeks of tracking in the factory, the production yield of the PS layer was indeed improved from 95.61% to 98.96%, and the yield variation decreased from 6.35 to 0.29. The PS yield chart is shown in Figure 9.
Graph: Figure 9. PS yield in CF.
As product life cycles have become shorter, management in the early stages of manufacturing systems has become more important. Unfortunately, the data collected in the early manufacturing stage are usually small and incomplete. This paper proposed a multi-model approach for small data-set prediction in order to take advantage of various prediction-learning algorithms to achieve more robust and correct prediction results. In addition, the MMTD method was developed to build up the compromise weight system, and a TFT-LCD manufacturing case study was used to demonstrate the proposed model in detail. The empirical verification of 30 pilot experiment products and the yield improvement show that the proposed method is both useful and effective when the data size is small. For the model parameters' setting, since the proposed model is proved to be robust on various parameters, for an engineer who wants to use this proposed methodology, we suggest using the parameter values listed in Table 3. However, this is a case-by-case problem, using a couple of parameter values to ensure the model's correctness is also a careful process to find the optimal final model. In this paper, we only use TFT-LCD data sets to verify the proposed method. In future work, other product data sets for production yield prediction can also be valuable to verify this approach.
By Der-Chiang Li; Chiao-Wen Liu and Wen-Chih Chen
Reported by Author; Author; Author