Natural Computing Series Series Editors: G. Rozenberg Th. Bäck A.E. Eiben J.N. Kok H.P. Spaink Leiden Center for Natura...

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Natural Computing Series Series Editors: G. Rozenberg Th. Bäck A.E. Eiben J.N. Kok H.P. Spaink Leiden Center for Natural Computing

Advisory Board: S. Amari G. Brassard K.A. De Jong C.C.A.M. Gielen T. Head L. Kari L. Landweber T. Martinetz Z. Michalewicz M.C. Mozer E. Oja G. P˘aun J. Reif H. Rubin A. Salomaa M. Schoenauer H.-P. Schwefel C. Torras D. Whitley E. Winfree J.M. Zurada

For further volumes: http://www.springer.com/series/4190

Daniel S. Yeung · Ian Cloete · Daming Shi · Wing W.Y. Ng

Sensitivity Analysis for Neural Networks

123

Authors Prof. Daniel S. Yeung School of Computer Science and Engineering South China University of Technology Wushan Rd. TianHe District Guangzhou, China [email protected] Prof. Daming Shi School of Electrical Engineering and Computer Science Kyungpook National University Buk-gu, Daegu South Korea [email protected]

Prof. Ian Cloete President Campus 3 International University in Germany 76646 Bruchsal, Germany [email protected]; [email protected] Dr. Wing W.Y. Ng School of Computer Science and Engineering South China University of Technology Wushan Rd. TianHe District Guangzhou, China [email protected]

Series Editors G. Rozenberg (Managing Editor) [email protected] Th. Bäck, J.N. Kok, H.P. Spaink Leiden Center for Natural Computing Leiden University Niels Bohrweg 1 2333 CA Leiden, The Netherlands A.E. Eiben Vrije Universiteit Amsterdam The Netherlands

ISSN 1619-7127 ISBN 978-3-642-02531-0 e-ISBN 978-3-642-02532-7 DOI 10.1007/978-3-642-02532-7 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2009938187 ACM Computing Classification (1998): I.2.6, F.1.1 © Springer-Verlag Berlin Heidelberg 2010 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: KuenkelLopka GmbH Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface

Neural networks provide a way to realize one of our human dreams to make machines think like us. Artificial neural networks have been developed since Rosenblatt proposed the Perceptron in 1958. Today, many neural networks are not treated as black boxes any more. Issues such as robustness and generalization abilities have been brought to the fore. The advances in neural networks have led to more and more practical applications in pattern recognition, financial engineering, automatic control and medical diagnosis, to name just a few. Sensitivity analysis dates back to the 1960s, when Widrow investigated the probability of misclassification due to weight perturbations, which are caused by machine imprecision and noisy input. For the purpose of analysis, these perturbations can be simulated by embedding disturbance into the original inputs or connection weights. The initial idea of sensitivity analysis was then extended to optimization and to applications of neural networks, such as sample reduction, feature selection, active learning and critical vector learning. This text should primarily be of interest to graduate students, academics, and researchers in branches of neural networks, artificial intelligence, machine learning, applied mathematics and computer engineering where sensitivity analysis of neural networks and related concepts are used. We have made an effort to make the book accessible to such a cross-disciplinary audience. The book is organized into eight chapters, of which Chap. 1 gives an introduction to the various neural network structures and learning schemes. A literature review on the methodologies of sensitivity analysis is presented in Chap. 2. Different from the traditional hypersphere model, the hyper-rectangle model described in Chap. 3 is especially suitable for the most popular and general feedforward network: the multilayer Perceptron. In Chap. 4, the activation function is also involved in the calculation of the sensitivity analysis by parameterizing. The sensitivity analysis of radial basis function networks is discussed in Chaps. 5 and 6, with the former giving a generalization error model whereas the latter concerns optimizing the hidden neurons. In Chap. 7, sensitivity is measured in order to encode prior knowledge into a neural network. In Chap. 8, sensitivity analysis is applied in many applications, such as dimensionality reduction, network optimization and selective learning. We would like to express our thanks to many colleagues, friends and students who provided reviews of different chapters of this manuscript. They include Minh v

vi

Preface

Nhut Nguyen, Xiaoqin Zeng, Patrick Chan, Xizhao Wang, Fei Chen and Lu He. We often find ourselves struggling with many competing demands for our time and effort. As a result, our families, especially our beloved spouses, are the ones who suffer the most. We are delighted to dedicate this work to Foo-Lau Yeung, Wilma Cloete and Jian Liu. It is with great humility that we would like to acknowledge our Good Lord as the true creator of all knowledge. This work is the result of our borrowing a small piece of knowledge from Him. 8 July 2009

Daniel S. Yeung Ian Cloete Daming Shi Wing W.Y. Ng

Contents

1 Introduction to Neural Networks . . . . . . . . . . . . . . 1.1 Properties of Neural Networks . . . . . . . . . . . . . 1.2 Neural Network Learning . . . . . . . . . . . . . . . . 1.2.1 Supervised Learning . . . . . . . . . . . . . . 1.2.2 Unsupervised Learning . . . . . . . . . . . . . 1.3 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Adaline and Least Mean Square Algorithm . . . . . . . 1.5 Multilayer Perceptron and Backpropagation Algorithm 1.5.1 Output Layer Learning . . . . . . . . . . . . . 1.5.2 Hidden Layer Learning . . . . . . . . . . . . . 1.6 Radial Basis Function Networks . . . . . . . . . . . . 1.7 Support Vector Machines . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

1 3 5 5 5 6 8 9 11 11 12 13

2 Principles of Sensitivity Analysis . . . . . . . . . 2.1 Perturbations in Neural Networks . . . . . . . 2.2 Neural Network Sensitivity Analysis . . . . . 2.3 Fundamental Methods of Sensitivity Analysis 2.3.1 Geometrical Approach . . . . . . . . 2.3.2 Statistical Approach . . . . . . . . . . 2.4 Summary . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

17 17 18 21 21 23 24

3 Hyper-Rectangle Model . . . . . . . . . . . . . . . . 3.1 Hyper-Rectangle Model for Input Space of MLP . 3.2 Sensitivity Measure of MLP . . . . . . . . . . . 3.3 Discussion . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

25 25 26 27

4 Sensitivity Analysis with Parameterized Activation Function 4.1 Parameterized Antisymmetric Squashing Function . . . . . 4.2 Sensitivity Measure . . . . . . . . . . . . . . . . . . . . . 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

29 29 30 31

5 Localized Generalization Error Model . . . . 5.1 Introduction . . . . . . . . . . . . . . . . 5.2 The Localized Generalization Error Model 5.2.1 The Q-Neighborhood and Q-Union

. . . .

. . . .

. . . .

. . . .

33 33 35 36

. . . .

. . . .

. . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

vii

viii

Contents

5.2.2 The Localized Generalization Error Bound . . . . . . . 5.2.3 Stochastic Sensitivity Measure for RBFNN . . . . . . 5.2.4 Characteristics of the Error Bound . . . . . . . . . . . 5.2.5 Comparing Two Classifiers Using the Error Bound . . 5.3 Architecture Selection Using the Error Bound . . . . . . . . . 5.3.1 Parameters for MC2 SG . . . . . . . . . . . . . . . . . 5.3.2 RBFNN Architecture Selection Algorithm for MC2 SG 5.3.3 A Heuristic Method to Reduce the Computational Time for MC2 SG . . . . . . . . . . . . . . . . . . . . 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

36 38 40 42 42 44 44

. . . .

45 45

6 Critical Vector Learning for RBF Networks . . . . . . . . . . . . . . 6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Construction of RBF Networks with Sensitivity Analysis . . . . . 6.2.1 RBF Classifiers’ Sensitivity to the Kernel Function Centers 6.2.2 Orthogonal Least Square Transform . . . . . . . . . . . . 6.2.3 Critical Vector Selection . . . . . . . . . . . . . . . . . . 6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47 47 48 49 51 52 52

7 Sensitivity Analysis of Prior Knowledge . . 7.1 KBANNs . . . . . . . . . . . . . . . . 7.2 Inductive Bias . . . . . . . . . . . . . . 7.3 Sensitivity Analysis and Measures . . . 7.3.1 Output-Pattern Sensitivity . . . . 7.3.2 Output-Weight Sensitivity . . . . 7.3.3 Output-H Sensitivity . . . . . . 7.3.4 Euclidean Distance . . . . . . . 7.4 Promoter Recognition . . . . . . . . . . 7.4.1 Data and Initial Domain Theory 7.4.2 Experimental Methodology . . . 7.5 Discussion and Conclusion . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

55 56 58 59 59 60 61 61 61 62 63 64

8 Applications . . . . . . . . . . . . . . . . . . . . . 8.1 Input Dimension Reduction . . . . . . . . . . . 8.1.1 Sensitivity Matrix . . . . . . . . . . . . 8.1.2 Criteria for Pruning Inputs . . . . . . . 8.2 Network Optimization . . . . . . . . . . . . . 8.3 Selective Learning . . . . . . . . . . . . . . . . 8.4 Hardware Robustness . . . . . . . . . . . . . . 8.5 Measure of Nonlinearity . . . . . . . . . . . . 8.6 Parameter Tuning for Neocognitron . . . . . . 8.6.1 Receptive Field . . . . . . . . . . . . . 8.6.2 Selectivity . . . . . . . . . . . . . . . . 8.6.3 Sensitivity Analysis of the Neocognitron

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

69 69 70 70 71 74 76 77 78 79 80 80

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

Chapter 1

Introduction to Neural Networks

The human brain consists of ten billion densely interconnected nerve cells, called neurons; each connected to about 10,000 other neurons, with 60 trillion connections, synapses, between them. By using multiple neurons simultaneously, the brain can perform its functions much faster than the fastest computers in existence today. On the other hand, a neuron can be considered as a basic information-processing unit, whereas our brain can be considered as a highly complex, nonlinear and parallel biological information-processing network, in which information is stored and processed simultaneously. Learning is a fundamental and essential characteristic of biological neural networks. The ease with which they can learn led to attempts to emulate a biological neural network in a computer. In the 1940s, McCulloch and Pitts proposed a model for biological neurons and biological neural networks. A stimulus is transmitted from dendrites to a soma via synapses, and axons transmit the response of one soma to another, as shown in Fig. 1.1. Inspired by the mechanism for learning in biological neurons, artificial neurons and artificial neural networks can perform arithmetic functions, with cells corresponding to neurons, activations corresponding to neuronal firing rates, connections corresponding to synapses, and connection weights corresponding to synaptic strengths, as shown in Fig. 1.1. The analogy between biological neurons and artificial neurons is made in Table 1.1. However, neural networks are far too simple to serve as realistic brain models on the cell level, but they might serve as very good models for the essential information processing tasks that organisms perform. This remains an open question because we have so little understanding of how the brain actually works (Gallant, 1993). In a neural network, neurons are joined by directed arcs – connections. The neurons and arcs constitute the network topology. Each arc has a numerical weight that specifies the influence between two neurons. Positive weights indicate reinforcement; negative weights represent inhibition. The weights determine the behavior of the network, playing somewhat the same role as in a conventional computer program. Typically, there are many inputs for a single neuron, and a subsequent output of an activation function (or transfer function). Some frequently used activation functions include:

D.S. Yeung et al., Sensitivity Analysis for Neural Networks, Natural Computing Series, C Springer-Verlag Berlin Heidelberg 2010 DOI 10.1007/978-3-642-02532-7_1,

1

2

1 Introduction to Neural Networks

a

b

Fig. 1.1 Biological motivations of neural networks. (a) Neuroanatomy of living animals. (b) Connections of an artificial neuron

• Linear Function: (u) = u

1 • Log-Sigmoid Function: (u) = −u 1+e 1 if u ≥ 0 • Hard Limit function: (u) = 0 otherwise Rosenblatt (1958) devised the Perceptron, which is now widely used. Widrow and Hoff (1960) proposed the Adaline at the same time. The Perceptron and Adaline were the first practical networks and learning rules that demonstrated the high potential of neural networks. Table 1.1 Analogy between biological and artificial neurons

Biological Neurons

Artificial Neurons

Soma Dendrite Axon Synapse

Sum + Activation Function Input Output Weight

1.1

Properties of Neural Networks

3

Minsky and Papert (1969) criticized the Perceptron, because they found it is not powerful enough to do tasks such as parity and connectedness. As long as a neural network consists of a linear combiner followed by a nonlinear element, a single-layer Perceptron can perform pattern classification only on linearly separable patterns, regardless of the form of nonlinearity used. Minsky and Papert demonstrated the limitations of the Perceptron with the simplest XOR problem, and as a consequence the research in the field was suspended due to the lack of funding. Because of Minsky and Papert’s conclusion, people lost confidence in neural networks. In the 1970s, the progress on neural networks continued at a much reduced pace, although some researchers, such as Amari, Anderson, Fukushima, Grossberg and Kohonen, kept working on it. In 1985, Ackley, Hinton and Sejnowski described the Boltzmann machine, which is more powerful than a Perceptron, and demonstrated a successful application – NETtalk. Rumelhart, Hinton and Williams (1986) are recognized for their milestone work – the Multilayer Perceptron (MLP) with backpropagation, which remained the dominant neural network architecture for more than ten years. A sufficiently big MLP has been proven to be able to learn any function, and many variants of MLP have been proposed since then. In 1988, Radial Basis Functions (RBFs), introduced as an alternative to MLPs, speeded up MLP training (fast-prop). From the 1990s to date, more and more research has been done to improve neural networks. The research agenda includes regularizers, probabilistic (Bayesian) inference, structural risk minimization and incorporation with evolutionary computation.

1.1 Properties of Neural Networks Neural networks can be described according to their network, cell, dynamic, and learning properties as follows: Network Properties. A neural network is an architecture consisting of many neurons, which work together to respond to the inputs. We sometimes consider a network as a black-box function. Here the external world presents inputs to the input cells and receives network outputs from the output cells. Intermediate cells are not seen externally, and for this reason they are often called hidden units. We classify networks as either feedforward networks if they do not contain directed cycles or recurrent networks if they do contain such cycles. It is often convenient to organize the cells of a network into layers. Cell Properties. Each cell computes a single (numerical) cell output or activation. Cell inputs and activations may be discrete, taking values {0, 1} or {−1, 0, 1}, or they may be continuous, assuming values in the interval [0, 1] or [−1, +1]. Typically, every cell uses the same algorithm for computing its activation. The activation for a cell must be computed from the activations of cells directly connected to it and the corresponding weights for those connections. Every cell (except for input cells) computes its new activation as a function of the weighted sum of the inputs from directly connected cells.

4

1 Introduction to Neural Networks

Dynamic Properties. A neural network model must specify the time when each cell computes its new activation value and when the change to that cell’s output actually takes place. Usually in feedforward models, cells are visited in a fixed order, and each cell reevaluates and changes its activation before the next one is visited. In this case the network achieves a steady state after one pass through the cells, provided the cells are correctly numbered. For recurrent models there are several possibilities. One possibility is to make one ordered pass through the cells, as with feedforward models; but we are no longer guaranteed that the network will reach steady state. Another possibility is to make repeated passes through the network. For discrete models the network will either reach a steady state or it will cycle; for continuous models the network will either reach a steady state, cycle, approach a steady state in the limit, blow up, or some combination of these things. A third possibility is to compute new activations for all cells simultaneously and then make changes to the cell outputs simultaneously. This is similar to the previous case. Still another possibility is to select a cell at random, compute its new activation, and then change its output before selecting the next cell. In this case we have no guarantee of any sort of limiting or cyclic behavior unless other constraints are placed upon the model. Learning Properties. Each neural network needs to be trained to respond to the inputs, so it must be associated with one or more algorithms for machine learning. Machine learning refers to computer models that improve neural network performance in significant ways based upon the input data. Machine learning techniques are usually divided into supervised and unsupervised models. In supervised learning a teacher or critic supplies additional input data that gives a measure of how well a program is performing during a training phase. The most common form of supervised learning is trying to duplicate behavior specified by a finite set of training examples, where each example consists of input data and the corresponding correct output. In unsupervised learning there is no performance evaluation available. Without any specific knowledge of what constitutes a correct answer and what constitutes an incorrect answer, the most that can be expected from these models is the construction of groups of similar input patterns. In the pattern recognition literature this is known as clustering. Neural networks have potential as intelligent control systems because they can learn and adapt, they can approximate nonlinear functions, they are suited for parallel and distributed processing, and they naturally model multivariable systems. If a physical model is unavailable or too expensive to develop, a neural network model might be an alternative. There are now many real-world applications ranging from finance to aerospace. There are also many advanced neural network architectures (Haykin, 1994). Some representative neural network models will be discussed in the remainder of this chapter. In terms of their architectures, neural networks can be categorized into feedforward and recurrent. The difference between these two is that there is at least one feedback loop in the latter. The focus of this book is on feedforward neural networks. Some representative feedforward models will be introduced in the following discussions.

1.2

Neural Network Learning

5

1.2 Neural Network Learning The primary significance of a neural network is its ability to learn from its environment. What is learning? In the context of neural networks, learning is defined as a process by which the free parameters of a neural network are adapted through a continuous process of stimulation by the environment. The type of learning is determined by the manner in which the parameter changes take place. The above definition implies that (1) the network is stimulated by the environment; (2) the network changes as a result of stimulation; and (3) the network responds to the environment in a new way after the occurrence of change.

1.2.1 Supervised Learning During the training session of a neural network, an input is applied to the network, and a response of the network is obtained. The response is compared with an a priori target response. If the actual response differs from the target response, the neural network generates an error signal, which is then used to compute the adjustment that should be made to the network’s synaptic weights so that the actual response matches the target output. In other words, the error is minimized, possibly to 0. Since the minimization process requires a teacher (supervisor), this kind of training is named supervised learning. A supervised learning model is illustrated in Fig. 1.2. The notion of teacher comes from biological observations. For example, when learning a language, we hear the sound of a word from a teacher. The sound is stored in the memory banks of our brain, and we try to reproduce the sound. When we hear our own sound, we mentally compare it (actual response) with the stored sound (desired response) and note the error. If the error is large, we try again and again until it becomes significantly small. An unsupervised learning model is illustrated in Fig. 1.3.

1.2.2 Unsupervised Learning In contrast to supervised learning, unsupervised learning does not require a teacher, i.e., there is no target response. During the training stage, the neural network receives its input patterns and it arbitrarily organizes them into categories. When an input is later applied, the neural network provides an output response to indicate the class to which the input pattern belongs. For example, show a person a set of different objects. Then ask him to separate them into different groups, such that objects in a group have one or more common features that distinguish them from other groups. When this (training) is done, show the same person an object that is unseen and ask him to place the object in one of the groups. He would then put it in the group with which the object shares the most common features. Even though unsupervised learning does not require a teacher, it requires guidelines to determine how it forms groups. Grouping may be based on shape, color, or material consistency or some other properties of the objects. If no guidelines have

6

1 Introduction to Neural Networks Vector describing state of environment Environment

Teacher desired response

Learning System

+ actual response _ Σ

error signal

Fig. 1.2 Supervised learning model Vector describing state of environment

Fig. 1.3 Unsupervised learning model Environment

Learning system

been given as to what type of features should be used for grouping the objects, the grouping may or may not be successful. Similarly, to classify patterns effectively using neural networks, some guidelines are needed.

1.3 Perceptron The Perceptron, the simplest form of a neural network, is able to classify data into two classes. Basically it consists of a single neuron with a number of adjustable weights. The neuron is the fundamental processor of a neural network (Fig. 1.4). The summation in the neuron also includes an offset for lowering or raising the net input to the activation function. Mathematically the input to the neuron is represented by a vector x =< 1, x1 , x2 · · · xn >T, and the output is a scalar y. The weights of the connections are represented by the vector w =< w0 , w1 , w2 · · · wn >T , where wo is the offset. The offset is often also called the bias and denoted with b. The output is calculated as y = f (wT x).

(1.1)

Fig. 1.4 is a Perceptron with two inputs and an offset. With a hard limiter as activation function, the neuron produces an output equal to 0 or +1 that we can associate with the two classes C1 and C2 respectively.

1.3

Perceptron

7 w1

x1

w2

x2

∑

net

...

xp

wp

Summing node

1 0

net

Output signals y

–1

Threshold unit logic

weights

Fig. 1.4 Perceptron consisting of a neuron with an offset and a hard limiter activation function

The weights w are adjusted using an adaptive learning rule. One such learning rule is the Perceptron Convergence Algorithm. If the two classes C1 and C2 are linearly separable (i.e., they lie on opposite sides of a straight line or, in general, a hyperplane), then there exists a weight vector w with the properties

wT x ≥ 0 for every input vector x belonging to class C1 wT x < 0 for every input vector x belonging to class C2

(1.2)

Assuming, to be general, that the Perceptron has m inputs, the equation wT x = 0 in an m-dimensional space with coordinates x1 , x2 · · · xm , defines a hyperplane as the switching surface between the two different classes of input. The training process adjusts the weights w to satisfy the two inequalities (1.2). A training set consists of, say, K samples of the input vector x together with each sample’s class membership (0 or 1). The learning is continued epoch by epoch until the weights stabilize. The core of the Perceptron convergence algorithm for adapting the weights of the elementary Perceptron has the following two steps. Step 1. If the kth member of the training set, xk (k = 1, 2. . . K), is correctly classified by the weight vector wk computed at the kth iteration of the algorithm, no correction is made to wk , i.e.,

wk+1 = wk if wT x ≥ 0 and xk belongs to class C1 wk+1 = wk if wT x < 0 and xk belongs to class C2

(1.3)

Step 2. Otherwise the Perceptron weights are updated according to the rule

wk+1 = wk − ηk xk if wT x ≥ 0 but xk belongs to class C2 wk+1 = wk + ηk xk if wT x < 0 but xk belongs to class C1

(1.4)

Notice the interchange of the class numbers from step 1 to step 2. The learningrate ηk controls the adjustment applied to the weight vector at iteration k. If ηk is a

8

1 Introduction to Neural Networks

constant, independent of the iteration number k, we have a fixed increment adaptation rule for the Perceptron. The algorithm has been proved to converge (Haykin, 1994).

1.4 Adaline and Least Mean Square Algorithm ADALINE is an acronym for ADAptive LINear Element (also known as linear threshold unit). It was developed by Bernard Widrow and Marcian Hoff (1960). As shown in Fig. 1.5, the architecture of the Adaline is the same as that of the Perceptron. The difference between these two models is the activation function. An Adaline has n variable inputs x1 , x2 · · · xn , which take on binary values of either +1 or −1. The bias input x0 is fixed at a value of +1. Associated with the Adaline are n+1 adjustable analog weights w0 , w1 , w2 · · · wn . The weights of the Adaline scale the corresponding inputs, the scaled inputs are summed, and the weighted sum is the input to a threshold device. The threshold device outputs a −1 for negative inputs and a +1 for positive inputs. The output of the threshold device is the Adaline output. The Adaline applies the Least Mean Square (LMS) algorithm, also known as Widrow-Hoff learning. The LMS algorithm is more powerful than the Perceptron learning rule. While the Perceptron rule is guaranteed to converge to a solution that correctly categorizes the training patterns, the resulting network can be sensitive to noise, since patterns often lie close to the decision boundaries. The LMS algorithm minimizes mean square error, and therefore tries to move the decision boundaries as far from the training patterns as possible. The LMS algorithm has found many more practical uses than the Perceptron learning rule. This is especially true in the area of digital signal processing (Hagan, Demuth and Beale, 1996). Given a training set {(x1 , d1 ), (x2 , d2 ). . . (xP , dP )}, where the pth training pair is given by (xp , dp ), with dp being the target output, the mean square error (MSE) between the real output and the target output is calculated by:

w1

x1

w2

x2

f(net)

... xp

Continuous perception

wp weights

Fig. 1.5 Adaptive linear element (Adaline)

Output signals y

1.5

Multilayer Perceptron and Backpropagation Algorithm 2 2 MSE = E[e − wT x)2 ] 2] = E[(dT− y) ] T= E[(d = E d − 2dw x + w xxT w = E[d 2 ] − 2wT E[dx] + wT E[xxT ]w

9

(1.5)

From Eq. (1.5), we can see that the MSE for the Adaline is a quadratic function: 1 J(w) = c + bT w + wT Aw (1.6) 2 with b = −2h = −2E [dx] and A = 2R = 2E xxT . The gradient of the MSE is given by: ∇J(w) = ∇ c + bT w + 12 wT Aw = b + Ax = −2h + 2Rx

(1.7)

To find the optimal weights w∗ , we need to find the stationary point of J(w). Let

∇J(w) = −2h + 2Rx = 0

(1.8)

w∗ = R−1 h

(1.9)

So we have

As the objective of neural network training is to find the optimal w∗ by iterative updating, the steepest gradient method can be applied: w(t + 1) = w(t) − η∇J(t)

(1.10)

where t refers to the tth iteration. However, since it is not desirable to calculate h and R, and not convenient to calculate R−1 , the MSE can be estimated iteratively by square error (Widrow and Hoff, 1960): Jˆ (w) =

1 1 [di (t) − yi (t)]2 = e2 (t) 2 2

(1.11)

and its gradient can be accordingly estimated by 1 2 ∂e (t) ∂ Jˆ (t) ∂e(t) ∂(d(t) − w(t) · x(t)) ∇ Jˆ (w) = = 2 = e(t) · = e(t) · ∂w(t) ∂w(t) ∂w(t) ∂w(t) = e(t) · ( − x(t))

(1.12)

1.5 Multilayer Perceptron and Backpropagation Algorithm As long as a neural network consists of a linear combiner followed by a nonlinear element, a single-layer Perceptron can perform pattern classification only on linearly separable patterns, regardless of the form of nonlinearity used. Linear separability

10

1 Introduction to Neural Networks

requires that the patterns to be classified be sufficiently separated from each other to ensure that the decision boundaries are hyperplanes. As shown in Fig. 1.6, MLP can address the above problem. An MLP consists of an input layer, one or more hidden layers and an output layer. Each neuron is fully connected to all the neurons in its preceding layer as well as all those in its next layer. Typically, the same nonlinear function, such as the log-sigmoid function, is applied to every neuron. The backpropagation algorithm was developed for training multilayer Perceptron networks. It was popularized by Rumelhart, Hinton and Williams (1986), although similar ideas had been developed previously by others. The idea is to train a network by propagating the output errors backward through the layers. The errors serve to evaluate the derivatives of the error function with respect to the weights, which can then be adjusted. The backpropagation algorithm consists of two phases, namely, a feedforward phase and a backpropagation phase. The former applies an input, evaluates the activations and stores the error. The latter computes the adjustments and updates the weights. However, only the errors in the output layer are visible, i.e., can be calculated directly, whereas those in the hidden layers will be propagated from their next layers. The error at the output neuron j at iteration t can be calculated by the difference between the desired output and the corresponding real output, i.e., ej (t) = dj (t) − 2 yj (t). Accordingly, the total error energy of all output neurons is ε(t) = 12 ej (t). j∈C

Referring to Fig. 1.6, the output of the kth neuron in the lth layer can be calculated by:

1

....

l

l–1 w l11 w l12

Fig. 1.6 Multilayer Perceptron

w l1n l−1

....

L

1.5

Multilayer Perceptron and Backpropagation Algorithm

⎛ ylk = f ⎝

⎞

l−1

n

11

⎠ wljk · yl−1 j

(1.13)

j=1

where 1 ≤ l ≤ L, nl refers to the number of neurons in layer l. For the input layer thus holds l = 1, y1j = xj ; for the output layer l = L, yLj = yj .

1.5.1 Output Layer Learning The mean square error (MSE) of the output can be computed by: ⎡ ⎞⎤ ⎛ L−1 nL nL n

2 1

1 ⎣ E= dj − yj = dj − f ⎝ wLij · yiL−1 ⎠⎦ 2 2 j=1

j=1

(1.14)

i=1

The steepest descent of MSE can be used to update the weights: wLij (t + 1) = wLij (t) − η

∂E ∂wLij

(1.15)

Since the output error E is an indirect function of the weights in the hidden layers, the chain rule of calculus is applied to calculate the derivatives. Let l−1 n l l netlj = wlij · yl−1 i , yj = f (netj ); then we have i=1

L ∂E ∂yj ∂E = · ∂wLij ∂yLj ∂wLij

[ − 12pt] =

∂yLj ∂netLj ∂E · · ∂yLj ∂netLj ∂wLij

(1.16)

[ − 12pt] = − dj − yj · f netLj · yiL−1 The error signal of the output can be backpropagated to layer L−1: δjL = −

∂yLj

L ∂E ∂E = − · = d − y · f netj j j ∂netLj ∂yLj ∂netLj

1.5.2 Hidden Layer Learning For any hidden layer l, the descent of the error is calculated by

(1.17)

12

1 Introduction to Neural Networks

∂E ∂wlij

=

∂E ∂ylj

·

∂ylj ∂wlij

=

∂E ∂ylj

·

∂ylj ∂netlj

·

∂netlj ∂wlij

=

∂E ∂ylj

· f netlj · yl−1 i

(1.18)

Since E is not a direct function of the output in the hidden layers, the steepest descent of the error can be calculated by: nl+1

∂yl+1 ∂E ∂E k = · ∂ylk ∂yl+1 ∂yli k=1 k l+1 n ∂yl+1 ∂netl+1 ∂E k k · · = l+1 ∂yli ∂netl+1 k=1 ∂yk k l+1 l+1 n n l+1 l+1 (netl+1 ) · wl+1 = el+1 δ · f · w = k k jk k jk k=1

(1.19)

k=1

where δkl+1 is the propagated error signal from the next layer, and the error signal in

the lth layer δkl = elk · f netlk will then be propagated to layer l−1.

1.6 Radial Basis Function Networks The design of a neural network can be viewed as a curve-fitting problem. According to this viewpoint, learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training data, whereas generalization is equivalent to the use of the trained multidimensional surface to interpolate the test data. There are two bases in support of Radial Basis Function networks: 1. Cover’s Theorem. A complex pattern classification problem cast in a highdimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space. Cover’s theorem tells us that we can map the input space to a high-dimensional space, in which a linear function will be found. 2. Tikhonov Regularization Theory. In the context of a hypersurface reconstruction problem, the basic idea of regularization is to stabilize the solution by means of some auxiliary nonnegative functional that embeds prior information about the solution. Tikhonov Regularization Theory tells us that we cannot purely rely on the training data, but must introduce some constraints, e.g., function smoothness. RBF networks are feedforward networks trained using a supervised training algorithm. They are typically configured with a single hidden layer of units whose activation function is selected from a class of functions called basis functions. Actually, RBF networks transform the input space into a high dimensional space in a nonlinear manner and then find the curve-fitting approximation in high-dimensional space. The activation function depends on the Euclidean distance between input and target vectors. An RBF network can be described by

1.7

Support Vector Machines

13

F(x) =

n

wi ϕ ( x − xi )

(1.20)

i=1

where ϕ is a nonlinear transformation from input space to hidden space of high dimensionality. The most common form of basis function is the Gaussian function

x − c 2 ϕ(x) = exp − 2σ 2

(1.21)

where c determines the center of the basis function, and σ is a width parameter that controls how spread out the curve is. The output layer performs a linear mapping from hidden space to output space. The output of the network is the weighted sum of the hidden layer neuron outputs. A set of training data with their corresponding target outputs are given for supervised learning: ⎡

ϕ11 ⎢ ϕ21 ⎢ ⎢ .. ⎣ .

ϕ12 ϕ22 .. .

··· ... .. .

⎤⎡ ⎤ ⎡ ⎤ ϕ1n d1 w1 ⎢ w2 ⎥ ⎢ d2 ⎥ ϕ2n ⎥ ⎥⎢ ⎥ ⎢ ⎥ .. ⎥ ⎢ .. ⎥ = ⎢ .. ⎥ . ⎦⎣ . ⎦ ⎣ . ⎦

ϕn1 ϕn2 · · · ϕnn

wn

(1.22)

dn

So, w = −1 d

(1.23)

In practice, we do not calculate −1 , but apply the following two steps to train RBF networks: (1) Select centers randomly or based on some criteria, (2) update weights, typically using a linear least estimation algorithm.

1.7 Support Vector Machines Given a number of n training samples for a two-class problem {(x1 ,y1 } · · · (xn ,yn }, where x ∈ Rm , y ∈ {−1,+1}, there may exist a nested structure of decision functions {fα (x):α ∈ } where fα :Rm → {−1, + 1}

(1.24)

However, the optimal function should be the one that can achieve the maximum 2 . The goal of support vector machines (SVMs) is to minimize w to get margin w the maximum margin, as shown in Fig. 1.7. The optimal solution fα is supposed to provide the smallest possible value for expected risk R(fα ):

14

1 Introduction to Neural Networks

Fig. 1.7 Illustration of maximum margin

R(fα ) =

|fα (x) − y|P(x,y)dxdy

(1.25)

As the probability distribution P(x,y) is unknown, the Empirical Risk, Remp (fα ) is computed in all the conventional neural networks: 1 1 |fα (xi − yi )| n 2 n

Remp (fα ) =

(1.26)

i=1

However, in terms of Vapnik-Chervonenkis (VC) Dimension theory (Vapnik, 1995), which indicates the maximum number of training points h that can be separated by set of linear functions fα , there is a bound to the expected risk:

R(fα ) ≤ Remp (fα ) +

h ln 2n + 1 − ln η h 4 n

(1.27)

with probability 1 − η. The second term in Eq. (1.27) is called the confidence term or confidence interval. For a fixed number of training examples n, the training error (empirical risk) decreases monotonically as the capacity or VC dimension h is increased, whereas the confidence interval increases monotonically. Accordingly, both the guaranteed risk and the generalization error go through a minimum. Hence, our goal is to find a network such that decreasing the VC dimension occurs at the expense of the smallest possible increase in training error. So the goal of SVMs is to minimize w 2 subject to yi (w · xi + b) ≥ 1, i = 1,...,n. This is equivalent to solving the following constrained Quadratic Programming (QP) problem to construct the Lagrangian:

1.7

Support Vector Machines

15

1 w 2 − αi yi (w · xi + b) − 1 2 n

L (w,b,) =

(1.28)

i=1

where = (α1 ,...,αn ) and ≥ 0. A solution to the QP problem is determined by the saddle point of the function. Differentiate L(w,b,)with respect to W and b.

∂L(w,b,) =w− αi yi xi = 0 ⇒ w = αi yi xi ∂w

(1.29)

∂L(w,b,)

αi y i = 0 = ∂b

(1.30)

n

n

i=1

i=1

n

i=1

So, the optimization problem is converted to maximizing F () =

n

i=1

αi −

n 1

αi αj yi yj xi · xj 2

(1.31)

i,j=1

subject to · y = 0 where ≥ 0. In most cases α = 0. Those input data points with α > 0 are called support vectors. Removing other data points except support vectors will not affect the solution of the classifier. Similarly to RBF, the input space is also mapped to a sufficiently large feature space in SVM, so that patterns become linearly separable, and so a simple Perceptron in feature space can do the classification. SVMs are radically different types of classifiers which have attracted a great deal of attention lately due to the novelty of the concepts that they bring to pattern recognition, their strong mathematical foundation, and their excellent results in practical problems.

Chapter 2

Principles of Sensitivity Analysis

Sensitivity refers to how a neural network output is influenced by its input and/or weight perturbations. Sensitivity analysis dates back to the 1960 s, when Widrow investigated the probability of misclassification caused by weight perturbations, which are caused by machine imprecision and noisy input (Widrow and Hoff, 1960). In network hardware realization, such perturbations must be analyzed prior to its design, since they significantly affect network training and generalization. The initial idea of sensitivity analysis has been extended to the optimization of neural networks, such as through sample reduction, feature selection, and critical vector learning.

2.1 Perturbations in Neural Networks Perturbations of neural networks are caused by machine imprecision and/or input noise. For the purpose of analysis, these perturbations can be simulated by embedding disturbance to the original inputs or connection weights. Perturbation analysis allows the study of the characteristics of a function under small perturbations of the function’s parameter (Holtzman, 1992; Zurada et al., 1997). Perturbation analysis is also important also when assessing network robustness against input noise to measure the uncertainty or fluctuations. In perturbation analysis we are interested in evaluating the disturbance in the function’s response to small perturbations in its parameters. Fig. 2.1 shows how to investigate the effects of perturbations to neural networks. Assuming that the performance function is differentiable, the relationship between the perturbed response of this function and parameter perturbations is expressed by a Taylor expansion of the function. For example, for a one-dimensional cost function g g ( + ) = g ( ) +

2 g ( ) + g ( ) + · · · 1! 2!

(2.1)

where is a parameter of the function and typically includes weight W and input X; is a small perturbation of . The performance cost function g is usually taken D.S. Yeung et al., Sensitivity Analysis for Neural Networks, Natural Computing Series, C Springer-Verlag Berlin Heidelberg 2010 DOI 10.1007/978-3-642-02532-7_2,

17

18

2

Fig. 2.1 General structure for investigating the effects of perturbations

Principles of Sensitivity Analysis

Weight W Reference Neural Network Input X comparison

output error

Perturbed Neural Network

ΔX

ΔW

by the noise-to-signal ratio (NSR) or expectation of decision errors. The Taylor expansion shows that the derivatives of the function with respect to the perturbed parameter encapsulate the characteristics of that function under the perturbations. 2 . Ideally, when → 0, 1! g ( ) + 2! g ( ) + · · · → 0 Eq. (2.1) shows that the derivatives play a very important role in determining the influence of parameter perturbations on the output of the performance function. Sensitivity analysis is applied to investigate how the derivatives can be used to quantify the response of the system to parameter perturbations, and how these derivatives can be calculated. Sensitivity analysis techniques differ mainly in the cost function used, the order of the derivatives that are considered, whether the analysis is in continuous time or for discrete time intervals, and the way in which the derivatives are calculated. Due to computational considerations, sensitivity analysis is based on approximations of Eq. (2.1), usually first-order or second-order approximations. The higher the order of the approximation, the more accurate but more complex and time consuming the process. Usually, sensitivity analysis is done at discrete time intervals for that time interval only. Sensitivity analysis can also be performed for continuous time models, referred to as stochastic analysis (Koda, 1995; 1997).

2.2 Neural Network Sensitivity Analysis Without loss of generality, let us consider a neural network performing a nonlinear, differentiable mapping : I → K , from input x = (x1 , x2 . . . xI ) to output o = (o1 , o2 . . . oK ). Suppose x(n) ∈ , where is an open set. Since o is differentiable at x(n) we have o(x + x) = o(x(n) ) + J(x(n) )x + g(x) where

(2.2)

2.2

Neural Network Sensitivity Analysis

⎡

19 ∂o1 ∂o1 ∂x1 ∂x2

⎢ ⎢ ∂o2 ⎢ J(x ) = ⎢ ∂x1 ⎢ .. ⎣ . (n)

∂oK ∂x1

···

∂o1 ∂xI

⎤

⎥ 2 ⎥ · · · ∂o ∂xI ⎥ .. .. .. ⎥ ⎥ . . . ⎦ ∂oK ∂oK ∂x2 · · · ∂xI ∂o2 ∂x2

(2.3)

is the Jacobian matrix and lim

x→0

g(x) =0 |x|

(2.4)

Fig. 2.2 illustrates geometrical interpretation of Eq. (2.2) in space K . Point o(x(n) ) represents the nominal response of the neural network for the nth element of the training set x(n) . The disturbance x of the input vector causes the perturbed response at o(x(n) + x). This response can be expressed as a combination of three vectors as indicated in Eq. (2.2). The sensitivity analysis can be used for different purposes in neural networks (Engelbrecht, 1999): Optimization. The calculation of the gradient of a function forms an important part of optimization. One of the first uses of sensitivity analysis is therefore in optimization problems (Cao, 1985; Holtzman, 1992). In neural networks, derivatives of the objective function with respect to the weights are computed to locate minima by driving these derivatives to 0. Second order derivatives have also been used to develop more sophisticated optimization techniques to improve convergence and accuracy. Koda (1995, 1997) employed stochastic sensitivity analysis to compute the gradient for time-dependent networks such as recurrent neural networks. Robustness. Neural network robustness and stability analysis is the study of the conditions under which the outcome of the neural network changes. This study is important for hardware implementation of neural networks to ensure stable

Fig. 2.2 Illustration of output disturbance caused by input disturbance

20

2

Principles of Sensitivity Analysis

networks that are not adversely affected by weight, external input and activation function perturbations (Alippi, Piuri and Sami, 1995; Oh and Lee, 1995). Instead of using derivatives to compute the gradient of the objective function with respect to the weights, Jabri and Flower (1991) use differences to approximate the gradient, thereby significantly reducing hardware complexity. Generalization. Fu and Chen (1993) state good generalization must imply insensitivity to small perturbations in inputs. They derive equations to compute the sensitivity of the neural network output vector to changes in input values, and show under what conditions global neural network sensitivity can be reduced. For example, using small slopes for the sigmoid activation function, using as small as possible weights, reducing the number of units, and ensuring activation levels close to 0 or 1 will reduce network sensitivity. Choi and Choi (1992) derive a neural network sensitivity norm which expresses the sensitivity of the neural network output with respect to input perturbations. This neural network norm is then used to select from sets of optimal weights the weight set with lowest neural network sensitivity, which results in the best generalization. Measure of nonlinearity. Lamers, Kok and Lebret (1998) use the variance of the sensitivity of the neural network output to input parameter perturbations as a measure of the nonlinearity of the data set. This measure of nonlinearity is then used to show that the higher the variance of noise injected to output values, the more linearized the problem. Causal inference. Sensitivity analysis has been used to assess the significance of model inputs. Engelbrecht, Cloete and Zurada (1995) use exact derivative calculations to compute the significance values which have a high influence on the neural network output. Goh (1993) derived a similar method using differences to approximate the gradient of the neural network output function with respect to inputs. Selective learning. Hunt and Deller (1995) use weight perturbation analysis to determine the inference each pattern has on weight changes during training. Only patterns that exhibit a high influence on weight changes are used for training. Engelbrecht (1998) presents new active learning models based on sensitivity analysis, which use a measure of pattern informativeness to dynamically select patterns during training. Decision boundary visualization. Goh (1993) uses an approximation to the derivatives of the neural network output function with respect to inputs to graphically visualize decision boundaries. Engelbrecht (1998) shows how exact derivative calculations can be used to locate and visualize decision boundaries. Victor (1998) uses the decision boundary algorithm to improve the accuracy of rules extracted from trained neural networks in a cooperative learning environment. Pruning. Sensitivity analysis has been applied extensively to neural network pruning. One technique is to compute the sensitivity of the objective function with respect to neural network parameters (Le Cun, 1990; Moody et al., 1995). Another method of sensitivity analysis pruning is to compute the sensitivity of the neural network output function to parameter perturbations (Zurada, 1994, 1997).

2.3

Fundamental Methods of Sensitivity Analysis

21

Learning derivatives. Basson and Engelbrecht (1999) developed a new learning algorithm for feedforward neural networks that also learns the first-order derivatives of the neural network output with respect to each input unit while learning the underlying function. The neural network consists of two parts, one representing the learned function, and the other representing the derivatives of the learned function. Concepts from sensitivity theory are used to create a training set for the training of the derivative part of the neural network using gradient descent.

2.3 Fundamental Methods of Sensitivity Analysis In system sensitivity theory, some sensitivity functions are introduced, such as the output sensitivity, the trajectory sensitivity and the performance-index sensitivity functions (Frank, 1978). All the methods of neural network sensitivity analysis can be divided into two categories, namely, geometrical approach and statistical approach. In 1962, Hoff used an n-dimensional hypersphere to model Adaline for sensitivity analysis (Hoff, 1962), which was further simplified in Glanz (1965). After two decades, Winter (1989) was the first one to derive an analytical expression for the probability of error in Madaline caused by weight perturbations. Stevenson continued Winter’s work and established the sensitivity of Madaline to weight error (Stevenson, 1990; Stevenson, Winter and Widrow, 1990). A milestone work was done by Piché, who used a statistical approach to relate the output error to the change of weights for an ensemble of Madalines, with several activation functions such as linear, sigmoid, and threshold (Piché, 1992; Piché, 1995). Figure 1.6 can be treated as a general neural network model. A neural network can have L layers, and each layer l (0 ≤ l ≤ L) has nl (nl ≥ 1) neurons. n0 stands for the input layer and nL for the output layer. Since the number of neurons in layer l −1 is equal to the output dimension of that layer, which is also equal to the input dimension of layer l, the input dimension of layer l is nl−1 . For a neuron i (1 ≤ T i ≤ nl ) in layer l, its input vector, weight vector and output are X l = x1l · · · xnl l−1 , T Wil = wli1 · · · wlinl−1 and yli = f (X l · W l ) respectively, where f(·) is an activation function. For each layer l, its input vector is Xl , its weight set is W l = {W1l · · · Wnl l }, T and its output vector is Y l = yl1 · · · ylnl . For the network, its input is the vector X1 T or Y0 , its weight is W, and its output is YL . Let X l = x1l · · · xnl l−1 and Y l = T yl1 · · · ylnl be the corresponding deviations for input and output, respectively.

2.3.1 Geometrical Approach The use of n-dimensional geometry has proven to be a valuable tool for understanding and analyzing the Adaline. The geometrical interpretation of the equation

22

2

Principles of Sensitivity Analysis

dictating the Adaline’s input-output map is the bias for most of the analysis presented in Stevenson (1990). The input vector X in a neural network can be treated as a vector from the origin to the point (x1 , x2 . . . xn ) in n-space. The point (x1 , x2 . . . xn ) will be referred to “the tip of X” in the following discussion. In n-space, points at a distance r from the point c form a hypersphere of radius r centered at c. The surface area of such a hypersphere An (r) is n √ An (r) = 2 π n · rn−1 2

(2.5)

where (•) is the Gamma function. As shown in Fig. 2.3, the connection weights, which are a vector at angle θ to the input vector X = (x1 , x2 . . . xn ) in n-space, satisfy

X·W=

n

xi wi = c

(2.6)

i=1

which is called a hyperplane for some scalar c. This hyperplane is perpendicular to the vector W and is at a distance c/|W| from the origin. Particularly, HPw is one of the hyperplanes passing through the origin with c = 0. Similarly to HPw , HPX denotes a hyperplane passing through the origin and perpendicular to the vector X. Hence, both HPw and HPX divide the hypersphere in two hemi hyperspheres, i.e., Hw + versus Hw − , and HX + versus HX − , respectively. There are four lunes intersected by + − + − these four hemi hyperspheres, HX+ ∩ HW , HX+ ∩ HW , HX− ∩ HW and HX− ∩ HW , as shown in Fig. 2.3. − + and HX− ∩HW , As the angle between X and W is θ , both the intersections, HX+ ∩HW + + − − describe lunes of angle θ whereas the intersections HX ∩ HW and HX ∩ HW both describe lunes of angle (π − θ ). The ratio of the surface content of a lune of angle θ to the surface content of the entire hypersphere is θ/2π . Assuming binary-valued inputs, there are 2n possible input patterns for an Adaline with n variable inputs. Each input pattern corresponds to a point in n√ space which lies on a hypersphere of radius n centered at the origin. The Hoff

W

X

+ + HX∩HW

θ HPX HPW + − HX∩HW

Fig. 2.3 Hypersphere approximation for sensitivity analysis

+

−

HX∩HX

π−θ

+

+

HX∩HX

2.3

Fundamental Methods of Sensitivity Analysis

23

hypersphere-area approximation states that as n gets large, the points corresponding to the n-dimensional input patterns are approximately uniformly distributed over the surface of a hypersphere in n-space. Consequently, the percentage of input patterns which correspond to points on a selected region of the hypersphere can be approximated as the ratio of the surface content of the selected region to the surface content of the entire hypersphere (Hoff, 1962).

2.3.2 Statistical Approach Let us consider the behavior of a multi-input/single-output mapping first. Let us assume that the connection weight vector changes from W∗ to W = W ∗ + W, where W indicates weight perturbations. Under the assumption of statistical weight perturbations, the statistical sensitivity can be defined as follows (Choi and Choi, 1992): ∗

S (W ) = lim p

σ →0

var[Xp (L)] σ

(2.7)

where Xp (L) is the output error, σ is the standard deviation of each component of deviation of each component of W, and var[...] is the variance. The output error vector Xp (L) in the lth layer arising from weight perturbations W is given by Xp (l) = Xp (l) − Xp∗ (l) ∼ =

l

Ck,l W ∗ , Xp∗ W(k)Xp∗ (k − 1)

(2.8)

k=1

where the Nl × Nk matrix Ck,l W ∗ , Xp∗ is defined as l ∗ ∗ ∇Tn Wn∗ ∇Tk = ∇Tl Wl∗ ∇Tl−1 Wl−1 . . . ∇Tk+1 Wk+1 (2.9) Ck,l W ∗ , Xp∗ = n=k+1

Here, Tk refers to the nonlinear transformation associated with the kth layer, and ∇Tk refers to its derivation with respect to the output of the kth layer. As in the case of the weight perturbation discussed above, the sensitivity to input perturbation can be also calculated by Eq. (2.7) with σ being the standard deviation of each component of the input perturbation Xp (0). In this respect, the output error Xp (L) in output layer L caused by input perturbation is given by: Xp (L) = CL,0 W ∗ , Xp∗ Xp (0)

(2.10)

To measure sensitivity for a multi-output neural network, typically, the sensitivity of each output neuron is calculated first, and then the final sensitivity is set as the maximum or the average value among all output neurons.

24

2

Principles of Sensitivity Analysis

Piché (1995) introduced a stochastic model for the statistical analysis of sensitivity. He assumed that (1) over the ensemble of networks, the weights of a particular layer all have the same variance, (2) the weights in the networks are statistically independent and (3) the mean value of each weight of the network over the ensemble is 0. Under these assumptions, it is possible to model the ensemble of weights associated with each node of the network as a vector of random variables. He discussed the selection of weight accuracies for Madaline using a statistical approach to sensitivity analysis. According to his stochastic model, all neurons have the same activation function. All network inputs, weights, input perturbations, and weight perturbations are random variables. The sensitivity of Madaline in this model is defined as the NSR of the output layer; the output NSR of a layer of threshold Adaline is: 2 2 σy σ2 4 σx + w (2.11) NSR = 2 = 2 σy π σx σw2 2 , σ 2 and σ 2 refer to the variances of output y, input x, where σy2 , σx2 , σw2 , σy x w weight w, output error y, input perturbation x and weight perturbation w, respectively.

2.4 Summary Sensitivity is initially investigated for the realization of a neural network to calculate its output perturbation caused by machine imprecision and noisy input. It is particularly useful and valuable to apply sensitivity analysis to the software design of networks, in which artificial perturbation is embedded in the training. In this paper, the principle of sensitivity analysis is discussed in detail, followed by systematic introduction to the advanced research in this area over the last two decades. All the existing techniques can be categorized into geometrical and statistical approaches. The former use hypersphere or hyper-rectangle model in n-dimensional space to analyze the deviation of the output vector caused by the perturbation of the input and weights. The latter calculate the sensitivity by noise-to-signal ratio or expectation of the output deviation.

Chapter 3

Hyper-Rectangle Model

In this chapter, we discuss a hyper rectangle model, instead of the traditional hypersphere, which is employed as the mathematical model to represent an MLP’s input space. The hyper-rectangle approach does not demand that the input deviation be very small as the derivative approach requires, and the mathematical expectation used in the hyper-rectangle model reflects the network’s output deviation more directly and exactly than the variance does. Moreover, this approach is applicable to the MLP that deals with infinite input patterns, which is an advantage of the MLP over other discrete feedforward networks like Madalines.

3.1 Hyper-Rectangle Model for Input Space of MLP Neurons in an MLP, like cells, are organized into layers by linking them together. Links exist only between neurons of two adjacent layers; there is no link between neurons in the same layer or in any two nonadjacent layers. All neurons in a layer are fully linked to all the neurons in the immediately preceding layer and to all the neurons in the immediately succeeding layer. At each layer except the input layer, the inputs of each neuron are the outputs of the neurons in the previous layer, X l = Y.l−1 (l ≥ 1). Through training, a hyperplane P is obtained for classification or regression

P:

n

xj wj + σ = 0 in input space

(3.1)

j=1

The computation of the sensitivity is essentially reduced to the calculation of the following integral (Zeng and Yeung, 2003): I (A,B,W,σ ) =

b1

a1

b2

a2

···

bn an

ϕ(X)

1 + exp −

n

dx1 dx2 · · · dxn . (3.2) xj wj − σ

j=1

D.S. Yeung et al., Sensitivity Analysis for Neural Networks, Natural Computing Series, C Springer-Verlag Berlin Heidelberg 2010 DOI 10.1007/978-3-642-02532-7_3,

25

26

3 Hyper-Rectangle Model P

a

c

b P

Q

P

Q

Q

Fig. 3.1 Spatial relationship between P and . (a) P does not cut . (b) P cuts the four parallel sides of . (c) P cuts a prism from

where A = {a1 , a2 . . . an } and B = {b1 , b2 . . . bn } are sets of lower and upper bounds of the integral, and W = {w1 , w2 . . . wn } is the set of the weights, with bj > aj and wj = 0 for 1 ≤ j ≤ n. It can be clearly seen that the input space can be regarded as a hyper-rectangle bounded by A and B in n-dimensional space From a geometric point of view, the hyperplane P can be regarded as a dividing plane. It divides the n-dimensional space into three parts: the points on P, the points below P, and the points above P. The possible spatial relationship between P and is illustrated in Fig. 3.1.

3.2 Sensitivity Measure of MLP Based on the above hype-rectangle model, the sensitivity measure of Eq. (3.2) can be rewritten as: ⎧ ⎪ ⎪ ⎪ ⎪ ··· ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ · · · I≈ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ · · · ⎪ ⎪ ⎩

⎛ ⎞⎞ ⎛ p n n

⎝ (− 1)k exp ⎝−k xj wj − kσ ⎠⎠d if xj wj + σ > 0 k=0

ip1 (X)d +

···

j=1

j=1

ip2 (X)d

if

n

xj wj + σ = 0

j=1

⎛ ⎛ ⎞⎞ p n n

⎝ (− 1)k exp ⎝−k xj wj + kσ ⎠⎠d if xj wj + σ < 0 k=0

j=1

j=1

(3.3)

where ip (X) denotes the first p-term Taylor expansion of the integrand. A solution to this problem is to locate the intersection points of P and those parallel lines that are the extensions of the edges of . Each line must have one and only one intersection point with P under the assumption wj = 0 for 1 ≤ j ≤n. By comparing the coordinates on a given axis of those intersection points with the lower bound and the upper bound of on that axis, it is easy to identify the intersection situations between P and . It is known that has 2n−1 edges parallel to a given coordinate axis. Take the xn -axis as an example. Each edge parallel to the xn -axis can be determined by assigning a total of n−1 coordinates (x1 . . . xn−1 ) with either xj = aj or xj = bj . Thus, the xn -coordinate for a given line’s intersection point can be calculated by

3.3

Discussion

27

⎛ ⎞ " n−1

xn = ⎝ xj wj + σ ⎠ wn

(3.4)

j=1

with its n−1 fixed coordinates, which consequently yield a set En = {en,1 , en,2 . . . en,2 n−1 } representing 2n−1 edges.

3.3 Discussion The sensitivity measure is something like an analog rule that network users can use as a tool to evaluate their network’s performance. For example, Lee and Oh (1994) indicated that in pattern classification applications, their sensitivity (misclassification probability) results could be applied to select the optimal weight set among many weight sets acquired through a number of learning trials. Among these weight sets, an MLP with the optimal set will have the best generalization capability and the best input noise immunity. Zurada, Malinowski, and Usui (1997) made use of analytical sensitivity to delete redundant inputs to an MLP for pruning its architecture. Engelbrecht and Cloete (1999) also used analytical sensitivity to dynamically select training patterns. In this chapter, sensitivity is analyzed through the hyper-rectangle model, which is especially suitable for the most popular and general feedforward network: multilayer Perceptron. The sensitivity measure is defined as the mathematical expectation of output deviation due to expected input deviation with respect to overall input patterns in a continuous interval. Based on the structural characteristics of the MLP, a bottom-up approach is adopted. A single neuron is considered first, and algorithms with approximately derived analytical expressions that are functions of expected input deviation are given for the computation of its sensitivity. Then another algorithm is given to compute the sensitivity of the entire MLP network.

Chapter 4

Sensitivity Analysis with Parameterized Activation Function

Among all the traditional methods introduced in Chap. 2, none has involved activation function in the calculation of sensitivity analysis. This chapter attempts to generalize Piché’s method by parameterizing antisymmetric squashing activation functions, through which a universal expression of MLP’s sensitivity will be derived without any restriction on input or output perturbations.

4.1 Parameterized Antisymmetric Squashing Function Any antisymmetric squashing activation function g(x) can be specified by three parameters, and the main idea of the function approximation approach is to use another function gA,B,C (x) to approximate g(x). The function has the following form: − gA,B,C (x) = g+ A,B,C (x) + gA,B,C (x)

(4.1)

where #

B − C −Ax2 , if x ≥ 0 e 2 0, if x < 0 # 0, if x ≥ 0 g− A,B,C (x) = C + B − C e−Ax2 , if x < 0 2 g+ A,B,C (x)

=

B−

(4.2)

(4.3)

The objective of function approximation is to determine the three factors A, B and C of g(x), so that the distance between the two functions gA,B,C (x) and g(x) is minimized. Here the distance is defined in terms of the L2 norm. The factors A, B and C are determined by $% ⎧ 2 +∞ ⎪ ⎪ g(x) − gA,B,C (x) dx ⎨ min 0 A

B = lim g(x) ⎪ x→+∞ ⎪ ⎩ C = −B D.S. Yeung et al., Sensitivity Analysis for Neural Networks, Natural Computing Series, C Springer-Verlag Berlin Heidelberg 2010 DOI 10.1007/978-3-642-02532-7_4,

(4.4)

29

30

4

Sensitivity Analysis with Parameterized Activation Function

1

A = +inf B=1 C = −1

0.8 0.6 0.4 g(x)

0.2

A = 0.446 B=1 C = −1

0 −0.2

A = 1.783 B=1 C = −1

−0.4 −0.6 −0.8 −1 −4

−3

−2

−1

0 x

1

2

3

4

Fig. 4.1 Approximation results of some antisymmetric squashing activation functions

Because both gA,B,C (x) and g(x) are antisymmetric about the point (0, 0), the factor A can be found by solving the minimization problem over (0, +∞) instead of (–∞, +∞,). Some commonly used antisymmetric squashing functions are g(x) = x ex −e−x or g(x) = eex −1 +1 . The approximated results are shown in Fig. 4.1. ex +e−x From Eqs. (4.1), (4.4) and Fig. 4.1, the following can be observed. (1) When a parameter changes from 0 to +∞, the function gA,B,C (x) becomes the threshold activation function, and we need not solve the threshold activation function and the sigmoidal function separately. (2) This function approximation approach still cannot avoid inaccuracy, but it helps us obtain an analytical expression of an MLP’s sensitivity for a class of activation functions. Then the effect of the activation function on the sensitivity of an MLP can be investigated. The function approximation approach is a tradeoff between accuracy and generalization.

4.2 Sensitivity Measure According to Piché (1995), the sensitivity of the jth neuron in the lth layer is calculated by the NSR of the output D(ylj ) l SR (yj ) = D(ylj ) where

(4.5)

4.3

Summary

31

D(ylj ) = D(ylj )

+∞ %

g2(αu) ·

−∞ +∞ % +∞ %

=

−∞ −∞

2 √1 e−(u /2) du − E 2 (yl ) j 2π

(g(αu + αβv) − g(αu))

2

(4.6)

2 2 · e−(u +v /2) dudv

To show the effects of this method, let us consider the sensitivity of a threshold function, i.e., A = +∞, B = 1 and C = –1. Eq. (4.5) can be calculated by (Yeung and Sun, 2002): 4 2 2 2 σw σx σw σ2 l + + · SR (yj ) = arctg x π σx2 σw2 σx2 σw2 σ2

(4.7) σ2

Comparing Eq. (4.5) with Eq. (4.7), we can see that, when σx2 and σw 2 are very x w small, they are nearly the same. But when they are large, the results are different. For the threshold activation function, the output error variance cannot exceed 2. The result derived from parameterized antisymmetric function satisfies this, but Piché’s fails. So the parameterized antisymmetric squashing function is more reasonable, and it is valid when the perturbation is either small or large.

4.3 Summary For each antisymmetric squashing activation function, there exist three factors A, B and C, so that gA,B,C (x) can approximate in the sense of the L2 norm. So we can use a vector (A, B, C) to represent an antisymmetric squashing activation function. That is to say that we can treat the antisymmetric squashing activation function as a parameter involved in the sensitivity of the MLP.

Chapter 5

Localized Generalization Error Model

The generalization error bounds found by current error models using the number of effective parameters of a classifier and the number of training samples are usually very loose. These bounds are intended for the entire input space. However, support vector machines, radial basis function neural networks and multilayer Perceptron neural networks are local learning machines for solving problems and thus treat unseen samples near the training samples as more important. In this chapter, we describe a localized generalization error model which bounds the generalization error from above within a neighborhood of the training samples using a stochastic sensitivity measure (Yeung et al., 2007 and Ng et al., 2007). This model is then used to develop an architecture selection technique for a classifier with maximal coverage of unseen samples by specifying a generalization error threshold.

5.1 Introduction For a pattern classification problem, one builds a classifier fθ to approximate or mimic the unknown input-output mapping function F, where θ is the set of parameters selected from a domain . For the model presented here, the mean-square-error (MSE) is used to measure the difference between fθ and F. The MSE is widely applied to training real-value output classifiers like neural networks, which classify a given sample by thresholding the real-value classifier output. The behavior of a classifier trained by minimizing an MSE, compared to the one trained by minimizing a classification error (0-1 loss function), is different. When two classifiers both yield the same, but very small percentage of training classification error, the one which yields a larger training MSE would produce indecisive outputs that deviate more from the target outputs. So a small change in the inputs may change the classification results (Anthony and Bartlett, 1999). This is not desirable and it indicates that the training classification error is not a good benchmark for the generalization capability of a classifier. Therefore, selecting a classifier using the training classification error or its bound may not be appropriate. The classification error for the entire input space is defined as (5.1) Rtrue = ( fθ (x) − F (x))2 p (x)dx T

D.S. Yeung et al., Sensitivity Analysis for Neural Networks, Natural Computing Series, C Springer-Verlag Berlin Heidelberg 2010 DOI 10.1007/978-3-642-02532-7_5,

33

34

5

Localized Generalization Error Model

where x denotes the input vector of a sample in the entire input space T, and p (x) denotes the true unknown probability density function of the input x. Given a training dataset D containing N training input-output pairs D = {(xb ,F (xb ))}N b=1 , a classifier fθ could be constructed by minimizing the empirical risk (Remp ) over D, where Remp =

N 1

(fθ (xb ) − F (xb ))2 N

(5.2)

b=1

The ultimate goal of solving a pattern classification problem is to find the fθ that is able to correctly classify future unseen samples (Haykin, 1999; Vapnik, 1998). The generalization error (Rgen ) is defined as: Rgen = T\D

(fθ (x) − F (x))2 p (x)dx

(5.3)

Since both target outputs and distributions of the unseen samples are unknown, it is impossible to compute the Rgen directly. There are two major approaches to estimate the Rgen , namely, an analytical model and cross-validation (CV). In general, analytical models cannot distinguish trained classifiers having the same number of effective parameters, but with different values of parameters. Thus they yield loose error bounds. The Akaike information criterion (AIC) (Akaike, 1974) only makes use of the number of effective parameters and the number of training samples. The Network Information Criterion (NIC) (Park et al., 2004) is an extension of the AIC for application in regularized neural networks. It defines classifier complexity by the trace of the second-order local derivatives of the network outputs with respect to connection weights. Unfortunately, current analytical models are only good for linear classifiers due to the singularity problem in their models. A major problem with these models is the difficulty of estimating the number of effective parameters of the classifier. This could be solved by using the VapnikChervonenkis (VC) dimensions (Vapnik, 1998). However, only loose bounds of VC dimensions could be found for nonlinear classifiers, thus severely limiting the applicability of analytical models to nonlinear classifiers, except for the support vector machine (SVM). Although CV uses true target outputs for unseen samples, it is time consuming for large datasets and, for k-fold CV and L choices of classifier parameters, kL classifiers must be trained. CV methods estimate the expected generalization error instead of its bound. Thus they cannot guarantee that the classifier finally constructed will have good generalization capability. Many classifiers, e.g., SVM, multilayer Perceptrons (MLPNN) and radial basis function neural networks (RBFNNs) are local learning machines. An RBFNN, by its nature, learns classification locally and every hidden neuron captures local information of a particular region in the input space defined by the center and width of its Gaussian activation function (Moody and Darken, 1989). A training sample

5.2

The Localized Generalization Error Model

35

located far away from a hidden neuron’s center does not affect the learning of this hidden neuron. An MLPNN learns decision boundaries for the classification problem in input space using the location of the training samples. However, as pointed out by Chakraborty and Pal (2001), the MLPNN output responses to unseen samples far away from the training samples are likely to be unreliable. So they proposed a learning algorithm which deactivates any MLPNN response to unseen samples much different from the training samples. This observation is further supported by the fact that, in many interesting industrial applications, such as aircraft detection in SAR images, character and fingerprint recognition (Jain and Venuri, 1999), and so on, the most significant unseen samples are expected to be similar to the training samples. On the other hand, an RBFNN is one of the most widely applied neural networks for pattern classification, with its performance primarily determined by its architecture selection. Haykin (1999) summarizes several training algorithms for an RBFNN. For instance, a two-stage learning algorithm may be a quick way to train an RBFNN. The first, unsupervised stage is to select the center positions and widths for the RBF using self-organizing map or k-means clustering. The second, supervised stage computes the connection weights using the least mean square method or pseudoinverse technique. Some have proposed that all training samples be selected as centers (Haykin, 1999). Mao (2002) based the selection of centers on the separability of the datasets. The experimental results of Mao (2002) indicate that the choice of the number of hidden neurons indeed affects the generalization capability of the RBFNNs and that an increase in the number of hidden neurons does not necessarily lead to a decrease in testing error. The selection of the number of hidden neurons affects the selection of an RBFNN architecture and ad hoc choices or sequential search are frequently used. In this chapter we describe a localized generalization error model RSM using the stochastic sensitivity measure (ST-SM), which bounds the generalization error from above for unseen samples within a predefined neighborhood of the training samples (Yeung et al., 2007 and Ng et al., 2007). In addition, an architecture selection method based on the RSM is proposed to find the maximal coverage classifier with its RSM bounded by a preselected threshold. An RBFNN will be used to demonstrate the use of the RSM and the architecture selection method. We introduce the localized generalization error model and its corresponding architecture selection method in Sects. 5.2 and 5.3 respectively.

5.2 The Localized Generalization Error Model Two major concepts of the RSM , the Q-neighborhood and the stochastic sensitivity measure, are introduced in Sects. 5.2.1 and 5.2.3 respectively. The derivation of the localized generalization error model is given in Sect. 5.2.2 and its characteristics are discussed in Sect. 5.2.4. Section 5.2.5 discusses the method to compare two classifiers using the localized generalization error model.

36

5

Localized Generalization Error Model

5.2.1 The Q-Neighborhood and Q-Union For every sample xb ∈ D, one finds a set of samples x which fulfills 0 < |xi | < Q ∀i = 1 · · · n, where n denotes the number of input features, x = (x1 · · · xn ) = x − xb and Q is a given real number. In a pattern classification problem, one usually does not have any knowledge about the distribution of the true input space. Therefore, without any prior knowledge, every unseen sample has the same chance to appear. So, x may be considered as input perturbations that are random variables having zero-mean uniform distributions. TQ (xb ) = {x | x = xb + x; |xi | ≤ Q

∀i = 1, · · · , n }

(5.4)

Then TQ (xb ) defines a Q-neighborhood of the training sample xb Let TQ be the union of all TQ (xb ) and call it the Q-Union. All samples in TQ (xb ), except xb , are considered as unseen samples (i.e., TQ (xb ) contains no training point other than xb ). For 0 ≤ Q1 ≤ · · · ≤ Qk ≤ ∞, the following relationship holds: D ⊆ TQ1 ⊆ · · · ⊆ TQk ⊆ T

(5.5)

One should note that the shape of the Q-neighborhood is chosen to be a hypercube for ease of computation, but it could also be a hypersphere or other shape. Moreover, in the localized generalization error model, the unseen samples could be selected from a distribution other than a uniform one. Only the derivation of the ST-SM needs to be modified and the rest of the model will remain the same.

5.2.2 The Localized Generalization Error Bound Instead of finding a bound for the generalization error for unseen samples in the entire input space T (Rtrue ), we find a bound on RSM , which is the error for unseen samples within TQ only, i.e., the shaded area in Fig. 5.1. We ignore the generalization error for unseen samples that are located far away from training samples (Rres in Eq. (5.6)). Note that Rres decreases when Q increases.

Fig. 5.1 An Illustration of Q-Union (TQ ) with 20 training samples. The Xs are training samples and any point in the shaded area is an unseen sample

5.2

The Localized Generalization Error Model

37

RSM (Q) = Rtrue − Rres (Q) = TQ

(fθ (x) − F (x))2 p (x) dx

(5.6)

errθ (xb )=fθ (xb )−F (xb ) ; then Let y = fθ (x)−fθ (xb ) , N % N

2 1 (errθ (xb ))2 , ETQ (y)2 = N1 Remp = (1/N) TQ (xb ) (y) (2Q)n dx, b=1 b=1 √ ε = B ln η/ (−2 N); and let A, B and N be the difference between the maximum and minimum values of the target outputs, the maximum possible value of the MSE and the number of training samples, respectively. In this work, we assume that the range of the desired output, i.e., the range of F, is either known or assigned to a preselected value. Moreover, B is computable because the range of the network outputs will be known after the classifier is trained. In general, one expects that the error of unseen samples will be larger than the training error, so we assume that the average of errors of unseen samples in TQ (xb ) will be larger than the training error of xb . By the Hoeffding inequality (Hoeffding, 1963), the average of the square errors of samples with the same population mean converges to the true mean with the following rate of convergence. With a probability of(1 − η), we have (Yeung et al., 2007 and Ng et al., 2007):

RSM (Q) ≤

N 1

1 dx + ε (fθ (x) − F (x))2 N (2Q)n TQ (xb ) b=1

N 1

1 = dx+ε (fθ (x)−fθ (xb ) + fθ (xb ) − F (xb ) + F (xb ) − F (x))2 N (2Q)n TQ (xb ) b=1

N 1 1

dx ≤ (y)2 N (2Q)n TQ (xb ) b=1

+

N N 1 1 1

1

2 dx+ )−F(x)) dx+ε (errθ (xb))2 (F(x b N N (2Q)n (2Q)n TQ (xb ) TQ (xb ) b=1

b=1

N N 1

1 1 1 dx dx (y)2 (errθ (xb ))2 + 2 N N (2Q)n (2Q)n TQ (xb ) TQ (xb ) b=1

b=1

N N 1

1

1 1

2 2 +2 dx dx (errθ (xb )) (F (xb )−F (x)) N N (2Q)n (2Q)n TQ (xb ) TQ (xb ) b=1

b=1

38

5

Localized Generalization Error Model

N N 1

1 1 1 2 2 +2 dx dx (y) (F (xb )−F (x)) N N (2Q)n (2Q)n TQ (xb ) TQ (xb ) b=1

≤

&

Remp +

b=1

$

'2

ETQ (y)2 + A + ε

= R∗SM (Q)

(5.7)

Both A and ε are constants for a given training dataset when an upper bound of the classifier output values is preselected. The R∗SM is an upper bound for the MSE of the trained classifier for unseen samples within the Q-union. This bound is better than those regression error bounds (based on AIC and VC-dimension) that are defined by using only the number of effective parameters and training samples, while ignoring statistical characteristics of a training dataset such as mean and variance. Moreover, those error bounds usually grow quickly with the increase of the number of effective parameters, e.g., number of hidden neurons in an RBFNN and VC-Dimension, while the R∗SM grows much slower. The term ETQ (y)2 will be discussed in Sect. 5.2.3. Further discussion on the characteristics of the R∗SM will be given in Sect. 5.2.4.

5.2.3 Stochastic Sensitivity Measure for RBFNN The output perturbation (y) measures the network output difference between the training sample (xb ∈ D) and the unseen sample in its Q-neighborhood ((xb + x) ∈ TQ (xb )). Thus, the ST-SM measures the expectation of the squares of network output perturbations (y) between training samples in D and unseen samples in TQ . The Sensitivity Measure (SM) of a neural network (Ng and Yeung, 2002; Ng et al., 2007) gives quantified data on the change of network outputs with respect to change of network inputs. Intuitively, it measures how sensitive the network output is to the input change. Ng et al. (2007) allowed every input or weight to have its own mean and variance, and the input and weight perturbations are allowed to be arbitrary. Hence the perturbed samples (x in Sect. 5.2.1) can be considered as unseen samples around the training samples (xb ). Ng and Yeung (2002) developed an analytical formula of the ST-SM for a Gaussian activation function RBFNN that is independent of the number of training samples. We assume the inputs are independent and not identically distributed and weight perturbations are not considered in this chapter. So, every input feature has its own expectation μxi and variance σx2i . The input perturbation of the ith input feature is a random variable having a uniform 2 . The centers and widths of the hidden distribution with mean 0 and variance σx i neurons are constant and the connection weights are fixed beforehand. An RBFNN could be described as

5.2

The Localized Generalization Error Model

fθ (x) =

M

wj exp

39

( ( (x − uj (2

j=1

−2v2j

=

M

wj φj (x)

(5.8)

j=1

where M, u and vj denote the number of hidden neurons, the center and width of the jth RBFNN hidden neuron respectively, and wj denotes the connection weight between jth hidden) neuron its corresponding output neuron. Let and the( ( (2 (2 ) 2 ( (2

2 2v4j − E (x − uj ( vj , E (x − uj ( = ϕj = wj exp Var (x − uj ( & n ' n

2 2 4 2 2 σxi + μxi − uji , υj = ϕj σxi + μxi − uji /vj , ζj = ϕj /v4j , i=1 i=1 ⎛ * 4 + 2 2

2 ⎞ ( n xi − μxi − σxi + 4σx2i μxi − uji ED (2 * ⎠, ⎝ Var (x − uj ( = 3 + xi − μxi μxi − uji +4ED i=1 p (x) denotes the probability density function of the input perturbations and p (x) = 1/ (2Q)n , n is the number of input features and uji denotes the ith input fea

ture of the jth center of the hidden RBF neuron (uj = uj1 , · · · ,ujn ). For uniformly 2 = (2Q)2/12 = Q2/3 . Theoretically, distributed input perturbations, we have σx i we do not restrict the distribution of the input perturbations as long as the variance 2 ) is finite. However, uniform distribution is assumed of the input perturbation (σx i here because without any prior knowledge on the distribution of unseen samples around the training samples, we assume that all of them have an equal chance of occurrence. By the law of large numbers, when the number of input features is not too small, φj (x) would have a lognormal distribution. So, the RBFNN ST-SM is given by: N 1

2 ETQ (y) = (fθ (xb + x) − fθ (xb ))2 p (x) dx N TQ (xb ) b=1

⎞⎤ ⎛ ) ) n n

2 4 − 2 σ2 2 2 + μ −u 2 +0.2σ 2 σ 2v σ − ⎟⎥ 2v exp 4 ⎢ ⎜ xi ji M ⎢ ⎜ xi j j xi xi xi

⎟⎥ ⎟⎥ ⎢ϕj ⎜ i=1 i=1 = ⎟⎥ ⎢ ⎜ ) ) n n

2 ⎠⎦ ⎣ ⎝ 2 2 2 j=1 2 exp 2v2j + 1 σx2i + μxi − uji + 0.2σx 2v4j − σx σx ⎡

i=1

≈

M

j=1

ϕj

i

i

n

2 σx i

σx2i

i=1

i

) 2

2 4 vj + μxi − uji + 0.2σxi

i=1

=

0.2 4

1 2

Q Q n υj + ζj 3 9 M

M

j=1

j=1

(5.9)

40

5

Localized Generalization Error Model

5.2.4 Characteristics of the Error Bound From Eq. (5.7), one may notice that the R∗SM consists of three major components:

training error (Remp ), ST-SM (ETQ (y)2 ) and the constants. The constants A and ε are preselected when the confidence of the bound (1 − η) and the training dataset is fixed. Moreover, the constant B in ε could be preselected when the classifier type is selected by fixing the maximum classifier output bound. ε is generally very small for large N. So, they will not affect the result of comparisons of the generalization capability between classifiers. In contrast, if the classifier could not generalize the training samples, one may not expect the classifier to have good generalization capability for future unseen samples. Thus the training error is one of the key components of the R∗SM . Furthermore, the ST-SM term measures the output fluctuations of the classifier. A classifier having high output fluctuations yields a high ST-SM because its output varies dramatically when the input value changes. So, due to the classifier Bias/Variance Dilemma, a classifier yielding a good generalization capability should minimize both terms or achieve a good balance between the two (Geman and Bienenstock, 1992). An interesting question is, can R∗SM be an effective mechanism for studying a classifier’s Bias/Variance Dilemma? Limiting cases of R∗SM (Q). Obviously, when Q → ∞, TQ → T and Q → 0, TQ →0. For0 ≤ Q1 ≤ · · · ≤ Qk ≤ (Q → ∞), the relationship D ⊆ TQ1 ⊆ · · · ⊆ TQk ⊆ T holds. We further extend this relationship to:

Remp the limiting case of R∗SM with Q → 0 ≤ R∗SM (Q1 )

≤ · · · ≤ R∗SM (Qk ) ≤ Rtrue R∗SM with Q → ∞

(5.10)

This relationship shows that the limiting case of R∗SM (Q) with Q → ∞ bounds from above the Rtrue . For Q → ∞, Rtrue = T

(fθ (x) − F (x))2 p (x) dx

= TQ

(fθ (x) − F (x))2 p (x) dx

= RSM (Q) ≤ R∗SM (Q)

(5.11)

Moreover, for the limiting case of Q → 0 , R∗SM (Q) bounds Remp from above. When

Q → 0, ETQ (y)2 vanishes, and we have Remp ≤

2 Remp + A + ε ≤ R∗SM (Q)

(5.12)

5.2

The Localized Generalization Error Model

41

R∗SM for other classifiers. The R∗SM as well as the RSM , could be defined for any classifier trained with MSE. Examples include feedforward neural networks like multilayer Perceptron neural networks, support vector machines and recurrent neural networks such as Hopfield networks. The R∗SM for other types of classifier could be defined by rederiving the ST-SM term for the particular type of classifier concerned. Independence of training method. The R∗SM is determined without regard of the training methods being used. Only the parameters of the finally trained classifier are used in the model. Hence the R∗SM model could also be used to compare different training methods in terms of the generalization capability of the classifiers being built. Time complexity. The ST-SM has a time complexity of O(Mn). The computational complexities of both the ST-SM and the R∗SM are low and they do not depend on the number of training samples (N). However, similarly to all other architecture selection methods, R∗SM requires a trained classifier and the time for architecture selections is dominated by the classifier training time. Therefore, the proposed architecture selection method may not have a large advantage in terms of speed in comparison to other architecture selection methods, except the two cross-validation-based methods. Limitations of the localized generalization error model. The major limitation of the localized generalization error model is that it requires a function (i.e., fθ ) for the derivation of the ST-SM. Classifiers such as rule-based systems and decision trees may not be able to make use of this concept. It will be a challenging task to find ways to determine the R∗SM for these classifiers. Another limitation of the present localized generalization error model is due to the assumption that unseen samples are uniformly distributed. This assumption is reasonable when there is no a priori knowledge of the true distribution of the input space and hence every sample may have the same probability of occurrence. One would need to re-derive a new R∗SM when a different distribution of the input space is assumed. We also remark that even if the distribution of the unseen samples is known, their true target outputs are still unknown and hence it will be difficult to judge how good the bound is. On the other hand, if both input and Q-Union distributions are the same, the localized generalization error model is expected to be a good estimate of the generalization error for the unseen samples, but this needs further investigation. R∗SM and regularization. From Eq. (5.9), one may notice that the connection weight magnitude (w2j in the ϕj ) is directly proportional to the ST-SM and thus the R∗SM . This provides a theoretical justification that, by controlling the magnitude of the connection weights between hidden and output layers, one could reduce the generalization error which is essential to the regularization of neural network learning (Haykin, 1999). Predicting unseen samples outside the Q-Union. In practice, some unseen samples may be located outside the Q-Union. This may be due to the fact that the Q value is too small and thus the Q-Union covers only very few unseen samples. However, expanding the Q value will lead to a larger R∗SM , because more dissimilar unseen

42

5

Localized Generalization Error Model

samples are included in the Q-Union, and a classifier with a very large R∗SM upper bound may not be meaningful. Furthermore, the R∗SM bounds from above the MSE of the unseen samples, and the MSE is an average of the errors. Thus, the R∗SM bound may still work well even if some of the unseen samples are located outside the Q-Union. When splitting the training and testing datasets, naturally some unseen testing samples may fall outside the Q-Union. However, our experiments show that the generalization capability of the RBFNNs selected by using the R∗SM is still the best in terms of testing accuracy when compared with other methods. On the other hand, if there is a large portion of unseen samples located outside the Q-Union, i.e., dissimilar to the training samples, one may consider revising the training dataset to include more such samples and retrain the classifier. As mentioned before, classifiers may not be expected to classify unseen samples that are totally different from the training set.

5.2.5 Comparing Two Classifiers Using the Error Bound One way to compare two classifiers is to fix the R∗SM (Q) value and compare the difference in their Q values. The other way is to fix the Q value and compare the R∗SM (Q) of the two classifiers. Assume there are two classifiers, f1 and f2 . There exists a Q1 for f1 yielding R∗SM (Q1 ) = a and a Q2 for f2 yielding R∗SM (Q2 ) = a. If Q1 < Q2 , then f2 has a better generalization capability because f2 covers more unseen samples, but still has the same generalization error upper bound. In other words, the architecture selection could be done by searching the functional space θ ∈ and seeking the fθ with the largest Q producing R∗SM (Q) = a. This is an important property of R∗SM (Q) and we will make use of it to find the optimal classifier with the largest coverage in the next section. On the other hand, one could compare the two classifiers, f1 and f2 , based on their R∗SM (Q) computed using the same Q value. The classifier with the lower R∗SM (Q) is expected to have a better generalization capability. All other comparison methods in which neither Q nor R∗SM (Q) is fixed may not be meaningful.

5.3 Architecture Selection Using the Error Bound The selection of the number of hidden neurons in the RBFNN is usually done by Sequential Learning (Huang et al., 2005) or by ad hoc choice. The Sequential Learning technique only makes use of the training error to determine the number of hidden neurons, without any reference to the generalization capability. Moreover, Huang et al. (2005) and Liang et al. (2006) assume the classifier does not have prior knowledge about the number of training samples while Kaminski and Strumillo (1997) and Gomm and Yu (2000) assume that it does. For ease of comparison with other architecture selection methods, we assume that the number of training samples

5.3

Architecture Selection Using the Error Bound

43

in our experiments is known to the classifier. In this section, we describe a new technique based on R∗SM to find the optimal number of hidden neurons that makes use of the generalization capability of the RBFNN. For any given threshold a on the generalization error bound (R∗SM ), the localized generalization error model allows us to find the best classifier by maximizing Q, assuming that the MSE of all samples within the Q-Union is smaller than a. One can formulate the architecture selection problem as a Maximal Coverage Classification problem with Selected Generalization error bound (MC2 SG), i.e., Q

max θ∈

subject to R∗SM (Q) ≤ a

(5.13)

In the RBFNN training algorithm presented in Sect. 5.3.2, once the number of hidden neurons is fixed, the center positions and widths could be estimated by any automatic clustering algorithm such as k-means clustering, a self-organizing map or hierarchical clustering. So we only need to concentrate on the problem of determining the number of hidden neurons. This means that θ = M and = {1, 2. . . N} because it is not reasonable to have the number of hidden neurons larger than the number of training samples. Eq. (5.13) represents a two-dimensional optimization problem. The first dimension is the number of hidden neurons (θ ) and the second dimension is the Q for a fixed θ . For every fixed θ and Q, we can determine an R∗SM (Q). The two parameters, θ and Q, are independent. Furthermore, by substituting Eq. (5.9) into Eq. (5.7), with probability (1 − η) we have the R∗SM for an RBFNN as follows, ⎞2 ⎛ M M

1 0.2 υj + ζj + Remp + A⎠ + ε Q4 n R∗SM (Q) ≈ ⎝ Q2 3 9 j=1

(5.14)

j=1

Let R∗SM (Q) = a. For every θ , let the Q that satisfies R∗SM (Q) = a be Q∗ , where a is a constant real number. R∗SM (Q) = a exists because the second-order derivative of Eq. (5.14) is positive. We could solve Eq. (5.14) as follows, Q4

M M

√ 2 0.2

ζj + Q2 υj − 3 a − ε − Remp − A = 0 n 3 j=1

(5.15)

j=1

Equation (5.15) could be solved by the quadratic equation and two solutions will be found for Q2 . For a ≥ ε (ε is usually a very small constant when the number of samples is large), there will be one positive and one negative real solution for Q2 because in Eq. (5.15), the coefficients for the terms Q4 and Q2 are positive, but the constant term is negative. Also, there will be two real and two imaginary solutions for Q. Let Q∗ be the only positive real solution among the four. Note that Q is defined to be the width of the Q-neighborhood and as such it must be a nonnegative real number.

44

5

∗

h M,Q

=

Localized Generalization Error Model

0 Remp ≥ a Q∗ else

(5.16)

So, for RBFNN architecture selection, Eq. (5.13) is equivalent to: max

M∈

h M,Q∗

(5.17)

5.3.1 Parameters for MC2 SG In Eq. (5.7), the differences between the maximum and minimum values of target outputs (A) and the number of training samples (N) are fixed for a given training dataset, and the maximum possible value of the MSE and the confidence level of the R∗SM bound, namely, B and η, could also be selected before any classifier training. In a K-class classification problem, one may select F (x) ∈ {(k1 ,k2 · · · kK )} where K ki = 1. ki = 1 if the sample belongs to the ith class, and one ki ∈ {0,1} and i=1

minimizes the MSE of all the RBFNN outputs simultaneously. All the F (x), fθ (x) and R∗SM are vectors and thus the sum of R∗SM values of all the K RBFNN outputs are minimized in the MC2 SG. One may notice that the minimization of the sum of the R∗SM values of all the outputs is equivalent to the minimization of the average of them. However, the average of the R∗SM values may provide a better interpretation and its range is not affected by the value K. The determination of the constant a is made according to the classifier’s output schemes for classification. For instance, if class outputs are different by 1, then a may be selected as 0.25 because a sample is misclassified if the square of its deviation from the target output is larger than 0.25. From Eq. (5.16), one may notice that the smaller the values of a is, the larger the number of hidden neurons selected by the MC2 SG. This is because a larger number of hidden neurons is needed to reduce the training error. On the other hand, if the a value is selected to be larger than 0.25, the effect on the architecture selection will be insignificant. Experimental results show that the RBFNNs yielding training error larger than 0.25 will not yield a good generalization capability, i.e., will yield poor testing accuracies (Yeung et al., 2007).

5.3.2 RBFNN Architecture Selection Algorithm for MC2 SG The solution of Eq. (5.17) is realized using the following selection algorithm. The MC2 SG is independent of any training algorithm of the RBFNN. Steps 2 and 3 represent unsupervised learning to find the centers and widths of the RBFs in the hidden neurons and Step 4 finds the least-square solution of the connection weights by making use of the linear relationship, shown in Eq. (5.8), between the hidden

5.4

Summary

45

neuron outputs and RBFNN outputs. Other training algorithms could be adopted in Steps 2, 3 and 4 in the following Architecture Selection Algorithm. The Architecture Selection Algorithm of the MC2 SG: 1. Start with M = 1 (M denotes the number of hidden neurons), 2. Execute k-means clustering algorithm to find the centers for the M hidden neurons, 3. For each of the M RBF hidden neurons, select its width value to be the distance between that center and its nearest hidden neuron, 4. Compute the connection weights using a pseudoinverse method, 5. Compute the Q-value for the current RBFNN using Eq. (5.16), 6. If the stopping criterion is not fulfilled, M = M + 1 and go to Step (2). The stopping criterion could be selected as “M is equal to the number of training samples” and this will allow the MC2 SG to search for all possible hidden neurons. However, it is computationally prohibitive for large datasets and thus we will discuss a heuristic stopping criterion in Sect. 5.3.3. Moreover, a constructive approach is employed here because it is more efficient to start the search with one hidden neuron, and with each iteration add one hidden neuron.

5.3.3 A Heuristic Method to Reduce the Computational Time for MC2 SG As in the other methods, h(M, Q∗ ) is generally not differentiable with respect to M (not a smooth function). One must try out all possible M values in order to find the optimal solution. Our experimental results show that h(M, Q∗ ) drops to 0 when the classifier becomes too complex, i.e., M is too large. Heuristically an early stop could be made to reduce the number of classifier training iterations when Q approaches 0. In our experiments, we stop the search when the Q values drop below a threshold. In fact, Q does not increase significantly after it drops below 10% of the maximum value of Q being found, and thus this is used as the threshold to speed up the MC2 SG.

5.4 Summary In this chapter, we described a new generalization error model based on the localized generalization error. R∗SM bounds the generalization error from above for unseen samples within the Q-neighborhoods of the training samples. Moreover, an architecture selection method, namely MC2 SG based on the R∗SM , is proposed to find the RBFNN classifier that has the largest coverage of unseen samples, while it’s R∗SM is still less than a preselected threshold. The R∗SM was shown to be a generalization of Rtrue .

46

5

Localized Generalization Error Model

This chapter has demonstrated the use of the MC2 SG to find the number of hidden neurons for an RBFNN, while the values of other parameters were found using existing methods. A possible extension of our result is to find the values of other RBFNN parameters, e.g., center positions, widths and connection weights, via an optimization of R∗SM because these parameters are also coefficients of the R∗SM . However, the trade-off between the optimality of the solution and time complexity will be an important consideration.

Chapter 6

Critical Vector Learning for RBF Networks

One of the most popular neural network models, the radial basis function (RBF) network attracts a lot of attention due to its improved approximation ability as well as the construction of its architecture. Bishop (1991) concluded that an RBF network can provide a fast, linear algorithm capable of representing complex nonlinear mappings. Park and Sandberg (1993) further showed that an RBF network can approximate any regular function. In a statistical sense, the approximation ability is a special case of statistical consistency. Hence, Xu et al. (1994) presented upper bounds for the convergence rates of the approximation error of RBF networks, and constructively proved the existence of a consistent point-wise estimator for RBF networks. Their results can be a guide to optimize the construction of an RBF network, which includes the determination of the total number of radial basis functions along with their centers and widths. This is an important problem to address because the performance and training of an RBF network depend very much on these parameters.

6.1 Related Work There are three ways to construct an RBF network, namely, clustering, pruning and critical vector learning. Bishop (1991) and Xu (1998) followed the clustering method, in which the training examples are grouped and then each neuron is assigned to a cluster. The pruning method, such as Chen et al. (1991) and Mao (2002), creates a neuron for each training example and then prunes the hidden neurons by example selection. The critical vector learning method, outlined by Schölkopf et al. (1997) constructs an RBF with the critical vectors, rather than cluster centers. Moody and Darken (1989) located an optimal set of centers using both the k-means clustering algorithm and learning vector quantization. The drawback of this method is that it considers only the distribution of the training inputs, yet the output values influence the positioning of the centers. Bishop (1991) introduced the Expectation-Maximization (EM) algorithm to optimize the cluster centers in two steps: obtaining of initial centers by clustering and optimization of the basis D.S. Yeung et al., Sensitivity Analysis for Neural Networks, Natural Computing Series, C Springer-Verlag Berlin Heidelberg 2010 DOI 10.1007/978-3-642-02532-7_6,

47

48

6 Critical Vector Learning for RBF Networks

functions by applying the EM algorithm. Such a treatment actually does not perform maximum likelihood learning but a suboptimal approximation. Xu (1998) extended the model for a mixture of experts to estimate basis functions, output neurons and the number of basis functions all together. The maximum likelihood learning and regularization mechanism can be further unified to his established Bayesian Ying Yang (BYY) learning framework (Xu, 2004a, 2004b, 2004c), in which any problem can be decomposed into Ying space or an invisible domain (e.g., the hidden neurons in RBFs), and Yang space or a visible domain (e.g., the training examples in RBFs), and the invisible/unknown parameters can be estimated through harmony learning between these two domains. Chen et al. (1991) proposed orthogonal least square (OLS) learning to determine the optimal centers. The OLS combines the orthogonal transform with the forward regression procedure to select model terms from a large candidate term set. The advantage of employing an orthogonal transform is that the responses of the hidden layer neurons are decorrelated so that the contribution of individual candidate neurons to the approximation error reduction can be evaluated independently. However, the original OLS learning algorithm lacks generalization and global optimization abilities. Mao (2002) employed OLS to decouple the correlations among the responses of the hidden units so that the class separability provided by individual RBF neurons can be evaluated independently. This method can select a parsimonious network architecture as well as centers providing large class separation. The common feature of all the above methods is that the radial basis function centers are a set of the optimal cluster centers of the training examples. Schölkopf et al. (1997) calculated support vectors using a support vector machine (SVM), and then used these support vectors as radial basis function centers. Their experimental results showed that the support-vector-based RBF outperforms conventional RBFs. Although the motivation of these researchers was to demonstrate the superior performance of a full support vector machine over either conventional or support-vector-based RBFs, their idea of critical vector learning is worth borrowing. This chapter describes a novel approach to determining the centers of RBF networks based on sensitivity analysis.

6.2 Construction of RBF Networks with Sensitivity Analysis An RBF classifier is a three-layer neural network model, in which an N-dimensional input vector x= (x1 x2 . . . xN ) is broadcast to each of K neurons in the hidden layer. Each hidden neuron produces an activation function, typically a Gaussian kernel: hi = exp −

x − ci 2 2σi2

, i = 1,2, . . . ,K

(6.1)

6.2

Construction of RBF Networks with Sensitivity Analysis

49

where ci and σi2 are the center and width of the Gaussian basis function of the i th hidden unit, respectively. The units in the output layer have interconnections with all the hidden units. The j th output neuron has the form: K

fj (x) = wj h =

wij exp −

i=1

x − ci 2

2σi2

(6.2)

where h = (h1 h2 . . . hK ) is the input vector from the hidden layer, and the wij is the interconnection weight between the j th output neuron and the i th hidden neuron. In this section, the RBF classifier’s sensitivity is defined as the mathematical expectation of the square of output deviations caused by the perturbation of RBF centers. An algorithm will be given that can be used to select critical vectors.

6.2.1 RBF Classifiers’ Sensitivity to the Kernel Function Centers We use symbols cˆ i and σˆ i to denote the values of center and width of the i th hidden neuron under a perturbation. Then the deviation resulting from this perturbation is: ˆ j hˆ − wj h = yj = w

( ( K (x − cˆ i (2

x − ci 2 − (6.3) wˆ ij exp − wij exp − 2σˆ i2 2σi2 i=1 i=1

K

Here, cˆ i = ci +ci are the centers deviated from the centers under the perturbations, ˆ j = wj + wj , where and the interconnection weights under the perturbations are w wj can be calculated using a pseudo-matrix inversion, or data training. Although there are ways to specify RBF widths, such as the method of Xu et al. (2004), the most common method for selecting RBF widths is to make all of them equal to a constant value depending on the prior knowledge of the given application. With predefined RBF widths, we just focus on the perturbations on the centers and their interconnection weights in this chapter. The perturbation on the i th RBF center and the weights connected to the j th output, ci and wj , can be generated following a Gaussian distribution with 0 means, variances σci and σwj , respectively: p(ci ) = √

1

2π σci

p(wj ) = √

N

1 2π σwj

' & cT c exp − 2σi 2 i

K

ci

' & wT wj exp − 2σj 2

(6.4)

wj

where N is the dimension of the input x, and K is the number of RBF centers. The RBF centers will be selected recursively in the next subsection. To make the sensitivity analysis cater to the construction of RBF networks, a recursive definition of sensitivity is given below. At the K th time, suppose there are a number K–1 of RBF centers fixed already, and the newcomer ci is observed. Hence, the j th

50

6 Critical Vector Learning for RBF Networks

output neuron’s sensitivity to the current number K of RBF centers is defined as the mathematical expectation of (yj )2 (square of output deviations caused by the perturbations of RBF centers) with respect to all ci and the training example set D = {xl }Ll=1 , which is expressed as (K)

Sj %

= E[(yj )2 ] = K 2 m w ˆ mj w ˆ nj exp − xl −cm2σ−c − p(w)p(c) 2 xl ∈D m,n=1

% −2 p(w)p(c)

m

K

xl ∈D m,n=1

%

+ p(w)p(c) =I1 − 2I2 + I3

K

xl ∈D m,n=1

xl −cn −cn 2 2σn2

2 m wˆ mj wnj exp − xl −cm2σ−c − 2 m

2 m wmj wnj exp − xl −cm2σ−c − 2 m

dcdw

xl −cn −cn 2 2σn2

dcdw xl −cn −cn 2 2σn2

dcdw (6.5)

where I1 , I2 and I3 are figured out by integrating over c and w, so I1 =

K

√ N σm2 σn2

x −cm 2 N exp − l2 2(σm +σc2m ) 2 2 2 2 (σm +σcm )(σn +σcn ) xl ∈D m,n=1;m=n √ N K σm2 2 x −cm 2 2 + (wmj + σwj ) $ N exp − 2l 2 (σm +2σcm ) (σm2 +2σc2m ) xl ∈D m=1

wmj wnj $

−

xl −cn 2 2(σn2 +σc2n )

,

and similarly, we have

Fig. 6.1 Illustration of the difference between sensitivity-based and conventional RBF classifiers. The circles and balls represent data points of two classes respectively, and RBF centers are indicated by extra circles. (a) Sensitivity-based RBF, centers being the vectors sensitive to the classification. (b) Conventional RBF, centers being the cluster centroids

6.2

Construction of RBF Networks with Sensitivity Analysis

I2 = I3 =

K

xl ∈D m,n=1 K xl ∈D m,n=1

√

wmj wnj $

σm2

N

N (σm2 +σc2m )

51

xl −cm 2 exp − 2(σ 2 +σ 2 ) − m

−cm 2 wmj wnj exp − xl2σ − 2 m

xl −cn 2 2σn2

cm

xl −cn 2 2σn2

,

.

The difference between the sensitivity-based and the conventional RBF networks can be illustrated as in Fig. 6.1.

6.2.2 Orthogonal Least Square Transform The most critical vectors can be found by Eq. (6.5). However, the RBF centers cannot be determined by a sensitivity measure only, because some of the critical vectors may be correlated. The OLS (Chen et al., 1991) can alleviate this problem of redundant or correlated centers. Let Y= (y1 , y2 . . . yL )T be the output matrix corresponding to the number L of all training examples, yi (i= 1, 2, . . ., L) an M-dimensional vector denoting the number M of output units. We have Y = HW = (QA) W

(6.6)

where Y, H, W are L × M, L × L, and L × M matrices, respectively. The selection of RBF centers is equivalent to the selection of the most critical columns of H. The matrix H can be decomposed into QA, where Q is an L × L matrix with orthogonal columns [q1 , q2 . . .qL ], and A is an L × L upper triangular matrix as follows: ⎡ ⎤ 1 h11 h12 · · · h1L ⎢ ⎢ .. ⎥ ⎢0 ⎢ h12 h12 . ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ .. . . .. ⎥ , A = ⎢ H=⎢ . ⎢ .. ⎢ ⎥ ⎢ ⎢. ⎥ ⎢ . ⎣ .. ⎦ ⎣ .. 0 h1L · · · · · · hLL ⎡

a12 · · · · · ·

a1L .. . .. .

⎤

⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ 0 1 a(L−1)L ⎦ ··· ··· 0 1 1 a23 . 0 ..

Only one column of H is orthogonalized in each iteration. At the K th iteration, one column is made orthogonal to each of the K–1 previously orthogonalized columns. The computational procedure can be represented as follows (Chen et al. 1991): ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨

q1 = h1 , aik =

qTi hK , qi

⎪ ⎪ ⎪ K−1 ⎪ ⎪ ⎪ ⎩ q K = hK − aiK hi

1≤i

Advisory Board: S. Amari G. Brassard K.A. De Jong C.C.A.M. Gielen T. Head L. Kari L. Landweber T. Martinetz Z. Michalewicz M.C. Mozer E. Oja G. P˘aun J. Reif H. Rubin A. Salomaa M. Schoenauer H.-P. Schwefel C. Torras D. Whitley E. Winfree J.M. Zurada

For further volumes: http://www.springer.com/series/4190

Daniel S. Yeung · Ian Cloete · Daming Shi · Wing W.Y. Ng

Sensitivity Analysis for Neural Networks

123

Authors Prof. Daniel S. Yeung School of Computer Science and Engineering South China University of Technology Wushan Rd. TianHe District Guangzhou, China [email protected] Prof. Daming Shi School of Electrical Engineering and Computer Science Kyungpook National University Buk-gu, Daegu South Korea [email protected]

Prof. Ian Cloete President Campus 3 International University in Germany 76646 Bruchsal, Germany [email protected]; [email protected] Dr. Wing W.Y. Ng School of Computer Science and Engineering South China University of Technology Wushan Rd. TianHe District Guangzhou, China [email protected]

Series Editors G. Rozenberg (Managing Editor) [email protected] Th. Bäck, J.N. Kok, H.P. Spaink Leiden Center for Natural Computing Leiden University Niels Bohrweg 1 2333 CA Leiden, The Netherlands A.E. Eiben Vrije Universiteit Amsterdam The Netherlands

ISSN 1619-7127 ISBN 978-3-642-02531-0 e-ISBN 978-3-642-02532-7 DOI 10.1007/978-3-642-02532-7 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2009938187 ACM Computing Classification (1998): I.2.6, F.1.1 © Springer-Verlag Berlin Heidelberg 2010 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: KuenkelLopka GmbH Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface

Neural networks provide a way to realize one of our human dreams to make machines think like us. Artificial neural networks have been developed since Rosenblatt proposed the Perceptron in 1958. Today, many neural networks are not treated as black boxes any more. Issues such as robustness and generalization abilities have been brought to the fore. The advances in neural networks have led to more and more practical applications in pattern recognition, financial engineering, automatic control and medical diagnosis, to name just a few. Sensitivity analysis dates back to the 1960s, when Widrow investigated the probability of misclassification due to weight perturbations, which are caused by machine imprecision and noisy input. For the purpose of analysis, these perturbations can be simulated by embedding disturbance into the original inputs or connection weights. The initial idea of sensitivity analysis was then extended to optimization and to applications of neural networks, such as sample reduction, feature selection, active learning and critical vector learning. This text should primarily be of interest to graduate students, academics, and researchers in branches of neural networks, artificial intelligence, machine learning, applied mathematics and computer engineering where sensitivity analysis of neural networks and related concepts are used. We have made an effort to make the book accessible to such a cross-disciplinary audience. The book is organized into eight chapters, of which Chap. 1 gives an introduction to the various neural network structures and learning schemes. A literature review on the methodologies of sensitivity analysis is presented in Chap. 2. Different from the traditional hypersphere model, the hyper-rectangle model described in Chap. 3 is especially suitable for the most popular and general feedforward network: the multilayer Perceptron. In Chap. 4, the activation function is also involved in the calculation of the sensitivity analysis by parameterizing. The sensitivity analysis of radial basis function networks is discussed in Chaps. 5 and 6, with the former giving a generalization error model whereas the latter concerns optimizing the hidden neurons. In Chap. 7, sensitivity is measured in order to encode prior knowledge into a neural network. In Chap. 8, sensitivity analysis is applied in many applications, such as dimensionality reduction, network optimization and selective learning. We would like to express our thanks to many colleagues, friends and students who provided reviews of different chapters of this manuscript. They include Minh v

vi

Preface

Nhut Nguyen, Xiaoqin Zeng, Patrick Chan, Xizhao Wang, Fei Chen and Lu He. We often find ourselves struggling with many competing demands for our time and effort. As a result, our families, especially our beloved spouses, are the ones who suffer the most. We are delighted to dedicate this work to Foo-Lau Yeung, Wilma Cloete and Jian Liu. It is with great humility that we would like to acknowledge our Good Lord as the true creator of all knowledge. This work is the result of our borrowing a small piece of knowledge from Him. 8 July 2009

Daniel S. Yeung Ian Cloete Daming Shi Wing W.Y. Ng

Contents

1 Introduction to Neural Networks . . . . . . . . . . . . . . 1.1 Properties of Neural Networks . . . . . . . . . . . . . 1.2 Neural Network Learning . . . . . . . . . . . . . . . . 1.2.1 Supervised Learning . . . . . . . . . . . . . . 1.2.2 Unsupervised Learning . . . . . . . . . . . . . 1.3 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Adaline and Least Mean Square Algorithm . . . . . . . 1.5 Multilayer Perceptron and Backpropagation Algorithm 1.5.1 Output Layer Learning . . . . . . . . . . . . . 1.5.2 Hidden Layer Learning . . . . . . . . . . . . . 1.6 Radial Basis Function Networks . . . . . . . . . . . . 1.7 Support Vector Machines . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

1 3 5 5 5 6 8 9 11 11 12 13

2 Principles of Sensitivity Analysis . . . . . . . . . 2.1 Perturbations in Neural Networks . . . . . . . 2.2 Neural Network Sensitivity Analysis . . . . . 2.3 Fundamental Methods of Sensitivity Analysis 2.3.1 Geometrical Approach . . . . . . . . 2.3.2 Statistical Approach . . . . . . . . . . 2.4 Summary . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

17 17 18 21 21 23 24

3 Hyper-Rectangle Model . . . . . . . . . . . . . . . . 3.1 Hyper-Rectangle Model for Input Space of MLP . 3.2 Sensitivity Measure of MLP . . . . . . . . . . . 3.3 Discussion . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

25 25 26 27

4 Sensitivity Analysis with Parameterized Activation Function 4.1 Parameterized Antisymmetric Squashing Function . . . . . 4.2 Sensitivity Measure . . . . . . . . . . . . . . . . . . . . . 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

29 29 30 31

5 Localized Generalization Error Model . . . . 5.1 Introduction . . . . . . . . . . . . . . . . 5.2 The Localized Generalization Error Model 5.2.1 The Q-Neighborhood and Q-Union

. . . .

. . . .

. . . .

. . . .

33 33 35 36

. . . .

. . . .

. . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

vii

viii

Contents

5.2.2 The Localized Generalization Error Bound . . . . . . . 5.2.3 Stochastic Sensitivity Measure for RBFNN . . . . . . 5.2.4 Characteristics of the Error Bound . . . . . . . . . . . 5.2.5 Comparing Two Classifiers Using the Error Bound . . 5.3 Architecture Selection Using the Error Bound . . . . . . . . . 5.3.1 Parameters for MC2 SG . . . . . . . . . . . . . . . . . 5.3.2 RBFNN Architecture Selection Algorithm for MC2 SG 5.3.3 A Heuristic Method to Reduce the Computational Time for MC2 SG . . . . . . . . . . . . . . . . . . . . 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

36 38 40 42 42 44 44

. . . .

45 45

6 Critical Vector Learning for RBF Networks . . . . . . . . . . . . . . 6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Construction of RBF Networks with Sensitivity Analysis . . . . . 6.2.1 RBF Classifiers’ Sensitivity to the Kernel Function Centers 6.2.2 Orthogonal Least Square Transform . . . . . . . . . . . . 6.2.3 Critical Vector Selection . . . . . . . . . . . . . . . . . . 6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47 47 48 49 51 52 52

7 Sensitivity Analysis of Prior Knowledge . . 7.1 KBANNs . . . . . . . . . . . . . . . . 7.2 Inductive Bias . . . . . . . . . . . . . . 7.3 Sensitivity Analysis and Measures . . . 7.3.1 Output-Pattern Sensitivity . . . . 7.3.2 Output-Weight Sensitivity . . . . 7.3.3 Output-H Sensitivity . . . . . . 7.3.4 Euclidean Distance . . . . . . . 7.4 Promoter Recognition . . . . . . . . . . 7.4.1 Data and Initial Domain Theory 7.4.2 Experimental Methodology . . . 7.5 Discussion and Conclusion . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

55 56 58 59 59 60 61 61 61 62 63 64

8 Applications . . . . . . . . . . . . . . . . . . . . . 8.1 Input Dimension Reduction . . . . . . . . . . . 8.1.1 Sensitivity Matrix . . . . . . . . . . . . 8.1.2 Criteria for Pruning Inputs . . . . . . . 8.2 Network Optimization . . . . . . . . . . . . . 8.3 Selective Learning . . . . . . . . . . . . . . . . 8.4 Hardware Robustness . . . . . . . . . . . . . . 8.5 Measure of Nonlinearity . . . . . . . . . . . . 8.6 Parameter Tuning for Neocognitron . . . . . . 8.6.1 Receptive Field . . . . . . . . . . . . . 8.6.2 Selectivity . . . . . . . . . . . . . . . . 8.6.3 Sensitivity Analysis of the Neocognitron

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

69 69 70 70 71 74 76 77 78 79 80 80

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

Chapter 1

Introduction to Neural Networks

The human brain consists of ten billion densely interconnected nerve cells, called neurons; each connected to about 10,000 other neurons, with 60 trillion connections, synapses, between them. By using multiple neurons simultaneously, the brain can perform its functions much faster than the fastest computers in existence today. On the other hand, a neuron can be considered as a basic information-processing unit, whereas our brain can be considered as a highly complex, nonlinear and parallel biological information-processing network, in which information is stored and processed simultaneously. Learning is a fundamental and essential characteristic of biological neural networks. The ease with which they can learn led to attempts to emulate a biological neural network in a computer. In the 1940s, McCulloch and Pitts proposed a model for biological neurons and biological neural networks. A stimulus is transmitted from dendrites to a soma via synapses, and axons transmit the response of one soma to another, as shown in Fig. 1.1. Inspired by the mechanism for learning in biological neurons, artificial neurons and artificial neural networks can perform arithmetic functions, with cells corresponding to neurons, activations corresponding to neuronal firing rates, connections corresponding to synapses, and connection weights corresponding to synaptic strengths, as shown in Fig. 1.1. The analogy between biological neurons and artificial neurons is made in Table 1.1. However, neural networks are far too simple to serve as realistic brain models on the cell level, but they might serve as very good models for the essential information processing tasks that organisms perform. This remains an open question because we have so little understanding of how the brain actually works (Gallant, 1993). In a neural network, neurons are joined by directed arcs – connections. The neurons and arcs constitute the network topology. Each arc has a numerical weight that specifies the influence between two neurons. Positive weights indicate reinforcement; negative weights represent inhibition. The weights determine the behavior of the network, playing somewhat the same role as in a conventional computer program. Typically, there are many inputs for a single neuron, and a subsequent output of an activation function (or transfer function). Some frequently used activation functions include:

D.S. Yeung et al., Sensitivity Analysis for Neural Networks, Natural Computing Series, C Springer-Verlag Berlin Heidelberg 2010 DOI 10.1007/978-3-642-02532-7_1,

1

2

1 Introduction to Neural Networks

a

b

Fig. 1.1 Biological motivations of neural networks. (a) Neuroanatomy of living animals. (b) Connections of an artificial neuron

• Linear Function: (u) = u

1 • Log-Sigmoid Function: (u) = −u 1+e 1 if u ≥ 0 • Hard Limit function: (u) = 0 otherwise Rosenblatt (1958) devised the Perceptron, which is now widely used. Widrow and Hoff (1960) proposed the Adaline at the same time. The Perceptron and Adaline were the first practical networks and learning rules that demonstrated the high potential of neural networks. Table 1.1 Analogy between biological and artificial neurons

Biological Neurons

Artificial Neurons

Soma Dendrite Axon Synapse

Sum + Activation Function Input Output Weight

1.1

Properties of Neural Networks

3

Minsky and Papert (1969) criticized the Perceptron, because they found it is not powerful enough to do tasks such as parity and connectedness. As long as a neural network consists of a linear combiner followed by a nonlinear element, a single-layer Perceptron can perform pattern classification only on linearly separable patterns, regardless of the form of nonlinearity used. Minsky and Papert demonstrated the limitations of the Perceptron with the simplest XOR problem, and as a consequence the research in the field was suspended due to the lack of funding. Because of Minsky and Papert’s conclusion, people lost confidence in neural networks. In the 1970s, the progress on neural networks continued at a much reduced pace, although some researchers, such as Amari, Anderson, Fukushima, Grossberg and Kohonen, kept working on it. In 1985, Ackley, Hinton and Sejnowski described the Boltzmann machine, which is more powerful than a Perceptron, and demonstrated a successful application – NETtalk. Rumelhart, Hinton and Williams (1986) are recognized for their milestone work – the Multilayer Perceptron (MLP) with backpropagation, which remained the dominant neural network architecture for more than ten years. A sufficiently big MLP has been proven to be able to learn any function, and many variants of MLP have been proposed since then. In 1988, Radial Basis Functions (RBFs), introduced as an alternative to MLPs, speeded up MLP training (fast-prop). From the 1990s to date, more and more research has been done to improve neural networks. The research agenda includes regularizers, probabilistic (Bayesian) inference, structural risk minimization and incorporation with evolutionary computation.

1.1 Properties of Neural Networks Neural networks can be described according to their network, cell, dynamic, and learning properties as follows: Network Properties. A neural network is an architecture consisting of many neurons, which work together to respond to the inputs. We sometimes consider a network as a black-box function. Here the external world presents inputs to the input cells and receives network outputs from the output cells. Intermediate cells are not seen externally, and for this reason they are often called hidden units. We classify networks as either feedforward networks if they do not contain directed cycles or recurrent networks if they do contain such cycles. It is often convenient to organize the cells of a network into layers. Cell Properties. Each cell computes a single (numerical) cell output or activation. Cell inputs and activations may be discrete, taking values {0, 1} or {−1, 0, 1}, or they may be continuous, assuming values in the interval [0, 1] or [−1, +1]. Typically, every cell uses the same algorithm for computing its activation. The activation for a cell must be computed from the activations of cells directly connected to it and the corresponding weights for those connections. Every cell (except for input cells) computes its new activation as a function of the weighted sum of the inputs from directly connected cells.

4

1 Introduction to Neural Networks

Dynamic Properties. A neural network model must specify the time when each cell computes its new activation value and when the change to that cell’s output actually takes place. Usually in feedforward models, cells are visited in a fixed order, and each cell reevaluates and changes its activation before the next one is visited. In this case the network achieves a steady state after one pass through the cells, provided the cells are correctly numbered. For recurrent models there are several possibilities. One possibility is to make one ordered pass through the cells, as with feedforward models; but we are no longer guaranteed that the network will reach steady state. Another possibility is to make repeated passes through the network. For discrete models the network will either reach a steady state or it will cycle; for continuous models the network will either reach a steady state, cycle, approach a steady state in the limit, blow up, or some combination of these things. A third possibility is to compute new activations for all cells simultaneously and then make changes to the cell outputs simultaneously. This is similar to the previous case. Still another possibility is to select a cell at random, compute its new activation, and then change its output before selecting the next cell. In this case we have no guarantee of any sort of limiting or cyclic behavior unless other constraints are placed upon the model. Learning Properties. Each neural network needs to be trained to respond to the inputs, so it must be associated with one or more algorithms for machine learning. Machine learning refers to computer models that improve neural network performance in significant ways based upon the input data. Machine learning techniques are usually divided into supervised and unsupervised models. In supervised learning a teacher or critic supplies additional input data that gives a measure of how well a program is performing during a training phase. The most common form of supervised learning is trying to duplicate behavior specified by a finite set of training examples, where each example consists of input data and the corresponding correct output. In unsupervised learning there is no performance evaluation available. Without any specific knowledge of what constitutes a correct answer and what constitutes an incorrect answer, the most that can be expected from these models is the construction of groups of similar input patterns. In the pattern recognition literature this is known as clustering. Neural networks have potential as intelligent control systems because they can learn and adapt, they can approximate nonlinear functions, they are suited for parallel and distributed processing, and they naturally model multivariable systems. If a physical model is unavailable or too expensive to develop, a neural network model might be an alternative. There are now many real-world applications ranging from finance to aerospace. There are also many advanced neural network architectures (Haykin, 1994). Some representative neural network models will be discussed in the remainder of this chapter. In terms of their architectures, neural networks can be categorized into feedforward and recurrent. The difference between these two is that there is at least one feedback loop in the latter. The focus of this book is on feedforward neural networks. Some representative feedforward models will be introduced in the following discussions.

1.2

Neural Network Learning

5

1.2 Neural Network Learning The primary significance of a neural network is its ability to learn from its environment. What is learning? In the context of neural networks, learning is defined as a process by which the free parameters of a neural network are adapted through a continuous process of stimulation by the environment. The type of learning is determined by the manner in which the parameter changes take place. The above definition implies that (1) the network is stimulated by the environment; (2) the network changes as a result of stimulation; and (3) the network responds to the environment in a new way after the occurrence of change.

1.2.1 Supervised Learning During the training session of a neural network, an input is applied to the network, and a response of the network is obtained. The response is compared with an a priori target response. If the actual response differs from the target response, the neural network generates an error signal, which is then used to compute the adjustment that should be made to the network’s synaptic weights so that the actual response matches the target output. In other words, the error is minimized, possibly to 0. Since the minimization process requires a teacher (supervisor), this kind of training is named supervised learning. A supervised learning model is illustrated in Fig. 1.2. The notion of teacher comes from biological observations. For example, when learning a language, we hear the sound of a word from a teacher. The sound is stored in the memory banks of our brain, and we try to reproduce the sound. When we hear our own sound, we mentally compare it (actual response) with the stored sound (desired response) and note the error. If the error is large, we try again and again until it becomes significantly small. An unsupervised learning model is illustrated in Fig. 1.3.

1.2.2 Unsupervised Learning In contrast to supervised learning, unsupervised learning does not require a teacher, i.e., there is no target response. During the training stage, the neural network receives its input patterns and it arbitrarily organizes them into categories. When an input is later applied, the neural network provides an output response to indicate the class to which the input pattern belongs. For example, show a person a set of different objects. Then ask him to separate them into different groups, such that objects in a group have one or more common features that distinguish them from other groups. When this (training) is done, show the same person an object that is unseen and ask him to place the object in one of the groups. He would then put it in the group with which the object shares the most common features. Even though unsupervised learning does not require a teacher, it requires guidelines to determine how it forms groups. Grouping may be based on shape, color, or material consistency or some other properties of the objects. If no guidelines have

6

1 Introduction to Neural Networks Vector describing state of environment Environment

Teacher desired response

Learning System

+ actual response _ Σ

error signal

Fig. 1.2 Supervised learning model Vector describing state of environment

Fig. 1.3 Unsupervised learning model Environment

Learning system

been given as to what type of features should be used for grouping the objects, the grouping may or may not be successful. Similarly, to classify patterns effectively using neural networks, some guidelines are needed.

1.3 Perceptron The Perceptron, the simplest form of a neural network, is able to classify data into two classes. Basically it consists of a single neuron with a number of adjustable weights. The neuron is the fundamental processor of a neural network (Fig. 1.4). The summation in the neuron also includes an offset for lowering or raising the net input to the activation function. Mathematically the input to the neuron is represented by a vector x =< 1, x1 , x2 · · · xn >T, and the output is a scalar y. The weights of the connections are represented by the vector w =< w0 , w1 , w2 · · · wn >T , where wo is the offset. The offset is often also called the bias and denoted with b. The output is calculated as y = f (wT x).

(1.1)

Fig. 1.4 is a Perceptron with two inputs and an offset. With a hard limiter as activation function, the neuron produces an output equal to 0 or +1 that we can associate with the two classes C1 and C2 respectively.

1.3

Perceptron

7 w1

x1

w2

x2

∑

net

...

xp

wp

Summing node

1 0

net

Output signals y

–1

Threshold unit logic

weights

Fig. 1.4 Perceptron consisting of a neuron with an offset and a hard limiter activation function

The weights w are adjusted using an adaptive learning rule. One such learning rule is the Perceptron Convergence Algorithm. If the two classes C1 and C2 are linearly separable (i.e., they lie on opposite sides of a straight line or, in general, a hyperplane), then there exists a weight vector w with the properties

wT x ≥ 0 for every input vector x belonging to class C1 wT x < 0 for every input vector x belonging to class C2

(1.2)

Assuming, to be general, that the Perceptron has m inputs, the equation wT x = 0 in an m-dimensional space with coordinates x1 , x2 · · · xm , defines a hyperplane as the switching surface between the two different classes of input. The training process adjusts the weights w to satisfy the two inequalities (1.2). A training set consists of, say, K samples of the input vector x together with each sample’s class membership (0 or 1). The learning is continued epoch by epoch until the weights stabilize. The core of the Perceptron convergence algorithm for adapting the weights of the elementary Perceptron has the following two steps. Step 1. If the kth member of the training set, xk (k = 1, 2. . . K), is correctly classified by the weight vector wk computed at the kth iteration of the algorithm, no correction is made to wk , i.e.,

wk+1 = wk if wT x ≥ 0 and xk belongs to class C1 wk+1 = wk if wT x < 0 and xk belongs to class C2

(1.3)

Step 2. Otherwise the Perceptron weights are updated according to the rule

wk+1 = wk − ηk xk if wT x ≥ 0 but xk belongs to class C2 wk+1 = wk + ηk xk if wT x < 0 but xk belongs to class C1

(1.4)

Notice the interchange of the class numbers from step 1 to step 2. The learningrate ηk controls the adjustment applied to the weight vector at iteration k. If ηk is a

8

1 Introduction to Neural Networks

constant, independent of the iteration number k, we have a fixed increment adaptation rule for the Perceptron. The algorithm has been proved to converge (Haykin, 1994).

1.4 Adaline and Least Mean Square Algorithm ADALINE is an acronym for ADAptive LINear Element (also known as linear threshold unit). It was developed by Bernard Widrow and Marcian Hoff (1960). As shown in Fig. 1.5, the architecture of the Adaline is the same as that of the Perceptron. The difference between these two models is the activation function. An Adaline has n variable inputs x1 , x2 · · · xn , which take on binary values of either +1 or −1. The bias input x0 is fixed at a value of +1. Associated with the Adaline are n+1 adjustable analog weights w0 , w1 , w2 · · · wn . The weights of the Adaline scale the corresponding inputs, the scaled inputs are summed, and the weighted sum is the input to a threshold device. The threshold device outputs a −1 for negative inputs and a +1 for positive inputs. The output of the threshold device is the Adaline output. The Adaline applies the Least Mean Square (LMS) algorithm, also known as Widrow-Hoff learning. The LMS algorithm is more powerful than the Perceptron learning rule. While the Perceptron rule is guaranteed to converge to a solution that correctly categorizes the training patterns, the resulting network can be sensitive to noise, since patterns often lie close to the decision boundaries. The LMS algorithm minimizes mean square error, and therefore tries to move the decision boundaries as far from the training patterns as possible. The LMS algorithm has found many more practical uses than the Perceptron learning rule. This is especially true in the area of digital signal processing (Hagan, Demuth and Beale, 1996). Given a training set {(x1 , d1 ), (x2 , d2 ). . . (xP , dP )}, where the pth training pair is given by (xp , dp ), with dp being the target output, the mean square error (MSE) between the real output and the target output is calculated by:

w1

x1

w2

x2

f(net)

... xp

Continuous perception

wp weights

Fig. 1.5 Adaptive linear element (Adaline)

Output signals y

1.5

Multilayer Perceptron and Backpropagation Algorithm 2 2 MSE = E[e − wT x)2 ] 2] = E[(dT− y) ] T= E[(d = E d − 2dw x + w xxT w = E[d 2 ] − 2wT E[dx] + wT E[xxT ]w

9

(1.5)

From Eq. (1.5), we can see that the MSE for the Adaline is a quadratic function: 1 J(w) = c + bT w + wT Aw (1.6) 2 with b = −2h = −2E [dx] and A = 2R = 2E xxT . The gradient of the MSE is given by: ∇J(w) = ∇ c + bT w + 12 wT Aw = b + Ax = −2h + 2Rx

(1.7)

To find the optimal weights w∗ , we need to find the stationary point of J(w). Let

∇J(w) = −2h + 2Rx = 0

(1.8)

w∗ = R−1 h

(1.9)

So we have

As the objective of neural network training is to find the optimal w∗ by iterative updating, the steepest gradient method can be applied: w(t + 1) = w(t) − η∇J(t)

(1.10)

where t refers to the tth iteration. However, since it is not desirable to calculate h and R, and not convenient to calculate R−1 , the MSE can be estimated iteratively by square error (Widrow and Hoff, 1960): Jˆ (w) =

1 1 [di (t) − yi (t)]2 = e2 (t) 2 2

(1.11)

and its gradient can be accordingly estimated by 1 2 ∂e (t) ∂ Jˆ (t) ∂e(t) ∂(d(t) − w(t) · x(t)) ∇ Jˆ (w) = = 2 = e(t) · = e(t) · ∂w(t) ∂w(t) ∂w(t) ∂w(t) = e(t) · ( − x(t))

(1.12)

1.5 Multilayer Perceptron and Backpropagation Algorithm As long as a neural network consists of a linear combiner followed by a nonlinear element, a single-layer Perceptron can perform pattern classification only on linearly separable patterns, regardless of the form of nonlinearity used. Linear separability

10

1 Introduction to Neural Networks

requires that the patterns to be classified be sufficiently separated from each other to ensure that the decision boundaries are hyperplanes. As shown in Fig. 1.6, MLP can address the above problem. An MLP consists of an input layer, one or more hidden layers and an output layer. Each neuron is fully connected to all the neurons in its preceding layer as well as all those in its next layer. Typically, the same nonlinear function, such as the log-sigmoid function, is applied to every neuron. The backpropagation algorithm was developed for training multilayer Perceptron networks. It was popularized by Rumelhart, Hinton and Williams (1986), although similar ideas had been developed previously by others. The idea is to train a network by propagating the output errors backward through the layers. The errors serve to evaluate the derivatives of the error function with respect to the weights, which can then be adjusted. The backpropagation algorithm consists of two phases, namely, a feedforward phase and a backpropagation phase. The former applies an input, evaluates the activations and stores the error. The latter computes the adjustments and updates the weights. However, only the errors in the output layer are visible, i.e., can be calculated directly, whereas those in the hidden layers will be propagated from their next layers. The error at the output neuron j at iteration t can be calculated by the difference between the desired output and the corresponding real output, i.e., ej (t) = dj (t) − 2 yj (t). Accordingly, the total error energy of all output neurons is ε(t) = 12 ej (t). j∈C

Referring to Fig. 1.6, the output of the kth neuron in the lth layer can be calculated by:

1

....

l

l–1 w l11 w l12

Fig. 1.6 Multilayer Perceptron

w l1n l−1

....

L

1.5

Multilayer Perceptron and Backpropagation Algorithm

⎛ ylk = f ⎝

⎞

l−1

n

11

⎠ wljk · yl−1 j

(1.13)

j=1

where 1 ≤ l ≤ L, nl refers to the number of neurons in layer l. For the input layer thus holds l = 1, y1j = xj ; for the output layer l = L, yLj = yj .

1.5.1 Output Layer Learning The mean square error (MSE) of the output can be computed by: ⎡ ⎞⎤ ⎛ L−1 nL nL n

2 1

1 ⎣ E= dj − yj = dj − f ⎝ wLij · yiL−1 ⎠⎦ 2 2 j=1

j=1

(1.14)

i=1

The steepest descent of MSE can be used to update the weights: wLij (t + 1) = wLij (t) − η

∂E ∂wLij

(1.15)

Since the output error E is an indirect function of the weights in the hidden layers, the chain rule of calculus is applied to calculate the derivatives. Let l−1 n l l netlj = wlij · yl−1 i , yj = f (netj ); then we have i=1

L ∂E ∂yj ∂E = · ∂wLij ∂yLj ∂wLij

[ − 12pt] =

∂yLj ∂netLj ∂E · · ∂yLj ∂netLj ∂wLij

(1.16)

[ − 12pt] = − dj − yj · f netLj · yiL−1 The error signal of the output can be backpropagated to layer L−1: δjL = −

∂yLj

L ∂E ∂E = − · = d − y · f netj j j ∂netLj ∂yLj ∂netLj

1.5.2 Hidden Layer Learning For any hidden layer l, the descent of the error is calculated by

(1.17)

12

1 Introduction to Neural Networks

∂E ∂wlij

=

∂E ∂ylj

·

∂ylj ∂wlij

=

∂E ∂ylj

·

∂ylj ∂netlj

·

∂netlj ∂wlij

=

∂E ∂ylj

· f netlj · yl−1 i

(1.18)

Since E is not a direct function of the output in the hidden layers, the steepest descent of the error can be calculated by: nl+1

∂yl+1 ∂E ∂E k = · ∂ylk ∂yl+1 ∂yli k=1 k l+1 n ∂yl+1 ∂netl+1 ∂E k k · · = l+1 ∂yli ∂netl+1 k=1 ∂yk k l+1 l+1 n n l+1 l+1 (netl+1 ) · wl+1 = el+1 δ · f · w = k k jk k jk k=1

(1.19)

k=1

where δkl+1 is the propagated error signal from the next layer, and the error signal in

the lth layer δkl = elk · f netlk will then be propagated to layer l−1.

1.6 Radial Basis Function Networks The design of a neural network can be viewed as a curve-fitting problem. According to this viewpoint, learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training data, whereas generalization is equivalent to the use of the trained multidimensional surface to interpolate the test data. There are two bases in support of Radial Basis Function networks: 1. Cover’s Theorem. A complex pattern classification problem cast in a highdimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space. Cover’s theorem tells us that we can map the input space to a high-dimensional space, in which a linear function will be found. 2. Tikhonov Regularization Theory. In the context of a hypersurface reconstruction problem, the basic idea of regularization is to stabilize the solution by means of some auxiliary nonnegative functional that embeds prior information about the solution. Tikhonov Regularization Theory tells us that we cannot purely rely on the training data, but must introduce some constraints, e.g., function smoothness. RBF networks are feedforward networks trained using a supervised training algorithm. They are typically configured with a single hidden layer of units whose activation function is selected from a class of functions called basis functions. Actually, RBF networks transform the input space into a high dimensional space in a nonlinear manner and then find the curve-fitting approximation in high-dimensional space. The activation function depends on the Euclidean distance between input and target vectors. An RBF network can be described by

1.7

Support Vector Machines

13

F(x) =

n

wi ϕ ( x − xi )

(1.20)

i=1

where ϕ is a nonlinear transformation from input space to hidden space of high dimensionality. The most common form of basis function is the Gaussian function

x − c 2 ϕ(x) = exp − 2σ 2

(1.21)

where c determines the center of the basis function, and σ is a width parameter that controls how spread out the curve is. The output layer performs a linear mapping from hidden space to output space. The output of the network is the weighted sum of the hidden layer neuron outputs. A set of training data with their corresponding target outputs are given for supervised learning: ⎡

ϕ11 ⎢ ϕ21 ⎢ ⎢ .. ⎣ .

ϕ12 ϕ22 .. .

··· ... .. .

⎤⎡ ⎤ ⎡ ⎤ ϕ1n d1 w1 ⎢ w2 ⎥ ⎢ d2 ⎥ ϕ2n ⎥ ⎥⎢ ⎥ ⎢ ⎥ .. ⎥ ⎢ .. ⎥ = ⎢ .. ⎥ . ⎦⎣ . ⎦ ⎣ . ⎦

ϕn1 ϕn2 · · · ϕnn

wn

(1.22)

dn

So, w = −1 d

(1.23)

In practice, we do not calculate −1 , but apply the following two steps to train RBF networks: (1) Select centers randomly or based on some criteria, (2) update weights, typically using a linear least estimation algorithm.

1.7 Support Vector Machines Given a number of n training samples for a two-class problem {(x1 ,y1 } · · · (xn ,yn }, where x ∈ Rm , y ∈ {−1,+1}, there may exist a nested structure of decision functions {fα (x):α ∈ } where fα :Rm → {−1, + 1}

(1.24)

However, the optimal function should be the one that can achieve the maximum 2 . The goal of support vector machines (SVMs) is to minimize w to get margin w the maximum margin, as shown in Fig. 1.7. The optimal solution fα is supposed to provide the smallest possible value for expected risk R(fα ):

14

1 Introduction to Neural Networks

Fig. 1.7 Illustration of maximum margin

R(fα ) =

|fα (x) − y|P(x,y)dxdy

(1.25)

As the probability distribution P(x,y) is unknown, the Empirical Risk, Remp (fα ) is computed in all the conventional neural networks: 1 1 |fα (xi − yi )| n 2 n

Remp (fα ) =

(1.26)

i=1

However, in terms of Vapnik-Chervonenkis (VC) Dimension theory (Vapnik, 1995), which indicates the maximum number of training points h that can be separated by set of linear functions fα , there is a bound to the expected risk:

R(fα ) ≤ Remp (fα ) +

h ln 2n + 1 − ln η h 4 n

(1.27)

with probability 1 − η. The second term in Eq. (1.27) is called the confidence term or confidence interval. For a fixed number of training examples n, the training error (empirical risk) decreases monotonically as the capacity or VC dimension h is increased, whereas the confidence interval increases monotonically. Accordingly, both the guaranteed risk and the generalization error go through a minimum. Hence, our goal is to find a network such that decreasing the VC dimension occurs at the expense of the smallest possible increase in training error. So the goal of SVMs is to minimize w 2 subject to yi (w · xi + b) ≥ 1, i = 1,...,n. This is equivalent to solving the following constrained Quadratic Programming (QP) problem to construct the Lagrangian:

1.7

Support Vector Machines

15

1 w 2 − αi yi (w · xi + b) − 1 2 n

L (w,b,) =

(1.28)

i=1

where = (α1 ,...,αn ) and ≥ 0. A solution to the QP problem is determined by the saddle point of the function. Differentiate L(w,b,)with respect to W and b.

∂L(w,b,) =w− αi yi xi = 0 ⇒ w = αi yi xi ∂w

(1.29)

∂L(w,b,)

αi y i = 0 = ∂b

(1.30)

n

n

i=1

i=1

n

i=1

So, the optimization problem is converted to maximizing F () =

n

i=1

αi −

n 1

αi αj yi yj xi · xj 2

(1.31)

i,j=1

subject to · y = 0 where ≥ 0. In most cases α = 0. Those input data points with α > 0 are called support vectors. Removing other data points except support vectors will not affect the solution of the classifier. Similarly to RBF, the input space is also mapped to a sufficiently large feature space in SVM, so that patterns become linearly separable, and so a simple Perceptron in feature space can do the classification. SVMs are radically different types of classifiers which have attracted a great deal of attention lately due to the novelty of the concepts that they bring to pattern recognition, their strong mathematical foundation, and their excellent results in practical problems.

Chapter 2

Principles of Sensitivity Analysis

Sensitivity refers to how a neural network output is influenced by its input and/or weight perturbations. Sensitivity analysis dates back to the 1960 s, when Widrow investigated the probability of misclassification caused by weight perturbations, which are caused by machine imprecision and noisy input (Widrow and Hoff, 1960). In network hardware realization, such perturbations must be analyzed prior to its design, since they significantly affect network training and generalization. The initial idea of sensitivity analysis has been extended to the optimization of neural networks, such as through sample reduction, feature selection, and critical vector learning.

2.1 Perturbations in Neural Networks Perturbations of neural networks are caused by machine imprecision and/or input noise. For the purpose of analysis, these perturbations can be simulated by embedding disturbance to the original inputs or connection weights. Perturbation analysis allows the study of the characteristics of a function under small perturbations of the function’s parameter (Holtzman, 1992; Zurada et al., 1997). Perturbation analysis is also important also when assessing network robustness against input noise to measure the uncertainty or fluctuations. In perturbation analysis we are interested in evaluating the disturbance in the function’s response to small perturbations in its parameters. Fig. 2.1 shows how to investigate the effects of perturbations to neural networks. Assuming that the performance function is differentiable, the relationship between the perturbed response of this function and parameter perturbations is expressed by a Taylor expansion of the function. For example, for a one-dimensional cost function g g ( + ) = g ( ) +

2 g ( ) + g ( ) + · · · 1! 2!

(2.1)

where is a parameter of the function and typically includes weight W and input X; is a small perturbation of . The performance cost function g is usually taken D.S. Yeung et al., Sensitivity Analysis for Neural Networks, Natural Computing Series, C Springer-Verlag Berlin Heidelberg 2010 DOI 10.1007/978-3-642-02532-7_2,

17

18

2

Fig. 2.1 General structure for investigating the effects of perturbations

Principles of Sensitivity Analysis

Weight W Reference Neural Network Input X comparison

output error

Perturbed Neural Network

ΔX

ΔW

by the noise-to-signal ratio (NSR) or expectation of decision errors. The Taylor expansion shows that the derivatives of the function with respect to the perturbed parameter encapsulate the characteristics of that function under the perturbations. 2 . Ideally, when → 0, 1! g ( ) + 2! g ( ) + · · · → 0 Eq. (2.1) shows that the derivatives play a very important role in determining the influence of parameter perturbations on the output of the performance function. Sensitivity analysis is applied to investigate how the derivatives can be used to quantify the response of the system to parameter perturbations, and how these derivatives can be calculated. Sensitivity analysis techniques differ mainly in the cost function used, the order of the derivatives that are considered, whether the analysis is in continuous time or for discrete time intervals, and the way in which the derivatives are calculated. Due to computational considerations, sensitivity analysis is based on approximations of Eq. (2.1), usually first-order or second-order approximations. The higher the order of the approximation, the more accurate but more complex and time consuming the process. Usually, sensitivity analysis is done at discrete time intervals for that time interval only. Sensitivity analysis can also be performed for continuous time models, referred to as stochastic analysis (Koda, 1995; 1997).

2.2 Neural Network Sensitivity Analysis Without loss of generality, let us consider a neural network performing a nonlinear, differentiable mapping : I → K , from input x = (x1 , x2 . . . xI ) to output o = (o1 , o2 . . . oK ). Suppose x(n) ∈ , where is an open set. Since o is differentiable at x(n) we have o(x + x) = o(x(n) ) + J(x(n) )x + g(x) where

(2.2)

2.2

Neural Network Sensitivity Analysis

⎡

19 ∂o1 ∂o1 ∂x1 ∂x2

⎢ ⎢ ∂o2 ⎢ J(x ) = ⎢ ∂x1 ⎢ .. ⎣ . (n)

∂oK ∂x1

···

∂o1 ∂xI

⎤

⎥ 2 ⎥ · · · ∂o ∂xI ⎥ .. .. .. ⎥ ⎥ . . . ⎦ ∂oK ∂oK ∂x2 · · · ∂xI ∂o2 ∂x2

(2.3)

is the Jacobian matrix and lim

x→0

g(x) =0 |x|

(2.4)

Fig. 2.2 illustrates geometrical interpretation of Eq. (2.2) in space K . Point o(x(n) ) represents the nominal response of the neural network for the nth element of the training set x(n) . The disturbance x of the input vector causes the perturbed response at o(x(n) + x). This response can be expressed as a combination of three vectors as indicated in Eq. (2.2). The sensitivity analysis can be used for different purposes in neural networks (Engelbrecht, 1999): Optimization. The calculation of the gradient of a function forms an important part of optimization. One of the first uses of sensitivity analysis is therefore in optimization problems (Cao, 1985; Holtzman, 1992). In neural networks, derivatives of the objective function with respect to the weights are computed to locate minima by driving these derivatives to 0. Second order derivatives have also been used to develop more sophisticated optimization techniques to improve convergence and accuracy. Koda (1995, 1997) employed stochastic sensitivity analysis to compute the gradient for time-dependent networks such as recurrent neural networks. Robustness. Neural network robustness and stability analysis is the study of the conditions under which the outcome of the neural network changes. This study is important for hardware implementation of neural networks to ensure stable

Fig. 2.2 Illustration of output disturbance caused by input disturbance

20

2

Principles of Sensitivity Analysis

networks that are not adversely affected by weight, external input and activation function perturbations (Alippi, Piuri and Sami, 1995; Oh and Lee, 1995). Instead of using derivatives to compute the gradient of the objective function with respect to the weights, Jabri and Flower (1991) use differences to approximate the gradient, thereby significantly reducing hardware complexity. Generalization. Fu and Chen (1993) state good generalization must imply insensitivity to small perturbations in inputs. They derive equations to compute the sensitivity of the neural network output vector to changes in input values, and show under what conditions global neural network sensitivity can be reduced. For example, using small slopes for the sigmoid activation function, using as small as possible weights, reducing the number of units, and ensuring activation levels close to 0 or 1 will reduce network sensitivity. Choi and Choi (1992) derive a neural network sensitivity norm which expresses the sensitivity of the neural network output with respect to input perturbations. This neural network norm is then used to select from sets of optimal weights the weight set with lowest neural network sensitivity, which results in the best generalization. Measure of nonlinearity. Lamers, Kok and Lebret (1998) use the variance of the sensitivity of the neural network output to input parameter perturbations as a measure of the nonlinearity of the data set. This measure of nonlinearity is then used to show that the higher the variance of noise injected to output values, the more linearized the problem. Causal inference. Sensitivity analysis has been used to assess the significance of model inputs. Engelbrecht, Cloete and Zurada (1995) use exact derivative calculations to compute the significance values which have a high influence on the neural network output. Goh (1993) derived a similar method using differences to approximate the gradient of the neural network output function with respect to inputs. Selective learning. Hunt and Deller (1995) use weight perturbation analysis to determine the inference each pattern has on weight changes during training. Only patterns that exhibit a high influence on weight changes are used for training. Engelbrecht (1998) presents new active learning models based on sensitivity analysis, which use a measure of pattern informativeness to dynamically select patterns during training. Decision boundary visualization. Goh (1993) uses an approximation to the derivatives of the neural network output function with respect to inputs to graphically visualize decision boundaries. Engelbrecht (1998) shows how exact derivative calculations can be used to locate and visualize decision boundaries. Victor (1998) uses the decision boundary algorithm to improve the accuracy of rules extracted from trained neural networks in a cooperative learning environment. Pruning. Sensitivity analysis has been applied extensively to neural network pruning. One technique is to compute the sensitivity of the objective function with respect to neural network parameters (Le Cun, 1990; Moody et al., 1995). Another method of sensitivity analysis pruning is to compute the sensitivity of the neural network output function to parameter perturbations (Zurada, 1994, 1997).

2.3

Fundamental Methods of Sensitivity Analysis

21

Learning derivatives. Basson and Engelbrecht (1999) developed a new learning algorithm for feedforward neural networks that also learns the first-order derivatives of the neural network output with respect to each input unit while learning the underlying function. The neural network consists of two parts, one representing the learned function, and the other representing the derivatives of the learned function. Concepts from sensitivity theory are used to create a training set for the training of the derivative part of the neural network using gradient descent.

2.3 Fundamental Methods of Sensitivity Analysis In system sensitivity theory, some sensitivity functions are introduced, such as the output sensitivity, the trajectory sensitivity and the performance-index sensitivity functions (Frank, 1978). All the methods of neural network sensitivity analysis can be divided into two categories, namely, geometrical approach and statistical approach. In 1962, Hoff used an n-dimensional hypersphere to model Adaline for sensitivity analysis (Hoff, 1962), which was further simplified in Glanz (1965). After two decades, Winter (1989) was the first one to derive an analytical expression for the probability of error in Madaline caused by weight perturbations. Stevenson continued Winter’s work and established the sensitivity of Madaline to weight error (Stevenson, 1990; Stevenson, Winter and Widrow, 1990). A milestone work was done by Piché, who used a statistical approach to relate the output error to the change of weights for an ensemble of Madalines, with several activation functions such as linear, sigmoid, and threshold (Piché, 1992; Piché, 1995). Figure 1.6 can be treated as a general neural network model. A neural network can have L layers, and each layer l (0 ≤ l ≤ L) has nl (nl ≥ 1) neurons. n0 stands for the input layer and nL for the output layer. Since the number of neurons in layer l −1 is equal to the output dimension of that layer, which is also equal to the input dimension of layer l, the input dimension of layer l is nl−1 . For a neuron i (1 ≤ T i ≤ nl ) in layer l, its input vector, weight vector and output are X l = x1l · · · xnl l−1 , T Wil = wli1 · · · wlinl−1 and yli = f (X l · W l ) respectively, where f(·) is an activation function. For each layer l, its input vector is Xl , its weight set is W l = {W1l · · · Wnl l }, T and its output vector is Y l = yl1 · · · ylnl . For the network, its input is the vector X1 T or Y0 , its weight is W, and its output is YL . Let X l = x1l · · · xnl l−1 and Y l = T yl1 · · · ylnl be the corresponding deviations for input and output, respectively.

2.3.1 Geometrical Approach The use of n-dimensional geometry has proven to be a valuable tool for understanding and analyzing the Adaline. The geometrical interpretation of the equation

22

2

Principles of Sensitivity Analysis

dictating the Adaline’s input-output map is the bias for most of the analysis presented in Stevenson (1990). The input vector X in a neural network can be treated as a vector from the origin to the point (x1 , x2 . . . xn ) in n-space. The point (x1 , x2 . . . xn ) will be referred to “the tip of X” in the following discussion. In n-space, points at a distance r from the point c form a hypersphere of radius r centered at c. The surface area of such a hypersphere An (r) is n √ An (r) = 2 π n · rn−1 2

(2.5)

where (•) is the Gamma function. As shown in Fig. 2.3, the connection weights, which are a vector at angle θ to the input vector X = (x1 , x2 . . . xn ) in n-space, satisfy

X·W=

n

xi wi = c

(2.6)

i=1

which is called a hyperplane for some scalar c. This hyperplane is perpendicular to the vector W and is at a distance c/|W| from the origin. Particularly, HPw is one of the hyperplanes passing through the origin with c = 0. Similarly to HPw , HPX denotes a hyperplane passing through the origin and perpendicular to the vector X. Hence, both HPw and HPX divide the hypersphere in two hemi hyperspheres, i.e., Hw + versus Hw − , and HX + versus HX − , respectively. There are four lunes intersected by + − + − these four hemi hyperspheres, HX+ ∩ HW , HX+ ∩ HW , HX− ∩ HW and HX− ∩ HW , as shown in Fig. 2.3. − + and HX− ∩HW , As the angle between X and W is θ , both the intersections, HX+ ∩HW + + − − describe lunes of angle θ whereas the intersections HX ∩ HW and HX ∩ HW both describe lunes of angle (π − θ ). The ratio of the surface content of a lune of angle θ to the surface content of the entire hypersphere is θ/2π . Assuming binary-valued inputs, there are 2n possible input patterns for an Adaline with n variable inputs. Each input pattern corresponds to a point in n√ space which lies on a hypersphere of radius n centered at the origin. The Hoff

W

X

+ + HX∩HW

θ HPX HPW + − HX∩HW

Fig. 2.3 Hypersphere approximation for sensitivity analysis

+

−

HX∩HX

π−θ

+

+

HX∩HX

2.3

Fundamental Methods of Sensitivity Analysis

23

hypersphere-area approximation states that as n gets large, the points corresponding to the n-dimensional input patterns are approximately uniformly distributed over the surface of a hypersphere in n-space. Consequently, the percentage of input patterns which correspond to points on a selected region of the hypersphere can be approximated as the ratio of the surface content of the selected region to the surface content of the entire hypersphere (Hoff, 1962).

2.3.2 Statistical Approach Let us consider the behavior of a multi-input/single-output mapping first. Let us assume that the connection weight vector changes from W∗ to W = W ∗ + W, where W indicates weight perturbations. Under the assumption of statistical weight perturbations, the statistical sensitivity can be defined as follows (Choi and Choi, 1992): ∗

S (W ) = lim p

σ →0

var[Xp (L)] σ

(2.7)

where Xp (L) is the output error, σ is the standard deviation of each component of deviation of each component of W, and var[...] is the variance. The output error vector Xp (L) in the lth layer arising from weight perturbations W is given by Xp (l) = Xp (l) − Xp∗ (l) ∼ =

l

Ck,l W ∗ , Xp∗ W(k)Xp∗ (k − 1)

(2.8)

k=1

where the Nl × Nk matrix Ck,l W ∗ , Xp∗ is defined as l ∗ ∗ ∇Tn Wn∗ ∇Tk = ∇Tl Wl∗ ∇Tl−1 Wl−1 . . . ∇Tk+1 Wk+1 (2.9) Ck,l W ∗ , Xp∗ = n=k+1

Here, Tk refers to the nonlinear transformation associated with the kth layer, and ∇Tk refers to its derivation with respect to the output of the kth layer. As in the case of the weight perturbation discussed above, the sensitivity to input perturbation can be also calculated by Eq. (2.7) with σ being the standard deviation of each component of the input perturbation Xp (0). In this respect, the output error Xp (L) in output layer L caused by input perturbation is given by: Xp (L) = CL,0 W ∗ , Xp∗ Xp (0)

(2.10)

To measure sensitivity for a multi-output neural network, typically, the sensitivity of each output neuron is calculated first, and then the final sensitivity is set as the maximum or the average value among all output neurons.

24

2

Principles of Sensitivity Analysis

Piché (1995) introduced a stochastic model for the statistical analysis of sensitivity. He assumed that (1) over the ensemble of networks, the weights of a particular layer all have the same variance, (2) the weights in the networks are statistically independent and (3) the mean value of each weight of the network over the ensemble is 0. Under these assumptions, it is possible to model the ensemble of weights associated with each node of the network as a vector of random variables. He discussed the selection of weight accuracies for Madaline using a statistical approach to sensitivity analysis. According to his stochastic model, all neurons have the same activation function. All network inputs, weights, input perturbations, and weight perturbations are random variables. The sensitivity of Madaline in this model is defined as the NSR of the output layer; the output NSR of a layer of threshold Adaline is: 2 2 σy σ2 4 σx + w (2.11) NSR = 2 = 2 σy π σx σw2 2 , σ 2 and σ 2 refer to the variances of output y, input x, where σy2 , σx2 , σw2 , σy x w weight w, output error y, input perturbation x and weight perturbation w, respectively.

2.4 Summary Sensitivity is initially investigated for the realization of a neural network to calculate its output perturbation caused by machine imprecision and noisy input. It is particularly useful and valuable to apply sensitivity analysis to the software design of networks, in which artificial perturbation is embedded in the training. In this paper, the principle of sensitivity analysis is discussed in detail, followed by systematic introduction to the advanced research in this area over the last two decades. All the existing techniques can be categorized into geometrical and statistical approaches. The former use hypersphere or hyper-rectangle model in n-dimensional space to analyze the deviation of the output vector caused by the perturbation of the input and weights. The latter calculate the sensitivity by noise-to-signal ratio or expectation of the output deviation.

Chapter 3

Hyper-Rectangle Model

In this chapter, we discuss a hyper rectangle model, instead of the traditional hypersphere, which is employed as the mathematical model to represent an MLP’s input space. The hyper-rectangle approach does not demand that the input deviation be very small as the derivative approach requires, and the mathematical expectation used in the hyper-rectangle model reflects the network’s output deviation more directly and exactly than the variance does. Moreover, this approach is applicable to the MLP that deals with infinite input patterns, which is an advantage of the MLP over other discrete feedforward networks like Madalines.

3.1 Hyper-Rectangle Model for Input Space of MLP Neurons in an MLP, like cells, are organized into layers by linking them together. Links exist only between neurons of two adjacent layers; there is no link between neurons in the same layer or in any two nonadjacent layers. All neurons in a layer are fully linked to all the neurons in the immediately preceding layer and to all the neurons in the immediately succeeding layer. At each layer except the input layer, the inputs of each neuron are the outputs of the neurons in the previous layer, X l = Y.l−1 (l ≥ 1). Through training, a hyperplane P is obtained for classification or regression

P:

n

xj wj + σ = 0 in input space

(3.1)

j=1

The computation of the sensitivity is essentially reduced to the calculation of the following integral (Zeng and Yeung, 2003): I (A,B,W,σ ) =

b1

a1

b2

a2

···

bn an

ϕ(X)

1 + exp −

n

dx1 dx2 · · · dxn . (3.2) xj wj − σ

j=1

D.S. Yeung et al., Sensitivity Analysis for Neural Networks, Natural Computing Series, C Springer-Verlag Berlin Heidelberg 2010 DOI 10.1007/978-3-642-02532-7_3,

25

26

3 Hyper-Rectangle Model P

a

c

b P

Q

P

Q

Q

Fig. 3.1 Spatial relationship between P and . (a) P does not cut . (b) P cuts the four parallel sides of . (c) P cuts a prism from

where A = {a1 , a2 . . . an } and B = {b1 , b2 . . . bn } are sets of lower and upper bounds of the integral, and W = {w1 , w2 . . . wn } is the set of the weights, with bj > aj and wj = 0 for 1 ≤ j ≤ n. It can be clearly seen that the input space can be regarded as a hyper-rectangle bounded by A and B in n-dimensional space From a geometric point of view, the hyperplane P can be regarded as a dividing plane. It divides the n-dimensional space into three parts: the points on P, the points below P, and the points above P. The possible spatial relationship between P and is illustrated in Fig. 3.1.

3.2 Sensitivity Measure of MLP Based on the above hype-rectangle model, the sensitivity measure of Eq. (3.2) can be rewritten as: ⎧ ⎪ ⎪ ⎪ ⎪ ··· ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ · · · I≈ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ · · · ⎪ ⎪ ⎩

⎛ ⎞⎞ ⎛ p n n

⎝ (− 1)k exp ⎝−k xj wj − kσ ⎠⎠d if xj wj + σ > 0 k=0

ip1 (X)d +

···

j=1

j=1

ip2 (X)d

if

n

xj wj + σ = 0

j=1

⎛ ⎛ ⎞⎞ p n n

⎝ (− 1)k exp ⎝−k xj wj + kσ ⎠⎠d if xj wj + σ < 0 k=0

j=1

j=1

(3.3)

where ip (X) denotes the first p-term Taylor expansion of the integrand. A solution to this problem is to locate the intersection points of P and those parallel lines that are the extensions of the edges of . Each line must have one and only one intersection point with P under the assumption wj = 0 for 1 ≤ j ≤n. By comparing the coordinates on a given axis of those intersection points with the lower bound and the upper bound of on that axis, it is easy to identify the intersection situations between P and . It is known that has 2n−1 edges parallel to a given coordinate axis. Take the xn -axis as an example. Each edge parallel to the xn -axis can be determined by assigning a total of n−1 coordinates (x1 . . . xn−1 ) with either xj = aj or xj = bj . Thus, the xn -coordinate for a given line’s intersection point can be calculated by

3.3

Discussion

27

⎛ ⎞ " n−1

xn = ⎝ xj wj + σ ⎠ wn

(3.4)

j=1

with its n−1 fixed coordinates, which consequently yield a set En = {en,1 , en,2 . . . en,2 n−1 } representing 2n−1 edges.

3.3 Discussion The sensitivity measure is something like an analog rule that network users can use as a tool to evaluate their network’s performance. For example, Lee and Oh (1994) indicated that in pattern classification applications, their sensitivity (misclassification probability) results could be applied to select the optimal weight set among many weight sets acquired through a number of learning trials. Among these weight sets, an MLP with the optimal set will have the best generalization capability and the best input noise immunity. Zurada, Malinowski, and Usui (1997) made use of analytical sensitivity to delete redundant inputs to an MLP for pruning its architecture. Engelbrecht and Cloete (1999) also used analytical sensitivity to dynamically select training patterns. In this chapter, sensitivity is analyzed through the hyper-rectangle model, which is especially suitable for the most popular and general feedforward network: multilayer Perceptron. The sensitivity measure is defined as the mathematical expectation of output deviation due to expected input deviation with respect to overall input patterns in a continuous interval. Based on the structural characteristics of the MLP, a bottom-up approach is adopted. A single neuron is considered first, and algorithms with approximately derived analytical expressions that are functions of expected input deviation are given for the computation of its sensitivity. Then another algorithm is given to compute the sensitivity of the entire MLP network.

Chapter 4

Sensitivity Analysis with Parameterized Activation Function

Among all the traditional methods introduced in Chap. 2, none has involved activation function in the calculation of sensitivity analysis. This chapter attempts to generalize Piché’s method by parameterizing antisymmetric squashing activation functions, through which a universal expression of MLP’s sensitivity will be derived without any restriction on input or output perturbations.

4.1 Parameterized Antisymmetric Squashing Function Any antisymmetric squashing activation function g(x) can be specified by three parameters, and the main idea of the function approximation approach is to use another function gA,B,C (x) to approximate g(x). The function has the following form: − gA,B,C (x) = g+ A,B,C (x) + gA,B,C (x)

(4.1)

where #

B − C −Ax2 , if x ≥ 0 e 2 0, if x < 0 # 0, if x ≥ 0 g− A,B,C (x) = C + B − C e−Ax2 , if x < 0 2 g+ A,B,C (x)

=

B−

(4.2)

(4.3)

The objective of function approximation is to determine the three factors A, B and C of g(x), so that the distance between the two functions gA,B,C (x) and g(x) is minimized. Here the distance is defined in terms of the L2 norm. The factors A, B and C are determined by $% ⎧ 2 +∞ ⎪ ⎪ g(x) − gA,B,C (x) dx ⎨ min 0 A

B = lim g(x) ⎪ x→+∞ ⎪ ⎩ C = −B D.S. Yeung et al., Sensitivity Analysis for Neural Networks, Natural Computing Series, C Springer-Verlag Berlin Heidelberg 2010 DOI 10.1007/978-3-642-02532-7_4,

(4.4)

29

30

4

Sensitivity Analysis with Parameterized Activation Function

1

A = +inf B=1 C = −1

0.8 0.6 0.4 g(x)

0.2

A = 0.446 B=1 C = −1

0 −0.2

A = 1.783 B=1 C = −1

−0.4 −0.6 −0.8 −1 −4

−3

−2

−1

0 x

1

2

3

4

Fig. 4.1 Approximation results of some antisymmetric squashing activation functions

Because both gA,B,C (x) and g(x) are antisymmetric about the point (0, 0), the factor A can be found by solving the minimization problem over (0, +∞) instead of (–∞, +∞,). Some commonly used antisymmetric squashing functions are g(x) = x ex −e−x or g(x) = eex −1 +1 . The approximated results are shown in Fig. 4.1. ex +e−x From Eqs. (4.1), (4.4) and Fig. 4.1, the following can be observed. (1) When a parameter changes from 0 to +∞, the function gA,B,C (x) becomes the threshold activation function, and we need not solve the threshold activation function and the sigmoidal function separately. (2) This function approximation approach still cannot avoid inaccuracy, but it helps us obtain an analytical expression of an MLP’s sensitivity for a class of activation functions. Then the effect of the activation function on the sensitivity of an MLP can be investigated. The function approximation approach is a tradeoff between accuracy and generalization.

4.2 Sensitivity Measure According to Piché (1995), the sensitivity of the jth neuron in the lth layer is calculated by the NSR of the output D(ylj ) l SR (yj ) = D(ylj ) where

(4.5)

4.3

Summary

31

D(ylj ) = D(ylj )

+∞ %

g2(αu) ·

−∞ +∞ % +∞ %

=

−∞ −∞

2 √1 e−(u /2) du − E 2 (yl ) j 2π

(g(αu + αβv) − g(αu))

2

(4.6)

2 2 · e−(u +v /2) dudv

To show the effects of this method, let us consider the sensitivity of a threshold function, i.e., A = +∞, B = 1 and C = –1. Eq. (4.5) can be calculated by (Yeung and Sun, 2002): 4 2 2 2 σw σx σw σ2 l + + · SR (yj ) = arctg x π σx2 σw2 σx2 σw2 σ2

(4.7) σ2

Comparing Eq. (4.5) with Eq. (4.7), we can see that, when σx2 and σw 2 are very x w small, they are nearly the same. But when they are large, the results are different. For the threshold activation function, the output error variance cannot exceed 2. The result derived from parameterized antisymmetric function satisfies this, but Piché’s fails. So the parameterized antisymmetric squashing function is more reasonable, and it is valid when the perturbation is either small or large.

4.3 Summary For each antisymmetric squashing activation function, there exist three factors A, B and C, so that gA,B,C (x) can approximate in the sense of the L2 norm. So we can use a vector (A, B, C) to represent an antisymmetric squashing activation function. That is to say that we can treat the antisymmetric squashing activation function as a parameter involved in the sensitivity of the MLP.

Chapter 5

Localized Generalization Error Model

The generalization error bounds found by current error models using the number of effective parameters of a classifier and the number of training samples are usually very loose. These bounds are intended for the entire input space. However, support vector machines, radial basis function neural networks and multilayer Perceptron neural networks are local learning machines for solving problems and thus treat unseen samples near the training samples as more important. In this chapter, we describe a localized generalization error model which bounds the generalization error from above within a neighborhood of the training samples using a stochastic sensitivity measure (Yeung et al., 2007 and Ng et al., 2007). This model is then used to develop an architecture selection technique for a classifier with maximal coverage of unseen samples by specifying a generalization error threshold.

5.1 Introduction For a pattern classification problem, one builds a classifier fθ to approximate or mimic the unknown input-output mapping function F, where θ is the set of parameters selected from a domain . For the model presented here, the mean-square-error (MSE) is used to measure the difference between fθ and F. The MSE is widely applied to training real-value output classifiers like neural networks, which classify a given sample by thresholding the real-value classifier output. The behavior of a classifier trained by minimizing an MSE, compared to the one trained by minimizing a classification error (0-1 loss function), is different. When two classifiers both yield the same, but very small percentage of training classification error, the one which yields a larger training MSE would produce indecisive outputs that deviate more from the target outputs. So a small change in the inputs may change the classification results (Anthony and Bartlett, 1999). This is not desirable and it indicates that the training classification error is not a good benchmark for the generalization capability of a classifier. Therefore, selecting a classifier using the training classification error or its bound may not be appropriate. The classification error for the entire input space is defined as (5.1) Rtrue = ( fθ (x) − F (x))2 p (x)dx T

D.S. Yeung et al., Sensitivity Analysis for Neural Networks, Natural Computing Series, C Springer-Verlag Berlin Heidelberg 2010 DOI 10.1007/978-3-642-02532-7_5,

33

34

5

Localized Generalization Error Model

where x denotes the input vector of a sample in the entire input space T, and p (x) denotes the true unknown probability density function of the input x. Given a training dataset D containing N training input-output pairs D = {(xb ,F (xb ))}N b=1 , a classifier fθ could be constructed by minimizing the empirical risk (Remp ) over D, where Remp =

N 1

(fθ (xb ) − F (xb ))2 N

(5.2)

b=1

The ultimate goal of solving a pattern classification problem is to find the fθ that is able to correctly classify future unseen samples (Haykin, 1999; Vapnik, 1998). The generalization error (Rgen ) is defined as: Rgen = T\D

(fθ (x) − F (x))2 p (x)dx

(5.3)

Since both target outputs and distributions of the unseen samples are unknown, it is impossible to compute the Rgen directly. There are two major approaches to estimate the Rgen , namely, an analytical model and cross-validation (CV). In general, analytical models cannot distinguish trained classifiers having the same number of effective parameters, but with different values of parameters. Thus they yield loose error bounds. The Akaike information criterion (AIC) (Akaike, 1974) only makes use of the number of effective parameters and the number of training samples. The Network Information Criterion (NIC) (Park et al., 2004) is an extension of the AIC for application in regularized neural networks. It defines classifier complexity by the trace of the second-order local derivatives of the network outputs with respect to connection weights. Unfortunately, current analytical models are only good for linear classifiers due to the singularity problem in their models. A major problem with these models is the difficulty of estimating the number of effective parameters of the classifier. This could be solved by using the VapnikChervonenkis (VC) dimensions (Vapnik, 1998). However, only loose bounds of VC dimensions could be found for nonlinear classifiers, thus severely limiting the applicability of analytical models to nonlinear classifiers, except for the support vector machine (SVM). Although CV uses true target outputs for unseen samples, it is time consuming for large datasets and, for k-fold CV and L choices of classifier parameters, kL classifiers must be trained. CV methods estimate the expected generalization error instead of its bound. Thus they cannot guarantee that the classifier finally constructed will have good generalization capability. Many classifiers, e.g., SVM, multilayer Perceptrons (MLPNN) and radial basis function neural networks (RBFNNs) are local learning machines. An RBFNN, by its nature, learns classification locally and every hidden neuron captures local information of a particular region in the input space defined by the center and width of its Gaussian activation function (Moody and Darken, 1989). A training sample

5.2

The Localized Generalization Error Model

35

located far away from a hidden neuron’s center does not affect the learning of this hidden neuron. An MLPNN learns decision boundaries for the classification problem in input space using the location of the training samples. However, as pointed out by Chakraborty and Pal (2001), the MLPNN output responses to unseen samples far away from the training samples are likely to be unreliable. So they proposed a learning algorithm which deactivates any MLPNN response to unseen samples much different from the training samples. This observation is further supported by the fact that, in many interesting industrial applications, such as aircraft detection in SAR images, character and fingerprint recognition (Jain and Venuri, 1999), and so on, the most significant unseen samples are expected to be similar to the training samples. On the other hand, an RBFNN is one of the most widely applied neural networks for pattern classification, with its performance primarily determined by its architecture selection. Haykin (1999) summarizes several training algorithms for an RBFNN. For instance, a two-stage learning algorithm may be a quick way to train an RBFNN. The first, unsupervised stage is to select the center positions and widths for the RBF using self-organizing map or k-means clustering. The second, supervised stage computes the connection weights using the least mean square method or pseudoinverse technique. Some have proposed that all training samples be selected as centers (Haykin, 1999). Mao (2002) based the selection of centers on the separability of the datasets. The experimental results of Mao (2002) indicate that the choice of the number of hidden neurons indeed affects the generalization capability of the RBFNNs and that an increase in the number of hidden neurons does not necessarily lead to a decrease in testing error. The selection of the number of hidden neurons affects the selection of an RBFNN architecture and ad hoc choices or sequential search are frequently used. In this chapter we describe a localized generalization error model RSM using the stochastic sensitivity measure (ST-SM), which bounds the generalization error from above for unseen samples within a predefined neighborhood of the training samples (Yeung et al., 2007 and Ng et al., 2007). In addition, an architecture selection method based on the RSM is proposed to find the maximal coverage classifier with its RSM bounded by a preselected threshold. An RBFNN will be used to demonstrate the use of the RSM and the architecture selection method. We introduce the localized generalization error model and its corresponding architecture selection method in Sects. 5.2 and 5.3 respectively.

5.2 The Localized Generalization Error Model Two major concepts of the RSM , the Q-neighborhood and the stochastic sensitivity measure, are introduced in Sects. 5.2.1 and 5.2.3 respectively. The derivation of the localized generalization error model is given in Sect. 5.2.2 and its characteristics are discussed in Sect. 5.2.4. Section 5.2.5 discusses the method to compare two classifiers using the localized generalization error model.

36

5

Localized Generalization Error Model

5.2.1 The Q-Neighborhood and Q-Union For every sample xb ∈ D, one finds a set of samples x which fulfills 0 < |xi | < Q ∀i = 1 · · · n, where n denotes the number of input features, x = (x1 · · · xn ) = x − xb and Q is a given real number. In a pattern classification problem, one usually does not have any knowledge about the distribution of the true input space. Therefore, without any prior knowledge, every unseen sample has the same chance to appear. So, x may be considered as input perturbations that are random variables having zero-mean uniform distributions. TQ (xb ) = {x | x = xb + x; |xi | ≤ Q

∀i = 1, · · · , n }

(5.4)

Then TQ (xb ) defines a Q-neighborhood of the training sample xb Let TQ be the union of all TQ (xb ) and call it the Q-Union. All samples in TQ (xb ), except xb , are considered as unseen samples (i.e., TQ (xb ) contains no training point other than xb ). For 0 ≤ Q1 ≤ · · · ≤ Qk ≤ ∞, the following relationship holds: D ⊆ TQ1 ⊆ · · · ⊆ TQk ⊆ T

(5.5)

One should note that the shape of the Q-neighborhood is chosen to be a hypercube for ease of computation, but it could also be a hypersphere or other shape. Moreover, in the localized generalization error model, the unseen samples could be selected from a distribution other than a uniform one. Only the derivation of the ST-SM needs to be modified and the rest of the model will remain the same.

5.2.2 The Localized Generalization Error Bound Instead of finding a bound for the generalization error for unseen samples in the entire input space T (Rtrue ), we find a bound on RSM , which is the error for unseen samples within TQ only, i.e., the shaded area in Fig. 5.1. We ignore the generalization error for unseen samples that are located far away from training samples (Rres in Eq. (5.6)). Note that Rres decreases when Q increases.

Fig. 5.1 An Illustration of Q-Union (TQ ) with 20 training samples. The Xs are training samples and any point in the shaded area is an unseen sample

5.2

The Localized Generalization Error Model

37

RSM (Q) = Rtrue − Rres (Q) = TQ

(fθ (x) − F (x))2 p (x) dx

(5.6)

errθ (xb )=fθ (xb )−F (xb ) ; then Let y = fθ (x)−fθ (xb ) , N % N

2 1 (errθ (xb ))2 , ETQ (y)2 = N1 Remp = (1/N) TQ (xb ) (y) (2Q)n dx, b=1 b=1 √ ε = B ln η/ (−2 N); and let A, B and N be the difference between the maximum and minimum values of the target outputs, the maximum possible value of the MSE and the number of training samples, respectively. In this work, we assume that the range of the desired output, i.e., the range of F, is either known or assigned to a preselected value. Moreover, B is computable because the range of the network outputs will be known after the classifier is trained. In general, one expects that the error of unseen samples will be larger than the training error, so we assume that the average of errors of unseen samples in TQ (xb ) will be larger than the training error of xb . By the Hoeffding inequality (Hoeffding, 1963), the average of the square errors of samples with the same population mean converges to the true mean with the following rate of convergence. With a probability of(1 − η), we have (Yeung et al., 2007 and Ng et al., 2007):

RSM (Q) ≤

N 1

1 dx + ε (fθ (x) − F (x))2 N (2Q)n TQ (xb ) b=1

N 1

1 = dx+ε (fθ (x)−fθ (xb ) + fθ (xb ) − F (xb ) + F (xb ) − F (x))2 N (2Q)n TQ (xb ) b=1

N 1 1

dx ≤ (y)2 N (2Q)n TQ (xb ) b=1

+

N N 1 1 1

1

2 dx+ )−F(x)) dx+ε (errθ (xb))2 (F(x b N N (2Q)n (2Q)n TQ (xb ) TQ (xb ) b=1

b=1

N N 1

1 1 1 dx dx (y)2 (errθ (xb ))2 + 2 N N (2Q)n (2Q)n TQ (xb ) TQ (xb ) b=1

b=1

N N 1

1

1 1

2 2 +2 dx dx (errθ (xb )) (F (xb )−F (x)) N N (2Q)n (2Q)n TQ (xb ) TQ (xb ) b=1

b=1

38

5

Localized Generalization Error Model

N N 1

1 1 1 2 2 +2 dx dx (y) (F (xb )−F (x)) N N (2Q)n (2Q)n TQ (xb ) TQ (xb ) b=1

≤

&

Remp +

b=1

$

'2

ETQ (y)2 + A + ε

= R∗SM (Q)

(5.7)

Both A and ε are constants for a given training dataset when an upper bound of the classifier output values is preselected. The R∗SM is an upper bound for the MSE of the trained classifier for unseen samples within the Q-union. This bound is better than those regression error bounds (based on AIC and VC-dimension) that are defined by using only the number of effective parameters and training samples, while ignoring statistical characteristics of a training dataset such as mean and variance. Moreover, those error bounds usually grow quickly with the increase of the number of effective parameters, e.g., number of hidden neurons in an RBFNN and VC-Dimension, while the R∗SM grows much slower. The term ETQ (y)2 will be discussed in Sect. 5.2.3. Further discussion on the characteristics of the R∗SM will be given in Sect. 5.2.4.

5.2.3 Stochastic Sensitivity Measure for RBFNN The output perturbation (y) measures the network output difference between the training sample (xb ∈ D) and the unseen sample in its Q-neighborhood ((xb + x) ∈ TQ (xb )). Thus, the ST-SM measures the expectation of the squares of network output perturbations (y) between training samples in D and unseen samples in TQ . The Sensitivity Measure (SM) of a neural network (Ng and Yeung, 2002; Ng et al., 2007) gives quantified data on the change of network outputs with respect to change of network inputs. Intuitively, it measures how sensitive the network output is to the input change. Ng et al. (2007) allowed every input or weight to have its own mean and variance, and the input and weight perturbations are allowed to be arbitrary. Hence the perturbed samples (x in Sect. 5.2.1) can be considered as unseen samples around the training samples (xb ). Ng and Yeung (2002) developed an analytical formula of the ST-SM for a Gaussian activation function RBFNN that is independent of the number of training samples. We assume the inputs are independent and not identically distributed and weight perturbations are not considered in this chapter. So, every input feature has its own expectation μxi and variance σx2i . The input perturbation of the ith input feature is a random variable having a uniform 2 . The centers and widths of the hidden distribution with mean 0 and variance σx i neurons are constant and the connection weights are fixed beforehand. An RBFNN could be described as

5.2

The Localized Generalization Error Model

fθ (x) =

M

wj exp

39

( ( (x − uj (2

j=1

−2v2j

=

M

wj φj (x)

(5.8)

j=1

where M, u and vj denote the number of hidden neurons, the center and width of the jth RBFNN hidden neuron respectively, and wj denotes the connection weight between jth hidden) neuron its corresponding output neuron. Let and the( ( (2 (2 ) 2 ( (2

2 2v4j − E (x − uj ( vj , E (x − uj ( = ϕj = wj exp Var (x − uj ( & n ' n

2 2 4 2 2 σxi + μxi − uji , υj = ϕj σxi + μxi − uji /vj , ζj = ϕj /v4j , i=1 i=1 ⎛ * 4 + 2 2

2 ⎞ ( n xi − μxi − σxi + 4σx2i μxi − uji ED (2 * ⎠, ⎝ Var (x − uj ( = 3 + xi − μxi μxi − uji +4ED i=1 p (x) denotes the probability density function of the input perturbations and p (x) = 1/ (2Q)n , n is the number of input features and uji denotes the ith input fea

ture of the jth center of the hidden RBF neuron (uj = uj1 , · · · ,ujn ). For uniformly 2 = (2Q)2/12 = Q2/3 . Theoretically, distributed input perturbations, we have σx i we do not restrict the distribution of the input perturbations as long as the variance 2 ) is finite. However, uniform distribution is assumed of the input perturbation (σx i here because without any prior knowledge on the distribution of unseen samples around the training samples, we assume that all of them have an equal chance of occurrence. By the law of large numbers, when the number of input features is not too small, φj (x) would have a lognormal distribution. So, the RBFNN ST-SM is given by: N 1

2 ETQ (y) = (fθ (xb + x) − fθ (xb ))2 p (x) dx N TQ (xb ) b=1

⎞⎤ ⎛ ) ) n n

2 4 − 2 σ2 2 2 + μ −u 2 +0.2σ 2 σ 2v σ − ⎟⎥ 2v exp 4 ⎢ ⎜ xi ji M ⎢ ⎜ xi j j xi xi xi

⎟⎥ ⎟⎥ ⎢ϕj ⎜ i=1 i=1 = ⎟⎥ ⎢ ⎜ ) ) n n

2 ⎠⎦ ⎣ ⎝ 2 2 2 j=1 2 exp 2v2j + 1 σx2i + μxi − uji + 0.2σx 2v4j − σx σx ⎡

i=1

≈

M

j=1

ϕj

i

i

n

2 σx i

σx2i

i=1

i

) 2

2 4 vj + μxi − uji + 0.2σxi

i=1

=

0.2 4

1 2

Q Q n υj + ζj 3 9 M

M

j=1

j=1

(5.9)

40

5

Localized Generalization Error Model

5.2.4 Characteristics of the Error Bound From Eq. (5.7), one may notice that the R∗SM consists of three major components:

training error (Remp ), ST-SM (ETQ (y)2 ) and the constants. The constants A and ε are preselected when the confidence of the bound (1 − η) and the training dataset is fixed. Moreover, the constant B in ε could be preselected when the classifier type is selected by fixing the maximum classifier output bound. ε is generally very small for large N. So, they will not affect the result of comparisons of the generalization capability between classifiers. In contrast, if the classifier could not generalize the training samples, one may not expect the classifier to have good generalization capability for future unseen samples. Thus the training error is one of the key components of the R∗SM . Furthermore, the ST-SM term measures the output fluctuations of the classifier. A classifier having high output fluctuations yields a high ST-SM because its output varies dramatically when the input value changes. So, due to the classifier Bias/Variance Dilemma, a classifier yielding a good generalization capability should minimize both terms or achieve a good balance between the two (Geman and Bienenstock, 1992). An interesting question is, can R∗SM be an effective mechanism for studying a classifier’s Bias/Variance Dilemma? Limiting cases of R∗SM (Q). Obviously, when Q → ∞, TQ → T and Q → 0, TQ →0. For0 ≤ Q1 ≤ · · · ≤ Qk ≤ (Q → ∞), the relationship D ⊆ TQ1 ⊆ · · · ⊆ TQk ⊆ T holds. We further extend this relationship to:

Remp the limiting case of R∗SM with Q → 0 ≤ R∗SM (Q1 )

≤ · · · ≤ R∗SM (Qk ) ≤ Rtrue R∗SM with Q → ∞

(5.10)

This relationship shows that the limiting case of R∗SM (Q) with Q → ∞ bounds from above the Rtrue . For Q → ∞, Rtrue = T

(fθ (x) − F (x))2 p (x) dx

= TQ

(fθ (x) − F (x))2 p (x) dx

= RSM (Q) ≤ R∗SM (Q)

(5.11)

Moreover, for the limiting case of Q → 0 , R∗SM (Q) bounds Remp from above. When

Q → 0, ETQ (y)2 vanishes, and we have Remp ≤

2 Remp + A + ε ≤ R∗SM (Q)

(5.12)

5.2

The Localized Generalization Error Model

41

R∗SM for other classifiers. The R∗SM as well as the RSM , could be defined for any classifier trained with MSE. Examples include feedforward neural networks like multilayer Perceptron neural networks, support vector machines and recurrent neural networks such as Hopfield networks. The R∗SM for other types of classifier could be defined by rederiving the ST-SM term for the particular type of classifier concerned. Independence of training method. The R∗SM is determined without regard of the training methods being used. Only the parameters of the finally trained classifier are used in the model. Hence the R∗SM model could also be used to compare different training methods in terms of the generalization capability of the classifiers being built. Time complexity. The ST-SM has a time complexity of O(Mn). The computational complexities of both the ST-SM and the R∗SM are low and they do not depend on the number of training samples (N). However, similarly to all other architecture selection methods, R∗SM requires a trained classifier and the time for architecture selections is dominated by the classifier training time. Therefore, the proposed architecture selection method may not have a large advantage in terms of speed in comparison to other architecture selection methods, except the two cross-validation-based methods. Limitations of the localized generalization error model. The major limitation of the localized generalization error model is that it requires a function (i.e., fθ ) for the derivation of the ST-SM. Classifiers such as rule-based systems and decision trees may not be able to make use of this concept. It will be a challenging task to find ways to determine the R∗SM for these classifiers. Another limitation of the present localized generalization error model is due to the assumption that unseen samples are uniformly distributed. This assumption is reasonable when there is no a priori knowledge of the true distribution of the input space and hence every sample may have the same probability of occurrence. One would need to re-derive a new R∗SM when a different distribution of the input space is assumed. We also remark that even if the distribution of the unseen samples is known, their true target outputs are still unknown and hence it will be difficult to judge how good the bound is. On the other hand, if both input and Q-Union distributions are the same, the localized generalization error model is expected to be a good estimate of the generalization error for the unseen samples, but this needs further investigation. R∗SM and regularization. From Eq. (5.9), one may notice that the connection weight magnitude (w2j in the ϕj ) is directly proportional to the ST-SM and thus the R∗SM . This provides a theoretical justification that, by controlling the magnitude of the connection weights between hidden and output layers, one could reduce the generalization error which is essential to the regularization of neural network learning (Haykin, 1999). Predicting unseen samples outside the Q-Union. In practice, some unseen samples may be located outside the Q-Union. This may be due to the fact that the Q value is too small and thus the Q-Union covers only very few unseen samples. However, expanding the Q value will lead to a larger R∗SM , because more dissimilar unseen

42

5

Localized Generalization Error Model

samples are included in the Q-Union, and a classifier with a very large R∗SM upper bound may not be meaningful. Furthermore, the R∗SM bounds from above the MSE of the unseen samples, and the MSE is an average of the errors. Thus, the R∗SM bound may still work well even if some of the unseen samples are located outside the Q-Union. When splitting the training and testing datasets, naturally some unseen testing samples may fall outside the Q-Union. However, our experiments show that the generalization capability of the RBFNNs selected by using the R∗SM is still the best in terms of testing accuracy when compared with other methods. On the other hand, if there is a large portion of unseen samples located outside the Q-Union, i.e., dissimilar to the training samples, one may consider revising the training dataset to include more such samples and retrain the classifier. As mentioned before, classifiers may not be expected to classify unseen samples that are totally different from the training set.

5.2.5 Comparing Two Classifiers Using the Error Bound One way to compare two classifiers is to fix the R∗SM (Q) value and compare the difference in their Q values. The other way is to fix the Q value and compare the R∗SM (Q) of the two classifiers. Assume there are two classifiers, f1 and f2 . There exists a Q1 for f1 yielding R∗SM (Q1 ) = a and a Q2 for f2 yielding R∗SM (Q2 ) = a. If Q1 < Q2 , then f2 has a better generalization capability because f2 covers more unseen samples, but still has the same generalization error upper bound. In other words, the architecture selection could be done by searching the functional space θ ∈ and seeking the fθ with the largest Q producing R∗SM (Q) = a. This is an important property of R∗SM (Q) and we will make use of it to find the optimal classifier with the largest coverage in the next section. On the other hand, one could compare the two classifiers, f1 and f2 , based on their R∗SM (Q) computed using the same Q value. The classifier with the lower R∗SM (Q) is expected to have a better generalization capability. All other comparison methods in which neither Q nor R∗SM (Q) is fixed may not be meaningful.

5.3 Architecture Selection Using the Error Bound The selection of the number of hidden neurons in the RBFNN is usually done by Sequential Learning (Huang et al., 2005) or by ad hoc choice. The Sequential Learning technique only makes use of the training error to determine the number of hidden neurons, without any reference to the generalization capability. Moreover, Huang et al. (2005) and Liang et al. (2006) assume the classifier does not have prior knowledge about the number of training samples while Kaminski and Strumillo (1997) and Gomm and Yu (2000) assume that it does. For ease of comparison with other architecture selection methods, we assume that the number of training samples

5.3

Architecture Selection Using the Error Bound

43

in our experiments is known to the classifier. In this section, we describe a new technique based on R∗SM to find the optimal number of hidden neurons that makes use of the generalization capability of the RBFNN. For any given threshold a on the generalization error bound (R∗SM ), the localized generalization error model allows us to find the best classifier by maximizing Q, assuming that the MSE of all samples within the Q-Union is smaller than a. One can formulate the architecture selection problem as a Maximal Coverage Classification problem with Selected Generalization error bound (MC2 SG), i.e., Q

max θ∈

subject to R∗SM (Q) ≤ a

(5.13)

In the RBFNN training algorithm presented in Sect. 5.3.2, once the number of hidden neurons is fixed, the center positions and widths could be estimated by any automatic clustering algorithm such as k-means clustering, a self-organizing map or hierarchical clustering. So we only need to concentrate on the problem of determining the number of hidden neurons. This means that θ = M and = {1, 2. . . N} because it is not reasonable to have the number of hidden neurons larger than the number of training samples. Eq. (5.13) represents a two-dimensional optimization problem. The first dimension is the number of hidden neurons (θ ) and the second dimension is the Q for a fixed θ . For every fixed θ and Q, we can determine an R∗SM (Q). The two parameters, θ and Q, are independent. Furthermore, by substituting Eq. (5.9) into Eq. (5.7), with probability (1 − η) we have the R∗SM for an RBFNN as follows, ⎞2 ⎛ M M

1 0.2 υj + ζj + Remp + A⎠ + ε Q4 n R∗SM (Q) ≈ ⎝ Q2 3 9 j=1

(5.14)

j=1

Let R∗SM (Q) = a. For every θ , let the Q that satisfies R∗SM (Q) = a be Q∗ , where a is a constant real number. R∗SM (Q) = a exists because the second-order derivative of Eq. (5.14) is positive. We could solve Eq. (5.14) as follows, Q4

M M

√ 2 0.2

ζj + Q2 υj − 3 a − ε − Remp − A = 0 n 3 j=1

(5.15)

j=1

Equation (5.15) could be solved by the quadratic equation and two solutions will be found for Q2 . For a ≥ ε (ε is usually a very small constant when the number of samples is large), there will be one positive and one negative real solution for Q2 because in Eq. (5.15), the coefficients for the terms Q4 and Q2 are positive, but the constant term is negative. Also, there will be two real and two imaginary solutions for Q. Let Q∗ be the only positive real solution among the four. Note that Q is defined to be the width of the Q-neighborhood and as such it must be a nonnegative real number.

44

5

∗

h M,Q

=

Localized Generalization Error Model

0 Remp ≥ a Q∗ else

(5.16)

So, for RBFNN architecture selection, Eq. (5.13) is equivalent to: max

M∈

h M,Q∗

(5.17)

5.3.1 Parameters for MC2 SG In Eq. (5.7), the differences between the maximum and minimum values of target outputs (A) and the number of training samples (N) are fixed for a given training dataset, and the maximum possible value of the MSE and the confidence level of the R∗SM bound, namely, B and η, could also be selected before any classifier training. In a K-class classification problem, one may select F (x) ∈ {(k1 ,k2 · · · kK )} where K ki = 1. ki = 1 if the sample belongs to the ith class, and one ki ∈ {0,1} and i=1

minimizes the MSE of all the RBFNN outputs simultaneously. All the F (x), fθ (x) and R∗SM are vectors and thus the sum of R∗SM values of all the K RBFNN outputs are minimized in the MC2 SG. One may notice that the minimization of the sum of the R∗SM values of all the outputs is equivalent to the minimization of the average of them. However, the average of the R∗SM values may provide a better interpretation and its range is not affected by the value K. The determination of the constant a is made according to the classifier’s output schemes for classification. For instance, if class outputs are different by 1, then a may be selected as 0.25 because a sample is misclassified if the square of its deviation from the target output is larger than 0.25. From Eq. (5.16), one may notice that the smaller the values of a is, the larger the number of hidden neurons selected by the MC2 SG. This is because a larger number of hidden neurons is needed to reduce the training error. On the other hand, if the a value is selected to be larger than 0.25, the effect on the architecture selection will be insignificant. Experimental results show that the RBFNNs yielding training error larger than 0.25 will not yield a good generalization capability, i.e., will yield poor testing accuracies (Yeung et al., 2007).

5.3.2 RBFNN Architecture Selection Algorithm for MC2 SG The solution of Eq. (5.17) is realized using the following selection algorithm. The MC2 SG is independent of any training algorithm of the RBFNN. Steps 2 and 3 represent unsupervised learning to find the centers and widths of the RBFs in the hidden neurons and Step 4 finds the least-square solution of the connection weights by making use of the linear relationship, shown in Eq. (5.8), between the hidden

5.4

Summary

45

neuron outputs and RBFNN outputs. Other training algorithms could be adopted in Steps 2, 3 and 4 in the following Architecture Selection Algorithm. The Architecture Selection Algorithm of the MC2 SG: 1. Start with M = 1 (M denotes the number of hidden neurons), 2. Execute k-means clustering algorithm to find the centers for the M hidden neurons, 3. For each of the M RBF hidden neurons, select its width value to be the distance between that center and its nearest hidden neuron, 4. Compute the connection weights using a pseudoinverse method, 5. Compute the Q-value for the current RBFNN using Eq. (5.16), 6. If the stopping criterion is not fulfilled, M = M + 1 and go to Step (2). The stopping criterion could be selected as “M is equal to the number of training samples” and this will allow the MC2 SG to search for all possible hidden neurons. However, it is computationally prohibitive for large datasets and thus we will discuss a heuristic stopping criterion in Sect. 5.3.3. Moreover, a constructive approach is employed here because it is more efficient to start the search with one hidden neuron, and with each iteration add one hidden neuron.

5.3.3 A Heuristic Method to Reduce the Computational Time for MC2 SG As in the other methods, h(M, Q∗ ) is generally not differentiable with respect to M (not a smooth function). One must try out all possible M values in order to find the optimal solution. Our experimental results show that h(M, Q∗ ) drops to 0 when the classifier becomes too complex, i.e., M is too large. Heuristically an early stop could be made to reduce the number of classifier training iterations when Q approaches 0. In our experiments, we stop the search when the Q values drop below a threshold. In fact, Q does not increase significantly after it drops below 10% of the maximum value of Q being found, and thus this is used as the threshold to speed up the MC2 SG.

5.4 Summary In this chapter, we described a new generalization error model based on the localized generalization error. R∗SM bounds the generalization error from above for unseen samples within the Q-neighborhoods of the training samples. Moreover, an architecture selection method, namely MC2 SG based on the R∗SM , is proposed to find the RBFNN classifier that has the largest coverage of unseen samples, while it’s R∗SM is still less than a preselected threshold. The R∗SM was shown to be a generalization of Rtrue .

46

5

Localized Generalization Error Model

This chapter has demonstrated the use of the MC2 SG to find the number of hidden neurons for an RBFNN, while the values of other parameters were found using existing methods. A possible extension of our result is to find the values of other RBFNN parameters, e.g., center positions, widths and connection weights, via an optimization of R∗SM because these parameters are also coefficients of the R∗SM . However, the trade-off between the optimality of the solution and time complexity will be an important consideration.

Chapter 6

Critical Vector Learning for RBF Networks

One of the most popular neural network models, the radial basis function (RBF) network attracts a lot of attention due to its improved approximation ability as well as the construction of its architecture. Bishop (1991) concluded that an RBF network can provide a fast, linear algorithm capable of representing complex nonlinear mappings. Park and Sandberg (1993) further showed that an RBF network can approximate any regular function. In a statistical sense, the approximation ability is a special case of statistical consistency. Hence, Xu et al. (1994) presented upper bounds for the convergence rates of the approximation error of RBF networks, and constructively proved the existence of a consistent point-wise estimator for RBF networks. Their results can be a guide to optimize the construction of an RBF network, which includes the determination of the total number of radial basis functions along with their centers and widths. This is an important problem to address because the performance and training of an RBF network depend very much on these parameters.

6.1 Related Work There are three ways to construct an RBF network, namely, clustering, pruning and critical vector learning. Bishop (1991) and Xu (1998) followed the clustering method, in which the training examples are grouped and then each neuron is assigned to a cluster. The pruning method, such as Chen et al. (1991) and Mao (2002), creates a neuron for each training example and then prunes the hidden neurons by example selection. The critical vector learning method, outlined by Schölkopf et al. (1997) constructs an RBF with the critical vectors, rather than cluster centers. Moody and Darken (1989) located an optimal set of centers using both the k-means clustering algorithm and learning vector quantization. The drawback of this method is that it considers only the distribution of the training inputs, yet the output values influence the positioning of the centers. Bishop (1991) introduced the Expectation-Maximization (EM) algorithm to optimize the cluster centers in two steps: obtaining of initial centers by clustering and optimization of the basis D.S. Yeung et al., Sensitivity Analysis for Neural Networks, Natural Computing Series, C Springer-Verlag Berlin Heidelberg 2010 DOI 10.1007/978-3-642-02532-7_6,

47

48

6 Critical Vector Learning for RBF Networks

functions by applying the EM algorithm. Such a treatment actually does not perform maximum likelihood learning but a suboptimal approximation. Xu (1998) extended the model for a mixture of experts to estimate basis functions, output neurons and the number of basis functions all together. The maximum likelihood learning and regularization mechanism can be further unified to his established Bayesian Ying Yang (BYY) learning framework (Xu, 2004a, 2004b, 2004c), in which any problem can be decomposed into Ying space or an invisible domain (e.g., the hidden neurons in RBFs), and Yang space or a visible domain (e.g., the training examples in RBFs), and the invisible/unknown parameters can be estimated through harmony learning between these two domains. Chen et al. (1991) proposed orthogonal least square (OLS) learning to determine the optimal centers. The OLS combines the orthogonal transform with the forward regression procedure to select model terms from a large candidate term set. The advantage of employing an orthogonal transform is that the responses of the hidden layer neurons are decorrelated so that the contribution of individual candidate neurons to the approximation error reduction can be evaluated independently. However, the original OLS learning algorithm lacks generalization and global optimization abilities. Mao (2002) employed OLS to decouple the correlations among the responses of the hidden units so that the class separability provided by individual RBF neurons can be evaluated independently. This method can select a parsimonious network architecture as well as centers providing large class separation. The common feature of all the above methods is that the radial basis function centers are a set of the optimal cluster centers of the training examples. Schölkopf et al. (1997) calculated support vectors using a support vector machine (SVM), and then used these support vectors as radial basis function centers. Their experimental results showed that the support-vector-based RBF outperforms conventional RBFs. Although the motivation of these researchers was to demonstrate the superior performance of a full support vector machine over either conventional or support-vector-based RBFs, their idea of critical vector learning is worth borrowing. This chapter describes a novel approach to determining the centers of RBF networks based on sensitivity analysis.

6.2 Construction of RBF Networks with Sensitivity Analysis An RBF classifier is a three-layer neural network model, in which an N-dimensional input vector x= (x1 x2 . . . xN ) is broadcast to each of K neurons in the hidden layer. Each hidden neuron produces an activation function, typically a Gaussian kernel: hi = exp −

x − ci 2 2σi2

, i = 1,2, . . . ,K

(6.1)

6.2

Construction of RBF Networks with Sensitivity Analysis

49

where ci and σi2 are the center and width of the Gaussian basis function of the i th hidden unit, respectively. The units in the output layer have interconnections with all the hidden units. The j th output neuron has the form: K

fj (x) = wj h =

wij exp −

i=1

x − ci 2

2σi2

(6.2)

where h = (h1 h2 . . . hK ) is the input vector from the hidden layer, and the wij is the interconnection weight between the j th output neuron and the i th hidden neuron. In this section, the RBF classifier’s sensitivity is defined as the mathematical expectation of the square of output deviations caused by the perturbation of RBF centers. An algorithm will be given that can be used to select critical vectors.

6.2.1 RBF Classifiers’ Sensitivity to the Kernel Function Centers We use symbols cˆ i and σˆ i to denote the values of center and width of the i th hidden neuron under a perturbation. Then the deviation resulting from this perturbation is: ˆ j hˆ − wj h = yj = w

( ( K (x − cˆ i (2

x − ci 2 − (6.3) wˆ ij exp − wij exp − 2σˆ i2 2σi2 i=1 i=1

K

Here, cˆ i = ci +ci are the centers deviated from the centers under the perturbations, ˆ j = wj + wj , where and the interconnection weights under the perturbations are w wj can be calculated using a pseudo-matrix inversion, or data training. Although there are ways to specify RBF widths, such as the method of Xu et al. (2004), the most common method for selecting RBF widths is to make all of them equal to a constant value depending on the prior knowledge of the given application. With predefined RBF widths, we just focus on the perturbations on the centers and their interconnection weights in this chapter. The perturbation on the i th RBF center and the weights connected to the j th output, ci and wj , can be generated following a Gaussian distribution with 0 means, variances σci and σwj , respectively: p(ci ) = √

1

2π σci

p(wj ) = √

N

1 2π σwj

' & cT c exp − 2σi 2 i

K

ci

' & wT wj exp − 2σj 2

(6.4)

wj

where N is the dimension of the input x, and K is the number of RBF centers. The RBF centers will be selected recursively in the next subsection. To make the sensitivity analysis cater to the construction of RBF networks, a recursive definition of sensitivity is given below. At the K th time, suppose there are a number K–1 of RBF centers fixed already, and the newcomer ci is observed. Hence, the j th

50

6 Critical Vector Learning for RBF Networks

output neuron’s sensitivity to the current number K of RBF centers is defined as the mathematical expectation of (yj )2 (square of output deviations caused by the perturbations of RBF centers) with respect to all ci and the training example set D = {xl }Ll=1 , which is expressed as (K)

Sj %

= E[(yj )2 ] = K 2 m w ˆ mj w ˆ nj exp − xl −cm2σ−c − p(w)p(c) 2 xl ∈D m,n=1

% −2 p(w)p(c)

m

K

xl ∈D m,n=1

%

+ p(w)p(c) =I1 − 2I2 + I3

K

xl ∈D m,n=1

xl −cn −cn 2 2σn2

2 m wˆ mj wnj exp − xl −cm2σ−c − 2 m

2 m wmj wnj exp − xl −cm2σ−c − 2 m

dcdw

xl −cn −cn 2 2σn2

dcdw xl −cn −cn 2 2σn2

dcdw (6.5)

where I1 , I2 and I3 are figured out by integrating over c and w, so I1 =

K

√ N σm2 σn2

x −cm 2 N exp − l2 2(σm +σc2m ) 2 2 2 2 (σm +σcm )(σn +σcn ) xl ∈D m,n=1;m=n √ N K σm2 2 x −cm 2 2 + (wmj + σwj ) $ N exp − 2l 2 (σm +2σcm ) (σm2 +2σc2m ) xl ∈D m=1

wmj wnj $

−

xl −cn 2 2(σn2 +σc2n )

,

and similarly, we have

Fig. 6.1 Illustration of the difference between sensitivity-based and conventional RBF classifiers. The circles and balls represent data points of two classes respectively, and RBF centers are indicated by extra circles. (a) Sensitivity-based RBF, centers being the vectors sensitive to the classification. (b) Conventional RBF, centers being the cluster centroids

6.2

Construction of RBF Networks with Sensitivity Analysis

I2 = I3 =

K

xl ∈D m,n=1 K xl ∈D m,n=1

√

wmj wnj $

σm2

N

N (σm2 +σc2m )

51

xl −cm 2 exp − 2(σ 2 +σ 2 ) − m

−cm 2 wmj wnj exp − xl2σ − 2 m

xl −cn 2 2σn2

cm

xl −cn 2 2σn2

,

.

The difference between the sensitivity-based and the conventional RBF networks can be illustrated as in Fig. 6.1.

6.2.2 Orthogonal Least Square Transform The most critical vectors can be found by Eq. (6.5). However, the RBF centers cannot be determined by a sensitivity measure only, because some of the critical vectors may be correlated. The OLS (Chen et al., 1991) can alleviate this problem of redundant or correlated centers. Let Y= (y1 , y2 . . . yL )T be the output matrix corresponding to the number L of all training examples, yi (i= 1, 2, . . ., L) an M-dimensional vector denoting the number M of output units. We have Y = HW = (QA) W

(6.6)

where Y, H, W are L × M, L × L, and L × M matrices, respectively. The selection of RBF centers is equivalent to the selection of the most critical columns of H. The matrix H can be decomposed into QA, where Q is an L × L matrix with orthogonal columns [q1 , q2 . . .qL ], and A is an L × L upper triangular matrix as follows: ⎡ ⎤ 1 h11 h12 · · · h1L ⎢ ⎢ .. ⎥ ⎢0 ⎢ h12 h12 . ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ .. . . .. ⎥ , A = ⎢ H=⎢ . ⎢ .. ⎢ ⎥ ⎢ ⎢. ⎥ ⎢ . ⎣ .. ⎦ ⎣ .. 0 h1L · · · · · · hLL ⎡

a12 · · · · · ·

a1L .. . .. .

⎤

⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ 0 1 a(L−1)L ⎦ ··· ··· 0 1 1 a23 . 0 ..

Only one column of H is orthogonalized in each iteration. At the K th iteration, one column is made orthogonal to each of the K–1 previously orthogonalized columns. The computational procedure can be represented as follows (Chen et al. 1991): ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨

q1 = h1 , aik =

qTi hK , qi

⎪ ⎪ ⎪ K−1 ⎪ ⎪ ⎪ ⎩ q K = hK − aiK hi

1≤i

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close