{"title": "The Bias-Variance Tradeoff and the Randomized GACV", "book": "Advances in Neural Information Processing Systems", "page_first": 620, "page_last": 626, "abstract": null, "full_text": "The Bias-Variance Tradeoff and the Randomized \n\nGACV \n\nGrace Wahba, Xiwu Lin and Fangyu Gao \n\nDept of Statistics \nUniv of Wisconsin \n\n1210 W Dayton Street \nMadison, WI 53706 \n\nwahba,xiwu,fgao@stat.wisc.edu \n\nDong Xiang \n\nSAS Institute, Inc. \nSAS Campus Drive \n\nCary, NC 27513 \n\nsasdxx@unx.sas.com \n\nRonald Klein, MD and Barbara Klein, MD \n\nDept of Ophthalmalogy \n610 North Walnut Street \n\nMadison, WI 53706 \n\nkleinr,kleinb@epi.ophth.wisc.edu \n\nAbstract \n\nWe propose a new in-sample cross validation based method (randomized \nGACV) for choosing smoothing or bandwidth parameters that govern the \nbias-variance or fit-complexity tradeoff in 'soft' classification. Soft clas(cid:173)\nsification refers to a learning procedure which estimates the probability \nthat an example with a given attribute vector is in class 1 vs class O. The \ntarget for optimizing the the tradeoff is the Kullback-Liebler distance \nbetween the estimated probability distribution and the 'true' probabil(cid:173)\nity distribution, representing knowledge of an infinite population. The \nmethod uses a randomized estimate of the trace of a Hessian and mimics \ncross validation at the cost of a single relearning with perturbed outcome \ndata. \n\n1 \n\nINTRODUCTION \n\nWe propose and test a new in-sample cross-validation based method for optimizing the bias(cid:173)\nvariance tradeoff in 'soft classification' (Wahba et al1994), called ranG ACV (randomized \nGeneralized Approximate Cross Validation). Summarizing from Wahba et al(l994) we are \ngiven a training set consisting of n examples, where for each example we have a vector \nt E T of attribute values, and an outcome y, which is either 0 or 1. Based on the training \ndata it is desired to estimate the probability p of the outcome 1 for any new examples in the \n\n\fThe Bias-Variance TradeofJand the Randomized GACV \n\n621 \n\nfuture. In 'soft' classification the estimate p(t) of p(t) is of particular interest, and might be \nused by a physician to tell patients how they might modify their risk p by changing (some \ncomponent of) t, for example, cholesterol as a risk factor for heart attack. Penalized like(cid:173)\nlihood estimates are obtained for p by assuming that the logit f(t), t E T, which satisfies \np(t) = ef(t) 1(1 + ef(t\u00bb) is in some space 1{ of functions . Technically 1{ is a reproducing \nkernel Hilbert space, but you don't need to know what that is to read on. Let the training \nset be {Yi, ti, i = 1,\u00b7\u00b7\u00b7, n}. Letting Ii = f(td, the negative log likelihood .c{Yi, ti, fd of \nthe observations, given f is \n\nn \n\n.c{Yi, ti, fd = 2::[-Ydi + b(li)], \n\ni=1 \n\n(1) \n\nwhere b(f) = log(l + ef ). The penalized likelihood estimate of the function f is the \nsolution to: Find f E 1{ to minimize h. (I): \n\nn \n\nh.(f) = 2::[-Ydi + b(ld) + J>.(I), \n\ni =1 \n\n(2) \n\nwhere 1>.(1) is a quadratic penalty functional depending on parameter(s) A = (AI, ... , Aq) \nwhich govern the so called bias-variance tradeoff. Equivalently the components of A con(cid:173)\ntrol the tradeoff between the complexity of f and the fit to the training data. In this paper we \nsketch the derivation of the ranG ACV method for choosing A, and present some prelim(cid:173)\ninary but favorable simulation results, demonstrating its efficacy. This method is designed \nfor use with penalized likelihood estimates, but it is clear that it can be used with a variety \nof other methods which contain bias-variance parameters to be chosen, and for which mini(cid:173)\nmizing the Kullback-Liebler (K L) distance is the target. In the work of which this is a part, \nwe are concerned with A having multiple components. Thus, it will be highly convenient \nto have an in-sample method for selecting A, if one that is accurate and computationally \nconvenient can be found. \n\nLet P>. be the the estimate and p be the 'true' but unknown probability function and let \nPi = p(td,p>.i = p>.(ti). For in-sample tuning, our criteria for a good choice of A is \nthe KL distance KL(p,p>.) = ~ E~I[PilogP7. + (1- pdlogg~::?)]. We may replace \nK L(p,p>.) by the comparative K L distance (C K L), which differs from K L by a quantity \nwhich does not depend on A. Letting hi = h (ti), the C K L is given by \nCKL(p,p>.) == CKL(A) = ;;, 2:: [-pd>'i + b(l>.i)). \n\n1 n \n\n(3) \n\ni=) \n\nC K L(A) depends on the unknown p, and it is desired is to have a good estimate or proxy \nfor it, which can then be minimized with respect to A. \nIt is known (Wong 1992) that no exact unbiased estimate of CK L(A) exists in this case, so \nthat only approximate methods are possible. A number of authors have tackled this prob(cid:173)\nlem, including Utans and M90dy(1993), Liu(l993), Gu(1992). The iterative U BR method \nof Gu(l992) is included in GRKPACK (Wang 1997), which implements general smooth(cid:173)\ning spline ANOVA penalized likelihood estimates with multiple smoothing parameters. It \nhas been successfully used in a number of practical problems, see, for example, Wahba \net al (1994,1995). The present work represents an approach in the spirit of GRKPACK \nbut which employs several approximations, and may be used with any data set, no matter \nhow large, provided that an algorithm for solving the penalized likelihood equations, either \nexactly or approximately, can be implemented. \n\n\f622 \n\nG. Wahba et al. \n\n2 THE GACV ESTIMATE \n\nIn the general penalized likelihood problem the minimizer 1>,(-) of (2) has a representation \n\nM \n\n1>.(t) = L dv.(ti, t) \n\nn \n\n(4) \n\nv=l \n\ni=l \n\nwhere the \" Q>.(8, t) is a reproducing kernel (positive definite \nfunction) for the penalized part of 7-1., and C = (Cl' ... ,Cn)' satisfies M linear conditions, \nso that there are (at most) n free parameters in 1>.. Typically the unpenalized functions \n.C) is of the form (4) then 1>,(f>.) is a quadratic form in c. Substituting (4) into (2) \nresults in h a convex functional in C and d, and C and d are obtained numerically via a \nNewton Raphson iteration, subject to the conditions on c. For large n, the second sum on \nthe right of (4) may be replaced by L~=1 Cik Q>. (tik , t), where the tik are chosen via one \nof several principled methods. \n\nTo obtain the CACV we begin with the ordinary leaving-out-one cross validation function \nCV(.\\) for the CKL: \n\nCV .\\) -\n\n( \n\n_ 1 \"\" \n\n- LJ-yd>.i + b 1>.i) , \n] \nn \n\n[-i] \n\n( \n\nn \n\ni=1 \n\n(5) \n\nwhere fl- i ] the solution to the variational problem of (2) with the ith data point left out \nand fti] is the value of fl- i] at ti . Although f>.C) is computed by solving for C and d \nthe CACV is derived in terms of the values (it\"\", fn)' of f at the ti. Where there is \nno confusion between functions f(-) and vectors (it, ... ,fn)' of values of fat tl, ... ,tn, \nwe let f = (it, ... \" fn)'. For any f(-) of the form (4), J>. (f) also has a representation as \na non-negative definite quadratic form in (it, ... , fn)'. Letting L:>. be twice the matrix of \nthis quadratic form we can rewrite (2) as \n\nn \n\nh(f,Y) = L[-Ydi + b(/i)] + 2f'L:>.f. \n\n1 \n\n(6) \n\ni=1 \n\nLet W = W(f) be the n x n diagonal matrix with (/ii == Pi(l - Pi) in the iith position. \nUsing the fact that (/ii is the second derivative of b(fi), we have that H = [W + L:>.] - 1 \nis the inverse Hessian of the variational problem (6). In Xiang and Wahba (1996), several \nTaylor series approximations, along with a generalization of the leaving-out-one lemma \n(see Wahba 1990) are applied to (5) to obtain an approximate cross validation function \nACV(.\\), which is a second order approximation to CV(.\\) . Letting hii be the iith entry \nof H , the result is \n\nCV(.\\) ~ ACV('\\) = .!. t[-Yd>.i + b(f>.i)] + .!. t \nn i=1 \n\nn i= l \n\nhiiYi(Yi - P>.i) . \n\n[1 - hiwii] \n\n(7) \n\nThen the GACV is obtained from the ACV by replacing hii by ~ L~1 hii == ~tr(H) \nand replacing 1 - hiWii by ~tr[I - (Wl/2 HWl/2)], giving \n\nCACV('\\) = ;; t;;[-Yd>.i + b(1).i) + -n-tr[I _ (Wl/2HWl /2)] , \n\n] \n\ntr(H) L~l Yi(Yi - P>.i) \n\n1 ~ \n\n(8) \n\nwhere W is evaluated at 1>.. Numerical results based on an exact calculation of (8) appear \nin Xiang and Wahba (1996). The exact calculation is limited to small n however. \n\n\fThe Bias-Variance TradeofJand the Randomized GACV \n\n623 \n\n3 THE RANDOMIZED GACV ESTIMATE \n\nGiven any 'black box' which, given >., and a training set {Yi, ti} produces f>. (.) as the min(cid:173)\nimizer of (2), and thence f>. = (fA 1 , \" ' , f>.n)', we can produce randomized estimates of \ntrH and tr[! - W 1/ 2 HW 1/ 2 J without having any explicit calculations of these matrices. \nThis is done by running the 'black box' on perturbed data {Vi + <5i , td. For the Yi Gaus(cid:173)\nsian, randomized trace estimates of the Hessian of the variational problem (the 'influence \nmatrix') have been studied extensively and shown to be essentially as good as exact calcu(cid:173)\nlations for large n, see for example Girard(1998) . Randomized trace estimates are based \non the fact that if A is any square matrix and <5 is a zero mean random n-vector with inde(cid:173)\npendent components with variance (TJ, then E<5' A<5 = ~ tr A. See Gong et al( 1998) and \nreferences cited there for experimental results with multiple regularization parameters. Re-\nturning to the 0-1 data case, it is easy to see that the minimizer fA(') of 1;.. is continuous in \nY, not withstanding the fact that in our training set the Yi take on only values 0 or 1. Letting \nif = UA1,\"', f>.n)' be the minimizer of (6) given y = (Y1,\"', Yn)', and if+O be the \nminimizer given data y+<5 = (Y1 +<51, ... ,Yn +<5n)' (the ti remain fixed), Xiang and Wahba \n(1997) show, again using Taylor series expansions, that if+O - ff ,....., [WUf) + ~AJ-1<5. \nThis suggests that ~<5'Uf+O - ff) provides an estimate oftr[W(ff) + ~At1. However, \nif we take the solution ff to the nonlinear system for the original data Y as the initial value \nfor a Newton-Raphson calculation of ff+O things become even simpler. Applying a one \nstep Newton-Raphson iteration gives \n\nu\" \n\nu\" \n\n(9) \n\nSince Pjf(ff,y + <5) = \n[ 82 h(fY )J- 1 \n8?7if A' Y \n[WUf) + EAt 1<5. The result is the following ranGACV function: \n\n-<5 + PjfUf,Y) = -<5, and [:;~f(ff,Y + <5)J - 1 \nf y+o,l -\nfY \n- A \n-\n\nfY \nA + 8?7if A' Y \n\n[ 8 2 h(fY )J- 1 J: \n\n,we ave A \n\nu so t at A \n\nf y+o,l \n\nh \n\nh \n\nranGACV(>.) = .!. ~[- 'I '+bU .)J+ \n\nn \n\nn ~ Yz At \n\nAt \n\n<5' (f Y+O,l \nn \n\nA \n\nfY) \n- A \n\n) \n\",n \nwi=l Yi Yi - PAi \n\n( \n\n. \n\n[<5'<5 - <5'WUf)Uf+O,l - ff)J \n(10) \nin (10), we may draw R \n'+' in \nto obtain an R-replicated \n\nthe variance in \n\nTo reduce \nindependent replicate vectors <51,'\" ,<5R , and replace the term after the \n(1O)b \n\n1... \",R o:(fr Hr .1 -ff) \n\n2:7-1 y.(y.-P>..) \n\nterm after the \n\nthe \n\n'+' \n\n[O~Or-O~ W(fn(f~+Ar . l-ff)1 \n\ny R wr=l \n\nn \nranGACV(>.) function. \n\n4 NUMERICAL RESULTS \n\nIn this section we present simulation results which are representative of more extensive \nsimulations to appear elsewhere. In each case, K < < n was chosen by a sequential clus(cid:173)\ntering algorithm. In that case, the ti were grouped into K clusters and one member of each \ncluster selected at random. The model is fit. Then the number of clusters is doubled and the \nmodel is fit again. This procedure continues until the fit does not change. In the randomized \ntrace estimates the random variates were Gaussian. Penalty functionals were (multivariate \ngeneralizations of) the cubic spline penalty functional>. fa1 U\" (X))2, and smoothing spline \nANOVA models were fit. \n\n\f624 \n\nG. Wahba et at. \n\n4.1 EXPERIMENT 1. SINGLE SMOOTHING PARAMETER \n\nIn this experiment t E [0,1], f(t) = 2sin(10t), ti = (i -\n.5)/500, i = 1,\u00b7\u00b7\u00b7,500. A \nrandom number generator produced 'observations' Yi = 1 with probability Pi = el , /(1 + \neli), to get the training set. Q A is given in Wahba( 1990) for this cubic spline case, K = 50. \nSince the true P is known, the true CKL can be computed. Fig. \nl(a) gives a plot of \nCK L(A) and 10 replicates of ranGACV(A). In each replicate R was taken as 1, and J \nwas generated anew as a Gaussian random vector with (115 = .001. Extensive simulations \nwith different (115 showed that the results were insensitive to (115 from 1.0 to 10-6 \u2022 The \nminimizer of C K L is at the filled-in circle and the 10 minimizers of the 10 replicates of \nranGACV are the open circles. Anyone of these 10 provides a rather good estimate of \nthe A that goes with the filled-in circle. Fig. l(b) gives the same experiment, except that \nthis time R = 5. It can be seen that the minimizers ranGACV become even more reliable \nestimates of the minimizer of C K L, and the C K L at all of the ranG ACV estimates are \nactually quite close to its minimum value. \n\n4.2 EXPERIMENT 2. ADDITIVE MODEL WITH A = (Al' A2) \n\nHere t E [0,1] 0 [0,1]. n = 500 values of ti were generated randomly according to \na uniform distribution on the unit square and the Yi were generated according to Pi = \neli j(l + el ,) with t = (Xl,X2) and f(t) = 5 sin 27rXl - 3sin27rX2. An additive model \nas a special case of the smoothing spline ANOVA model (see Wahba et al, 1995), of the \nform f(t) = /-l + h(xd + h(X2) with cubic spline penalties on hand h were used. \n.001, R = 5. Figure l(c) gives a plot of CK L(Al' A2) and Figure l(d) \nK = 50, (115 = \ngives a plot of ranGACV(Al, A2). The open circles mark the minimizer of ranGACV \nin both plots and the filled in circle marks the minimizer of C K L. The inefficiency, as \nmeasured by CKL()..)/minACKL(A) is 1.01. Inefficiencies near 1 are typical of our \nother similar simulations. \n\n4.3 EXPERIMENT 3. COMPARISON OF ranGACV AND UBR \n\nThis experiment used a model similar to the model fit by GRKPACK for the risk of \nprogression of diabetic retinopathy given t = \n(Xl, X2, X3) = (duration, glycosylated \nhemoglobin, body mass index) in Wahba et al(l995) as 'truth'. A training set of 669 \nexamples was generated according to that model, which had the structure f(Xl, X2, X3) = \n/-l + fl (xd + h (X2) + h (X3) + fl,3 (Xl, X3). This (synthetic) training set was fit by GRK(cid:173)\nPACK and also using K = 50 basis functions with ranG ACV. Here there are P = 6 \nsmoothing parameters (there are 3 smoothing parameters in f13) and the ranGACV func(cid:173)\ntion was searched by a downhill simplex method to find its minimizer. Since the 'truth' is \nknown, the CKL for)\" and for the GRKPACK fit using the iterative UBR method were \ncomputed. This was repeated 100 times, and the 100 pairs of C K L values appears in Fig(cid:173)\nure l(e). It can be seen that the U BR and ranGACV give similar C K L values about 90% \nof the time, while the ranG ACV has lower C K L for most of the remaining cases. \n\n4.4 DATA ANALYSIS: AN APPLICATION \n\nFigure 1(f) represents part of the results of a study of association at baseline of pigmentary \nabnormalities with various risk factors in 2585 women between the ages of 43 and 86 in the \nBeaver Dam Eye Study, R. Klein et al( 1995). The attributes are: Xl = age, X2 =body mass \nindex, X3 = systolic blood pressure, X4 = cholesterol. X5 and X6 are indicator variables for \ntaking hormones, and history of drinking. The smoothing spline ANOVA model fitted was \nf(t) = /-l+dlXl +d2X2 + h(X3)+ f4(X4)+ h4(X3, x4)+d5I(x5) +d6I(x6), where I is the \nindicator function. Figure l(e) represents a cross section of the fit for X5 = no, X6 = no, \n\n\fThe Bias- Variance Tradeoff and the Randomized GACV \n\n625 \n\nX2, X3 fixed at their medians and Xl fixed at the 75th percentile. The dotted lines are the \nBayesian confidence intervals, see Wahba et al( 1995). There is a suggestion of a borderline \ninverse association of cholesterol. The reason for this association is uncertain. More details \nwill appear elsewhere. \n\nPrincipled soft classification procedures can now be implemented in much larger data sets \nthan previously possible, and the ranG ACV should be applicable in general learning. \n\nReferences \n\nGirard, D. (1998), 'Asymptotic comparison of (partial) cross-validation, GCV and random(cid:173)\nized GCV in nonparametric regression', Ann. Statist. 126, 315-334. \nGirosi, F., Jones, M. & Poggio, T. (1995), 'Regularization theory and neural networks \narchitectures', Neural Computatioll 7,219-269. \nGong, J., Wahba, G., Johnson, D. & Tribbia, J. (1998), 'Adaptive tuning of numerical \nweather prediction models: simultaneous estimation of weighting, smoothing and physical \nparameters', MOllthly Weather Review 125, 210-231. \nGu, C. (1992), 'Penalized likelihood regression: a Bayesian analysis', Statistica Sinica \n2,255-264. \nKlein, R., Klein, B. & Moss, S. (1995), 'Age-related eye disease and survival. the Beaver \nDam Eye Study', Arch Ophthalmol113, 1995. \nLiu, Y. (1993), Unbiased estimate of generalization error and model selection in neural \nnetwork, manuscript, Department of Physics, Institute of Brain and Neural Systems, Brown \nUniversity. \nUtans, J. & Moody, J. (1993), Selecting neural network architectures via the prediction \nrisk: application to corporate bond rating prediction, in 'Proc. First Int'I Conf. on Artificial \nIntelligence Applications on Wall Street', IEEE Computer Society Press. \nWahba, G. (1990), Spline Models for Observational Data, SIAM. CBMS-NSF Regional \nConference Series in Applied Mathematics, v. 59. \nWahba, G. (1995), Generalization and regularization in nonlinear learning systems, in \nM. Arbib, ed., 'Handbook of Brain Theory and Neural Networks', MIT Press, pp. 426-\n430. \nWahba, G., Wang, Y., Gu, c., Klein, R. & Klein, B. (1994), Structured machine learning \nfor 'soft' classification with smoothing spline ANOVA and stacked tuning, testing and \nevaluation, in J. Cowan, G. Tesauro & J. Alspector, eds, 'Advances in Neural Information \nProcessing Systems 6', Morgan Kauffman, pp. 415-422. \nWahba, G., Wang, Y., Gu, C., Klein, R. & Klein, B. (1995), 'Smoothing spline AN OVA for \nexponential families, with application to the Wisconsin Epidemiological Study of Diabetic \nRetinopathy' , Ann. Statist. 23, 1865-1895. \nWang, Y. (1997), 'GRKPACK: Fitting smoothing spline analysis of variance models to data \nfrom exponential families', Commun. Statist. Sim. Compo 26,765-782. \nWong, W. (1992), Estimation of the loss of an estimate, Technical Report 356, Dept. of \nStatistics, University of Chicago, Chicago, II. \nXiang, D. & Wahba, G. (1996), 'A generalized approximate cross validation for smoothing \nsplines with non-Gaussian data', Statistica Sinica 6, 675-692, preprint TR 930 available \nvia www. stat. wise. edu/-wahba - > TRLIST. \nXiang, D. & Wahba, G. (1997), Approximate smoothing spline methods for large data sets \nin the binary case, Technical Report 982, Department of Statistics, University of Wisconsin, \nMadison WI. To appear in the Proceedings of the 1997 ASA Joint Statistical Meetings, \nBiometrics Section, pp 94-98 (1998). Also in TRLIST as above. \n\n\fG, Wahba et aI, \n\nCKL \n\nCKL \nranGACV \n\n626 \n\n10 \n(0 \nc:i \n\n0 \n(0 \nc:i \n\n10 \n10 \nc:i \n\n0 \n~ \n0 \n\n.' . \n\n10 \n(0 \nc:i \n\no \n(0 \nc:i \n\n10 \n10 \nc:i \n\no \n10 \nc:i \n\n-8 \n\n-7 \n\n-6 \n-5 \nlog lambda \n\n(a) \n\n-4 \n\n-3 \n\n-8 \n\n-7 \n\n-6 \n-5 \nlog lambda \n\n(b) \n\n-4 \n\n-3 \n\n9.29 \n\nr,,6 \n:0' \n~, O. 7 O. 9 \n\n-7 \n\nlog lambda1 \n\n-6 \n\n(c) \n\n-5 \n\no \n\n12! \no \n\nC\\I \n(0 \nc:i \n\nco \n10 \nc:i \n\n\"f \n\n... 0,28 \n\n.. \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7r ..... ranGACV \n\n\\~7 \n0\\ \n\n.' \n\n\" ...... \n\n0:\\13 \n\n': \n\n:.\u00b7 .. O-:!4!7 \n: \n\n0 \n\n. . \n. . \n\n. \n. \n\n.: 0'F5 0'F8 0.[32 \n\nlog lambda1 \n\n-6 \n\n(d) \n\n-5 \n\n-4 \n\nO. 4 \n\n-4 \n\n.25 0'.2,4 \n\n-7 \n\n(0 \nc:i \n\n.~ =...,. \n.0 . \nca O \n.0 e a.. \n\nC\\I \nc:i \n\n(0 \n10 \nc:i ~--------.-------.--------r--~ \n\n0,56 \n\n0,58 \n\n0,60 \nranGACV \n\n(e) \n\n0,62 \n\no \nc:i ~ __ ~ ____ , -____ . -__ -. ____ . -__ ~ \n\n100 \n\n150 \n\n200 \n300 \nCholesterol (mg/dL) \n\n250 \n\n(f) \n\n350 \n\n400 \n\nFigure 1: (a) and (b): Single smoothing parameter comparison of ranGACV and CK L. \n(c) and (d): Two smoothing parameter comparison of ranGACV and CK L. (e): Compar(cid:173)\nison of ranG ACV and U B R. (f): Probability estimate from Beaver Dam Study \n\n\f", "award": [], "sourceid": 1483, "authors": [{"given_name": "Grace", "family_name": "Wahba", "institution": null}, {"given_name": "Xiwu", "family_name": "Lin", "institution": null}, {"given_name": "Fangyu", "family_name": "Gao", "institution": null}, {"given_name": "Dong", "family_name": "Xiang", "institution": null}, {"given_name": "Ronald", "family_name": "Klein", "institution": null}, {"given_name": "Barbara", "family_name": "Klein", "institution": null}]}