Random Feature Expansions for Deep Gaussian Processes

) = ) θ ) )... ) θ ))) ) ) ) ) ) ) ) ) )

, θ) = ) ), θ ) ) ) ), θ )... ), θ )) )... )

Φ =Φ Ω )=N, )

σ Φ = [ Ω), Ω)] ) Ω θ = N, Λ ) σ Λ =,..., )

σ Φ =, Ω) ) Ω θ = N, Λ )

σ Φ = [ Ω), Ω)] Ω θ ) = N, Λ ) σ Λ =,..., ) σ Φ =, Ω) Ω θ ) = N, Λ )

Φ ) ) Φ ) ) Ω ) ) Ω ) ) θ ) θ )

Ψ = Ω ),...,Ω ), ),..., )) [ θ)] E Ψ) [ Ψ)]) [Ψ) Ψ θ)] Ψ) Ψ, θ)

, Ψ, θ) =, Ψ, θ) E Ψ) [, Ψ, θ)]) E Ψ) [, Ψ, θ)]) I E Ψ) ) [, Ψ, θ)] [, Ψ, θ)] = Ψ Ψ)

Ψ) = Ω ) ) ) ) Ω ) ) = N ) µ ), σ ) ) ) ) = N ), ) ) )

Ψ) = Ω ) ) ) ) Ω ) ) = N ) µ ), σ ) ) ) ) = N ), ) ) ) ) ) = ) ) ϵ ) + ) ϵ ) N, )

Ω RMSE MNLL 0.4 0.6 0.2 0.4 1 1.5 2 2.5 3 3.5 log 10 RFs) 0.2 1 1.5 2 2.5 3 3.5 log 10 RFs) prior-fixed var-fixed var-resampled

................. dgp-rbf dgp-arc dgp-ep dnn var-gp

................... dgp-rbf dgp-arc dgp-ep dnn var-gp

Dataset Accuracy RBF ARC MNLL RBF ARC MNIST8M 99.14% 99.04% 0.0454 0.0465 AIRLINE 78.55% 72.76% 0.4583 0.5335

.......... 2 layers 10 layers 20 layers 30 layers SV-DKL

Deep probabilistic models; Composition of functions: fx) = h Nh 1) θ Nh 1))... h 0) θ 0))) x); Fig. 1: Illustration of how stochastic processes may be composed. Inference requires calculating the marginal likelihood: py X, θ) = Extremely challenging! p Y F Nh), θ Nh)) p... p F 1) X, θ 0)) df Nh)...dF 1) ; GPs are single layered Neural Nets with an infinite number of hidden units; Taking a weight-space view of a GP: F =ΦW ; The priors over the weights are: p W i) =N 0,I) ; The RBF kernel: [ krbfx, x )=exp 1 ] 2 x x ) x x ) = X Ω Φ W Fig. 2: Single layered GP. pω)exp ιx x ) ω ) dω can be approximated using trigonometric functions: σ Φrbf = 2 [cos F Ω), sin F Ω)] with p Ω j θ) =N 0, Λ 1), NRF Meanwhile, the first order Arc-Cosine kernel: karcx, x )= 1 π x x Jα) =2 max0, ω x) max0, ω x )N ω 0,I) dω x x can be approximated using Rectified Linear Units ReLU): 2σ Φarc = 2 max 0,FΩ) with p Ω j θ) =N 0, Λ 1). NRF ) F where DGPs with RFs become DNNs with low-rank weight matrices! X Φ 0) 1) 2) F Φ1) F Y 0) 0) 1) Ω W Ω1) W θ 0) Define Ψ =Ω 0),...,W 0),...); Lower bound on Marginal Likelihood: θ1) Fig. 3: Diagram of the proposed DGP with random features. log [py θ)] E qψ) log [p Y Ψ)]) DKL [qψ) p Ψ θ)], where qψ) approximates pψ Y, θ); Factorized approximate posterior: with q ) = N ) q and ) q ijl q ), ) = N Assuming factorized likelihood, we can use mini-batch stochastic gradient optimization: n E qψ)log[pyk Ψ)]) DKL[qΨ) pψ θ)]; m k Im The expectation can be estimated using Monte Carlo: with Ψr qψ); NMC log[pyk Ψr)], NMC r=1 Computational cost dominated by low-rank matrix multiplication - no inverses required; Various optimization strategies for Ω available. 0.4 0.2 RMSE 1 1.5 2 2.5 3 3.5 log10rfs) 0.6 0.4 0.2 MNLL 1 1.5 2 2.5 3 3.5 log10rfs) prior-fixed var-fixed var-resampled Fig. 4: Performance of different strategies for dealing with Ω as a function of the number of random features.these can be fixed PRIOR-FIXED), or treated variationally with fixed randomness VAR-FIXED and resampled at each iteration VAR-RESAMPLED). ) ; Competing methods: DGP with RBF Kernel DGP-RBF); DGP with first order Arc-Cosine Kernel DGP-ARC); DGP with Expectation Propagation DGP-EP); Sparse GP with Variational Inference VAR-GP). 0.1 0.05 0.4 0.2 0.5 0.4 0.3 Regression Power plant Dataset n = 9568, d=4) 0.2 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 0.5 Spam Dataset n = 4061, d=57) 1 1.5 2 2.5 3 3.5 1 0 log10sec) 1 1.5 2 2.5 3 3.5 log10sec) log10sec) 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 log10sec) 0.2 0.1 0.4 0.2 0.8 0.7 1.2 1.1 Classification EEG Dataset n = 14979, d=14) 0 2 2.5 3 3.5 log10sec) 2 2.5 3 3.5 log10sec) 1 Protein Dataset n = 45730, d=9) 2 2.5 3 3.5 log10sec) 2 2.5 3 3.5 log10sec) 0.2 0.15 0.1 0.05 MNIST Dataset n = 60000, d=784) 0 3 3.5 4 4.5 log10sec) 0 3 3.5 4 4.5 DGP-RBF DGP-ARC DGP-EP VAR-GP 3 2 1 log10sec) Fig. 5: Progression of RMSE and MNLL on test data over time for competing DGP models. Easily extendable architecture with option to feed-forward inputs. 0.5 0.4 0.3 Error rate 0.2 2 3 4 5 log10sec) 0.6 0.55 0.5 MNLL 0.45 2 3 4 5 log10sec) 2.7 2.6 10 6 NELBO 2 10 20 30 Layers 2 layers 10 layers 20 layers 30 layers SV-DKL Fig. 6: Left and center - Performance of our model on the AIRLINE dataset as function of time for different depths. Right - The box plot of the negative lower bound confirms this is a suitable objective for model selection. Our contributions: Complete specification and evaluation of DGPs based on random features; Scalable and practical DGP inference - no inverses; Synchronous/asynchronous distributed implementation available; Ongoing work: Fastfood and other kernels; Convolutional DGPs for image processing. [1] A. Damianou and N. Lawrence. Deep Gaussian Processes, AISTATS 2013. [2] Y. Gal and Z. Gharamani. Dropout as a Bayesian Approximation, ICML 2016. Random Feature Expansions for Deep Gaussian Processes Kurt Cutajar 1, Edwin V. Bonilla 2, Pietro Michiardi 1, Maurizio Filippone 1 1 - EURECOM, Sophia Antipolis, France 2 - University of New South Wales, Australia Deep Gaussian Processes Model Architecture Experimental Setup and Results h 0) x) h 1) x) h 0) h 1) x) ) DGPs with Random Features F Nh) F Nh 1), θ Nh 1)) allowing for scaling factors σ 2 and Λ = diagl 2 1,...,l 2 d ) for the kernel and the features; Jα) =sinα +π α) cos α and α = cos 1 x x.. Stochastic Variational Inference W l) ij qψ) = ijl µ l) ij, σ2 ) l) ij Ω l) ij E qψ) log [pyk Ψ)]) 1 W l) ij Ω l) ij m l) ij, s2 ) l) ij Accuracy MNLL Conclusions References RMSE MNLL Accuracy MNLL RMSE MNLL Accuracy MNLL