DSVDD: Deep Support Vector Data Description
1. Background
1.1 Kernel-based One-Class Classification.
Let \(\mathcal{X} \subset \mathbb{R}^d\) be the data space. Let \( k : \mathcal{X} \times \mathcal{X} \rightarrow [0, \inf)\) be a PSD kernel, \(\mathcal{F}_{k}\) it's associated RKHS, and \(\phi_{k} : \mathcal{X} \rightarrow \mathcal{F}_{k}\) its associated feature mapping. So \(k(\boldsymbol{x}, \tilde{\boldsymbol{x}})=\left< \phi_{k}(\boldsymbol{x}), \phi_{k}(\tilde{\boldsymbol{x}})\right>_{\mathcal{F}_{k}}\) for all \(\boldsymbol{x}, \tilde{\boldsymbol{x}}\in \mathcal{X}\) where \(\left<\cdot, \ \cdot\right>_{\mathcal{F}_{k}}\) is the dot product in Hilbert space \(\mathcal{F}_{k}\).
OC-SVM. Probably the most prominent example of a kernel-based method for one-class classification is the One-Class SVM (OC-SVM). The objective of the OC-SVM finds a maximum margin hyperplane in feature space, \( \boldsymbol{w}\in\mathcal{F}_k\), that best separates the mapped data from the origin. Given a dataset \(\mathcal{D}_{n} = {\boldsymbol{x}_1, \ldots, \boldsymbol{x}_n}\) with \(\boldsymbol{x}_i \in \mathcal{X}\), the OC-SVM solves the primal problem:
$$ \min_{\boldsymbol{w},\ \rho,\ \xi} \frac{1}{2}\|\boldsymbol{w}\|^{2}_{\mathcal{F}_{k}}-\rho+\frac{1}{\nu n}\sum^{n}_{i=1}\xi_{i} $$
$$ \textbf{s.t.}\ \left<\boldsymbol{w}, \phi_{k}(\boldsymbol{x}_i)\right>_{\mathcal{F}_k} \ge \rho-\xi_{i}, \ \xi_{i}\ge 0, \ \forall i.$$
Here \(\rho\) is the distance from the origin to hyperplane \(\boldsymbol{w}\). Nonnegative slack variables \(\boldsymbol{\xi}=(\xi_1, \ldots, \xi_n)^\top\) allow the margin to be soft, but violations \(\xi_i\) get penalized. \(\|\boldsymbol{w}\|^{2}_{\mathcal{F}_k}\) is a regularizer on the hyperplane \(\boldsymbol{w}\) where \( \|\cdot\|_{\mathcal{F}_{k}} \) is the norm induced by \(\left<\cdot,\ \cdot\right>_{\mathcal{F}_k}\). The hyperparameter \(\nu \in (0, 1]\) controls the trade-off in the objective. Seperateing the data from the origin in feature space translates into finding a halfspace in which most of the data lie and point lying outside this halfspace, i.e. \(\left< \boldsymbol{w}, \phi_{k}(\boldsymbol{x})\right> < \rho\), are deemed to be anomalous.
SVDD. Support Vector Data Description (SVDD) is a technique related to OC-SVM where a hypersphere is used to seperate the data instead of a hyperplane. The objective of SVDD is to find the smallest hypersphere with center \(\boldsymbol{c} \in \mathcal{F}_{k}\) and radius \(R > 0\) that encloses the majority of the data in feature space \(\mathcal{F}_k\). The SVDD primal problem is given by
$$ \min_{R,\ \boldsymbol{c},\ \boldsymbol{\xi}} R^2 + \frac{1}{\nu n}\sum_{i}\xi_i$$
$$ \textbf{s.t.} \ \|\phi_k(\boldsymbol{x}_i)-\boldsymbol{c}\|^{2}_{\mathcal{F}_k} \le R^2+\xi_i,\ \xi_i \ge 0,\ \forall i.$$
Again, slack variables \(\xi_i \ge 0\) allow a soft boundary and hyperparameter \(\nu \in (0,1]\) controls the trade-off between penalties \(\xi_i\) and the volume of the sphere. Points which fall outside the sphere, i.e. \(\|\phi_k(\boldsymbol{x}-\boldsymbol{c})\|^2_{\mathcal{F}_k} > R^2\), are deemed anomalous.
Formulating the primal problems with hyperparameter \(\nu \in (0,1]\) as in OC-SVM or SVDD is a handy choice of parameterization since \(\nu \in (0,1]\) is an upper bound on the fraction of outliers, and a lower bound on the fraction of support vectors (points that are eigher on or outside the boundary). This result is known as \(\nu\text{-}property\) and allows one to incorporate a prior belief about the fraction of outliers present in the training data into the model.
Drawbacks. Apart from the necessity to perform explicit feature engineering, another drawback of the aforementioned methods is their poor computational scaling due to the construction and manipulation of the kernel matrix. Kernel-based methods scale at learst quadratically in the number of samples unless some sort of approximation technique is used. Moreover, prediction with kernel methods requires strong support vectors which can require large amount of memory.
1.2 Deep Approaches to Anomaly Detection
Deep learning for AD could be categorized into either 'mixed' or 'fully deep' approaches. In mixed approaches, representations are learned separately ina preceding step before these representations are then fed into classical (shallow) AD methods like the OC-SVM. Fully deep approaches, in contrast, employ the representation learning objective directly for detecting anomalise.
Deep Auto-Encoder. Deep autoencoders are the predominant approach used for deep AD. Typically auto-encoders are trained to minimize reconstruction error, i.e. \(\|\boldsymbol{x}-\tilde{\boldsymbol{x}}\|^2\). Therefore these networks should be able to extract the commom factors of variation from normal samples and reconstruct them accurately, while anomalous samples do not contain these common factors of variation and thus cannot be reconstruct accurately. This allows for the use of auto-encoders in mixed approaches, by pluggin the learned embeddings into classical AD methods, but also in fully deep approaches, by directly employing the recontstruction error as an anomaly score.
AnoGAN. In this method one first trains a GAN to generate samplese according to the training data. Given a test point AnoGAN tries to find the point in the generator's latent space that generates the sample closest to the test input considered. Intuitively, if the GAN has captured the distribution of the training data then normal samples, i.e. samples from the distribution, should have a good representation in the latent space and anomalous samples will not. To find the point in latent space, gradient descent is performed in latent space keeping the learned weights of the generator fixed. Finally, AnoGAN defines an anomaly score also via the reconstruction error.
2. Methodology
2.1 the Deep SVDD Objective
Deep SVDD is based on kernel-based SVDD and conduct minimum volume estimation by finding a data-enclosing hypersphere of smallest size. In constrat to kernel-based SVDD, Deep SVDD learns useful feature representations of the data together with the one-class classification objective. To do this Deep SVDD employ a neural network that is jointly trained to map the data into a hypersphere of minimum volume.
For some input space \(\mathcal{X} \subset \mathbb{R}^d\) and output space \(\mathcal{F} \subset \mathbb{R}^p\), let \(\phi(\cdot;\ \mathcal{W}) : \mathcal{X} \rightarrow \mathcal{F}\) be a neural network with \(L \in \mathbb{N}\) hidden layers and set of weights \(\mathcal{W} = \{\boldsymbol{W}^1, \ldots, \boldsymbol{W}^L\}\) where \(\boldsymbol{W}^l\) are the weights of layer \(l \in \{1,\ldots,L\}\). That is, \(\phi(\boldsymbol{x}, \mathcal{W} \in \mathcal{F}\) is the feature representation of \(\boldsymbol{x} \in \mathcal{X}\) given by network \(\phi\) with parameters \(\mathcal{W}\). The aim of Deep SVDD then is to jointly learn the network paremeter \(\mathcal{W}\) together with minimizing the volume of a data-enclosing hypyersphere in output space \(\mathcal{F}\) that is characterized by radius \(R > 0\) and center \(\boldsymbol{c} \in \mathcal{F}\) which we assume to be given for now. Given some training data \(\mathcal{D}_n = \{\boldsymbol{x}_1, \ldots, \boldsymbol{x}_n\}\) on \(\mathcal{X}\), the soft-boundary Deep SVDD is defined as:
$$\min_{R,\ \mathcal{W}} R^2+\frac{1}{\nu n}\sum^{n}_{i=1}\max\{0,\ \| \phi(\boldsymbol{x}_i;\ \mathcal{W}) - \boldsymbol{c}\|^2-R^2\} + \frac{\lambda}{2}\sum^{L}_{l=1}\|\boldsymbol{W}^l\|^2_F.$$
As in kernel SVDD, minimizing \(R^2\) minimizes the volume of the hypersphere. The scecond term is a penalty term for points lying outside the sphere after being passed through the network, i.e. if its distance to the center \(\|\phi(\boldsymbol{x}_i;\ \mathcal{W})-\boldsymbol{c}\|\) is greater than radius \(R\). Hyperparameter \(\nu \in (0,1]\) controls the trade-off between the volume of the sphere and violations of the boundary, i.e. allowing some points to be mapped outside the sphere. The last term is a weight decay regularizer on the network parameter \(\mathcal{W}\) with hyperparemeter \(\lambda > 0\), where \(\|\cdot\|_F\) denotes the Frobenius norm.
** Optimizing above objective lets the network learn parameters \(\mathcal(W)\) such that data points are closely mapped to the center \(\boldsymbol{c}\) of the hypersphere. As a results, normal examples of the data are closely mapped to center \(\boldsymbol{c}\), whereas anomalous examples are mapped further away from the center or outside of the hypersphere.
2.2 One-Class Deep SVDD
For the case where most of the training data \(\mathcal{D}_n\) are assumed to be normal, which is often the case in one-class classification tasks. So, soft-boundary Deep SVDD can be simplied as:
$$\min_{\mathcal{W}} \frac{1}{n}\sum^{n}_{i=1} \|\phi(\boldsymbol{x}_i;\ \mathcal{W})-\boldsymbol{c}\|^2 + \frac{\lambda}{2}\sum^{L}_{l=1}\|\boldsymbol{W}^l\|^2_F.$$
This is the One-Class Deep SVDD objective. One-Class Deep SVDD simply employs a quadractic loss for pernalizing the distance of every network representation \(\phi(\boldsymbol{x}_i;\ \mathcal{W})\) to \(\boldsymbol{c} \in \mathcal{F}\). The second term again is a network weight decay regularizer with hyperparameter \(\lambda > 0\). One-Class Deep SVDD is a method to find a hypersphere of minimum volume with center \(\boldsymbol{c}\). But unlike in soft-boundary Deep SVDD, where the hyperparameter is contracted by penalizing the radius directly and the data representation that fall outside the sphere, One-Class Deep SVDD contracts the sphere by minimizing the mean distance of all data representation to the center.
For a given test point \(\boldsymbol{x} in \mathcal{X}\), anomaly scrore \(s\) can be defined for both variants of Deep SVDD by the distance of the point to the center of the hypersphere, i.e.
$$s(\boldsymbol{x}) = \|\phi(\boldsymbol{x};\ \mathcal{W}^*)-\boldsymbol{c}\|^2$$
where \(\mathcal{W}^*\) are the network parameters of a trained model. For soft-boundary Deep SVDD, anomaly score can be adjusted by subtracting the final radius \(R^*\) of the trained model such that anomalise have positive scores, wheares inliers have negative scores.
댓글