DSVDD: Deep Support Vector Data Description
3. Properties of Deep SVDD
Proposition 1: All-zero-weights solution. Let \(\mathcal{W}_{0}\) be the set of all-zero network weights, i.e., \(\boldsymbol{W}^{l} = \boldsymbol{0}\) for every \(\boldsymbol{W}^{l} \in \mathcal{W}_{0}\). For this choice of parameters, the network maps any input to the same output, i.e., \(\phi(\boldsymbol{x};\mathcal{W}_{0}) = \phi(\boldsymbol{\tilde{x}}; \mathcal{W}_{0}) =: \boldsymbol{c}_{0} \in \mathcal{F}\) for any \(\boldsymbol{x}, \boldsymbol{\tilde{x}} \in \mathcal{X}\). Then, if \(\boldsymbol{c} = \boldsymbol{c}_{0} \), the optimal solution of Deep SVDD is given by \( \mathcal{W}^{*} = \mathcal{W}_{0}\) and \(R^{*}=0\).
For every configuration (\(\mathcal{W}, R\)) the following conditions are met: \(J_{soft}(R,\mathcal{W}) \ge 0\) and \(J_{OC}(\mathcal{W}) \ge 0\) respectively. As the output of the all-zero-weights network \(\phi(\boldsymbol{x}; \mathcal{W}_{0})\) is constant for every inpyt \(\boldsymbol{x} \in \mathcal{X}\), all errors in the empirical sums of the objectives become zero. Thus, \(R^{*} = 0\) and \( \mathcal{W}^{*} = \mathcal{W}_{0}\) are optimal solutions since \(J_{soft}(\mathcal{W}^{*}, R^{*}) = 0\) and \(J_{OC}(\mathcal{W}^{*}) = 0\) in this case.
Proposition 1 implies that if the hypersphere center \(\boldsymbol{c}\) is set as a free variable in the SGD optimization, Deep SVDD would likely converge to the trivial solution \( \mathcal{W}^{*}, R^{*}, \boldsymbol{c}^{*} = (\mathcal{W}_{0}, 0, \boldsymbol{c}_{0})\). This phenomenon is called hypersphere collapse where the network learns weights such that the network produces a constant function mapping to the hypersphere center. Proposition 1 also implies that Deep SVDD requires \( \boldsymbol{c} \neq \boldsymbol{c}_{0}\) when fixing \(\boldsymbol{c}\) in output space \(\boldsymbol{F}\) because otherwise a hyperparameter collapse would again be possible. For a CNN with ReLU activation functions, for example, this would require \(\boldsymbol{c} \neq \boldsymbol{0}\). Empirically fixed center \(\boldsymbol{c}\) is set to the mean of the network representations that result from performing an initial forward pass on the some training data sample. Also, fixing \(\boldsymbol{c}\) in the neighborhood of the initial network outputs made SGD convergence faster and more robust.
Proposition 2: Bias terms. Let \(\boldsymbol{c} \in \mathcal{F}\) be any fixed hypersphere center. If there is any hidden layer in network \(\phi(\cdot; \mathcal{W}) : \mathcal{X} \leftarrow \mathcal{F}\) having a bias term, there exist an optimal solution \( R^{*}, \mathcal{W}^{*} \) of the Deep SVDD objectives with \( R^{*} = 0 \) and \( \phi(\boldsymbol{x}; \mathcal{W}^{*}) = \boldsymbol{c} \) for every \(\boldsymbol{x} \in \mathcal{X} \).
Assume layer \(l \in \{1,\ldots,L\}\) with weights \(\boldsymbol{W}^{l}\) also has a bias term \(\boldsymbol{b}^{l}\). For any input \(\boldsymbol{x} \in \mathcal{X}\), the output of layer \(l\) is then given by \( \boldsymbol{z}^{l}(\boldsymbol{x}) = \sigma^{l}(\boldsymbol{W}^{l}\cdot\boldsymbol{z}^{l-1}(\boldsymbol{x})+\boldsymbol{b}^{l}) \), where "\( \cdot \)" denotes a linear operator, \( \sigma^{l}{\cdot} \) is the activation of layer \(l\), and the output \(\boldsymbol{z}^{l-1}\) of the previous layer \(l-1\) depends on input \(\boldsymbol{x}\) by the concatenation of previous layers. Then, for \(\boldsymbol{W}^{l} = 0\), The equation \( \boldsymbol{z}^{l}(\boldsymbol{x}) = \sigma^{l}(b^{l})\), i.e., the output of layer \(l\) is constant for every input \( \boldsymbol{x} \in \mathcal{X} \). Therefore, the bias term \(\boldsymbol{b}^{l}\) (and the weights of the subsequent layers) can be chosen such that \(\phi(\boldsymbol{x}; \mathcal{W}^{8}) = \boldsymbol{c}\) for every \(\boldsymbol{x} \in \mathcal{X}\). Hence, selecting \(\mathcal{W}^{*}\) in this way results in an empirical term of zero and choosing \(R^{*} = 0\) gives the optimal solution (ignoring the weight decay regularization term for simplicity).
Proposition 2 implies that networks with bias terms can easily learn any constant function, which is indepedent of the input \(\boldsymbol{x} \in \mathcal{X}\). It follows that bias terms should not be used in neural networks with Deep SVDD since the network can learn the constant function mapping directly to the hypersphere center, leading to hypersphere collapse.
Proposition 3: Bounded activation functions. Consider a network unit having a monotonic activation function \(\sigma(\cdot)\) that has an upper (or under) bound with \(\sup_{z} \sigma(z) \neq 0\) (or \(\inf_{z} \sigma(z) \neq 0\)). Then, for a set of unit inputs \(\{ \boldsymbol{z}_1, \ldots, \boldsymbol{z}_n \}\) that have at least one feature that is positive or negative for all inputs, that non-zero supremum (or infimum) can be uniformly approximated on the set of inputs.
Without loss of generality consider the case of \( \sigma \) being upper bounded by \( B := \sup_{z} \sigma(z) \neq 0 \) and feature \( k \) being positive for all inputs, i.e. \( z^{(k)}_{i} > 0\) for every \( i = 1, \ldots, n \). Then, for every \( \epsilon > 0 \), one can always choose the weight of the \( k \)-th element \( w_k \) large enough (setting all other network unit weights to zero) such \( \sup_{i} | \sigma(w_{k}z^{(k)}_{i}) - B| < \epsilon \).
Proposition 3 simply says that a network unit with bounded activation function can be saturated for all inputs having at least one feaure with common sign thereby emulation a bias term in the subsequent layer, which again leads to a hypersphere collapse. Therefore, unbounded activation function (or functions only bounded by 0) such as the ReLU should be preffered in Deep SVDD to avoid a hypersphere collapse due to learned bias terms.
Proposition 4: \(\nu\)-property. Hyperparameter \(\nu \in (0, 1]\) in the soft-boundary Deep SVDD objective in \(J_{soft}(R, \mathcal{W})\) is an upper bound on the fraction of outliers and a lower bound on the fractions of samples being outside or on the boundary of the hypersphere.
Define \( d_{i} = \| \phi(\boldsymbol{x}_i; \mathcal{W}) - \boldsymbol{c} \|^{2} \) for \( i = 1, \ldots, n\). Without loss of generality assume \( d_{i} \ge \cdots \ge d_{n}\). The number of outliers is then given by \( n_{out} \le \nu n \) holds and decreasing \( R \) gradually increases \( n_{out} \). Thus, \( \frac{n_{out}}{n} \le \nu \) must hold in the optimum, i.e. \( \nu \) is an upper bound on the fraction of outliers, and the optimal radius \( R^{*} \) is given for the largest \( n_{out} \) for which this inequality still holds. Finally, The equation \( R^{*2} = d_{i} \) holds for \( i = n_{out} + 1 \) since radius \( R \) is minimal in this case and points on the boundary do not increase the objective. Hence, the equation \( |\{ i : d_{i} \ge R^{*2} \} | \ge n_{out}+1 \ge \nu n \).
Summary of properties.
- The choice of hypersphere center \( \boldsymbol{c} \) must be something other than the all-zero-weights solution.
- Only neural networks without bias terms or bounded activation fucntions should be used in Deep SVDD.
- The \( \nu \)-property holds for soft-boundary Deep SVDD which allows to include a prior assumption on the number of anomalies assumed to be present in the training data.
4. Experiments
Setting. For Deep SVDD, the bias terms are removed in all network units to prevent a hypersphere collapse. In soft-boundary Deep SVDD, \( R \) is found via line search every \( k=5 \) epochs. The hyperparameter \( \nu \) is chosen from \( \nu \in \{ 0.01, 0.1 \} \). Then hypersphere center \( \boldsymbol{c} \) is set to the mean of the mapped data after performing an initial forward pass. For optimization, Adam optimizer is used and applied Batch Normalization. Leaky ReLU activations are used with leakiness \( \alpha = 0.1\).
4.1 One-class classification on MNIST and CIFAR-10.
In each setup, one of the classes is the normal class and samples from the remaining classes are used to represent anomalies. All images are pre-processed with global contrast normalization using the \( L^{1} \)-norm and finally rescale to \( [0, 1] \) via min-max-scaling.
Network architectures. For both datasets, LeNet-type CNNs where each convolutional module consists of a convolutional layer followed by leaky ReLU activations and \( 2 \times 2\) max-pooling were used. On MNIST, CNN consists of two modules, \( 8 \times (5 \times 5 \times 1) \)-filters followed by \( 4 \times (5 \times 5 \times 1) \) filters, and a final dense layer of 32 units. On CIFAR-10, CNN consists of three modules, \( 32 \times (5 \times 5 \times 3) \)-filters, \( 64 \times (5 \times 5 \times 3) \)-filters, and \( 128 \times (5 \times 5 \times 3) \)-filters, followed by a final dense layer of 128 units. Batch size is set to 200, and weight-decay hyperparameter \(\lambda\) is set to \(10^{-6}\).
4.1.1 Quanititative results. Deep SVDD cleary outperforms both its swallow and deep competitors on MNIST. On CIFAR-10 the picture is mixed. Deep SVDD, however, shows an overall strong performance. It is interesting to note that swallow SVDD and KDE perform better than depp methods on three of the ten CIFAR-10 classes. Notably, the One-Class Deep SVDD performs slightly better than its soft-boundary counterpart on both datasets. This may be because the asumption of no anomalies being present in the training data is valid in experiment scenario.
4.1.2 Qualitative results. The normal examples of the classes on which KDE performs best seem to have strong global structures. For examples, TRUCK images are mostly divided horizontally into street and sky, and DEER as well as FROG have similar colors globally. For these classes, choosing local CNN features can be questioned. These cases underline the importance of network architecture choice.
4.2 Adversarial attacks on GTSRB stop signs.
Detecting adversarial attacks is vital in many applications such autonomous driving. "Stop sign" class of the German Traffic Sign Recognition Benchmark (GTSRB) dataset is considered while generating adversarial examples from randomly drawn stop sign images of the test set unsing Boundary Attack. Models are trained again only on normal stop sign samples and in testing models check if adversarial examples are correctly detected. The training set contains \(n=780\) stop signs. The test set is composed of 270 normal examples and 20 advarsarial examples. Data is pre-processed by removing the 10% border around each sign, and then resize every image to \(32 \times 32\) pixels. After that, global contrast normalization using \(L^{1}\)-norm and rescaling the unit interval \([0, 1]\) were applied.
Network architecture. CNN with LeNet architecture consists of three convolutional modules, \(16 \times (5 \times 5 \times 3)\)-filters, \(32 \times (5 \times 5 \times 3)\)-filters, \(64 \times (5 \times 5 \times 3)\)-filters, follwed by a final dense layerof 32 units. The batch size is set to 64 which is smaller than previous case due to the dataset size. Hyperparameter \(\lambda\) is set to \(10^{-6}\) as same as one-class classification.
4.2.1 Quanititative results. The One-Class Deep SVDD shows again the best performance. Generally, the deep methods perform better.
4.2.2 Qualitative results. The most anomalous samples detected by One-Class Deep SVDD which are either adversarial attacks or images in odd perspectives that are cropped incorrectly.
5. Conclusion
Deep SVDD, jointly trains a deep neural network while optimizing a data-enclosing hypersphere in output space. Through this Deep SVDD extracts common factors of variation from the data. Theoretical properties of Deep SVDD methods were demonstrated such as the ν-property that allows to incorporate a prior assumption on the number of outliers being present in the data.
댓글