异常值检测

Mar 24, 2024 · Xiebro

library(tidyverse)
library(hrbrthemes)
theme_set(theme_ipsum(base_family = "Kai",
                      base_size = 8))

Dummy Data with Outliers

set.seed(1234)
N = 1e3
x = c(rnorm(N, 0, 0.5), rnorm(N * 0.05, -1, 1))
y = c(rnorm(N, 0, 0.5), rnorm(N * 0.05,  1, 1))
o = c(rep(0L, N), rep(1L, (N * 0.05)))
mm  <- matrix(x, y, nrow = N * 1.05, ncol = 2)
dat <- tibble(x, y, outlier = o)

dat |>
    ggplot(aes(x, y, col = factor(o))) +
    geom_vline(xintercept = 0) + 
    geom_hline(yintercept = 0) + 
    geom_point(alpha = 0.7) +
    scale_color_manual(values = c("0" = "blue", "1" = "red")) +
    guides(color = "none") +
    labs(title = "Dummy Data with Outliers")

Kmeans-Clustering

既然我们知道了我们有两类数据组，离群值和非离群值，我们能不能利用聚类区分他们？

km <- kmeans(mm, centers = 2)

dat |> 
    ggplot(aes(x, y, fill = factor(km$cluster))) + 
    geom_vline(xintercept = 0) + 
    geom_hline(yintercept = 0) + 
    geom_point(alpha = 0.7, shape = 21, col = "black") + 
    scale_fill_brewer(guide = "none", palette = "Dark2")

DBSCAN

DBSCAN 算法可以抽象为以下步骤：

找到每个点的 ε（eps）邻域中的点，并识别具有 m 个 (minPts) 邻居的核心点 (core points);
在邻居图上查找核心点的连通组件，忽略所有非核心点;
如果群集是 ε（eps）邻居，则将每个非核心点分配给附近的群集，否则将其分配给噪声;

library(dbscan)
# do density-based clustering
dbs <- dbscan(mm, eps = .3, minPts = 5, borderPoints = F)

dat |> 
    ggplot(aes(x, y, fill = factor(dbs$cluster))) + 
    geom_vline(xintercept = 0) + 
    geom_hline(yintercept = 0) + 
    geom_point(alpha = 0.7, shape = 21, col = "black") + 
    scale_fill_manual(values = c("red", "lightblue")) +
    guides(fill = "none")

One-Class SVM

基本的原理就是利用 svm 进行单分类（正常类）

# One Class SVM ----------------------------------------------

library(e1071)

# nu is determined with hindsight over here
svmfit <- svm(mm, type = "one-classification", nu = 0.1)

dat |> 
    mutate(is_outlier = predict(svmfit, mm)) %>% 
    ggplot(aes(x, y, fill = factor(is_outlier))) + 
    geom_vline(xintercept = 0) + 
    geom_hline(yintercept = 0) + 
    geom_point(alpha = 0.7, shape = 21, col = "black") + 
    scale_fill_manual(values = c("red", "lightblue")) +
    guides(fill = "none")

Isolation Forest

孤立森林的核心假设是，异常数据只占很少量、且特征值和正常数据差别很大。因此如果使用递归地随机分割数据集，直到所有的样本点都是孤立的。在这种随机分割的策略下，异常点通常具有较短的路径。

library(solitude)

iso = isolationForest$new()
isofit <- dat |> select(-outlier) |> iso$fit()

## INFO  [21:35:06.549] Building Isolation Forest ...
## INFO  [21:35:06.981] done
## INFO  [21:35:06.984] Computing depth of terminal nodes ...
## INFO  [21:35:07.224] done
## INFO  [21:35:07.234] Completed growing isolation forest

dat |> 
    bind_cols(iso$predict(dat)) |> 
    mutate(is_outlier = anomaly_score > 0.65) |> 
    ggplot(aes(
        x    = x,
        y    = y,
        fill = factor(-is_outlier)
    )) + 
    geom_vline(xintercept = 0) + 
    geom_hline(yintercept = 0) + 
    geom_point(alpha = 0.7, shape = 21, col = "black") + 
    scale_fill_manual(values = c("red", "lightblue")) +
    guides(fill = "none")