转化率预测:分群方法、特征筛选与可解释归因
在订阅、续费、复购这类业务里,运营和市场常问的问题是:
“这一批刚进来的用户,最终能续多少?”
预测的难点不在"算出一个数字",而在两件事:
- 可解释:预测结果要能告诉业务侧"为什么是这个数";
- 可归因:当预测偏离实际时,能定位误差来自哪部分人群、哪部分服务环节。
本文整理一套面向当期转化率的预测与归因方法,覆盖三类思路的对比、特征筛选标准、模型评估,以及偏差出现时如何拆解原因。
模拟数据
构造一个跨多个周期(cohort)的订阅用户样本,每个用户带几个可观测特征和一个转化标签:
library(tidyverse)
library(glmnet)
set.seed(42)
simulate_users <- function(cohort_id, n = 1000) {
city_tier <- sample(1:5, n, replace = TRUE,
prob = c(0.15, 0.25, 0.25, 0.20, 0.15))
age_group <- sample(1:6, n, replace = TRUE)
channel <- sample(c("paid", "organic", "referral", "live"),
n, replace = TRUE)
active_days <- pmax(0, round(rnorm(n, mean = 15 + 0.3 * cohort_id, sd = 5)))
task_rate <- pmin(1, pmax(0, rnorm(n, mean = 0.55, sd = 0.2)))
# 潜在转化概率:隐藏的真实因果结构
logit_p <- -2 +
0.30 * (city_tier <= 2) +
0.20 * (channel == "referral") -
0.15 * (channel == "paid") +
0.04 * active_days +
1.20 * task_rate
p <- 1 / (1 + exp(-logit_p))
tibble(
cohort_id, city_tier, age_group, channel,
active_days, task_rate,
converted = rbinom(n, 1, p)
)
}
df <- map_dfr(1:12, simulate_users)
整个数据集包含 12 个周期、每期 1000 个用户。先看一下各期的转化率走势:
trend_df <- df |>
group_by(cohort_id) |>
summarise(rate = mean(converted), .groups = "drop")
grand_mean <- mean(trend_df$rate)
ggplot(trend_df, aes(cohort_id, rate)) +
geom_hline(yintercept = grand_mean,
color = palette_blog["ash"], linetype = 2, linewidth = 0.4) +
annotate("text", x = 12.3, y = grand_mean,
label = sprintf("avg %.1f%%", grand_mean * 100),
hjust = 0, vjust = 0.5,
color = palette_blog["ash"], size = 3.2) +
geom_area(alpha = 0.10, fill = palette_blog["blue"]) +
geom_line(color = palette_blog["blue"], linewidth = 1) +
geom_point(color = palette_blog["blue"], size = 2.4) +
geom_text(aes(label = sprintf("%.1f%%", rate * 100)),
vjust = -1.1, size = 2.9,
color = palette_blog["ink"]) +
scale_x_continuous(breaks = 1:12,
expand = expansion(mult = c(0.02, 0.08))) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1),
expand = expansion(mult = c(0.05, 0.18))) +
labs(
x = "Cohort",
y = NULL,
title = "Overall conversion rate by cohort",
subtitle = "Aggregated rate over 12 simulated cohorts"
) +
theme_blog()
整体随活跃度(active_days 的均值跟 cohort_id 正相关)小幅上行,但单一时序看不出结构性差异。
三类预测思路
记 $R_t$ 为时刻 $t$ 的当期转化率。三种思路:
| 方法 | 公式 | 特点 |
|---|---|---|
| 全局预测 | $R_t = c + \sum_{i=1}^{p} \phi_i R_{t-i} + \epsilon_t$ | 单变量时序,捕捉整体周期 |
| 分群预测 | $R_t = \sum_{g=1}^{k} p_g R_{t_g}$ | 多变量,可解释、可归因 |
| 个体预测 | $R_t = \frac{1}{N}\sum_{i} R_{t_i},\ R_{t_i}=P(y=1 \mid X)$ | 多变量,逐人打分 |
其中 $p_g$ 是群体 $g$ 的占比,$R_{tg}$ 是群体 $g$ 的转化率,$R_{ti}$ 是用户 $i$ 的转化率。
直觉上三者各有侧重:
- 时序模型擅长"季节性";
- 个体模型擅长"打分排序";
- 分群模型擅长"解释结构"。
哪个更适合聚合指标的预测,要看数据。下面用滚动窗口(前 3 期训练、第 t 期预测)跑一遍。
全局预测(滚动均值)
hist_rate <- df |>
group_by(cohort_id) |>
summarise(rate = mean(converted), .groups = "drop")
trend_predict <- function(rates, t, window = 3) {
mean(rates[(t - window):(t - 1)])
}
trend_forecast <- map_dbl(4:12, \(t) trend_predict(hist_rate$rate, t))
分群预测
把每个用户落入一个分群单元(特征的笛卡尔积),每期预测 = 历史同分群转化率 × 当期分群占比:
group_predict <- function(df, target_t, hist_window = 3) {
hist <- df |>
filter(cohort_id >= target_t - hist_window, cohort_id < target_t) |>
group_by(city_tier, channel) |>
summarise(hist_rate = mean(converted), .groups = "drop")
curr <- df |>
filter(cohort_id == target_t) |>
group_by(city_tier, channel) |>
summarise(n_user = n(), .groups = "drop") |>
mutate(curr_prop = n_user / sum(n_user))
curr |>
left_join(hist, by = c("city_tier", "channel")) |>
summarise(pred = sum(hist_rate * curr_prop, na.rm = TRUE)) |>
pull(pred)
}
group_forecast <- map_dbl(4:12, \(t) group_predict(df, t))
个体预测
individual_predict <- function(df, target_t, hist_window = 3) {
train <- df |>
filter(cohort_id >= target_t - hist_window, cohort_id < target_t)
test <- df |>
filter(cohort_id == target_t)
fm <- converted ~ city_tier + age_group + factor(channel) +
active_days + task_rate - 1
x_train <- model.matrix(fm, train); y_train <- train$converted
x_test <- model.matrix(fm, test)
fit <- cv.glmnet(x_train, y_train, family = "binomial", alpha = 0.5)
mean(as.numeric(predict(fit, x_test, type = "response", s = "lambda.min")))
}
indiv_forecast <- map_dbl(4:12, \(t) individual_predict(df, t))
特征工程
编码:把连续特征切成可解释的三档
为了让特征本身就带业务含义,对每个候选特征按历史转化意向做卡方分桶(ChiMerge),切成 0 高 / 1 中 / 2 低 三档。手写一个简单版本:
encode_by_target <- function(feature, target, n_bins = 3) {
rate_by_level <- tapply(target, feature, mean)
ranks <- rank(-rate_by_level, ties.method = "first")
bin <- cut(ranks, breaks = n_bins, labels = 0:(n_bins - 1)) |>
as.character() |> as.integer()
setNames(bin, names(rate_by_level))
}
encode_by_target(df$city_tier, df$converted)
## 1 2 3 4 5
## 0 0 2 1 2
这样做的好处是:后续不管走哪种模型,特征本身已经"自带解释"——city_tier=0 直接就是"历史转化率最高的那一档城市"。
筛选:PSI + IV 双指标
特征上线前需要回答两个问题:够稳吗?够有用吗?
稳定性 — PSI(Population Stability Index)
$$ \text{PSI} = \sum_{i=1}^{n} (A_i - E_i), \ln!\left(\frac{A_i}{E_i}\right) $$
psi <- function(expected, actual, eps = 1e-6) {
e_dist <- prop.table(table(expected))
a_dist <- prop.table(table(actual))
k <- union(names(e_dist), names(a_dist))
e <- as.numeric(e_dist[k]); e[is.na(e)] <- eps
a <- as.numeric(a_dist[k]); a[is.na(a)] <- eps
sum((a - e) * log(a / e))
}
经验阈值:
- PSI < 0.1:稳定;
- 0.1 ≤ PSI < 0.25:需关注;
- PSI ≥ 0.25:建议重新校准。
预测力 — IV(Information Value)
$$ \text{IV} = \sum_{i=1}^{n} (P_i - Q_i), \ln!\left(\frac{P_i}{Q_i}\right) $$
iv <- function(feature, target, eps = 1e-6) {
tbl <- table(feature, target)
pos <- tbl[, "1"] / sum(tbl[, "1"])
neg <- tbl[, "0"] / sum(tbl[, "0"])
pos[pos == 0] <- eps; neg[neg == 0] <- eps
sum((pos - neg) * log(pos / neg))
}
经验阈值:
- IV < 0.02:几乎无预测力;
- 0.02 ≤ IV < 0.1:弱;
- 0.1 ≤ IV < 0.3:中等;
- 0.3 ≤ IV < 0.5:强;
- IV ≥ 0.5:非常强。
实际筛选条件:PSI < 0.1 AND IV > 0.02。
把所有特征的 PSI(期 1 vs 期 12)和 IV(全量样本)算一遍,可视化筛选结果:
df_bin <- df |>
mutate(active_days_bin = ntile(active_days, 5),
task_rate_bin = ntile(task_rate, 5))
features <- c("city_tier", "age_group", "channel",
"active_days_bin", "task_rate_bin")
feature_table <- map_dfr(features, \(f) {
tibble(
feature = f,
psi_val = psi(df_bin[[f]][df_bin$cohort_id == 1],
df_bin[[f]][df_bin$cohort_id == 12]),
iv_val = iv(df_bin[[f]], factor(df_bin$converted))
)
})
feature_table
## # A tibble: 5 × 3
## feature psi_val iv_val
## <chr> <dbl> <dbl>
## 1 city_tier 0.00125 0.0174
## 2 age_group 0.0124 0.00111
## 3 channel 0.00461 0.0155
## 4 active_days_bin 0.586 0.0380
## 5 task_rate_bin 0.00452 0.0548
thresholds <- tibble(
metric = factor(c("PSI (stability)", "IV (predictive power)"),
levels = c("PSI (stability)", "IV (predictive power)")),
y = c(0.10, 0.02),
lbl = c("threshold = 0.10", "threshold = 0.02")
)
plot_df <- feature_table |>
pivot_longer(c(psi_val, iv_val),
names_to = "metric", values_to = "value") |>
mutate(metric = factor(recode(metric,
psi_val = "PSI (stability)",
iv_val = "IV (predictive power)"),
levels = c("PSI (stability)",
"IV (predictive power)"))) |>
group_by(metric) |>
arrange(value, .by_group = TRUE) |>
mutate(feature_label = factor(paste0(feature, "__", metric),
levels = paste0(feature, "__", metric))) |>
ungroup()
ggplot(plot_df,
aes(feature_label, value, fill = metric)) +
geom_col(width = 0.7) +
geom_text(aes(label = sprintf("%.3f", value)),
hjust = -0.15, size = 3, color = palette_blog["ink"]) +
geom_hline(data = thresholds,
aes(yintercept = y),
linetype = 2, color = palette_blog["ash"], linewidth = 0.4) +
geom_text(data = thresholds,
aes(x = 0.6, y = y, label = lbl),
hjust = 0, vjust = -0.4, size = 3,
color = palette_blog["ash"],
inherit.aes = FALSE) +
facet_wrap(~ metric, scales = "free") +
coord_flip(clip = "off") +
scale_x_discrete(labels = \(x) sub("__.*$", "", x)) +
scale_y_continuous(expand = expansion(mult = c(0.02, 0.22))) +
scale_fill_manual(values = c(
"PSI (stability)" = palette_blog[["rose"]],
"IV (predictive power)" = palette_blog[["blue"]]
)) +
labs(
x = NULL, y = NULL,
title = "Feature screening: PSI vs IV",
subtitle = "Keep features with PSI < 0.10 (stable) and IV > 0.02 (informative)"
) +
theme_blog() +
theme(legend.position = "none")
图中虚线即筛选阈值。左侧 PSI 越小越稳,右侧 IV 越大越有预测力。
模型评估:分群方法占优
actual <- hist_rate$rate[4:12]
mae <- function(pred, actual) mean(abs(pred - actual))
mae_tbl <- tibble(
method = c("trend", "group", "individual"),
mae = c(
mae(trend_forecast, actual),
mae(group_forecast, actual),
mae(indiv_forecast, actual)
)
) |> arrange(mae)
mae_tbl
## # A tibble: 3 × 2
## method mae
## <chr> <dbl>
## 1 individual 0.0144
## 2 group 0.0146
## 3 trend 0.0152
eval_long <- tibble(
cohort_id = 4:12,
trend = trend_forecast,
group = group_forecast,
individual = indiv_forecast,
actual = actual
) |>
pivot_longer(c(trend, group, individual),
names_to = "method", values_to = "pred") |>
left_join(mae_tbl, by = "method") |>
mutate(facet_label = sprintf("%s · MAE = %.3f", method, mae),
facet_label = factor(facet_label,
levels = unique(facet_label[order(mae)])))
method_colors <- c(
group = palette_blog[["blue"]],
trend = palette_blog[["sage"]],
individual = palette_blog[["rose"]]
)
ggplot(eval_long, aes(cohort_id)) +
geom_line(aes(y = actual),
color = palette_blog["ink"], linewidth = 0.9, alpha = 0.85) +
geom_point(aes(y = actual),
color = palette_blog["ink"], size = 1.8, alpha = 0.85) +
geom_line(aes(y = pred, color = method), linewidth = 1) +
geom_point(aes(y = pred, color = method), size = 2.2) +
facet_wrap(~ facet_label, ncol = 3) +
scale_x_continuous(breaks = c(4, 6, 8, 10, 12)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_color_manual(values = method_colors) +
labs(
x = "Cohort", y = NULL,
title = "Predicted vs actual conversion rate",
subtitle = "Dark line = actual; colored line = predicted. Methods ordered by MAE."
) +
theme_blog() +
theme(legend.position = "none")
经验上常见的排序:
$$ \text{分群} > \text{趋势} > \text{个体} $$
直觉解释:
- 个体模型会累积误差。逐人打分再加总,每个人 1~2 个百分点的误差,叠几千人后整体偏差比直接拟合宏观量更大。
- 趋势模型缺乏结构。当人群构成发生变化(例如某个渠道占比突然涨/掉),单变量时序无法吸收。
- 分群模型最稳。它本质是"已知群体转化率 × 已知群体占比",结构清晰、偏差容易追根溯源。
聚合指标的预测,并不总是"模型越精细越准"。
可解释:决策路径与归因
决策路径
分群预测的决策路径非常浅——只有两步:
- 计算历史 N 期的群体转化率表;
- 用当期人群占比对历史转化率做加权求和。
整个模型没有黑盒。
归因公式
当预测偏离实际时,把"用户结构"和"服务质量"分开看:
$$ R_t = \sum_{g=1}^{k} k_{gn}, p_g, R_{t_g} $$
- $p_g$:群体 $g$ 的占比,由用户自身属性决定(城市、年龄、活跃度等);
- $R_{tg}$:群体 $g$ 的历史转化率,作为基准;
- $k_{gn}$:群体 $g$ 在服务水平 $n$ 下的转化率缩放系数,由服务质量决定。
这把"为什么不准"分成了两条独立的解释线。
结构归因(用户侧)
定位"哪个细分群体的转化率偏移最大、占比最高":
attribution_structural <- function(df, target_t, hist_window = 3) {
hist <- df |>
filter(cohort_id >= target_t - hist_window, cohort_id < target_t) |>
group_by(city_tier, channel) |>
summarise(hist_rate = mean(converted), .groups = "drop")
curr <- df |>
filter(cohort_id == target_t) |>
group_by(city_tier, channel) |>
summarise(
curr_rate = mean(converted),
n_user = n(),
.groups = "drop"
) |>
mutate(curr_prop = n_user / sum(n_user))
curr |>
left_join(hist, by = c("city_tier", "channel")) |>
mutate(
rate_delta = curr_rate - hist_rate,
contrib = rate_delta * curr_prop
)
}
attr_t12 <- attribution_structural(df, target_t = 12)
attr_top <- attr_t12 |>
arrange(desc(abs(contrib))) |>
head(10) |>
mutate(
label = sprintf("city = %d · %s", city_tier, channel),
sign = ifelse(contrib > 0, "lift", "drag"),
hjust_v = ifelse(contrib > 0, -0.15, 1.15)
)
ggplot(attr_top,
aes(reorder(label, contrib), contrib, fill = sign)) +
geom_col(width = 0.7) +
geom_text(aes(label = sprintf("%+.2f pp", contrib * 100),
hjust = hjust_v),
size = 3, color = palette_blog["ink"]) +
geom_hline(yintercept = 0,
color = palette_blog["ash"], linewidth = 0.3) +
coord_flip(clip = "off") +
scale_y_continuous(
labels = scales::percent_format(accuracy = 0.1),
expand = expansion(mult = c(0.18, 0.18))
) +
scale_fill_manual(
values = c(lift = palette_blog[["blue"]],
drag = palette_blog[["rose"]]),
breaks = c("lift", "drag"),
labels = c("Positive contribution", "Negative contribution")
) +
labs(
x = NULL, y = "Contribution to overall delta (percentage points)",
title = "Top contributors to cohort-12 conversion-rate delta",
subtitle = "contrib = (curr_rate − hist_rate) × curr_prop",
fill = NULL
) +
theme_blog()
contrib 即每个细分单元对整体转化率波动的贡献量。这种"细分群体 × 占比 × 转化偏移"的输出,可以直接喂给业务方讨论。
服务质量归因(Spearman 相关)
服务质量类指标通常波动大、样本少,不适合直接进模型,但可以用 Spearman 等级相关系数 做事后定性归因:
$$ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $$
适用于:分布非正态、定序数据、关注单调而非线性关系。
service_quality <- tibble(
cohort_id = 1:12,
qa_score = 0.6 + 0.02 * (1:12) + rnorm(12, 0, 0.05)
)
period_summary <- df |>
group_by(cohort_id) |>
summarise(rate = mean(converted), .groups = "drop") |>
left_join(service_quality, by = "cohort_id")
rho <- cor(period_summary$rate, period_summary$qa_score,
method = "spearman")
ggplot(period_summary, aes(qa_score, rate)) +
geom_smooth(method = "lm", se = TRUE,
color = palette_blog["ash"],
fill = palette_blog["ash"],
alpha = 0.10, linetype = 2, linewidth = 0.5) +
geom_point(size = 3.6, color = palette_blog["blue"], alpha = 0.85) +
geom_text(aes(label = sprintf("c%d", cohort_id)),
vjust = -1.15, size = 3,
color = palette_blog["ash"]) +
annotate(
"label",
x = -Inf, y = Inf,
label = sprintf("Spearman ρ = %.2f", rho),
hjust = -0.1, vjust = 1.4,
label.size = 0,
fill = "grey96",
color = palette_blog["ink"],
size = 3.6,
fontface = "bold"
) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(
x = "Service-quality score", y = NULL,
title = "Conversion rate vs service quality",
subtitle = "Each point is one cohort; dashed line = OLS fit"
) +
theme_blog()
注意:服务质量与转化率的相关性在不同子群往往差异极大——头部、腰部、尾部受到的影响方向可能完全相反。不建议将服务质量直接做进预测模型(样本量不足容易过拟合),建议仅作事后定性归因。
总结
针对转化率这类聚合预测目标,方案要解决的从来不是"再准 1%“的问题,而是:
- 特征可解释:用历史转化率分桶做特征编码,模型外部能直接读懂;
- 模型透明:分群预测 = 历史群体转化率 × 当期人群占比,无黑盒;
- 可自动归因:偏差出现时能拆出"用户结构偏移 vs 服务质量波动"两类原因。
精度排序常常是 分群 > 趋势 > 个体,但更重要的是——分群预测让"预测—解释—归因"三件事统一在同一套公式里,业务沟通成本最低。