《区域水环境污染数据分析实践》
Data analysis practice of regional water environment pollution
2024-04-09
# A tibble: 10,000 × 7
tip distance company local dow month hour
<fct> <dbl> <fct> <fct> <fct> <fct> <int>
1 yes 17.2 Chicago Independents no Thu Feb 16
2 yes 0.88 City Service yes Thu Mar 8
3 yes 18.1 other no Mon Feb 18
4 yes 20.7 Chicago Independents no Mon Apr 8
5 yes 12.2 Chicago Independents no Sun Mar 21
6 yes 0.94 Sun Taxi yes Sat Apr 23
7 yes 17.5 Flash Cab no Fri Mar 12
8 yes 17.7 other no Sun Jan 6
9 yes 1.85 Taxicab Insurance Agency Llc no Fri Apr 12
10 yes 1.47 City Service no Tue Mar 14
# ℹ 9,990 more rows
对于机器学习,我们通常将数据分成训练集和测试集:
在训练过程中不要使用测试集。
# A tibble: 7,500 × 7
tip distance company local dow month hour
<fct> <dbl> <fct> <fct> <fct> <fct> <int>
1 yes 0.7 Taxi Affiliation Services yes Tue Mar 18
2 yes 0.99 Sun Taxi yes Tue Jan 8
3 yes 1.78 other no Sat Mar 22
4 yes 0 Taxi Affiliation Services yes Wed Apr 15
5 yes 0 Taxi Affiliation Services no Sun Jan 21
6 yes 2.3 other no Sat Apr 21
7 yes 6.35 Sun Taxi no Wed Mar 16
8 yes 2.79 other no Sun Feb 14
9 yes 16.6 other no Sun Apr 18
10 yes 0.02 Chicago Independents yes Sun Apr 15
# ℹ 7,490 more rows
Use strata = tip
Stratification often helps, with very little downside
模型多种多样
lm
for linear model
glm
for generalized linear model (e.g. logistic regression)
glmnet
for regularized regression
keras
for regression using TensorFlow
stan
for Bayesian regression
spark
for large data sets
Logistic Regression Model Specification (classification)
Computational engine: glmnet
Decision Tree Model Specification (classification)
Computational engine: rpart
All available models are listed at https://www.tidymodels.org/find/parsnip/
workflow()
?fit()
和 predict()
不仅适用于实际的模型拟合,还适用于预处理步骤tree_spec <-
decision_tree(cost_complexity = 0.002) %>%
set_mode("classification")
tree_spec %>%
fit(tip ~ ., data = taxi_train)
parsnip model object
n= 8000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 8000 616 yes (0.92300000 0.07700000)
2) distance>=14.12 2041 68 yes (0.96668300 0.03331700) *
3) distance< 14.12 5959 548 yes (0.90803826 0.09196174)
6) distance< 5.275 5419 450 yes (0.91695885 0.08304115) *
7) distance>=5.275 540 98 yes (0.81851852 0.18148148)
14) company=Chicago Independents,City Service,Sun Taxi,Taxi Affiliation Services,Taxicab Insurance Agency Llc,other 478 68 yes (0.85774059 0.14225941) *
15) company=Flash Cab 62 30 yes (0.51612903 0.48387097)
30) dow=Thu 12 2 yes (0.83333333 0.16666667) *
31) dow=Sun,Mon,Tue,Wed,Fri,Sat 50 22 no (0.44000000 0.56000000)
62) distance>=11.77 14 4 yes (0.71428571 0.28571429) *
63) distance< 11.77 36 12 no (0.33333333 0.66666667) *
tree_spec <-
decision_tree(cost_complexity = 0.002) %>%
set_mode("classification")
workflow() %>%
add_formula(tip ~ .) %>%
add_model(tree_spec) %>%
fit(data = taxi_train)
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
tip ~ .
── Model ───────────────────────────────────────────────────────────────────────
n= 8000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 8000 616 yes (0.92300000 0.07700000)
2) distance>=14.12 2041 68 yes (0.96668300 0.03331700) *
3) distance< 14.12 5959 548 yes (0.90803826 0.09196174)
6) distance< 5.275 5419 450 yes (0.91695885 0.08304115) *
7) distance>=5.275 540 98 yes (0.81851852 0.18148148)
14) company=Chicago Independents,City Service,Sun Taxi,Taxi Affiliation Services,Taxicab Insurance Agency Llc,other 478 68 yes (0.85774059 0.14225941) *
15) company=Flash Cab 62 30 yes (0.51612903 0.48387097)
30) dow=Thu 12 2 yes (0.83333333 0.16666667) *
31) dow=Sun,Mon,Tue,Wed,Fri,Sat 50 22 no (0.44000000 0.56000000)
62) distance>=11.77 14 4 yes (0.71428571 0.28571429) *
63) distance< 11.77 36 12 no (0.33333333 0.66666667) *
tree_spec <-
decision_tree(cost_complexity = 0.002) %>%
set_mode("classification")
workflow(tip ~ ., tree_spec) %>%
fit(data = taxi_train)
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
tip ~ .
── Model ───────────────────────────────────────────────────────────────────────
n= 8000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 8000 616 yes (0.92300000 0.07700000)
2) distance>=14.12 2041 68 yes (0.96668300 0.03331700) *
3) distance< 14.12 5959 548 yes (0.90803826 0.09196174)
6) distance< 5.275 5419 450 yes (0.91695885 0.08304115) *
7) distance>=5.275 540 98 yes (0.81851852 0.18148148)
14) company=Chicago Independents,City Service,Sun Taxi,Taxi Affiliation Services,Taxicab Insurance Agency Llc,other 478 68 yes (0.85774059 0.14225941) *
15) company=Flash Cab 62 30 yes (0.51612903 0.48387097)
30) dow=Thu 12 2 yes (0.83333333 0.16666667) *
31) dow=Sun,Mon,Tue,Wed,Fri,Sat 50 22 no (0.44000000 0.56000000)
62) distance>=11.77 14 4 yes (0.71428571 0.28571429) *
63) distance< 11.77 36 12 no (0.33333333 0.66666667) *
How do you use your new tree_fit
model?
Run:
predict(tree_fit, new_data = taxi_test)
Run:
augment(tree_fit, new_data = taxi_test)
What do you get?
new_data
中的行数和输出中的行数相同如何 理解tree_fit
模型?
# A tibble: 8,000 × 10
tip .pred_class .pred_yes .pred_no distance company local dow month hour
<fct> <fct> <dbl> <dbl> <dbl> <fct> <fct> <fct> <fct> <int>
1 yes yes 0.967 0.0333 17.2 Chicag… no Thu Feb 16
2 yes yes 0.935 0.0646 0.88 City S… yes Thu Mar 8
3 yes yes 0.967 0.0333 18.1 other no Mon Feb 18
4 yes yes 0.949 0.0507 12.2 Chicag… no Sun Mar 21
5 yes yes 0.821 0.179 0.94 Sun Ta… yes Sat Apr 23
6 yes yes 0.967 0.0333 17.5 Flash … no Fri Mar 12
7 yes yes 0.967 0.0333 17.7 other no Sun Jan 6
8 yes yes 0.938 0.0616 1.85 Taxica… no Fri Apr 12
9 yes yes 0.938 0.0616 0.53 Sun Ta… no Tue Mar 18
10 yes yes 0.931 0.0694 6.65 Taxica… no Sun Apr 11
# ℹ 7,990 more rows
模型的敏感性(Sensitivity)和特异性(Specificity)是评估二分类模型性能的重要指标:
\[ \text{Sensitivity} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} \]
\[ \text{Specificity} = \frac{\text{True Negatives}}{\text{True Negatives} + \text{False Positives}} \]
在评估模型时,我们希望敏感性和特异性都很高。高敏感性表示模型能够捕获真正的正类别样本,高特异性表示模型能够准确排除负类别样本。
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sensitivity binary 0.994
We can use metric_set()
to combine multiple calculations into one
taxi_metrics <- metric_set(accuracy, specificity, sensitivity)
augment(taxi_fit, new_data = taxi_train) %>%
taxi_metrics(truth = tip, estimate = .pred_class)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.928
2 specificity binary 0.130
3 sensitivity binary 0.994
taxi_metrics <- metric_set(accuracy, specificity, sensitivity)
augment(taxi_fit, new_data = taxi_train) %>%
group_by(local) %>%
taxi_metrics(truth = tip, estimate = .pred_class)
# A tibble: 6 × 4
local .metric .estimator .estimate
<fct> <chr> <chr> <dbl>
1 yes accuracy binary 0.898
2 no accuracy binary 0.935
3 yes specificity binary 0.169
4 no specificity binary 0.116
5 yes sensitivity binary 0.987
6 no sensitivity binary 0.996
# 10-fold cross-validation
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [7200/800]> Fold01
2 <split [7200/800]> Fold02
3 <split [7200/800]> Fold03
4 <split [7200/800]> Fold04
5 <split [7200/800]> Fold05
6 <split [7200/800]> Fold06
7 <split [7200/800]> Fold07
8 <split [7200/800]> Fold08
9 <split [7200/800]> Fold09
10 <split [7200/800]> Fold10
What is in this?
# 10-fold cross-validation using stratification
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [7200/800]> Fold01
2 <split [7200/800]> Fold02
3 <split [7200/800]> Fold03
4 <split [7200/800]> Fold04
5 <split [7200/800]> Fold05
6 <split [7200/800]> Fold06
7 <split [7200/800]> Fold07
8 <split [7200/800]> Fold08
9 <split [7200/800]> Fold09
10 <split [7200/800]> Fold10
Stratification often helps, with very little downside
We’ll use this setup:
# 10-fold cross-validation using stratification
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [7200/800]> Fold01
2 <split [7200/800]> Fold02
3 <split [7200/800]> Fold03
4 <split [7200/800]> Fold04
5 <split [7200/800]> Fold05
6 <split [7200/800]> Fold06
7 <split [7200/800]> Fold07
8 <split [7200/800]> Fold08
9 <split [7200/800]> Fold09
10 <split [7200/800]> Fold10
Set the seed when creating resamples
# Resampling results
# 10-fold cross-validation using stratification
# A tibble: 10 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [7200/800]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
2 <split [7200/800]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
3 <split [7200/800]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
4 <split [7200/800]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
5 <split [7200/800]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
6 <split [7200/800]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
7 <split [7200/800]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
8 <split [7200/800]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
9 <split [7200/800]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
10 <split [7200/800]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.915 10 0.00309 Preprocessor1_Model1
2 roc_auc binary 0.624 10 0.0105 Preprocessor1_Model1
We can reliably measure performance using only the training data 🎉
How do the metrics from resampling compare to the metrics from training and testing?
The ROC AUC previously was
Remember that:
⚠️ the training set gives you overly optimistic metrics
⚠️ the test set is precious
# Save the assessment set results
ctrl_taxi <- control_resamples(save_pred = TRUE)
taxi_res <- fit_resamples(taxi_wflow, taxi_folds, control = ctrl_taxi)
taxi_res
# Resampling results
# 10-fold cross-validation using stratification
# A tibble: 10 × 5
splits id .metrics .notes .predictions
<list> <chr> <list> <list> <list>
1 <split [7200/800]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
2 <split [7200/800]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
3 <split [7200/800]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
4 <split [7200/800]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
5 <split [7200/800]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
6 <split [7200/800]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
7 <split [7200/800]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
8 <split [7200/800]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
9 <split [7200/800]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
10 <split [7200/800]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
# A tibble: 8,000 × 7
id .pred_yes .pred_no .row .pred_class tip .config
<chr> <dbl> <dbl> <int> <fct> <fct> <chr>
1 Fold01 0.938 0.0615 14 yes yes Preprocessor1_Model1
2 Fold01 0.946 0.0544 19 yes yes Preprocessor1_Model1
3 Fold01 0.973 0.0269 33 yes yes Preprocessor1_Model1
4 Fold01 0.903 0.0971 43 yes yes Preprocessor1_Model1
5 Fold01 0.973 0.0269 74 yes yes Preprocessor1_Model1
6 Fold01 0.903 0.0971 103 yes yes Preprocessor1_Model1
7 Fold01 0.915 0.0851 104 yes no Preprocessor1_Model1
8 Fold01 0.903 0.0971 124 yes yes Preprocessor1_Model1
9 Fold01 0.667 0.333 126 yes yes Preprocessor1_Model1
10 Fold01 0.949 0.0510 128 yes yes Preprocessor1_Model1
# ℹ 7,990 more rows
# A tibble: 30 × 4
id .metric .estimator .estimate
<chr> <chr> <chr> <dbl>
1 Fold01 accuracy binary 0.905
2 Fold02 accuracy binary 0.925
3 Fold03 accuracy binary 0.926
4 Fold04 accuracy binary 0.915
5 Fold05 accuracy binary 0.902
6 Fold06 accuracy binary 0.912
7 Fold07 accuracy binary 0.906
8 Fold08 accuracy binary 0.91
9 Fold09 accuracy binary 0.918
10 Fold10 accuracy binary 0.931
# ℹ 20 more rows
# Resampling results
# 10-fold cross-validation using stratification
# A tibble: 10 × 5
splits id .metrics .notes .predictions
<list> <chr> <list> <list> <list>
1 <split [7200/800]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
2 <split [7200/800]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
3 <split [7200/800]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
4 <split [7200/800]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
5 <split [7200/800]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
6 <split [7200/800]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
7 <split [7200/800]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
8 <split [7200/800]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
9 <split [7200/800]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
10 <split [7200/800]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
# Bootstrap sampling
# A tibble: 25 × 2
splits id
<list> <chr>
1 <split [8000/2902]> Bootstrap01
2 <split [8000/2916]> Bootstrap02
3 <split [8000/3004]> Bootstrap03
4 <split [8000/2979]> Bootstrap04
5 <split [8000/2961]> Bootstrap05
6 <split [8000/2962]> Bootstrap06
7 <split [8000/3026]> Bootstrap07
8 <split [8000/2926]> Bootstrap08
9 <split [8000/2972]> Bootstrap09
10 <split [8000/2972]> Bootstrap10
# ℹ 15 more rows
# Monte Carlo cross-validation (0.75/0.25) with 10 resamples
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [6000/2000]> Resample01
2 <split [6000/2000]> Resample02
3 <split [6000/2000]> Resample03
4 <split [6000/2000]> Resample04
5 <split [6000/2000]> Resample05
6 <split [6000/2000]> Resample06
7 <split [6000/2000]> Resample07
8 <split [6000/2000]> Resample08
9 <split [6000/2000]> Resample09
10 <split [6000/2000]> Resample10
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
tip ~ .
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Main Arguments:
trees = 1000
Computational engine: ranger
ctrl_taxi <- control_resamples(save_pred = TRUE)
# Random forest uses random numbers so set the seed first
set.seed(2)
rf_res <- fit_resamples(rf_wflow, taxi_folds, control = ctrl_taxi)
collect_metrics(rf_res)
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.923 10 0.00317 Preprocessor1_Model1
2 roc_auc binary 0.616 10 0.0147 Preprocessor1_Model1
# Resampling results
# Manual resampling
# A tibble: 1 × 6
splits id .metrics .notes .predictions .workflow
<list> <chr> <list> <list> <list> <list>
1 <split [8000/2000]> train/test split <tibble> <tibble> <tibble> <workflow>
final_fit
?# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.914 Preprocessor1_Model1
2 roc_auc binary 0.638 Preprocessor1_Model1
These are metrics computed with the test set
final_fit
?# A tibble: 2,000 × 7
id .pred_yes .pred_no .row .pred_class tip .config
<chr> <dbl> <dbl> <int> <fct> <fct> <chr>
1 train/test split 0.957 0.0426 4 yes yes Preprocessor1_Mo…
2 train/test split 0.938 0.0621 10 yes yes Preprocessor1_Mo…
3 train/test split 0.958 0.0416 19 yes yes Preprocessor1_Mo…
4 train/test split 0.894 0.106 23 yes yes Preprocessor1_Mo…
5 train/test split 0.943 0.0573 28 yes yes Preprocessor1_Mo…
6 train/test split 0.979 0.0213 34 yes yes Preprocessor1_Mo…
7 train/test split 0.954 0.0463 35 yes yes Preprocessor1_Mo…
8 train/test split 0.928 0.0722 38 yes yes Preprocessor1_Mo…
9 train/test split 0.985 0.0147 40 yes yes Preprocessor1_Mo…
10 train/test split 0.948 0.0523 42 yes no Preprocessor1_Mo…
# ℹ 1,990 more rows
final_fit
?══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
tip ~ .
── Model ───────────────────────────────────────────────────────────────────────
Ranger result
Call:
ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
Type: Probability estimation
Number of trees: 1000
Sample size: 8000
Number of independent variables: 6
Mtry: 2
Target node size: 10
Variable importance mode: none
Splitrule: gini
OOB prediction error (Brier s.): 0.07069778
Use this for prediction on new data, like for deploying
rf_spec <- rand_forest(min_n = tune()) %>%
set_mode("classification")
rf_wflow <- workflow(tip ~ ., rf_spec)
rf_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
tip ~ .
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Main Arguments:
min_n = tune()
Computational engine: ranger
tune_grid()
works similar to fit_resamples()
but covers multiple parameter values:
Inspecting results and selecting the best-performing hyperparameter(s):
# A tibble: 5 × 7
min_n .metric .estimator mean n std_err .config
<int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 33 roc_auc binary 0.623 10 0.0149 Preprocessor1_Model1
2 31 roc_auc binary 0.622 10 0.0154 Preprocessor1_Model3
3 21 roc_auc binary 0.620 10 0.0149 Preprocessor1_Model4
4 13 roc_auc binary 0.617 10 0.0137 Preprocessor1_Model5
5 6 roc_auc binary 0.611 10 0.0156 Preprocessor1_Model2
# A tibble: 1 × 2
min_n .config
<int> <chr>
1 33 Preprocessor1_Model1
collect_metrics()
and autoplot()
are also available.
require(tidyverse)
sitedf <- readr::read_csv("https://www.epa.gov/sites/default/files/2014-01/nla2007_sampledlakeinformation_20091113.csv") |>
select(SITE_ID,
lon = LON_DD,
lat = LAT_DD,
name = LAKENAME,
area = LAKEAREA,
zmax = DEPTHMAX
) |>
group_by(SITE_ID) |>
summarize(lon = mean(lon, na.rm = TRUE),
lat = mean(lat, na.rm = TRUE),
name = unique(name),
area = mean(area, na.rm = TRUE),
zmax = mean(zmax, na.rm = TRUE))
visitdf <- readr::read_csv("https://www.epa.gov/sites/default/files/2013-09/nla2007_profile_20091008.csv") |>
select(SITE_ID,
date = DATE_PROFILE,
year = YEAR,
visit = VISIT_NO
) |>
distinct()
waterchemdf <- readr::read_csv("https://www.epa.gov/sites/default/files/2013-09/nla2007_profile_20091008.csv") |>
select(SITE_ID,
date = DATE_PROFILE,
depth = DEPTH,
temp = TEMP_FIELD,
do = DO_FIELD,
ph = PH_FIELD,
cond = COND_FIELD,
)
sddf <- readr::read_csv("https://www.epa.gov/sites/default/files/2014-10/nla2007_secchi_20091008.csv") |>
select(SITE_ID,
date = DATE_SECCHI,
sd = SECMEAN,
clear_to_bottom = CLEAR_TO_BOTTOM
)
trophicdf <- readr::read_csv("https://www.epa.gov/sites/default/files/2014-10/nla2007_trophic_conditionestimate_20091123.csv") |>
select(SITE_ID,
visit = VISIT_NO,
tp = PTL,
tn = NTL,
chla = CHLA) |>
left_join(visitdf, by = c("SITE_ID", "visit")) |>
select(-year, -visit) |>
group_by(SITE_ID, date) |>
summarize(tp = mean(tp, na.rm = TRUE),
tn = mean(tn, na.rm = TRUE),
chla = mean(chla, na.rm = TRUE)
)
phytodf <- readr::read_csv("https://www.epa.gov/sites/default/files/2014-10/nla2007_phytoplankton_softalgaecount_20091023.csv") |>
select(SITE_ID,
date = DATEPHYT,
depth = SAMPLE_DEPTH,
phyta = DIVISION,
genus = GENUS,
species = SPECIES,
tax = TAXANAME,
abund = ABUND) |>
mutate(phyta = gsub(" .*$", "", phyta)) |>
filter(!is.na(genus)) |>
group_by(SITE_ID, date, depth, phyta, genus) |>
summarize(abund = sum(abund, na.rm = TRUE)) |>
nest(phytodf = -c(SITE_ID, date))
envdf <- waterchemdf |>
filter(depth < 2) |>
select(-depth) |>
group_by(SITE_ID, date) |>
summarise_all(~mean(., na.rm = TRUE)) |>
ungroup() |>
left_join(sddf, by = c("SITE_ID", "date")) |>
left_join(trophicdf, by = c("SITE_ID", "date"))
nla <- envdf |>
left_join(phytodf) |>
left_join(sitedf, by = "SITE_ID") |>
filter(!purrr::map_lgl(phytodf, is.null)) |>
mutate(cyanophyta = purrr::map(phytodf, ~ .x |>
dplyr::filter(phyta == "Cyanophyta") |>
summarize(cyanophyta = sum(abund, na.rm = TRUE))
)) |>
unnest(cyanophyta) |>
select(-phyta) |>
mutate(clear_to_bottom = ifelse(is.na(clear_to_bottom), TRUE, FALSE))
# library(rmdify)
# library(dwfun)
# dwfun::init()
Name | nla |
Number of rows | 1208 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
character | 3 |
list | 1 |
logical | 1 |
numeric | 14 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
SITE_ID | 0 | 1.00 | 12 | 24 | 0 | 1114 | 0 |
date | 0 | 1.00 | 8 | 10 | 0 | 116 | 0 |
name | 44 | 0.96 | 5 | 48 | 0 | 990 | 0 |
Variable type: list
skim_variable | n_missing | complete_rate | n_unique | min_length | max_length |
---|---|---|---|---|---|
phytodf | 0 | 1 | 1207 | 4 | 4 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
clear_to_bottom | 0 | 1 | 0.96 | TRU: 1154, FAL: 54 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
temp | 1 | 1.00 | 24.13 | 4.06 | 10.95 | 21.43 | 24.53 | 27.20 | 37.73 | ▁▅▇▅▁ |
do | 54 | 0.96 | 7.84 | 1.97 | 0.77 | 6.85 | 7.88 | 8.70 | 21.00 | ▁▇▂▁▁ |
ph | 3 | 1.00 | 8.08 | 0.88 | 4.10 | 7.55 | 8.25 | 8.64 | 10.33 | ▁▁▅▇▂ |
cond | 247 | 0.80 | 714.37 | 2499.57 | 3.00 | 86.93 | 219.40 | 471.00 | 42487.75 | ▇▁▁▁▁ |
sd | 62 | 0.95 | 2.15 | 2.49 | 0.04 | 0.60 | 1.35 | 2.75 | 36.71 | ▇▁▁▁▁ |
tp | 0 | 1.00 | 112.77 | 301.34 | 1.00 | 10.00 | 25.00 | 90.00 | 4865.00 | ▇▁▁▁▁ |
tn | 0 | 1.00 | 1174.39 | 2061.71 | 5.00 | 317.75 | 584.00 | 1174.25 | 26100.00 | ▇▁▁▁▁ |
chla | 5 | 1.00 | 29.91 | 69.27 | 0.07 | 2.96 | 7.79 | 26.08 | 936.00 | ▇▁▁▁▁ |
lon | 0 | 1.00 | -94.34 | 14.08 | -124.25 | -103.12 | -94.48 | -84.32 | -67.70 | ▃▃▇▆▃ |
lat | 0 | 1.00 | 40.55 | 5.02 | 26.94 | 37.12 | 41.31 | 44.64 | 48.98 | ▁▃▆▇▆ |
area | 0 | 1.00 | 12.01 | 78.53 | 0.04 | 0.24 | 0.70 | 2.87 | 1674.90 | ▇▁▁▁▁ |
zmax | 0 | 1.00 | 9.41 | 10.13 | 0.50 | 2.90 | 5.90 | 12.00 | 97.00 | ▇▁▁▁▁ |
depth | 9 | 0.99 | 1.58 | 0.59 | 0.08 | 1.13 | 2.00 | 2.00 | 2.00 | ▁▂▂▁▇ |
cyanophyta | 0 | 1.00 | 38382.63 | 191373.91 | 0.66 | 1200.61 | 5483.81 | 23504.81 | 4982222.22 | ▇▁▁▁▁ |
nla |>
filter(tp > 1) |>
ggplot(aes(tn, tp)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))) +
scale_y_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x)))
Call:
lm(formula = log10(tp) ~ log10(tn), data = nla)
Residuals:
Min 1Q Median 3Q Max
-1.8063 -0.2360 0.0125 0.2245 1.9140
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.92315 0.07166 -26.84 <2e-16 ***
log10(tn) 1.21700 0.02528 48.13 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.405 on 1206 degrees of freedom
Multiple R-squared: 0.6576, Adjusted R-squared: 0.6574
F-statistic: 2317 on 1 and 1206 DF, p-value: < 2.2e-16
nla |>
filter(tp > 1) |>
ggplot(aes(tp, cyanophyta)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))) +
scale_y_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x)))
Call:
lm(formula = log10(cyanophyta) ~ log10(tp), data = nla)
Residuals:
Min 1Q Median 3Q Max
-5.1551 -0.5128 0.1407 0.6546 3.1811
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.82739 0.06181 45.74 <2e-16 ***
log10(tp) 0.58577 0.03784 15.48 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9095 on 1206 degrees of freedom
Multiple R-squared: 0.1658, Adjusted R-squared: 0.1651
F-statistic: 239.7 on 1 and 1206 DF, p-value: < 2.2e-16
<Training/Testing/Total>
<844/364/1208>
# A tibble: 844 × 19
SITE_ID date temp do ph cond sd clear_to_bottom tp tn
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl>
1 NLA06608-0… 6/14… 22.2 NaN 5.52 62.1 0.55 TRUE 36 695
2 NLA06608-0… 8/29… 30.1 7.5 8.35 1128. 0.71 TRUE 43 738
3 NLA06608-0… 9/6/… 30.0 9 8.4 1220. 0.49 TRUE 50 843
4 NLA06608-0… 9/14… 22.7 NaN 5.8 44.5 1.05 TRUE 20 264
5 NLA06608-0… 9/4/… 25.1 NaN 5.26 45.5 0.65 TRUE 28 384
6 NLA06608-0… 8/23… 18.1 8.4 9.4 9052. 0.63 TRUE 175 4456
7 NLA06608-0… 8/6/… 21.6 7.6 9.4 9080 0.85 TRUE 175 4147
8 NLA06608-0… 6/11… 22.9 4.75 8.38 3373 0.35 TRUE 801 7047
9 NLA06608-0… 8/7/… 21.2 4.71 9.03 3125. 0.55 TRUE 1376 6578
10 NLA06608-0… 8/9/… 29.4 7.8 8.2 168 0.95 TRUE 12 349
# ℹ 834 more rows
# ℹ 9 more variables: chla <dbl>, phytodf <list>, lon <dbl>, lat <dbl>,
# name <chr>, area <dbl>, zmax <dbl>, depth <dbl>, cyanophyta <dbl>
# A tibble: 364 × 19
SITE_ID date temp do ph cond sd clear_to_bottom tp tn
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl>
1 NLA06608-00… 7/31… 16.3 8.25 8.15 152. 6.4 TRUE 6 151
2 NLA06608-00… 7/23… 24.8 NaN 5.07 46.0 0.45 TRUE 22 469
3 NLA06608-00… 7/17… 25.3 8.56 8.15 77 3.21 TRUE 7 184
4 NLA06608-00… 8/30… 24.5 8.68 7.84 83 4.1 TRUE 4 223
5 NLA06608-00… 6/13… 26.8 5.4 7.43 196. 0.31 TRUE 159 1026
6 NLA06608-00… 9/18… 24.1 6.87 7.78 221. 0.27 TRUE 142 1052
7 NLA06608-00… 7/10… 24.4 8 7.9 NaN 0.37 TRUE 109 470
8 NLA06608-00… 7/2/… 27 7.45 8.3 215 1.07 TRUE 20 466
9 NLA06608-00… 7/11… 27.8 7.1 7.15 176. 0.9 TRUE 35 860
10 NLA06608-00… 8/14… 32.8 8.5 8.8 136. 1.08 TRUE 29 943
# ℹ 354 more rows
# ℹ 9 more variables: chla <dbl>, phytodf <list>, lon <dbl>, lat <dbl>,
# name <chr>, area <dbl>, zmax <dbl>, depth <dbl>, cyanophyta <dbl>
nla_formula <- as.formula("cyanophyta ~ temp + do + ph + cond + sd + tp + tn + chla + clear_to_bottom")
# nla_formula <- as.formula("cyanophyta ~ temp + do + ph + cond + sd + tp + tn")
nla_recipe <- recipes::recipe(nla_formula, data = nla_train) |>
recipes::step_string2factor(all_nominal()) |>
recipes::step_nzv(all_nominal()) |>
recipes::step_log(chla, cyanophyta, base = 10) |>
recipes::step_normalize(all_numeric_predictors()) |>
prep()
nla_recipe
nla_cv <- recipes::bake(
nla_recipe,
new_data = training(nla_split)
) |>
rsample::vfold_cv(v = 10)
nla_cv
# 10-fold cross-validation
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [759/85]> Fold01
2 <split [759/85]> Fold02
3 <split [759/85]> Fold03
4 <split [759/85]> Fold04
5 <split [760/84]> Fold05
6 <split [760/84]> Fold06
7 <split [760/84]> Fold07
8 <split [760/84]> Fold08
9 <split [760/84]> Fold09
10 <split [760/84]> Fold10
xgboost_model <- parsnip::boost_tree(
mode = "regression",
trees = 1000,
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune()
) |>
set_engine("xgboost", objective = "reg:squarederror")
xgboost_model
Boosted Tree Model Specification (regression)
Main Arguments:
trees = 1000
min_n = tune()
tree_depth = tune()
learn_rate = tune()
loss_reduction = tune()
Engine-Specific Arguments:
objective = reg:squarederror
Computational engine: xgboost
# grid specification
xgboost_params <- dials::parameters(
min_n(),
tree_depth(),
learn_rate(),
loss_reduction()
)
xgboost_params
Collection of 4 parameters for tuning
identifier type object
min_n min_n nparam[+]
tree_depth tree_depth nparam[+]
learn_rate learn_rate nparam[+]
loss_reduction loss_reduction nparam[+]
xgboost_wf <- workflows::workflow() |>
add_model(xgboost_model) |>
add_formula(nla_formula)
xgboost_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: boost_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
cyanophyta ~ temp + do + ph + cond + sd + tp + tn + chla + clear_to_bottom
── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (regression)
Main Arguments:
trees = 1000
min_n = tune()
tree_depth = tune()
learn_rate = tune()
loss_reduction = tune()
Engine-Specific Arguments:
objective = reg:squarederror
Computational engine: xgboost
# hyperparameter tuning
if (FALSE) {
xgboost_tuned <- tune::tune_grid(
object = xgboost_wf,
resamples = nla_cv,
grid = xgboost_grid,
metrics = yardstick::metric_set(rmse, rsq, mae),
control = tune::control_grid(verbose = TRUE)
)
saveRDS(xgboost_tuned, "./xgboost_tuned.RDS")
}
xgboost_tuned <- readRDS("./xgboost_tuned.RDS")
min_n | tree_depth | learn_rate | loss_reduction | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|---|---|
17 | 3 | 0.0767369 | 6.9949310 | rmse | standard | 1.824333 | 10 | 0.0461643 | Preprocessor1_Model21 |
19 | 8 | 0.0361629 | 24.8793941 | rmse | standard | 1.825676 | 10 | 0.0468745 | Preprocessor1_Model16 |
10 | 1 | 0.0093868 | 0.4234282 | rmse | standard | 1.827963 | 10 | 0.0478933 | Preprocessor1_Model29 |
35 | 10 | 0.0103573 | 30.7044800 | rmse | standard | 1.836934 | 10 | 0.0492418 | Preprocessor1_Model26 |
12 | 1 | 0.0370675 | 0.0000002 | rmse | standard | 1.840259 | 10 | 0.0447300 | Preprocessor1_Model04 |
# A tibble: 180 × 10
min_n tree_depth learn_rate loss_reduction .metric .estimator mean n
<int> <int> <dbl> <dbl> <chr> <chr> <dbl> <int>
1 3 2 0.0000362 0.0000000523 mae standard 7.76 10
2 3 2 0.0000362 0.0000000523 rmse standard 8.09 10
3 3 2 0.0000362 0.0000000523 rsq standard 0.328 10
4 31 1 0.0000481 0.000000376 mae standard 7.67 10
5 31 1 0.0000481 0.000000376 rmse standard 8.00 10
6 31 1 0.0000481 0.000000376 rsq standard 0.218 10
7 39 12 0.0586 0.0316 mae standard 1.57 10
8 39 12 0.0586 0.0316 rmse standard 1.99 10
9 39 12 0.0586 0.0316 rsq standard 0.311 10
10 12 1 0.0371 0.000000200 mae standard 1.46 10
# ℹ 170 more rows
# ℹ 2 more variables: std_err <dbl>, .config <chr>
Boosted Tree Model Specification (regression)
Main Arguments:
trees = 1000
min_n = 17
tree_depth = 3
learn_rate = 0.0767368959881715
loss_reduction = 6.99493103991011
Engine-Specific Arguments:
objective = reg:squarederror
Computational engine: xgboost
# A tibble: 844 × 10
temp do ph cond sd tp tn chla clear_to_bottom
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
1 -0.501 NaN -2.90 -0.278 -0.689 -0.317 -0.243 -0.581 TRUE
2 1.47 -0.157 0.304 0.191 -0.616 -0.286 -0.222 0.395 TRUE
3 1.44 0.569 0.360 0.232 -0.717 -0.255 -0.172 0.214 TRUE
4 -0.390 NaN -2.58 -0.286 -0.460 -0.388 -0.450 -0.0359 TRUE
5 0.221 NaN -3.20 -0.286 -0.643 -0.352 -0.393 0.358 TRUE
6 -1.53 0.279 1.49 3.68 -0.653 0.300 1.57 0.521 TRUE
7 -0.666 -0.108 1.49 3.70 -0.552 0.300 1.42 -0.0396 TRUE
8 -0.330 -1.49 0.338 1.18 -0.781 3.08 2.81 -0.549 TRUE
9 -0.748 -1.51 1.07 1.07 -0.689 5.63 2.59 1.03 TRUE
10 1.29 -0.0116 0.134 -0.232 -0.506 -0.423 -0.410 -0.793 TRUE
# ℹ 834 more rows
# ℹ 1 more variable: cyanophyta <dbl>
train_prediction <- xgboost_model_final |>
# fit the model on all the training data
fit(
formula = nla_formula,
data = train_processed
) |>
# predict the sale prices for the training data
predict(new_data = train_processed) |>
bind_cols(nla_train |>
mutate(.obs = log10(cyanophyta)))
xgboost_score_train <-
train_prediction |>
yardstick::metrics(.obs, .pred) |>
mutate(.estimate = format(round(.estimate, 2), big.mark = ","))
knitr::kable(xgboost_score_train)
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 0.81 |
rsq | standard | 0.38 |
mae | standard | 0.64 |
test_processed <- bake(nla_recipe, new_data = nla_test)
test_prediction <- xgboost_model_final |>
# fit the model on all the training data
fit(
formula = nla_formula,
data = train_processed
) |>
# use the training model fit to predict the test data
predict(new_data = test_processed) |>
bind_cols(nla_test |>
mutate(.obs = log10(cyanophyta)))
# measure the accuracy of our model using `yardstick`
xgboost_score <- test_prediction |>
yardstick::metrics(.obs, .pred) |>
mutate(.estimate = format(round(.estimate, 2), big.mark = ","))
knitr::kable(xgboost_score)
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 0.80 |
rsq | standard | 0.35 |
mae | standard | 0.62 |