Exploration of sample size and diatom-based indicator performance in three North American phosphorus training sets

Euan D. Reavie; Steve Juggins

doi:10.1007/s10452-011-9373-9

Exploration of sample size and diatom-based indicator performance in three North American phosphorus training sets

Euan D. Reavie, Steve Juggins

Natural Resources Research Institute

Research output: Contribution to journal › Article › peer-review

28 Scopus citations

Abstract

Three large training sets were investigated to determine optimal sample sizes for diatom-based inference models. The sample sets represented (1) assemblages from Great Lakes coastlines, (2) phytoplankton from the pelagic Great Lakes and (3) surface sediment assemblages from Minnesota lakes. Diatom-based weighted average models to infer nutrient concentrations were developed for each training set. Training set sample sizes ranging from 10 to the maximum number of samples were created through random sample selection, and performance of each model was evaluated. For each model iteration, diatom-inferred (DI) nutrient data were related to stressor data (e. g., adjacent agricultural or urban development) to characterize the ability of each model to track human activities. The relationships between model performance parameters (DI-stressor correlations and model r², error and bias) and sample size were used to determine the minimum sample size needed to optimize models for each region. Depending on the training set, at least 40-70 samples were needed to capture the variation in diatom assemblages and environmental conditions to such a degree that non-analog situations should be rare and so should provide an unambiguous result if the model was applied to any sample assemblage from the region. It is recommended that one exercises caution when dealing with smaller training sets unless there is certainty that the selected samples reflect the regional variability in diatom assemblages and environmental conditions.

Original language	English (US)
Pages (from-to)	529-538
Number of pages	10
Journal	Aquatic Ecology
Volume	45
Issue number	4
DOIs	https://doi.org/10.1007/s10452-011-9373-9
State	Published - Nov 2011

Bibliographical note

Funding Information:
Acknowledgments The Minnesota lake dataset has been progressively developed by Steve Heiskary and Mark Tomasek (Minnesota Pollution Control Agency), Dan Engstrom, Mark Edlund, Shawn Schottler and Joy Ramstack (St. Croix Watershed Research Station). Amy Kireta, Gerald Sgro, Norman Andresen and Michael Ferguson supported diatom assessments for GLEI samples. Michael Agbeti supported diatom assessments of the GLNPO phytoplankton samples. There are several people to thank for GLEI project management and field support, including Valerie Brady, Jerry Henneck, John Ameel, Gerald Niemi, John (Jack) Kelly, Russell Kreis and Jeffrey Johansen. This research was supported by grants to E. Reavie from the US Environmental Protection Agency under Cooperative Agreements EPA/R–8286750 (GLEI) and GL-00E23101 (GLNPO). This document has not been subjected to the EPA’s required peer and policy review and therefore does not necessarily reflect the view of the Agency, and no official endorsement should be inferred. This is contribution number 530 of the Center for Water and the Environment, Natural Resources Research Institute, University of Minnesota Duluth.

Keywords

Diatoms
Inference models
Models
Sample size
Stressors
Training sets

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access

10.1007/s10452-011-9373-9

OpenUrl availability

Full text

Cite this

@article{7afd3817d8c34ee185fa716c55019aa1,

title = "Exploration of sample size and diatom-based indicator performance in three North American phosphorus training sets",

abstract = "Three large training sets were investigated to determine optimal sample sizes for diatom-based inference models. The sample sets represented (1) assemblages from Great Lakes coastlines, (2) phytoplankton from the pelagic Great Lakes and (3) surface sediment assemblages from Minnesota lakes. Diatom-based weighted average models to infer nutrient concentrations were developed for each training set. Training set sample sizes ranging from 10 to the maximum number of samples were created through random sample selection, and performance of each model was evaluated. For each model iteration, diatom-inferred (DI) nutrient data were related to stressor data (e. g., adjacent agricultural or urban development) to characterize the ability of each model to track human activities. The relationships between model performance parameters (DI-stressor correlations and model r2, error and bias) and sample size were used to determine the minimum sample size needed to optimize models for each region. Depending on the training set, at least 40-70 samples were needed to capture the variation in diatom assemblages and environmental conditions to such a degree that non-analog situations should be rare and so should provide an unambiguous result if the model was applied to any sample assemblage from the region. It is recommended that one exercises caution when dealing with smaller training sets unless there is certainty that the selected samples reflect the regional variability in diatom assemblages and environmental conditions.",

keywords = "Diatoms, Inference models, Models, Sample size, Stressors, Training sets",

author = "Reavie, {Euan D.} and Steve Juggins",

note = "Funding Information: Acknowledgments The Minnesota lake dataset has been progressively developed by Steve Heiskary and Mark Tomasek (Minnesota Pollution Control Agency), Dan Engstrom, Mark Edlund, Shawn Schottler and Joy Ramstack (St. Croix Watershed Research Station). Amy Kireta, Gerald Sgro, Norman Andresen and Michael Ferguson supported diatom assessments for GLEI samples. Michael Agbeti supported diatom assessments of the GLNPO phytoplankton samples. There are several people to thank for GLEI project management and field support, including Valerie Brady, Jerry Henneck, John Ameel, Gerald Niemi, John (Jack) Kelly, Russell Kreis and Jeffrey Johansen. This research was supported by grants to E. Reavie from the US Environmental Protection Agency under Cooperative Agreements EPA/R–8286750 (GLEI) and GL-00E23101 (GLNPO). This document has not been subjected to the EPA{\textquoteright}s required peer and policy review and therefore does not necessarily reflect the view of the Agency, and no official endorsement should be inferred. This is contribution number 530 of the Center for Water and the Environment, Natural Resources Research Institute, University of Minnesota Duluth.",

year = "2011",

month = nov,

doi = "10.1007/s10452-011-9373-9",

language = "English (US)",

volume = "45",

pages = "529--538",

journal = "Aquatic Ecology",

issn = "1386-2588",

publisher = "Kluwer Academic Publishers",

number = "4",

}

TY - JOUR

T1 - Exploration of sample size and diatom-based indicator performance in three North American phosphorus training sets

AU - Reavie, Euan D.

AU - Juggins, Steve

N1 - Funding Information: Acknowledgments The Minnesota lake dataset has been progressively developed by Steve Heiskary and Mark Tomasek (Minnesota Pollution Control Agency), Dan Engstrom, Mark Edlund, Shawn Schottler and Joy Ramstack (St. Croix Watershed Research Station). Amy Kireta, Gerald Sgro, Norman Andresen and Michael Ferguson supported diatom assessments for GLEI samples. Michael Agbeti supported diatom assessments of the GLNPO phytoplankton samples. There are several people to thank for GLEI project management and field support, including Valerie Brady, Jerry Henneck, John Ameel, Gerald Niemi, John (Jack) Kelly, Russell Kreis and Jeffrey Johansen. This research was supported by grants to E. Reavie from the US Environmental Protection Agency under Cooperative Agreements EPA/R–8286750 (GLEI) and GL-00E23101 (GLNPO). This document has not been subjected to the EPA’s required peer and policy review and therefore does not necessarily reflect the view of the Agency, and no official endorsement should be inferred. This is contribution number 530 of the Center for Water and the Environment, Natural Resources Research Institute, University of Minnesota Duluth.

PY - 2011/11

Y1 - 2011/11

N2 - Three large training sets were investigated to determine optimal sample sizes for diatom-based inference models. The sample sets represented (1) assemblages from Great Lakes coastlines, (2) phytoplankton from the pelagic Great Lakes and (3) surface sediment assemblages from Minnesota lakes. Diatom-based weighted average models to infer nutrient concentrations were developed for each training set. Training set sample sizes ranging from 10 to the maximum number of samples were created through random sample selection, and performance of each model was evaluated. For each model iteration, diatom-inferred (DI) nutrient data were related to stressor data (e. g., adjacent agricultural or urban development) to characterize the ability of each model to track human activities. The relationships between model performance parameters (DI-stressor correlations and model r2, error and bias) and sample size were used to determine the minimum sample size needed to optimize models for each region. Depending on the training set, at least 40-70 samples were needed to capture the variation in diatom assemblages and environmental conditions to such a degree that non-analog situations should be rare and so should provide an unambiguous result if the model was applied to any sample assemblage from the region. It is recommended that one exercises caution when dealing with smaller training sets unless there is certainty that the selected samples reflect the regional variability in diatom assemblages and environmental conditions.

AB - Three large training sets were investigated to determine optimal sample sizes for diatom-based inference models. The sample sets represented (1) assemblages from Great Lakes coastlines, (2) phytoplankton from the pelagic Great Lakes and (3) surface sediment assemblages from Minnesota lakes. Diatom-based weighted average models to infer nutrient concentrations were developed for each training set. Training set sample sizes ranging from 10 to the maximum number of samples were created through random sample selection, and performance of each model was evaluated. For each model iteration, diatom-inferred (DI) nutrient data were related to stressor data (e. g., adjacent agricultural or urban development) to characterize the ability of each model to track human activities. The relationships between model performance parameters (DI-stressor correlations and model r2, error and bias) and sample size were used to determine the minimum sample size needed to optimize models for each region. Depending on the training set, at least 40-70 samples were needed to capture the variation in diatom assemblages and environmental conditions to such a degree that non-analog situations should be rare and so should provide an unambiguous result if the model was applied to any sample assemblage from the region. It is recommended that one exercises caution when dealing with smaller training sets unless there is certainty that the selected samples reflect the regional variability in diatom assemblages and environmental conditions.

KW - Diatoms

KW - Inference models

KW - Models

KW - Sample size

KW - Stressors

KW - Training sets

UR - http://www.scopus.com/inward/record.url?scp=80355144436&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80355144436&partnerID=8YFLogxK

U2 - 10.1007/s10452-011-9373-9

DO - 10.1007/s10452-011-9373-9

M3 - Article

AN - SCOPUS:80355144436

SN - 1386-2588

VL - 45

SP - 529

EP - 538

JO - Aquatic Ecology

JF - Aquatic Ecology

IS - 4

ER -

Exploration of sample size and diatom-based indicator performance in three North American phosphorus training sets

Abstract

Bibliographical note

Keywords

UN SDGs

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this