Applications of a novel clustering approach using non-negative matrix factorization to environmental research in public health

Paul Fogel; Yann Gaston-Mathé; Douglas Hawkins; Fajwel Fogel; George Luta; S. Stanley Young

doi:10.3390/ijerph13050509

Applications of a novel clustering approach using non-negative matrix factorization to environmental research in public health

Paul Fogel, Yann Gaston-Mathé, Douglas Hawkins, Fajwel Fogel, George Luta, S. Stanley Young

Statistics (Twin Cities)

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Often data can be represented as a matrix, e.g., observations as rows and variables as columns, or as a doubly classified contingency table. Researchers may be interested in clustering the observations, the variables, or both. If the data is non-negative, then Non-negative Matrix Factorization (NMF) can be used to perform the clustering. By its nature, NMF-based clustering is focused on the large values. If the data is normalized by subtracting the row/column means, it becomes of mixed signs and the original NMF cannot be used. Our idea is to split and then concatenate the positive and negative parts of the matrix, after taking the absolute value of the negative elements. NMF applied to the concatenated data, which we call PosNegNMF, offers the advantages of the original NMF approach, while giving equal weight to large and small values. We use two public health datasets to illustrate the new method and compare it with alternative clustering methods, such as K-means and clustering methods based on the Singular Value Decomposition (SVD) or Principal Component Analysis (PCA).With the exception of situations where a reasonably accurate factorization can be achieved using the first SVD component, we recommend that the epidemiologists and environmental scientists use the new method to obtain clusters with improved quality and interpretability.

Original language	English (US)
Article number	509
Journal	International journal of environmental research and public health
Volume	13
Issue number	5
DOIs	https://doi.org/10.3390/ijerph13050509
State	Published - May 18 2016

Bibliographical note

Publisher Copyright:
© 2016 by the authors; licensee MDPI, Basel, Switzerland.

Keywords

K-means
NMF
PCA
SVD

Access

10.3390/ijerph13050509

OpenUrl availability

Full text

Cite this

@article{b53bca38b5aa44ec849573fa92f1f5f4,

title = "Applications of a novel clustering approach using non-negative matrix factorization to environmental research in public health",

abstract = "Often data can be represented as a matrix, e.g., observations as rows and variables as columns, or as a doubly classified contingency table. Researchers may be interested in clustering the observations, the variables, or both. If the data is non-negative, then Non-negative Matrix Factorization (NMF) can be used to perform the clustering. By its nature, NMF-based clustering is focused on the large values. If the data is normalized by subtracting the row/column means, it becomes of mixed signs and the original NMF cannot be used. Our idea is to split and then concatenate the positive and negative parts of the matrix, after taking the absolute value of the negative elements. NMF applied to the concatenated data, which we call PosNegNMF, offers the advantages of the original NMF approach, while giving equal weight to large and small values. We use two public health datasets to illustrate the new method and compare it with alternative clustering methods, such as K-means and clustering methods based on the Singular Value Decomposition (SVD) or Principal Component Analysis (PCA).With the exception of situations where a reasonably accurate factorization can be achieved using the first SVD component, we recommend that the epidemiologists and environmental scientists use the new method to obtain clusters with improved quality and interpretability.",

keywords = "K-means, NMF, PCA, SVD",

author = "Paul Fogel and Yann Gaston-Math{\'e} and Douglas Hawkins and Fajwel Fogel and George Luta and Young, {S. Stanley}",

note = "Publisher Copyright: {\textcopyright} 2016 by the authors; licensee MDPI, Basel, Switzerland.",

year = "2016",

month = may,

day = "18",

doi = "10.3390/ijerph13050509",

language = "English (US)",

volume = "13",

journal = "International journal of environmental research and public health",

issn = "1661-7827",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "5",

}

TY - JOUR

T1 - Applications of a novel clustering approach using non-negative matrix factorization to environmental research in public health

AU - Fogel, Paul

AU - Gaston-Mathé, Yann

AU - Hawkins, Douglas

AU - Fogel, Fajwel

AU - Luta, George

AU - Young, S. Stanley

PY - 2016/5/18

Y1 - 2016/5/18

N2 - Often data can be represented as a matrix, e.g., observations as rows and variables as columns, or as a doubly classified contingency table. Researchers may be interested in clustering the observations, the variables, or both. If the data is non-negative, then Non-negative Matrix Factorization (NMF) can be used to perform the clustering. By its nature, NMF-based clustering is focused on the large values. If the data is normalized by subtracting the row/column means, it becomes of mixed signs and the original NMF cannot be used. Our idea is to split and then concatenate the positive and negative parts of the matrix, after taking the absolute value of the negative elements. NMF applied to the concatenated data, which we call PosNegNMF, offers the advantages of the original NMF approach, while giving equal weight to large and small values. We use two public health datasets to illustrate the new method and compare it with alternative clustering methods, such as K-means and clustering methods based on the Singular Value Decomposition (SVD) or Principal Component Analysis (PCA).With the exception of situations where a reasonably accurate factorization can be achieved using the first SVD component, we recommend that the epidemiologists and environmental scientists use the new method to obtain clusters with improved quality and interpretability.

AB - Often data can be represented as a matrix, e.g., observations as rows and variables as columns, or as a doubly classified contingency table. Researchers may be interested in clustering the observations, the variables, or both. If the data is non-negative, then Non-negative Matrix Factorization (NMF) can be used to perform the clustering. By its nature, NMF-based clustering is focused on the large values. If the data is normalized by subtracting the row/column means, it becomes of mixed signs and the original NMF cannot be used. Our idea is to split and then concatenate the positive and negative parts of the matrix, after taking the absolute value of the negative elements. NMF applied to the concatenated data, which we call PosNegNMF, offers the advantages of the original NMF approach, while giving equal weight to large and small values. We use two public health datasets to illustrate the new method and compare it with alternative clustering methods, such as K-means and clustering methods based on the Singular Value Decomposition (SVD) or Principal Component Analysis (PCA).With the exception of situations where a reasonably accurate factorization can be achieved using the first SVD component, we recommend that the epidemiologists and environmental scientists use the new method to obtain clusters with improved quality and interpretability.

KW - K-means

KW - NMF

KW - PCA

KW - SVD

UR - http://www.scopus.com/inward/record.url?scp=84969567757&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84969567757&partnerID=8YFLogxK

U2 - 10.3390/ijerph13050509

DO - 10.3390/ijerph13050509

M3 - Article

C2 - 27213413

AN - SCOPUS:84969567757

SN - 1661-7827

VL - 13

JO - International journal of environmental research and public health

JF - International journal of environmental research and public health

IS - 5

M1 - 509

ER -

Applications of a novel clustering approach using non-negative matrix factorization to environmental research in public health

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this