Chapter 5 Functional Clustering

Clustering is a method of exploratory analysis used to identify patterns in data by grouping similar elements together, ensuring that the resulting groups are as distinct as possible from one another. In addition to data analysis, clustering can be used to detect outliers, recognize patterns, and generate hypotheses about the underlying population.

When working with functional data, traditional clustering methods can be adapted by incorporating notions of distance between functions or curves. This section introduces the concepts and tools available for functional data clustering using the fda.clust package.

5.1 R Packages for Clustering

Here is a list of popular R packages for clustering:

  • cluster: General-purpose clustering methods, including k-means, PAM, and hierarchical clustering.
  • factoextra: Visualization and interpretation of clustering results.
  • fpc: Additional methods for clustering and cluster validation.
  • mclust: Model-based clustering, classification, and density estimation.
  • NbClust: Determination of the optimal number of clusters.
  • clustertend: Check clustering tendency of a dataset.
  • dendextend: Visualization and adjustment of dendrograms.
  • clValid: Cluster validation statistics.
  • dbscan: Density-based spatial clustering of applications with noise (DBSCAN).
  • tclust: Trimmed k-means and clustering methods for robust statistics.
  • fanny: Fuzzy clustering methods.

5.2 R Packages for Clustering Functional Data

Here is a list of R packages specifically for clustering functional data:

  • fdacluster: Implements k-means, hierarchical agglomerative, and DBSCAN clustering methods for functional data, allowing for the joint alignment and clustering of curves.
  • fda.usc: Methods for functional data analysis, including clustering of functional data.
  • fda: Functional Data Analysis tools, with clustering capabilities.
  • fda.clust: Clustering of functional data using methods such as k-means and hierarchical clustering.
  • funHDDC: Mixture models for high-dimensional functional data.
  • FDboost: Boosting methods for functional data, which can be used for clustering.
  • funFEM: Clustering functional data by modeling the curves within a common and discriminative functional subspace.
  • funLBM: Model-based co-clustering of functional data.
  • fdasrvf: Functional data analysis with square-root velocity framework (SRVF) and clustering.
  • funcharts: Visual tools for functional clustering and data analysis.
  • ROpenFDA: Tools for FDA data analysis, which can be used for functional clustering.

5.3 K-Means Clustering for Functional Data

K-means is one of the most popular clustering methods due to its simplicity and efficiency. The goal is to partition the data into \(k\) groups in such a way that the within-group distances between points and their group center (centroid) are minimized.

The optimization problem can be formulated as follows:

\[ \min_{\{m_1, \dots, m_k\}} \sum_{j=1}^k \sum_{\mathbf{x}_i\in C_j} \| \mathbf{x}_i - m_j \|^2 \]

Where \(m_j\) represents the centroid of cluster \(C_j\), and the goal is to minimize the sum of squared distances between each point \(\mathbf{x}_i\) in cluster \(C_j\) and its corresponding centroid \(m_j\).

The K-means algorithm follows these steps:

  1. Initialization: Select \(k\) initial centroids (randomly or using a heuristic).
  2. Assignment: Assign each data point to the cluster with the nearest centroid.
  3. Update: Recompute the centroids as the mean of all points assigned to the cluster.
  4. Convergence: Repeat steps 2 and 3 until the centroids no longer change.

However, some limitations or considerations should be noted:

  • Choice of k: The number of clusters \(k\) must be specified beforehand.
  • Local optima: The algorithm can converge to local optima depending on the initialization.
  • Cluster size: If a cluster size is too small, it may require restarting the algorithm.

To address the problem of choosing \(k\), the elbow method is commonly used, where the within-group sum of squares is plotted for different values of \(k\) and the “elbow” point is selected as the optimal number of clusters.

The fda.clust package provides tools to perform K-means clustering for functional data. The goal is to cluster curves instead of points, and the package allows for several initialization options:

  • ncl = NULL: Random initialization with 2 groups.
  • ncl = integer: Number of clusters to form, and initial centers are selected automatically.
  • ncl = vector of integers: Specifies the indices of initial cluster centers.
  • ncl = fdata object: Uses the provided functional data as the initial cluster centers.

The main function for K-means clustering in fda.clust returns:

  • cluster: Indices of the groups assigned to each observation.
  • centers: Functional data objects representing the centers of the clusters.

See an example of Functional Data Clustering

data(phoneme)
mlearn <- phoneme$learn[1:150, ]
ylearn <- as.numeric(phoneme$classlearn[1:150])

# Perform K-means clustering on functional data
out.fd1 <- fda.usc:::kmeans.fd(mlearn, ncl=c(1, 51, 101))

table(out.fd1$cluster, ylearn)
##    ylearn
##      1  2  3
##   1 50 10  3
##   2  0 40  1
##   3  0  0 46

5.4 Functional Data Clustering with DBSCAN (fdbscan)

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters of points in regions of high density, separated by regions of low density. It defines clusters based on two parameters:

  • \(\varepsilon\): Maximum distance between two points to be considered neighbors.
  • \(\delta\): Minimum number of points required to form a dense region.

Algorithm

  1. Select a random unvisited point and compute its neighborhood \(N(\mathbf{x}_i)\).
  2. If \(|N(\mathbf{x}_i)| \ge \delta\), mark it as a core point and form a cluster by including its neighbors.
  3. Expand the cluster by adding points that are density-reachable from the core points.
  4. If \(|N(\mathbf{x}_i)| < \delta\), mark the point as noise.

See example of Functional Data Clustering using fdbscan

library(fda.clust)

# Load functional data
set.seed(123)
data(phoneme)
mlearn <- phoneme$learn[1:150, ]

# Optimize parameters for fdbscan
o <- optim.fdbscan(mlearn, 
                  eps= seq(5, 45, by = 5),
                  minPts =  seq(5, 45, by =5))

# Perform DBSCAN clustering
result_fdbscan <- fdbscan(mlearn, 
                          eps = o$optimal$eps,
                          minPts = o$optimal$minPts)

# Plot the clustering results
par(mfrow=c(1,1))
plot(mlearn, col = result_fdbscan$cluster+1)

# Display cluster assignments
table(result_fdbscan$cluster)
## 
##   0   1   2 
## 113  24  13

5.5 Functional Data Clustering with Mean Shift (fmeanshift)

Mean-Shift clustering is a non-parametric clustering algorithm based on density. It identifies areas of high data density as cluster centers. The mean shift vector is calculated as:

\[ \mathbf{x}_j^{(t+1)} = \mathbf{x}_j^{(t)} - \mathrm{m} \left( \mathbf{x}_j^{(t)} \right) \]

Where the mean shift vector \(\mathrm{m}(\mathbf{x}_j^{(t)})\) is given by:

\[ \mathrm{m} \left( \mathbf{x}_j^{(t)} \right) = \frac{ \sum_{\mathbf{x}_i \in N(\mathbf{x}_j^{(t)})} \mathbf{x}_i \cdot \mathrm{K}_h \left( \mathbf{x}_i - \mathbf{x}_j^{(t)} \right) }{ \sum_{\mathbf{x}_i \in N(\mathbf{x}_j^{(t)})} \mathrm{K}_h \left( \mathbf{x}_i - \mathbf{x}_j^{(t)} \right) } \]

Example of Functional Data Clustering using fmeanshift

library(fda.clust)

# Load functional data
data(phoneme)
mlearn <- phoneme$learn[1:150, ]

# Perform mean shift clustering
set.seed(123)
result_fmeanshift <- fmeanshift(mlearn, h = 15)  # h is the bandwidth parameter
## Min. Dist. InterPares: 20.0270419183663
# Plot the clustering results
par(mfrow=c(1,1))
plot(mlearn, col = result_fmeanshift$cluster)

# Display cluster assignments
table(result_fmeanshift$cluster)
## 
##   1   2 
## 104  46

5.6 Multivariate Functional Data Clustering

When dealing with multivariate functional data, clustering can be performed using a multidimensional metric. The aim is to cluster multivariate functional observations, where each observation may have multiple functional components.

Two key approaches for multivariate functional clustering are:

  1. Hierarchical clustering using the mfhclust() function from the fda.clust package, which combines information from multiple functional components.
  2. K-means clustering for multivariate functional data using the mfkmeans() function, which partitions the data into \(k\) clusters to minimize within-cluster variance.

5.6.1 Hierarchical Clustering with mfhclust()

The mfhclust() function performs hierarchical clustering on multivariate functional data. The following example demonstrates how to use mfhclust() with the Spanish weather dataset, which contains information on temperature and log-precipitation.

data(aemet, package = "fda.usc")
datos <- mfdata("temp"=aemet$temp, "logprec"=aemet$logprec)
# Perform hierarchical clustering using Ward's method
result <- mfhclust(datos, method = "ward.D2")

# Plot the dendrogram
plot(result, main = "Dendrogram of Multivariate Functional Data (Ward's Method)")

# Cut the dendrogram into 3 clusters
groups <- cutree(result, k = 3)

# Visualize the clustering results for the multivariate functional data
par(mfrow=c(1,3))
plot(aemet$temp, col=groups)
plot(aemet$logprec, col=groups)
plot(aemet$df[7:8], col=groups, asp=TRUE)

5.6.2 K-means Clustering with mfkmeans()

The mfkmeans() function performs k-means clustering for multivariate functional data. Similar to standard k-means, this approach aims to partition the multivariate functional data into \(k\) clusters by minimizing the within-cluster variance.

The following example demonstrates how to apply k-means clustering to the Spanish weather dataset using mfkmeans().

# Set the number of clusters
k <- 3

# Perform k-means clustering for multivariate functional data
set.seed(123) # Ensures reproducibility
result_kmeans <- mfkmeans(datos, ncl = k)

# Visualize the cluster assignments for the functional components
par(mfrow=c(1,3))
plot(aemet$temp, col=result_kmeans$cluster)
plot(aemet$logprec, col=result_kmeans$cluster)
plot(aemet$df[7:8], col=result_kmeans$cluster, asp=TRUE)

5.7 Cluster Validation Measures

5.7.1 Silhouette Index

The Silhouette Index \(IS_k\) measures the quality of the clusters. The formula is:

\[ IS_k = \frac{1}{n} \sum_{i=1}^n \frac{\mathrm{b}(\mathbf{x}_i) - \mathrm{a}(\mathbf{x}_i)}{\max \{ \mathrm{a}(\mathbf{x}_i), \mathrm{b}(\mathbf{x}_i) \}} \]

result_silhouette <- fda.clust::fclust.measures(mlearn, result_kmeans$cluster, index = "silhouette")
result_silhouette
## [1] NA

5.7.2 Dunn Index

The Dunn Index \(ID_k\) is given by:

\[ ID_k = \frac{\min_{i \neq j} \mathrm{d}(C_i, C_j)}{\max_{i} \mathrm{d}(C_i)} \]

result_dunn <- fda.clust::fclust.measures(mlearn, result_kmeans$cluster, index = "dunn")
result_dunn
## [1] 0.1479544

5.7.3 Davies-Bouldin Index

The Davies-Bouldin index \(IDB_k\) is defined as:

\[ IDB_k = \frac{1}{K} \sum_{j=1}^K \max_{i \neq j} \left( \frac{\sigma_i + \sigma_j}{d(m_i, m_j)} \right) \]

result_db_index <- fda.clust::fclust.measures(mlearn, result_kmeans$cluster, index = "db")
result_db_index
## [1] 6.099786

5.7.4 Calinski-Harabasz Index

The Calinski-Harabasz index \(ICH_k\) is defined as:

\[ ICH_k = \frac{SS_M (n - k)}{SS_W (k - 1)} \]

result_ch_index <- fda.clust::fclust.measures(mlearn, result_kmeans$cluster, index = "ch")
result_ch_index
## [1] 0.5357454

The selection of the most appropriate measure depends on the shape, size, and separation of clusters.

Conclusions and Future Work

The development of fda.clust is ongoing, with several key areas for future improvement and expansion:

  1. Automatic Parameter Selection: select key parameters like the number of clusters (\(k\)), bandwidth (\(h\)) for mean shift, and DBSCAN parameters (\(\epsilon\) and \(\delta\)).

  2. Efficiency Improvements: Avoid redundant distance calculations by caching results and enabling parallel processing.

  3. Other and New Clustering Procedures: Extend support for mode-based clustering, PCA-based clustering, and clustering for 2D functional data.

  4. Heterogeneity Measures: Incorporate heterogeneity measures from FErraty and Vieu 2006 (Chapter 9, Section 3).

  5. Enhanced Visualizations: Develop interactive plots.

  6. Community Feedback.

Chapter references

  • Bouveyron, C., Côme, E., & Jacques, J. (2014). The functional latent block model for the co-clustering of electricity consumption curves. Journal of the Royal Statistical Society: Series C (Applied Statistics), 63(3), 403-427.
  • Centofanti, F., Vichi, M., & Vittadini, G. (2024). Sparse and smooth functional clustering (SaS-Funclust). Statistical Papers, 65(1), 1-26.
  • Hartigan, J. A., & Wong, M. A. (1979). A k-means clustering algorithm. Applied Statistics, 28(1), 100-108.
  • Kaufman, L., & Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley.

Funding and Financial Support

This research/work is part of the grants PID2020-113578RB-I00 and PID2023-147127OB-I00 “ERDF/EU”, funded by MCIN/AEI/10.13039/501100011033/. It has also been supported by the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2024/14) and by CITIC as a center accredited for excellence within the Galician University System and a member of the CIGUS Network, receives subsidies from the Department of Education, Science, Universities, and Vocational Training of the Xunta de Galicia. Additionally, it is co-financed by the EU through the FEDER Galicia 2021-27 operational program (Ref. ED431G 2023/01).

References

Akaike, Htrotugu. 1973. “Maximum Likelihood Identification of Gaussian Autoregressive Moving Average Models.” Biometrika 60 (2): 255–65.
Aneiros-Pérez, Germán, and Philippe Vieu. 2006. “Semi-Functional Partial Linear Regression.” Statist. Probab. Lett. 76 (11): 1102–10.
Berrendero, José R, Antonio Cuevas, and Jos’e L Torrecilla. 2018. “On the Use of Reproducing Kernel Hilbert Spaces in Func2tional Classification.” Journal of the American Statistical Association 113: 1210–18.
Berrendero, José R, Antonio Cuevas, and José L Torrecilla. 2016. “Variable Selection in Functional Data Classification: A Maxima-Hunting Proposal.” Statistica Sinica, 619–38.
Cardot, Hervé, Christophe Crambes, and Pascal Sarda. 2005. “Quantile Regression When the Covariates Are Functions.” Nonparametric Statistics 17 (7): 841–56.
Cardot, Hervé, Frédéric Ferraty, and Pascal Sarda. 1999. “Functional Linear Model.” Statist. Probab. Lett. 45 (1): 11–22.
Carmack, Patrick S, Jeffrey S Spence, and William R Schucany. 2012. “Generalised Correlated Cross-Validation.” Journal of Nonparametric Statistics 24 (2): 269–82.
Chiou, Jeng-Min, Hans-Georg Muller, Jane-Ling Wang, et al. 2004. “Functional Response Models.” Statistica Sinica 14 (3): 675–94.
Cuesta-Albertos, Juan A, Manuel Febrero-Bande, and Manuel Oviedo de la Fuente. 2017. Test 26 (1): 119–42.
Cuesta-Albertos, Juan, and Alicia Nieto-Reyes. 2008. “The Random Tukey Depth.” Computational Statistics and Data Analysis 52 (11): 4979–88.
Cuevas, Antonio, Manuel Febrero, and Ricardo Fraiman. 2007. “Robust Estimation and Classification for Functional Data via Projection-Based Depth Notions.” Comput. Statist. 22 (3): 481–96. http://link.springer.com/article/10.1007/s00180-007-0053-0.
Efron, Bradley, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. 2004. “Least Angle Regression.” The Annals of Statistics 32 (2): 407–99.
Faraway, Julian J. 1997. “Regression Analysis for a Functional Response.” Technometrics 39 (3): 254–61.
Febrero-Bande, Manuel, Pedro Galeano, and Wenceslao González-Manteiga. 2008. “Outlier Detection in Functional Data by Depth Measures, with Application to Identify Abnormal \({\rm NO}_x\) Levels.” Environmetrics 19 (4): 331–45.
———. 2010. “Measures of Influence for the Functional Linear Model with Scalar Response.” J. Multivariate Anal. 101 (2): 327–39.
Febrero-Bande, Manuel, and Wenceslao González-Manteiga. 2013. “Generalized Additive Models for Functional Data.” Test 22 (2): 278–92. http://dx.doi.org/10.1007/s11749-012-0308-0.
Febrero-Bande, Manuel, Wenceslao González-Manteiga, and Manuel Oviedo de la Fuente. 2019. “Variable Selection in Functional Additive Regression Models.” Computational Statistics 34 (2): 469–87. https://doi.org/10.1007/s00180-018-0844-5.
Febrero-Bande, Manuel, and M Oviedo de la Fuente. 2012. “Statistical Computing in Functional Data Analysis: The R Package fda.usc.” J. Statist. Software 51 (4): 1–28.
Ferraty, Frédéric, Aldo Goia, Ernesto Salinelli, and Philippe Vieu. 2013. “Functional Projection Pursuit Regression.” Test 22 (2): 293–320.
Ferraty, Frédéric, P Hall, and Philippe Vieu. 2010. “Most-Predictive Design Points for Functional Data Predictors.” Biometrika 97 (4): 807–24.
Ferraty, Frédéric, Juhyun Park, and Philippe Vieu. 2011. “Estimation of a Functional Single Index Model.” In Recent Advances in Functional Data Analysis and Related Topics, 111–16. Springer.
Ferraty, Frédéric, Ingrid Van Keilegom, and Philippe Vieu. 2012. “Regression When Both Response and Predictor Are Functions.” Journal of Multivariate Analysis 109: 10–28.
Ferraty, Frédéric, and Philippe Vieu. 2003. “Curves Discrimination: A Nonparametric Functional Approach.” Comput. Statist. Data Anal. 44 (1): 161–73. http://www.sciencedirect.com/science/article/pii/S016794730300032X.
———. 2006. Nonparametric Functional Data Analysis. Springer Series in Statistics. New York: Springer-Verlag.
Ferraty, F., and P. Vieu. 2009. “Additive Prediction and Boosting for Functional Data.” Computational Statistics & Data Analysis 53 (4): 1400–1413. http://www.sciencedirect.com/science/article/pii/S0167947308005628.
Fraiman, Ricardo, and Graciela Muniz. 2001. “Trimmed Means for Functional Data.” Test 10 (2): 419–40.
Garcı́a-Portugués, Eduardo, Wenceslao González-Manteiga, and Manuel Febrero-Bande. 2014. “A Goodness-of-Fit Test for the Functional Linear Model with Scalar Response.” Journal of Computational and Graphical Statistics 23 (3): 761–78.
Kato, Kengo et al. 2012. “Estimation in Functional Linear Quantile Regression.” The Annals of Statistics 40 (6): 3108–36.
Krämer, Nicole, and Masashi Sugiyama. 2011. “The Degrees of Freedom of Partial Least Squares Regression.” Journal of the American Statistical Association 106 (494): 697–705.
Li, Jun, Juan A Cuesta-Albertos, and Regina Y Liu. 2012. \(DD\)–Classifier: Nonparametric Classification Procedure Based on \(DD\)–Plot.” J. Amer. Statist. Assoc. 107 (498): 737–53. http://amstat.tandfonline.com/doi/abs/10.1080/01621459.2012.688462.
Lin, Yi, Hao Helen Zhang, et al. 2006. “Component Selection and Smoothing in Multivariate Nonparametric Regression.” The Annals of Statistics 34 (5): 2272–97.
López-Pintado, Satan, and Juan Romo. 2009. “On the Concept of Depth for Functional Data.” Journal of the American Statistical Association 104: 718–34.
Muller, Hans-Georg et al. 2008. “Functional Modeling of Longitudinal Data.” In Longitudinal Data Analysis, 237–66. Chapman; Hall/CRC.
Müller, Hans-Georg, and Ulrich Stadtmüller. 2005. “Generalized Functional Linear Models.” Annals of Statistics, 774–805.
Müller, Hans-Georg, and Fang Yao. 2012. “Functional Additive Models.” Journal of the American Statistical Association.
Ordóñez, Celestino, Manuel Oviedo de la Fuente, Javier Roca-Pardiñas, and José Ramón Rodrı́guez-Pérez. 2018. “Determining Optimum Wavelengths for Leaf Water Content Estimation from Reflectance: A Distance Correlation Approach.” Chemometrics and Intelligent Laboratory Systems 173: 41–50.
Oviedo de la Fuente, Manuel, Manuel Febrero-Bande, Marı́a Pilar Muñoz, and Àngela Domı́nguez. 2018. “Predicting Seasonal Influenza Transmission Using Functional Regression Models with Temporal Dependence.” PloS One 13 (4): e0194250.
Oviedo-de la Fuente, Manuel, Carlos Cabo, Celestino Ordóñez, and Javier Roca-Pardiñas. 2021. “A Distance Correlation Approach for Optimum Multiscale Selection in 3D Point Cloud Classification.” Mathematics 9 (12): 1328.
Peng, Hanchuan, Fuhui Long, and Chris Ding. 2005. “Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy.” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 8: 1226–38.
Preda, C., and G. Saporta. 2005. “PLS Regression on a Stochastic Process.” Comput. Statist. Data Anal. 48 (1): 149–58.
Ramsay, J. O., and B. W. Silverman. 2005a. Functional Data Analysis. Springer.
———. 2005b. Functional Data Analysis. Second. Springer Series in Statistics. New York: Springer-Verlag.
Székely, G. J., M. L. Rizzo, and N. K. Bakirov. 2007. “Measuring and Testing Dependence by Correlation of Distances.” The Annals of Statistics 35 (6): 2769–94. http://projecteuclid.org/euclid.aos/1201012979.
Székely, Gábor J, and Maria L Rizzo. 2013. “The Distance Correlation \(t\)-Test of Independence in High Dimension.” J. Multivariate Anal. 117: 193–213.
Wood, Simon N. 2004. “Stable and Efficient Multiple Smoothing Parameter Estimation for Generalized Additive Models.” Journal of the American Statistical Association 99 (467): 673–86.
Yenigün, C Deniz, and Maria L Rizzo. 2015. “Variable Selection in Regression Using Maximal Correlation and Distance Correlation.” Journal of Statistical Computation and Simulation 85 (8): 1692–1705.
Zhao, Yihong, Huaihou Chen, and R Todd Ogden. 2015. “Wavelet-Based Weighted LASSO and Screening Approaches in Functional Linear Regression.” Journal of Computational and Graphical Statistics 24 (3): 655–75.
Zivot, Eric, and Jiahui Wang. 2007. Modeling Financial Time Series with s-Plus. Vol. 191. Springer Science & Business Media.