Chapter 1 Functional Data: Definition, Representation and Manipulation

1.1 Some definitions of Functional Data

Functional data analysis is a branch of statistics that analyzes data providing information about curves, surfaces or anything else varying over a continuum. The continuum is often time, but may also be spatial location, wavelength, probability, etc.

Functional data analysis is a branch of statistics concerned with analysing data in the form of functions.

Definition 1: A random variable \(\mathcal{X}\) is called a functional variable if it takes values in a functional space \(\mathcal{E}\) –complete normed (or seminormed) space–, (Frédéric Ferraty and Vieu 2006)
Definition 2: A functional dataset \(\{\mathcal{X}_1,\ldots,\mathcal{X}_n\}\) is the observation of n functional variables \(\mathcal{X}_1,\ldots,\mathcal{X}_n\) identically distributed as \(\mathcal{X}\), (Frédéric Ferraty and Vieu 2006).

A research stream focuses on the computational treatment that is used to working with functional data as an extension of multivariate data. Thus, the following definition of functional data proposed by (Muller et al. 2008) is also common in practice.

Definition 3. Functional data is multivariate data with an ordering on the dimensions, so that \(a=t_1<t_2,\ldots,t_{m-1}< b=t_m\).

1.2 In fda.usc: ``The data are curves’’

\(X_i(t)\) represents the mean temperature (averaged over 1980-2009 years) at the ith weather station in Spain, and at time time \(t\) during the year.

Temperature curves (right) located in Spanish airports (left),

But what code has been used to obtain the graph?

data(aemet)
par(mfrow=c(1,2))
CanaryIslands <- ifelse(aemet$df$latitude < 31, 2, 4)
plot(aemet$df[,c("longitude","latitude")], col = CanaryIslands, lwd=2)
plot(temp, col = CanaryIslands, lwd=2)

1.2.1 Definition of –fdata– class in R

Definition of fdata class object: An object called fdata as a list of the following components:

data: typically a matrix of (n x m) dimension which contains a set of n curves discretized in m points or argvals.
argvals: locations of the discretization points, by default: \({{t}_1=1,\ldots,{t}_m=m}\}\).
rangeval: rangeval of discretization points.
names: (optional) list with three components: main, an overall title, xlab, a title for the x axis and ylab, a title for the y axis.

lapply(aemet,class)
names(aemet$df)
lapply(aemet$temp,class)
temp <- aemet$temp
names(temp)
dim(temp)
length(argvals(temp))
rangeval(temp)
temp$names

1.2.2 Some utilities of fda.usc package

Basic operations for fdata class objects:

Group Math: abs, sqrt, floor, ceiling,semimetric.basis(),trunc,round,signif,exp,log,cos,sin,tan`.
Other operations: [], is.fdata(), c(), dim(), ncol(), nrow().

is.fdata(aemet$temp)
is.fdata(aemet$df)
dim(aemet$df)
dim(aemet$temp)

Convert the class

The class fdata only uses the evaluations at the discretization points.
The fdata2fd() converts fdata object to fd object (using the basis representation).
Inversely, the fdata() converts object of class: fd, fds, fts, sfts, vector, matrix, data.frame to an object of class fdata.

temp.fd=fdata2fd(temp,type.basis="fourier",nbasis=15)
temp.fdata=fdata(temp.fd) #back to fdata
class(temp.fd)

## [1] "fd"

class(temp.fdata)

## [1] "fdata"

split.fdata() and unlist(): A wrapper for the split and unlist function for fdata object

# Canary Islands in red vs Iberian Peninsula in blue
l1 <- fda.usc:::split.fdata(temp,CanaryIslands)
par(mfrow=c(1,2))
plot(l1[[1]],col=2,ylim=c(0,30))
plot(l1[[2]],col=4,ylim=c(0,30))

dim(l1[[1]]);dim(l1[[2]])

## [1]   9 365

## [1]  64 365

Group Ops: + -, *, /, ^, %%, %/%, &, |, !, ==, !=, <, <=, >=, >.

l1[[1]]==l1[[2]]

## [1] FALSE

order.fdata() A wrapper for the order function. The funcional data is ordered w.r.t the sample order of the values of vector.

temp2 <- order.fdata(order(CanaryIslands),temp)

Generate random process of fdata class.

rproc2fdata() function generates Functional data from: Ornstein Uhlenbeck process, Brownian process, Gaussian process or Exponential variogram process.

par(mfrow=c(1,2))
lent <- 30
tt <- seq(0,1,len=lent)
xgen1 <- rproc2fdata(200,t=tt,sigma="OU",par.list=list("scale"=1))
plot(xgen1)
mu <- fdata(sin(2*pi*tt),tt)
xgen2 <- rproc2fdata(200,mu=mu,sigma="OU",par.list=list("scale"=1))
plot(xgen2)

1.2.3 Definition of –ldata– class in R

ldata is a list with two type of objects:

df is a data frame with the multivariate data with n rows.
... objects of class fdata with n rows.

ldat <- ldata(df=aemet$df,temp=aemet$temp, logprec=aemet$logprec )
plot(ldat)

1.3 Resume by smoothing

If we supposed that our functional data \(Y(t)\) is observed through the model \(Y(t_i)=X(t_i)+\varepsilon(t_i)\) where the residuals \(\varepsilon(t)\) are independent with \(X(t)\).

We can to get back the original signal \(X(t)\) using a linear smoother: \[\hat{X}(t_i)=\sum_{i=1}^{n} s_{i}(t_j)Y(t_i) \Rightarrow \mathbf{\hat{X}}=\mathbf{S}\mathbf{Y} \] where \(s_{i}(t_j)\) is the weight that the point \(t_j\) gives to the point \(t_i\).

We use two methods to estimate the smoothing matrix \(S\).

Finite representation in a fixed basis (Ramsay and Silverman 2005a):

Let \(X(t)\in \mathcal{L}_2\), \[ X(t)= \sum_{k\in\mathbb{N}}{c_k\phi_k(t)}\approx\sum_{k=1}^K c_k\phi_k(t)=c^{\top}\Phi\]

The smoothing matrix is given by: \(\mathbf{S}=\Phi(\Phi^{\top}W\Phi+\lambda R)^{-1}\Phi^{\top}W\) where \(\lambda\) is the penalty parameter.

Type of basis:

Fourier: Design to represent periodic functions. Orthonormal
BSplines: Set of polynomials (of order m) defined in subintervals constructed in such a way that in the border of the subintervals the polynomials coincide (up to \(m - 2\) derivative). Banded
Wavelets: Powerful especially when the grid is a power of 2. Orthonormal
Polynomials: Adjust a polynomial to the whole curve.

Example: raw (left) and smoothed temperature curves with 111 (center) and 11 (right) elements of a B-spline base:

bsp11 <- create.bspline.basis(temp$rangeval,nbasis=11)
bsp111 <- create.bspline.basis(temp$rangeval,nbasis=111)
S.bsp11 <-  S.basis(temp$argvals, bsp11)
S.bsp111 <- S.basis(temp$argvals, bsp111)
temp.bsp11 <- temp.bsp111 <- temp
temp.bsp11$data <- temp$data%*%S.bsp11
temp.bsp111$data <- temp$data%*%S.bsp111 
par(mfrow=c(1,3))
plot(temp)
plot(temp.bsp111, main="111 Bspline basis elements")
plot(temp.bsp11, main="11 Bspline basis elements")

# Another way 
out1 <- optim.basis(temp, type.CV=CV.S)
names(out1)

##  [1] "gcv"          "numbasis"     "lambda"       "fdataobj"     "fdata.est"   
##  [6] "gcv.opt"      "numbasis.opt" "lambda.opt"   "S.opt"        "base.opt"

out1$numbasis.opt

## [1] 76

Kernel Smoothing (Frédéric Ferraty and Vieu 2006)

The problem is to estimate the smoothing parameter or bandwidth \(\nu=h\) that better represents the functional data using kernel smoothing. Now,the nonparametric smoothing of functional data is given by the smoothing matrix \(S\): \[s_{ij}=\frac{1}{h}K\left(\frac{t_i-t_j}{h}\right)\]

\[S(h)=(s_j(t_i))=\frac{K\left(\frac{t_i-t_j}{h}\right)}{\sum_{k=1}^{\top}K\left(\frac{t_k-t_j}{h}\right)}\] where \(h\) is the bandwidth and \(K()\) the Kernel function.

Different types of kernels \(K()\) are defined in the package, see Kernel function.

Example: three examples of non-parametric smoothing with different values of the bandwidth parameter (\(h = 1\), left; \(h = 10\), center and \(h\) selected by the GCV criteria, right).

S <- S.NW(temp$argvals,h=1)
temp <- aemet$temp
temp.hat <- temp
temp.hat$data <- temp$data%*%S
args(optim.np)

## function (fdataobj, h = NULL, W = NULL, Ker = Ker.norm, type.CV = GCV.S, 
##     type.S = S.NW, par.CV = list(trim = 0, draw = FALSE), par.S = list(), 
##     correl = TRUE, verbose = FALSE, ...) 
## NULL

# Another way 
temp.est <- optim.np(temp,h=1)$fdata.est
par(mfrow=c(1,3))
plot(temp.est,main="Kernel smooth - h=1")
temp.est == temp.hat

## [1] FALSE

plot(temp.hat, main="Kernel smooth - h=10")
plot(optim.np(temp)$fdata.est, main="Kernel smooth- h-GCV")

Finite representation in a Data-driven basis (Cardot, Ferraty, and Sarda 1999)

Functional Principal Components (FPC) Functional Principal Components Analysis (FPCA) and the Functional PLS (FPLS) allow display the functional in a few components.

The functional data can be rewritten as a decomposition in an orthonormal PC basis: maximizing the variance of \(X(t)\):

\[ \hat{X}_{i}(t)=\sum_{i=1}^{K}f_{ik}\xi_{k}(t) \]

where \(f_{i,k}\) is the score of the principal component PC, \(\xi_k\).

Example: Functional principal component analysis (FPCA)

y <- rowSums(aemet$logprec$data)
par(mfrow=c(1,3))
plot(aemet$temp)
plot((pc <- create.pc.basis(aemet$temp,l=1:2))[["fdata.est"]],
     main="Temperature, 2 FPC")
pc2 <- fdata2pc(aemet$temp)
plot(pc2$rotation, main="2 FPC basis", xlab="day")

summary(pc2)

## [1] TRUE
## 
##      - SUMMARY:   fdata2pc  object   -
## 
## -With 2  components are explained  98.78 %
##  of the variability of explicative variables.
##  
## -Variability for each component (%):
## [1] 85.57 13.21

Functional Partial Least Squares (FPLS) (Krämer and Sugiyama 2011) The PLS components seek to maximize the covariance of \(X(t)\) and \(y\).

args(create.pls.basis)
args(fdata2pls)

An integrated version of these functions is coming.

Example: Exploratory analysis for spectrometric curves

Tecator dataset: 215 spectrometric curves of meat samples also with Fat, Water and Protein contents obtained by analytic procedures.

Tecator Goal: Explain the fat content through spectrometric curves.

data(tecator)
names(tecator)

## [1] "absorp.fdata" "y"

names(tecator$y)

## [1] "Fat"     "Water"   "Protein"

As shown in tecator Figure, the plot of \(X(t)\) against \(t\) is not necessarily the most informative and other semi-metric can allow to extract much information from functional variables.

The information about fat seems to be contained in the shape of the curves, so, the semimetric of derivatives could be preferred to \(\mathcal{L}_2\). It is difficult to find the best plot given a particular functional dataset because the shape of graphics depends strongly on the proximity measure.

x <- tecator$absorp
par(mfrow=c(1,2))
boxplot(tecator$y$Fat)
ycat <- ifelse(tecator$y$Fat < 20,2,4)
plot(x, col=ycat)

1.3.1 Derivatives

To compute derivatives there are several options (see fdata.deriv function):

Raw Diferentiation: Only interesting when the number of discretization points are dense.
Basis representation: Dierentiate a finite representation in a basis.
Spline Interpolation: Perform a spline interpolation of given data points and use this interpolation for computing derivatives.
Nonparametric Estimation: Use a Local Polynomial Estimator of the functional data.

Example:

### Computing distances, norms and inner products

Utilities for computing distances, norm and inner products are included in the package. Below are described those but of course, many others can be built with the only restriction that the first two arguments correspond to class –fdata–. In addition, the procedures of the package fda.usc that contains the argument –metric– allow the use of metric or semi-metric functions as shown in the following sections.

Distance between functional elements, metric.lp: \(d(X(t),Y(t))=\left\|X(t)-Y(t)\right\|\) with a norm \(\left\| \cdot \right\|\)
Norm norm.fdata: \(\left\|X(t)\right\|=\sqrt{\left\langle X(t),X(t)\right\rangle}\)
Inner product inprod.fdata: \(\left\langle x,y \right\rangle=(1/4)(\left\|x+y\right\|^2-\left\|x-y\right\|^2)\).

fda.usc package collects several metric and semi-metric functions which allow to extract as much information possible from the functional variable.

Option 1: If the data is in a Hilbert space, represent your data in a basis, and compute all the things accordingly. From a practical point of way, the representation of a functional datum must be approximated by a finite number of terms.

semimetric.basis()

Option 2: Approximates all quantities using numerical approximations (Valid for Hilbert and non-Hilbert spaces). This usually involves numerical approximations of integrals, derivatives and so on. A collection of semi-metrics proposed by (Frédéric Ferraty and Vieu 2006) are also included in the package.

semimetric.pca(), based on the Principal Components.
semimetric.mplsr(), based on the Partial Least Squares.
semimetric.deriv(), based on B-spline representation.
semimetric.hshift(), measure the horizontal shift effect.
semimetric.fourier(), based on ther Fourier representation

Example of computing norms

Consider \(X(t) = t^2; t\in[0, 1]\). In this case \(||X_1|| =\sqrt{1/5}=0.4472136\) and \(||X_2|| =1/3\)

t = seq(0, 1, len = 101)
x2 = t^2
x2 = fdata(x2, t)
(c(norm.fdata(x2), norm.fdata(x2, lp = 1))) # L2, L1 norm

## [1] 0.4472509 0.3333500

f1 = fdata(rep(1, length(t)), t)
f2 = fdata(sin(2 * pi * t), t) * sqrt(2)
f3 = fdata(cos(2 * pi * t), t) * sqrt(2)
f4 = fdata(sin(4 * pi * t), t) * sqrt(2)
f5 = fdata(cos(4 * pi * t), t) * sqrt(2)
fou5 = c(f1, f2, f3, f4, f5) # Fourier basis is Orthonormal
round(inprod.fdata(fou5), 5)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    1    0    0    0
## [3,]    0    0    1    0    0
## [4,]    0    0    0    1    0
## [5,]    0    0    0    0    1

Example of computing distances

set.seed(1:4)
p <- 1001
r <- rnorm(p,sd=.1)
x <- seq(0,2*pi,length=p)
fx <- fdata((sin(x)+r)/sqrt(pi),x)
fx0 <- fdata(r,x)
plot(c(fx,fx0))

round(metric.lp(fx,fx0)[1,1],3)

## [1] 1.003

round(semimetric.basis(fx,fx0,nbasis1=5,nbasis2=5)[1,1],3)

## [1] 0.993

round(semimetric.basis(fx,fx0,nbasis1=11,nbasis2=111)[1,1],3)

## [1] 0.999

round(semimetric.fourier(fx,fx0,nbasis=5)[1,1],3)

## [1] 0.997

round(semimetric.fourier(fx,fx0,nbasis=11)[1,1],3)

## [1] 0.999

round(semimetric.deriv(fx,fx0,nderiv=0,nknot=5)[1,1],3)

## [1] 0.993

round(semimetric.deriv(fx,fx0,nderiv=0,nknot=11)[1,1],3)

## [1] 0.998

integrate(function(x){(sin(x)/sqrt(pi))^2},0,2*pi)

## 1 with absolute error < 7.3e-10

1.4 Semi-metric as classification rule

In this section, we explore how semi-metrics can be used as a unupervised ## Semi-metric as Classification Rule

In this section, we explore how semi-metrics can be used as an unsupervised classification tool for functional data. A semi-metric defines distances between curves and allows for the grouping of similar observations. This approach is used to cluster functional data based on the similarity of their shapes, derivatives, or other features.

We begin by applying hierarchical clustering using the metric.lp semi-metric, both on the raw functional data and its first derivative. The goal is to assess how the input choice (raw vs. derivative) affects the classification results.

data(tecator)
set.seed(4:1)
#ind <- sample(215,30)
#xx <- tecator$absorp[ind]

yy <- ifelse(tecator$y$Fat<17,0,1)
ind <- c(sample(which(yy==0),10),sample(which(yy==1),10))
xx <- fdata.deriv(tecator$absorp[ind])
yy <- yy[ind]

par(mfrow=c(1,1))
d1 <- as.dist(metric.lp(xx),T,T)
c1 <- hclust(d1)
c1$labels=yy
plot(c1,main="Raw data -- metric.lp",xlab="Class of eah leaf")

Next, we try with the derivative of the curves:

xx.d1 <- fdata.deriv(xx,nderiv=1)
d2 <- as.dist(metric.lp(xx.d1),T,T)
c2 <- hclust(d2)
c2$labels=yy
plot(c2,main="Derivative -- metric.lp",xlab="Class of eah leaf")

# Cut the dendrograms into 2 clusters
table(yy, cutree(c1, k = 2))

##    
## yy   1  2
##   0 10  0
##   1  8  2

table(yy, cutree(c2, k = 2))

##    
## yy   1  2
##   0 10  0
##   1  6  4

We introduce the fhclust() function (in the fda.clust package), which allows users to achieve this more easily.

fhclust(xx)

See help for more details.

Extension: Multivariate Functional Data

We also explore multivariate functional data, where multiple components are combined using a multidimensional metric. A common choice is defined as:

\[ m\left(\left(x_0^1, \ldots, x_0^p\right), \left(x_i^1, \ldots, x_i^p\right)\right) := \sqrt{m_1\left(x_0^1, x_i^1\right)^2 + \cdots + m_p\left(x_0^p, x_i^p\right)^2} \]

where \(m_{i}\) represents the metric corresponding to the \(i\)-th component of the product space. For implementation details, see the fda.usc:::metric.ldata function.

It is essential to ensure that the metrics for the individual components have similar scales to prevent one component from disproportionately influencing the overall distance.

# Checking the range of two components to assess scale differences
range(d1)

## [1] 0.009269209 0.074329337

range(d2)

## [1] 0.0003392917 0.0077699816

Now the mfhclust() function from the fda.clust package allows the user to do this easily. The call would be:

library(fda.clust)
mfdat=list("xx"=xx,
           xx.d1 = xx.d1)
c12 <- mfhclust(mfdat)
c12$labels=yy
plot(c12,main="mfdata - metric.mfdata",xlab="Class of eah leaf")

# Cut the dendrograms into 2 clusters
table(yy, cutree(c1, k = 2))

##    
## yy   1  2
##   0 10  0
##   1  8  2

table(yy, cutree(c2, k = 2))

##    
## yy   1  2
##   0 10  0
##   1  6  4

table(yy, cutree(c12, k = 2))

##    
## yy   1  2
##   0 10  0
##   1  5  5

An example with real data for weather in Spain:

data(aemet, package = "fda.usc")
datos <- mfdata("temp"=aemet$temp,"logprec"=aemet$logprec)
# dd = metric.mfdata(datos)
# Perform hierarchical clustering using the default method (ward.D2)
result <- mfhclust(datos, method = "ward.D2")
# Plot the dendrogram
plot(result, main = "Dendrogram of Multivariate Functional Data (Ward's Method)")

# Cut the dendrogram into  clusters
groups <- cutree(result, k = 3)
colors <- c(rgb(1, 0, 0, alpha = 0.25),
            rgb(0, 1, 0, alpha = 0.25), 
            rgb(0, 0, 1, alpha = 0.25)) 

gcolors <- colors[groups]
par(mfrow=c(1,3))
plot(aemet$temp,col=gcolors )
plot(aemet$logprec,col=gcolors )
plot(aemet$df[7:8],col=gcolors ,asp=T)

See Example of multivariate FDA usingh Spanish Weather to view a use case.

NOTE: Dependent data in tecator dataset?

# acf(tecator$y$Fat)
ar(tecator$y$Fat)

## 
## Call:
## ar(x = tecator$y$Fat)
## 
## Coefficients:
##      1       2  
## 0.6552  0.1161  
## 
## Order selected 2  sigma^2 estimated as  72.82

# plot(tecator$y,col=1:3)
par(mfrow=c(1,1))
# ts.plot(tecator$y$Fat)
barplot(t(tecator$y))

1.5 Correlation Distances (G. J. Székely, Rizzo, and Bakirov 2007)

The correlation Distances characterizes independence between vectors of arbitrary finite dimensions. Recently, in (Gábor J. Székely and Rizzo 2013), a bias-corrected version is considered and a test of independence developed.

data(tecator)
xx <- tecator$absorp
xx.d2 <- fdata.deriv(xx,nderiv=2)
dcor.xy(xx,xx.d2)

## 
##  dcor t-test of independence
## 
## data:  D1 and D2
## T = 35.933, df = 22789, p-value < 2.2e-16
## sample estimates:
## Bias corrected dcor 
##           0.2315581

d1 <- metric.lp(xx)
d2 <- metric.lp(xx.d2)
bcdcor.dist(d1,d2)

## [1] 0.2315581

Fat <- tecator$y[,"Fat",drop=FALSE]
d3 <- metric.dist(Fat)
bcdcor.dist(d1,d3)

## [1] 0.1963254

bcdcor.dist(d2,d3)

## [1] 0.9147076

# Functional / Factor
Fat.cut <- as.matrix(cut(Fat[,1],labels=1:4,breaks=4))
dcor.xy(Fat.cut,xx.d2)

## 
##  dcor t-test of independence
## 
## data:  D1 and D2
## T = 252.57, df = 22789, p-value < 2.2e-16
## sample estimates:
## Bias corrected dcor 
##           0.8583661

1.5.1 Depth for functional data

Depth (in univariate or multivariate context) is a statistical tool that provides a center-outward ordering of data points. This ordering can be employed to define location measures (and by oposition outliers). Also, some measures can be defined as maximizers of a particular depth.

Mahalanobis Depth (Mean) mdepth.MhD
Halfspace Depth, also known as Tukey Depth (Median) mdepth.HS
Convex Hull Peeling Depth (Mode)
Oja Depth
Simplicial Depth mdepth.SD
Likelihood Depth (Mode) mdepth.LD

y12 <- tecator$y[,1:2]
dep <- mdepth.LD(x=y12,scale=T)$dep
plot(y12,col=grey(1-dep))
points(y12[which.max(dep),],col=4,lwd=3)

Many of the preceding depth functions defined in a multivariate contex cannot be applied to functional data. In any case, given a depth for functional data.

The deepest point is a location measure and based on the depth function can have a dierent interpretation: Mean, Median, Mode …
Computing the depths of the whole sample and ordering them (in decreasing order) gives a rank from the most central point (\(x_{[1]}\)) to the most outlying one (\(x_{[n]}\)). So, those points with larger rank are the candidates to be outliers w.r.t. the data cloud (in the sense of the depth function).
Summarizing the \((1-\alpha)\) deepest points could lead also to a robust location measure.
Fraiman-Muniz Depth (Fraiman and Muniz 2001) depth.FM
Modal Depth (Cuevas, Febrero, and Fraiman 2007) depth.mode
Random Projection Depth (Cuevas, Febrero, and Fraiman 2007) depth.RP
Random Tukey Depth (J. Cuesta-Albertos and Nieto-Reyes 2008) depth.RT
Band Depth (López-Pintado and Romo 2009) depth.MBD (private function)

x <- tecator$absorp
depth.mode(x,draw=TRUE)

1.5.2 Depth (and distances) for multivariate functional data (J. A. Cuesta-Albertos, Febrero-Bande, and Oviedo de la Fuente 2017)

Modify the procedure to incorporate the extended information.

Fraiman–Muniz: Compute a multivariate depth marginally. depth.FMp
Modal depth: Use a new distance between data (for derivatives, for example, the Sobolev metric). depth.FMp
Random Projection: Consider a multivariate depth to be applied to the dierent projections (also the random projection method could be applied twice). depth.RPp

x <- tecator$absorp
x.d1 <- fdata.deriv(x)
depth.modep(list(x=x,x.d1=x.d1),draw=T)

Example poblenou dataset

Hourly levels of nitrogen oxides in Poblenou (Barcelona). This dataset has 127 daily records (2005/01/06-2005/06/26).

Objective: Explain the diferences in NOx levels as a function of day.

data(poblenou)
dayw = ifelse(poblenou$df$day.week == 7 | poblenou$df$day.festive ==1, 3, ifelse(poblenou$df$day.week == 6, 2, 1))
plot(poblenou$nox,col=dayw)

The \(\mathcal{L}_2\) space (distance is area between curves) seems appropriate although other possibilities could be take into account (\(\mathcal{L}_1\), for example).

fmd = depth.FM(poblenou$nox,draw=T)

md =  depth.mode(poblenou$nox)
rpd = depth.RP(poblenou$nox, nproj = 50)
rtd = depth.RT(poblenou$nox)
print(cur <- c(fmd$lmed, md$lmed, rpd$lmed, rtd$lmed))

## 2005-05-06 2005-05-06 2005-05-06 2005-05-11 
##         63         63         63         67

plot(poblenou$nox,col="grey")
lines(poblenou$nox[cur], lwd = 2, lty = 1:4, col = 1:4)
legend("topleft", c("FMD", "MD", "RPD", "RTD"), lwd = 2, lty = 1:4,col = 1:4)

1.5.3 Outliers detection

There is no general accepted definition of outliers in Functional data so, we define outlier as a datum generated from a dierent process than the rest of the sample with the following characteristics Its number in the sample is unknown but probably low.

An outlier will have low depth and it will be an outlier in the sense of the depth used.

md1 = depth.mode(lab <- poblenou$nox[dayw == 1]) #Labour days
md2 = depth.mode(sat <- poblenou$nox[dayw == 2]) #Saturdays
md3 = depth.mode(sun <- poblenou$nox[dayw == 3]) #Sundays/Festive
rbind(poblenou$df[dayw == 1, ][which.min(md1$dep), ], poblenou$df[dayw ==
2, ][which.min(md2$dep), ], poblenou$df[dayw == 3, ][which.min(md3$dep),])

##          date day.week day.festive
## 22 2005-03-18        5           0
## 57 2005-04-30        6           0
## 58 2005-05-01        7           0

plot(poblenou$nox,col="grey")
lines(poblenou$nox[c(22,57,58),], lwd = 2, lty = 2:4, col = 2:4)

# Method for detecting outliers, see Febrero-Bande, 2008
#out1 <- outliers.depth.trim(poblenou$nox[dayw == 1],nb=100)$outliers

Two procedures for detecting outliers are implemented in the package. (Febrero-Bande, Galeano, and González-Manteiga 2008)

Detecting outliers based on trimming: outliers.depth.trim()
Detecting outliers based on weighting: outliers.depth.pond()

# Method for detecting outliers, see Febrero-Bande, 2008
#out1 <- outliers.depth.trim(poblenou$nox[dayw == 1],nb=100)$outliers
#out2 <- outliers.depth.pond(poblenou$nox[dayw == 1],nb=100)$outliers

References

Cardot, Hervé, Frédéric Ferraty, and Pascal Sarda. 1999. “Functional Linear Model.” Statist. Probab. Lett. 45 (1): 11–22.

Cuesta-Albertos, Juan A, Manuel Febrero-Bande, and Manuel Oviedo de la Fuente. 2017. Test 26 (1): 119–42.

Cuesta-Albertos, Juan, and Alicia Nieto-Reyes. 2008. “The Random Tukey Depth.” Computational Statistics and Data Analysis 52 (11): 4979–88.

Cuevas, Antonio, Manuel Febrero, and Ricardo Fraiman. 2007. “Robust Estimation and Classification for Functional Data via Projection-Based Depth Notions.” Comput. Statist. 22 (3): 481–96. http://link.springer.com/article/10.1007/s00180-007-0053-0.

Febrero-Bande, Manuel, Pedro Galeano, and Wenceslao González-Manteiga. 2008. “Outlier Detection in Functional Data by Depth Measures, with Application to Identify Abnormal \({\rm NO}_x\) Levels.” Environmetrics 19 (4): 331–45.

———. 2006. Nonparametric Functional Data Analysis. Springer Series in Statistics. New York: Springer-Verlag.

Fraiman, Ricardo, and Graciela Muniz. 2001. “Trimmed Means for Functional Data.” Test 10 (2): 419–40.

Krämer, Nicole, and Masashi Sugiyama. 2011. “The Degrees of Freedom of Partial Least Squares Regression.” Journal of the American Statistical Association 106 (494): 697–705.

López-Pintado, Satan, and Juan Romo. 2009. “On the Concept of Depth for Functional Data.” Journal of the American Statistical Association 104: 718–34.

Muller, Hans-Georg et al. 2008. “Functional Modeling of Longitudinal Data.” In Longitudinal Data Analysis, 237–66. Chapman; Hall/CRC.

Ramsay, J. O., and B. W. Silverman. 2005a. Functional Data Analysis. Springer.

Székely, G. J., M. L. Rizzo, and N. K. Bakirov. 2007. “Measuring and Testing Dependence by Correlation of Distances.” The Annals of Statistics 35 (6): 2769–94. http://projecteuclid.org/euclid.aos/1201012979.

Székely, Gábor J, and Maria L Rizzo. 2013. “The Distance Correlation \(t\)-Test of Independence in High Dimension.” J. Multivariate Anal. 117: 193–213.