Analysis of large and complex data sets
This broad area considers methods for analysing large data sets with complex structure. Often the data is collected over time and/or space, and the structure in the data is due to correlations between data collected at similar times/positions.
Data of this type is collected in many modern-day applications. Examples include image data, medical data such as FMRi or ECG, and financial time-series.
Work in this area is carried out by Idris Eckley, Paul Fearnhead, Phil Jonathan, Juhyun Park, Joe Whittaker and a range of PhD students. The group has links with Shell Research and Unilever Research.
Statistical Signal Processing
This area covers the analysis of signals observed in time or space. For example, analysing speech data, or processing noisy observations of images. One focus of research is into methods for detecting abrupt changes in these signals - with applications to oil drilling, and fault detection. A second focus is on the use of multiscale methods, which permits inferences to be made about the signal at both large and fine scales simultaneously. These methods have been applied, for example, to the analysis of time series and textured images.
Functional Data Analysis
It becomes increasingly common in various applications to observe large datasets in high dimensional observations. Often data can be represented in a format of curves, as discretized values of smooth random functions. Examples include growth curves in biomedicine, EEG brain potentials and implied volatilities in econometrics. Understanding and characterizing the underlying structure of such data can be pursued by functional data analysis techniques. Many multivariate techniques such as PCA or regression and classification have been studied and extended to functional data but there still remains many issues to be resolved.
Data Mining
Many studies in the social, biological, medical and sciences, and economic activities in business, industry and government, result in the collection of substantial amounts of data. These data sets vary in their nature and complexity, they may be one-off or repeated, they may be hierarchical, spatial or temporal. Their common thread is that they are large.
The very size of these sets has promoted new ways of analysis often
subsumed under the heading data mining. A traditional strategy for analysis take summary measures and look for trends, patterns, and anomalies. However their findings may just capitalise on chance.
At Lancaster we are concerned go further and model the data to propose
sensible ways of deciding whether discoveries in the data reflect
random fluctuation or real variation. Many techniques developed within this framework of statistical model building and inference are generalisations of regression and
classification. The inference uses frequentist, likelihood and Bayesian paradigm
depending on the study, and may well utilise shrinkage and other forms
of regularisation. The statistical models are inherently multivariate and are subject to
the acid tests of predicting future outcomes and behaviour.
Recent Publications
- P. Fearnhead and Z. Liu. (2007) Online Inference for Multiple Changepoint Problems. Journal of the Royal Statistical Society, Series B 69, 589-605.
- I A Eckley and G P Nason. (2005) Efficient computation of the discrete autocorrelation wavelet inner product matrix, Statistics and Computing, 15, 83-92.
- Whittaker, J., Whitehead C. and Somers, M. (2005). The neglog transformation and quantile regression for the analysis of a large credit scoring database. JRSS: Series C (Applied Statistics), 54, 5, 863--878.