A Program to Process Strainmeter Data
by J. Langbein, USGS
Introduction
I have written a set of software to process strainmeter data called cleanstrain+; once installed, it is easy to run with a one-line command (with the usual UNIX style arguments). It runs on both Linux and MacIntosh (Intel type is probably better) computers. The main script, cleanstrain+, calls other UNIX scripts and Fortran programs.
This code will simultaneously estimate the tidal constituents, pressure admittance, offsets, rate changes, and other terms using least-squares but, importantly, incorporating the temporally correlated nature of strain data (that is, the power spectra is red). I am aware of two alternative methods to analyze time-series data from strainmeters; one being baytap, and the other being the application of Kalman filter techniques. Although Kalman filter techniques are available in the literature, I'm not aware of any written as an easy application to strainmeter data. Although baytap is rigorous in its approach (ie, it assumes that the strainmeter data have a background noise process that is a combination of random walk and white noise), one needs to pre-clean the data by estimating and removing offsets and other transients and removing spurious observations which all might bias the results from baytap. Much subjectivity is used to pre-clean the data; the subjectivity is minimized using cleanstrain+.
This code will not automatically fix outliers and offsets. But, it will help identify those problems in the data and, once identified (with a time), it will then fix these problems. The cleanstrain package also includes a script called glitch+ which will also help identify the times of problems in the data.
The software can be downloaded where one can get the latest version (5 Dec 2008 is 1.12) and a README file. A UNIX style manual is also available.
This note provides a general introduction to the use of the script, cleanstrain+, and provides some motivation of why it works better than standard, weighted least-squares. Finally, I summarize what is required to install this program.
An Example
The user needs to supply:
- Strainmeter data (time series)
- Atmospheric pressure data (but not required)
- Auxiliary time-series data (but not required) such as the first derivative of pressure or water level changes measured at a nearby tide gauge
The format of these data can be USGS bottles (which are binary data with a header) or ASCII data with each record specifying the time of the observation and its value.
Additional information will help with improving the processing of the strain data. This includes:
- List of times of "edits" or bad/spurious data
- List of times of offsets in the data
- List of times of different trends such as rate changes and exponential functions
Note that the amplitudes or sizes of the offsets, exponentials and rate change are NOT specified; this program will provide those estimates.
On output, the script provides:
- Pressure sensitivity (and its standard error)
- Size of the offsets (and their standard error)
- Tidal amplitudes (real and imaginary) (and their standard error)
- Sensitivity to auxiliary data
- Noise spectrum (white, power law amplitude and index)
- Estimate of amplitudes of specified functions and their time constants
- Various data plots
- baytap analysis if requested
For example, one might want to analyze strainmeter data in the bottle file mc01_1996.bot with the pressure data mc01_1996.bot over a 65 day long interval from late 1995 through early 1996. Although the sampling interval of these data is 10 minutes, this needs not to be specified as an input. To run:
cleanstrain+ -s mc01_1996.bot -p mc05_1996.bot -f bot -u cnts -O mc.off -E mc.edit -b 1995j3630000 -e 1996j0620000 -m 2 -V X97062 -B
A listing of the edits file, mc.edit is:
- T 199512281825 199512281845 spike
- N 199603120000 199503150000 bad pressure readings
- N 199603121350 199603150030 ditto
- N 199601222255 199601222345 spike
- T 199602170645 199602170725 teleseismic?
- T 199602250325 199602250345 spike?
A listing of the offset file, mc.off is:
- N 199507291345 valve reset
- N 199601191735 valve reset
- N 199602220645 valve reset
Each record in the offsets and edits starts with either a N or T which specifies their cause and allows the program to deal with each appropriately. More later, but the N identifies times of instrumental problems (ie, the date should be delete/adjusted for all analysis) and T identified times where signals of either tectonic or of unknown origin for which the data should be deleted/adjusted to estimate the tides and pressure admittance, but should be left in the data untouched for further analysis and interpretation. In addition, the program accepts two different formats of time; year -- month -- day, and year -- day of year.
On output is processed strain data in the same format as the input:
- mc01_1996_cl_X97062.bot is the clean data ie, non tectonic offsets and outliers removed
- mc01_1996_pr_cl_X97062.bot is the clean data, pressure adjusted, and ie, non tectonic offsets and outliers removed
- mc01_1996_pr_td_cl_X97062.bot is the clean data ie, non tectonic offsets and outliers removed; pressure and tides removed
- mc01_1996_pr_X97062.bot is the data with all outliers and pressure response removed
- mc01_1996_pr_td_X97062.bot is the data with all outliers, pressure response and tides removed
- mc01_1996_pr_td_off_X97062.bot is the data with all outliers, pressure response,tides and all offsets removed
In addition, there are plots of the data in:
- pl_mc01_1996_X97062.ai shows the progression of data
- pl_out_mc01_1996_X97062.ai shows the outliers and the data used for PSD
- psd_mc01_1996_X97062.ai shows the power spectra and periodogram
- bay_mc01_1996_X97062.ai shows the results of baytap (optional)
Finally, the analysis of the data is summarized a text file:
- mc01_1996_X97062_results
The plot file showing outliers and processed data:
The plot showing the Power spectrum (PSD) and noise model of the processed data:
Before discussing why this program works, tagging the spurious data with N or T needs to be discussed.
Instrumental: Labeled with N
- Signify signals that are clearly instrumental in origin
- Dilatometers; valve resets
- Telemetry glitches
- Offsets or spikes from field visits
Other: Labeled with T
- Signify signals that are clearly tectonic in origin
- Coseismic offsets
- Teleseismisms (aliased or otherwise)
- Other signals the could affect estimates of tidal and pressure admittance such as response to rainfall(the gray area....)
Why it works
This code works using least-squares but using a data covariance matrix that consistent with the power-law noise character of strainmeter data. The examples will show the difference between incorporation of the proper covariance matrix and what I call, standard, weighted, least-squares.
Least squares solves the problem of:
y = A x
where y is the data vector, A is the design matrix and x is the vector of model parameters. For fitting a linear trend to data, the A will have two columns; the first being 1s and the second being the time of each observation. If there is an offset in the data at time to, then there would be a third column with 0s in the slots preceding time
t _{o} and 1s for time >= t _{o}.
In addition, least-squares allows for data error which is specified by the data covariance matrix, C. In standard, weighted least squares, the data covariance is a diagonal matrix with each elements specifying the variance of each observation (standard error squared). It is assumed that the data are independent and the equivalent power spectra is flat (or frequency independent). Thus, the estimate of the model parameters in a least-squares sense is:
x = ( A ^{t} C ^{-1} A ) ^{-1} A ^{t} C ^{-1} d
or
x = H d
where A ^{t} is the transpose of A and the index -1 represents the matrix inverse.
However, data independence is not satisfied by the strain data; examination of it power spectra reveals a strong power law relation (see plots above) plus white noise;
P(f) = P _{o} / f ^{n} + P _{w}
For random-walk noise, which is appropriate for most strainmeter data, the index n is equal to 2. The white noise part, P _{w}, is often proportional to the least count noise of the telemetry. Using the results summarized by Langbein (2004) (and other references, too), it is straight forward to convert a power-law function of a PSD to a data covariance matrix, C. For example, a data covariance representing a random walk will have the following structure
- 0 0 0 0 0 0 ...
- 0 1 1 1 1 1 ...
- 0 1 2 2 2 2 ...
- 0 1 2 3 3 3 ...
- 0 1 2 3 4 4 ...
- 0 1 2 3 4 5 ...
- etc
The simple examples below will show the affect upon the estimate of model parameters x using the white noise assumption (diagonal data covariance), the random-walk assumption, and a mix of random-walk and white noise. The inverse of the data covariance essentially filters both the data and the design matrix, A such that the filtered data becomes white noise. For random-walk, the inverse of the covariance matrix is the same as that derived from a differentiator.
In the examples below, I will show how the unfiltered data, d, is re-weighted to estimate a particular element of the vector of model parameters, x. Recall the least-square estimate of x is H d. For the i^{th} parameter of vector x, one uses the i ^{th} row of H to multiply with the data d. Or,
x _{i} = H _{i} d
where H _{i} is the i ^{th} row of H. Below, I show the values of H _{i} estimating the average value, the rate, an offset, and a sinusoid using three different data covariances; purely white noise, purely random-walk noise, and a mix of random walk and white noise. In the plots that follow, the top panel represents the weighting of the data using white noise (a diagonal covariance matrix), the middle panel is the weighting of the data using random-walk noise, and the bottom panel is the weighting using a mix of random walk and white noise.
With a diagonal covariance matrix (all observations are weighted equally), the estimate of the average value is as expected; a summation of all of the data divided by the number of observation. However, for random-walk noise, the concept of "average" is different; the estimated value is the first observation.
With the diagonal covariance matrix, the estimate of the rate conforms to our expectation from standard, weighted least-squares; all of the data influence the estimate of rate, but the observations at the ends of the time series carry more weight. However, for random-walk noise, the estimate of rate only requires two values, the first and last observation; the remaining n-2 observations don't count. Think of this as your rate of return from buying and selling stocks; the only thing that counts is the price when you bought and the price when you sold. For a mix of random walk and white noise, more of the data at the ends of the time-series are used to estimate the rate, but the data in the middle have essentially no influence.
Likewise, for estimating offsets, using the diagonal covariance matrix results in averaging the data before and after the offset. With a covariance matrix the represent random-walk noise, the offset is simply the difference of the two observations adjacent to the offset. With a mix of random walk and white noise, the estimate of the offset is a weighted sum of the few observations nearest the time of the offset with those observations nearest the offset having more influence.
Finally, for estimating amplitude of a sinusoid, different covariance matrices have no influence on those estimates. Hence, the application of weighted least squares will get the same answer as that from random-walk model for the size of the sinusoid. However, as we seen in the examples above, if the correct covariance is not used, then the estimate of the other parameters will be wrong.
The cleanstrain+ uses two approaches to obtain a noise model of the data. The best way to estimate the noise model is to use the Maximum Likelihood Estimator (MLE) to juggle the parameters of the noise model and simultaneously fit the data. This can be CPU intensive. I only recommend this approach (set with a flag in cleanstrain+) when there are large gaps in the data. Instead, when there are no significant gaps in the data, a faster method is to guess a combination of random-walk and white noise, perform the least-squares fit using that noise model, and, estimate the power-spectrum of the residuals to that fit. Then, a power-law function plus white noise, P(f) = P _{o} / f ^{n} + P _{w} is fit to the PSD. The values of the index n, P _{o}, and P _{w} are used to construct an updated covariance matrix using the method of Langbein (2004) (and other references, too). Last, least-squares is used to fit the data using the new data covariance. Either way, cleanstrain+ finds the optimal data covariance and the best fitting model that represents the observations.
Comments on Installation
I have successfully installed this software on my LINUX-box and several different Intel based, Macs. The development was done on my LINUX-box but tested on the Macs. To install the software after downloading the code and untarring the file, read the README file. For LINUX-boxes (Intel style CPUs), download the file identified with Linux. This tar file consists on not only the Fortran source codes, but compiled executables with all of the libraries statically linked. The README file will instruct you to read the Installation.pdf file. Addition stuff is needed:
- A Fortran compiler if you are using a Mac. I've used gfortan on the Macs. For LINUX, I've only had experience using the Intel's Fortran compiler (version 7.1) which is fairly old.
- You'll need the LAPACK/BLAS libraries which are used by my Fortran program est_noise6ac. The compiled version for LINUX comes with these libraries statically linked so that one does not need to get these libraries. However, if one choses to compile, the libraries are often packaged with commercial compilers (e.g. Intel packages them with MKL, I'm sure the Portland/Absoft has a similar set) or one can download either the Atlas distribution or download the netlib distribution. For Macs, this is painless as LAPACK/BLAS is included in the software developer's kit (SDK) which is found on the Mac OSX installation CD.
- In addition, you'll need to download the current (greater than 4.0) version of GMT plotting software. Older versions don't include the features needed to plot time axis in sensible units.
- I'm currently running these programs both on a 3.0 GHz LINUX box (circa 2003 - single core) and the newer Intel Macs including the bottom of the line MacBook laptop. (This is not a product endorsement). With any current LINUX-PC or Mac, this code should run OK; there are no heavy memory requirements; the faster the CPU, the better. On the Mac, the compilation using gfortran seems to spread the CPU load over both cores; thus a nominal 2.0 GHz dual-core Mac seems to run est_noise6ac about 30% faster than my 3.0 GHz LINUX box.
- The program baytap is nice to have, but not required.
- If you are working with USGS bottled-style data and don't have the software, the Linux version is included in the link to the cleanstrain code. There is a version also for Macs. The Mac codes are compiled for the PPC version of the Macs but run OK on the Intel-Macs; the Intel versions have not been completely debugged.
- Finally, you should have a license to use the Numerical Recipes of Press et al. [2007]. est_noise6ac uses some of the subroutines.
John Langbein US Geological Survey, Menlo Park (langbein at usgs.gov)
December 2007.