data.avg

data.avg is primarily used to average data, but it can also be used to fill data (no missing records), convert data to be rectangular (single record type), and split variables based on the size cut. The result is output to standard output.

The normal mode of operation is to average on a one hour interval with no filling, but with variables split on cut size and standard deviations for them generated. Average bins are attempted to be aligned on intelligent time boundaries (e.g. one hour average bins start on the hour). Normally data are combined into one record type when not re-averaging but this can be controlled with the combine switch.

It can operate either by requesting data from data.get itself or by operating on its input (as part of a pipe). By default it attempts to detect if it is operating on input that has already been run through itself (i.e. averaging already averaged data to a longer time base), this is controlled with the reavg switch.

Uses contaminate.conf and ignorecut.conf to determine the default contamination and cut size splitting.

Command Line Usage

data.avg [--interval=<time>] [--stddev=on/off | --nostddev] 
         [--cut=on/off | --nocut] [--cutflags] 
         [--ignorecut=pattern1,pattern2,...]
         [--count=on/off] [--source=archive] [--fill] [--reavg]
         [--picklast=pattern11,pattern2,...]
         [--noalign] [--[no]combine]
         [--contam] [--contam-match=pattern1,pattern2,...]
         [--require-uncontaminated=N]
         [--decimal-format=FORMAT,MVC]
         [--include=RANGES]
         [--cpx] [--nocache]
         [station rec start end [source]]

Arguments

If the station, record, start and end are omitted data.avg works on standard input instead of requesting data. Specifying a toggle argument without an “on/off” code enables that toggle. That is “–count” turns on count generation.

start and end

The time specifiers for the data to be retrieved. Start is inclusive while end is exclusive, so all data contained within the half open interval [start,end) will be returned. Any convertible time format is accepted.

station

The station identifier code. For example 'brw'. Case insensitive.

records

The cpd2 record type to be retrieved. For example: 'S11a'. Case sensitive. Multiple record types may be separated by “,”, “;” or “:”. Note that this is a single argument and that spaces are not allowed.

--interval=<time>

The averaging interval. This may be 0 (or negative) to perform no averaging, an exact amount of time in seconds, an integer and a multiplier (one of “d” for days, “h” for hours, “m” for minutes or “s” for seconds, e.x. “15m” for 15 minutes), or a special time range. Special time ranges imply alignment to the relevant boundary. The following are valid:

  • year - Yearly averages
  • quarter - Standard quarter averages
  • month or m - Monthly averages
  • week - Weekly averages
  • day or d - Daily averages
  • hour or h - Hourly averages
  • minute - One minute averages
  • forever, infinity, inf, all, or always - Entire input data range (not aligned)

If a special time range is not given and alignment is enabled (that is, the noalign flag is not used), then if the interval is greater divisible by one day the starting day boundary, if it is divisible by one hour to the hour boundary, divisible by a minute to that boundary, otherwise it is not aligned. Special time ranges are always aligned (except “forever”).

--stddev=on/off --nostddev

Enable or disable (default enabled) standard deviation calculation for each field. A separate field for each field is generated that contains the standard deviation for that record.

--cut=on/off --nocut

Enable or disable (default enabled) cut size splitting. If cut size splitting is enabled, all fields will be separated into 0 (default/coarse/10um) and 1 (flagged/fine/1um) variants based on the cut size flag for that record.

--cutflags

Split flags based on the cut size (default off).

--ignorecut=pattern1,pattern2,...

Set patterns to ignore the cut split on. If a variable matched the regular expression defined by the pattern encased as “^pattern$” then cut size splitting on that variable is not done.

--count=on/off

Enable or disable (default disable) generating a count field for each variable (after cut splitting). The count field is the variable name with an “N” added to the end.

--source=archive source

Selected the source archive to request data from, defaulting to clean. Has no effect when working on standard input.

--fill

Enable filling of missing time ranges. This enables filling with MVCs for time ranges that are missing entirely from the source archive. Alignment and normal binning are preserved. That is a gap of two hours in length when generating one hour averages would result in two records of all MVCs. When not combining records filling will also cause missing record types for a given bin to be output.

--reavg --reavg=off

By default data.avg will inspect its input and if it appears to have already been averaged it will enter reaveraging mode. This mode is designed to properly handle variable naming and related issues when operating to lengthen the time base of already averaged data. The mode can be forced to be enabled or disabled with this switch.

--combine --nocombine --combine=off

Enables or disables record combination (conversion to rectangular data). Enabled by default when not re-averaging and disabled when re-averaging. If enabled only a single record type will be output with all the averages of all input variables contained in it. If disabled then the output is records the same as the inputs but containing averages (and calculated values). In this mode a record will not be output for a given averaging bin unless the base input record existed in that bin or the fill switch is used.

--picklast=pattern1,pattern2,...

Set patterns to only pick the latest value instead of averaging. If a variable matched the regular expression defined by the pattern encased as “^pattern$” then only the last value of it is output instead of any averaging being done.

--noalign

Disable bin alignment as described above.

--contam

Enable averaging of contaminated data. Normally contaminated data is disabled (not averaged). The variables that are ignored during contamination are set either with the contam-match switch or loaded from contaminate.conf.

--contam-match=pattern1,pattern2,...

List of Perl regular expressions that define all variables that are affected by contamination. Any variable that matches one of these patterns will not be averaged when it is contaminated (assuming –contam has not been set). If this option is set then contaminate.conf is not used, instead these patterns are applied for all time.

--require-uncontaminated=N

Set the fraction of data required to be uncontaminated before any average is emitted. That is, if the fraction of data that is non-MVC but is contaminated in the input is less than N then the output is an MVC. Defaults to 0.25 (require 25% of data to be uncomtaminated).

--decimal-format=FORMAT,MVC

Set the format and/or MVC of all decimal numbers that are averaged. For example “–decimal-format=%0+10.3e,+9.999e-99” would cause all decimal floating point numbers to be output in scientific notation instead of their existing decimal format.

--include=RANGES

Specify the range(s) separated by “;” or “,” for the times that should be included within each average bin. Each range consists of either a single time amount specifier or a pair of two separated by “-”. For single times it specifies the exact time after the bin start time to include, for a range of two any between the start (inclusive) and end (exclusive) are included.

Each time specifier can be formed like “[ [hours:]minutes:]seconds”. That is, it may consist of an optional hour and minute specifiers in addition to the number of seconds after the bin start. It may also be an integer and a multiplier (one of “d” for days, “h” for hours, or “m” for minutes), for example “30m” for 30 minutes.

For example: to only include minutes 10-15 and 25-35 in an hour average: –include=10:00-15:00,25:00-35:00 or –include=10m-15m,25m-30m

--cpx

Pass –cpx to data.get.

--nocache

Pass –nocache to data.get.

Example Usage

Single record one hour average

data.avg sgp S11a 2008:10 2008:11

Multiple records of raw data in one day average

data.avg --interval=86400 --source=raw bnd S11a,A11a 2003W02 2003W03

Filling data from a pipe to a file

data.get sgp A11a 2008 2009 raw | data.avg --fill > 2008_hourly