Cumsum and Diff Tricks
There are many instances where you want to look for “trigger” events in a table (data frame) that mark the beginning of a sequence of records that need to be grouped together. While in some cases the end of such a sequence may be defined by the next similar event, in some cases there may be different “trigger” event values that mark the end of such a group of records.
For example, we might have records of wind direction and be interested in identifying periods when the direction was generally toward the west. The beginning of such a period would be a record for which the direction is “west”, but the direction in the previous record was something other than “west”. Such a period of westerly wind direction would logically end when the direction again became something other than “west”.
The obvious solution for automating the identification of such groups of records is to write a while
or for
loop that sequentially examines each record and uses if-else
logic to generate a new column in the table that indicates the assignment of each record to an appropriate group. However, in many (but not all) cases such logic can instead be implemented using fast-running vectorized functions by creating a sequence of a few column-wise vectors that lead to generation of such a grouping column using simpler and faster basic operations.
(Note that these techniques can be applied in Python as well, using numpy/pandas functions, but examples in Python are not included in this post.)
suppressPackageStartupMessages({
library(dplyr)
library(ggplot2)
library(kableExtra)
})
theme_set(theme_minimal())
The idea
If we have a timeline, or any real number line, we may be able to figure out when certain things happened, but it may be useful to lump records following those events together.
dta1 <- read.table( text=
"X Start
1 1
2 0
3 1
4 0
5 0
6 0
7 1
8 0
",header=TRUE,as.is=TRUE)
X | Start |
---|---|
1 | 1 |
2 | 0 |
3 | 1 |
4 | 0 |
5 | 0 |
6 | 0 |
7 | 1 |
8 | 0 |
Here we have some events at X
“times” of 1, 3, and 7.
But we would like to group the records together:
X | Start |
---|---|
1 | 1 |
2 | 0 |
3 | 1 |
4 | 0 |
5 | 0 |
6 | 0 |
7 | 1 |
8 | 0 |
The trick to marking these records as belonging together is to use cumsum
on the Start
events
dta1$G <- cumsum( dta1$Start )
X | Start | G |
---|---|---|
1 | 1 | 1 |
2 | 0 | 1 |
3 | 1 | 2 |
4 | 0 | 2 |
5 | 0 | 2 |
6 | 0 | 2 |
7 | 1 | 3 |
8 | 0 | 3 |
The G
column can be used in grouped calculations to identify each group of records.
Note that diff
and cumsum
are useful in this way when the data records are in increasing “time” order. It may be necessary to sort your data before applying these tools to it.
Intervals
Now we look at a more involved extension of this idea: collapsing records that redundantly specify overlapping intervals.
dta2 <- read.table( text=
"Label Start End
A 5 8
B 2 6
C 9 11
D 14 17
",header=TRUE)
Label | Start | End |
---|---|---|
A | 5 | 8 |
B | 2 | 6 |
C | 9 | 11 |
D | 14 | 17 |
As we see below, there are parts of the number line that are not identified by these Start
/End
pairs (red bars), and there are individual intervals shown by the black bars. Intervals A
and B
overlap, so the goal is to represent them by one black bar.
We start by sorting by beginning of interval (Start
).
Label | Start | End |
---|---|---|
B | 2 | 6 |
A | 5 | 8 |
C | 9 | 11 |
D | 14 | 17 |
Next we notice that we can subtract the red End
values from the yellow Start
values to figure out how big the “hole” is preceding each Start
value. Notice that the hole before the first interval is infinitely large.
Label | Start | End |
---|---|---|
B | 2 | 6 |
A | 5 | 8 |
C | 9 | 11 |
D | 14 | 17 |
dta2c <- ( dta2b
%>% mutate( hole = c( Inf, Start[-1]-End[-n()] ) )
)
Label | Start | End | hole |
---|---|---|---|
B | 2 | 6 | Inf |
A | 5 | 8 | -1 |
C | 9 | 11 | 1 |
D | 14 | 17 | 3 |
Next we identify events… which intervals have holes greater or equal to zero in front of them?
dta2d <- ( dta2c
%>% mutate( group_start = ( 0 <= hole ) )
)
Label | Start | End | hole | group_start |
---|---|---|---|---|
B | 2 | 6 | Inf | TRUE |
A | 5 | 8 | -1 | FALSE |
C | 9 | 11 | 1 | TRUE |
D | 14 | 17 | 3 | TRUE |
Now that we have events identified, we can use cumsum
to identify records that belong together:
dta2e <- ( dta2d
%>% mutate( group = cumsum( group_start ) )
)
Label | Start | End | hole | group_start | group |
---|---|---|---|---|---|
B | 2 | 6 | Inf | TRUE | 1 |
A | 5 | 8 | -1 | FALSE | 1 |
C | 9 | 11 | 1 | TRUE | 2 |
D | 14 | 17 | 3 | TRUE | 3 |
With the group
column defined, the minimum Start
and the maximum End
in each group
value can be used to form a single interval record.
dta2f <- ( dta2e
%>% group_by( group )
%>% summarise( Start = min( Start )
, End = max( End )
)
%>% ungroup()
)
group | Start | End |
---|---|---|
1 | 2 | 8 |
2 | 9 | 11 |
3 | 14 | 17 |
With dplyr
this series of operations can be piped together instead of built incrementally as we did above to show the intermediate results:
dta2g <- ( dta2
%>% arrange( Start )
%>% mutate( hole = c( Inf, Start[-1]-End[-n()] )
, group = cumsum( 0 <= hole )
)
%>% group_by( group )
%>% summarise( Start = min( Start )
, End = max( End )
)
%>% ungroup()
)
group | Start | End |
---|---|---|
1 | 2 | 8 |
2 | 9 | 11 |
3 | 14 | 17 |
One day (max) separation
Suppose you have intervals that are always defined using dates within one year, so technically December 31 does not actually overlap with January 1 of the following year. Depending on the nature of the data this may in fact be close enough, so you may want to define “overlap” as “within one day”. Consider a sample data set:
library(dplyr)
dta3 <- read.csv(text=
"id|start|end
1|2015-08-01|2015-12-31
1|2016-01-01|2016-12-31
1|2017-01-01|2017-05-08
2|2014-08-12|2014-12-31
2|2015-01-01|2015-07-23
2|2016-01-12|2016-12-31
2|2017-01-01|2017-08-22
",sep="|",as.is=TRUE
,colClasses=c("numeric","Date","Date"))
We can again compute hole size, and compare the hole size with 1 day instead of 0 days:
dta3a <- ( dta3
%>% arrange( id, start )
%>% group_by( id )
%>% mutate( hole = c( Inf
, as.numeric( start[-1]-end[-n()]
, units="days"
)
)
, gp = cumsum( 1 < hole )
)
%>% ungroup()
)
id | start | end | hole | gp |
---|---|---|---|---|
1 | 2015-08-01 | 2015-12-31 | Inf | 1 |
1 | 2016-01-01 | 2016-12-31 | 1 | 1 |
1 | 2017-01-01 | 2017-05-08 | 1 | 1 |
2 | 2014-08-12 | 2014-12-31 | Inf | 1 |
2 | 2015-01-01 | 2015-07-23 | 1 | 1 |
2 | 2016-01-12 | 2016-12-31 | 173 | 2 |
2 | 2017-01-01 | 2017-08-22 | 1 | 2 |
Take this same sequence of calculations, and performing the group-by we can identify the starting date and the ending date for each group.
dta3b <- ( dta3
%>% arrange( id, start )
%>% group_by( id )
%>% mutate( hole = c( Inf
, as.numeric( start[-1]-end[-n()]
, units="days"
)
)
, gp = cumsum( 1 < hole )
)
%>% ungroup()
%>% group_by( id, gp )
%>% summarise( start = min( start )
, end = max( end )
)
)
id | gp | start | end |
---|---|---|---|
1 | 1 | 2015-08-01 | 2017-05-08 |
2 | 1 | 2014-08-12 | 2015-07-23 |
2 | 2 | 2016-01-12 | 2017-08-22 |
Wind Direction
In a question on the R-help mailing list, the following problem is posed:
[…] the wind direction is very stable during these situations, and therefore I would like to start from it. […]
In the case of the example below reported, I know that the directions of this particular automatic station must be only SW or WSW.
My biggest problem, […] is to find the beginning and the end of each event, when there is a change in the main direction. Thinking about categorical data in general, is there a way to detect periods when one particular category is more frequent?
So the sample data provided is one day of wind data, with the intent to handle many days of data efficiently:
first_day_POSIX <- as.POSIXct( "2020-02-19" )
last_day_POSIX <- as.POSIXct( "2020-02-20" )
mydf <- data.frame(
data_POSIX = seq( first_day_POSIX, last_day_POSIX, by="10 min" )
, main_dir = c( "WSW", "WSW", "SW", "SW", "W", "WSW", "WSW", "WSW", "W"
, "W", "SW", "WSW", "SSW", "S", "SW", "SW", "WSW", "WNW"
, "W", "WSW", "WSW", "SE", "SE", "SE", "NW", "NNE", "ENE"
, "SE", "NNW", "NW", "NW", "NW", "NW", "NW", "NW", "NE"
, "NW", "NW", "NW", "NW", "NW", "N", "WNW", "NW", "NNW"
, "NNW", "NW", "NW", "NW", "WNW", "ESE", "W", "WSW", "SW"
, "SW", "SW", "WSW", "SW", "S", "S", "SSW", "SW", "WSW"
, "WSW", "WSW", "WSW", "WSW", "WSW", "WSW", "SW", "WSW"
, "WSW", "WSW", "WSW", "SW", "SW", "WSW", "WSW", "WSW"
, "WSW", "WSW", "SW", "SW", "SW", "SW", "SW", "SW", "SW"
, "SW", "SW", "WSW", "WSW", "WSW", "WSW", "SW", "SW"
, "SW", "SW", "WSW", "SW", "SW", "SW", "SW", "SW", "WSW"
, "SW", "SW", "W", "WSW", "WSW", "SSW", "S", "WNW", "SW"
, "W", "WSW", "WSW", "SE", "SE", "SE", "NW", "NNE", "ENE"
, "SE", "NNW", "NW", "NW", "NW", "NW", "NW", "NW", "NE"
, "NW", "NW", "NW", "NW", "NW", "N", "WNW", "NW", "NNW"
, "NNW", "NW", "NW", "NW"
)
, max_speed = c( 4.60, 4.60, 3.40, 3.10, 4.80, 4.20, 4.10, 4.50, 4.70
, 4.30, 2.40, 2.30, 2.20, 2.10, 2.90, 2.80, 1.80, 2.70
, 4.30, 3.30, 2.30, 2.30, 3.20, 3.20, 2.90, 2.30, 1.50
, 1.80, 2.90, 2.40, 1.80, 2.40, 2.30, 2.60, 1.80, 2.30
, 1.90, 2.20, 2.80, 2.40, 1.00, 1.10, 1.60, 2.30, 2.50
, 3.30, 3.40, 3.20, 4.50, 3.90, 3.10, 2.40, 6.00, 7.80
, 6.30, 7.80, 8.10, 6.10, 7.40, 9.50, 8.90, 9.10, 10.10
, 10.50, 11.10, 10.10, 10.90, 11.30, 13.40, 13.50, 12.80
, 11.50, 13.10, 13.50, 11.10, 10.50, 8.50, 10.10, 10.70
, 13.60, 11.90, 14.90, 10.90, 10.90, 12.80, 12.10, 9.10
, 8.30, 8.80, 7.40, 8.40, 10.30, 10.00, 7.00, 8.50, 8.40
, 8.60, 6.70, 7.30, 6.20, 5.90, 5.90, 5.10, 5.80, 5.60
, 6.50, 6.60, 11.70, 11.30, 8.70, 7.10, 6.90, 4.30
, 3.80, 4.30, 3.30, 2.30, 2.30, 3.20, 3.20, 2.90, 2.30
, 1.50, 1.80, 2.90, 2.40, 1.80, 2.40, 2.30, 2.60, 1.80
, 2.30, 1.90, 2.20, 2.80, 2.40, 1.00, 1.10, 1.60, 2.30
, 2.50, 3.30, 3.40, 3.20, 4.50
)
)
We begin with the base-R approach. Afterward this we will show the equivalent dplyr
syntax.
First To identify candidate records to be considered as “foehn” conditions, and a first stab at the groups of records of interest:
mydf$foehn1a <- mydf$main_dir %in% c( "WSW", "SW" )
mydf$foehn1b <- cumsum( !mydf$foehn1a )
Having the ability to consider the groups separately, we identify sequences of records of more than ten records (one hour, as an example minimum period of time):
mydf$foehn1c <- ave( rep( 1, nrow( mydf ) )
, mydf$foehn1b
, FUN=function(v) 10 < length( v )
)
(If you are not familiar with the ave
function, it calls the given function with subsets from the first argument corresponding to distinct values in the second argument (the grouping variable), and combines the results as a vector just as long as the original input data. The function should return either a scalar (which will be extended as necessary), or a vector the same length as the input vector.)
With the groups of similar-direction records that are long enough, we can re-compute events that only apply to the beginning of long intervals:
mydf$foehn1d <- 0 < diff( c( 0, mydf$foehn1c ) )
Now we can generate distinct incrementing groups of records, and use the previously determined flag values for the long records to zero out the grouping flags for records that do not apply to the relevant “foehn” period of time:
mydf$foehn1e <- with( mydf
, ifelse( foehn1c
, cumsum( foehn1d )
, 0
)
)
data_POSIX | main_dir | max_speed | foehn1a | foehn1b | foehn1c | foehn1d | foehn1e |
---|---|---|---|---|---|---|---|
2020-02-19 00:00:00 | WSW | 4.6 | TRUE | 0 | 0 | FALSE | 0 |
2020-02-19 00:10:00 | WSW | 4.6 | TRUE | 0 | 0 | FALSE | 0 |
2020-02-19 00:20:00 | SW | 3.4 | TRUE | 0 | 0 | FALSE | 0 |
2020-02-19 00:30:00 | SW | 3.1 | TRUE | 0 | 0 | FALSE | 0 |
2020-02-19 00:40:00 | W | 4.8 | FALSE | 1 | 0 | FALSE | 0 |
2020-02-19 00:50:00 | WSW | 4.2 | TRUE | 1 | 0 | FALSE | 0 |
2020-02-19 01:00:00 | WSW | 4.1 | TRUE | 1 | 0 | FALSE | 0 |
2020-02-19 01:10:00 | WSW | 4.5 | TRUE | 1 | 0 | FALSE | 0 |
2020-02-19 01:20:00 | W | 4.7 | FALSE | 2 | 0 | FALSE | 0 |
2020-02-19 01:30:00 | W | 4.3 | FALSE | 3 | 0 | FALSE | 0 |
2020-02-19 01:40:00 | SW | 2.4 | TRUE | 3 | 0 | FALSE | 0 |
2020-02-19 01:50:00 | WSW | 2.3 | TRUE | 3 | 0 | FALSE | 0 |
2020-02-19 02:00:00 | SSW | 2.2 | FALSE | 4 | 0 | FALSE | 0 |
2020-02-19 02:10:00 | S | 2.1 | FALSE | 5 | 0 | FALSE | 0 |
2020-02-19 02:20:00 | SW | 2.9 | TRUE | 5 | 0 | FALSE | 0 |
2020-02-19 02:30:00 | SW | 2.8 | TRUE | 5 | 0 | FALSE | 0 |
2020-02-19 02:40:00 | WSW | 1.8 | TRUE | 5 | 0 | FALSE | 0 |
2020-02-19 02:50:00 | WNW | 2.7 | FALSE | 6 | 0 | FALSE | 0 |
2020-02-19 03:00:00 | W | 4.3 | FALSE | 7 | 0 | FALSE | 0 |
2020-02-19 03:10:00 | WSW | 3.3 | TRUE | 7 | 0 | FALSE | 0 |
2020-02-19 03:20:00 | WSW | 2.3 | TRUE | 7 | 0 | FALSE | 0 |
2020-02-19 03:30:00 | SE | 2.3 | FALSE | 8 | 0 | FALSE | 0 |
2020-02-19 03:40:00 | SE | 3.2 | FALSE | 9 | 0 | FALSE | 0 |
2020-02-19 03:50:00 | SE | 3.2 | FALSE | 10 | 0 | FALSE | 0 |
2020-02-19 04:00:00 | NW | 2.9 | FALSE | 11 | 0 | FALSE | 0 |
2020-02-19 04:10:00 | NNE | 2.3 | FALSE | 12 | 0 | FALSE | 0 |
2020-02-19 04:20:00 | ENE | 1.5 | FALSE | 13 | 0 | FALSE | 0 |
2020-02-19 04:30:00 | SE | 1.8 | FALSE | 14 | 0 | FALSE | 0 |
2020-02-19 04:40:00 | NNW | 2.9 | FALSE | 15 | 0 | FALSE | 0 |
2020-02-19 04:50:00 | NW | 2.4 | FALSE | 16 | 0 | FALSE | 0 |
2020-02-19 05:00:00 | NW | 1.8 | FALSE | 17 | 0 | FALSE | 0 |
2020-02-19 05:10:00 | NW | 2.4 | FALSE | 18 | 0 | FALSE | 0 |
2020-02-19 05:20:00 | NW | 2.3 | FALSE | 19 | 0 | FALSE | 0 |
2020-02-19 05:30:00 | NW | 2.6 | FALSE | 20 | 0 | FALSE | 0 |
2020-02-19 05:40:00 | NW | 1.8 | FALSE | 21 | 0 | FALSE | 0 |
2020-02-19 05:50:00 | NE | 2.3 | FALSE | 22 | 0 | FALSE | 0 |
2020-02-19 06:00:00 | NW | 1.9 | FALSE | 23 | 0 | FALSE | 0 |
2020-02-19 06:10:00 | NW | 2.2 | FALSE | 24 | 0 | FALSE | 0 |
2020-02-19 06:20:00 | NW | 2.8 | FALSE | 25 | 0 | FALSE | 0 |
2020-02-19 06:30:00 | NW | 2.4 | FALSE | 26 | 0 | FALSE | 0 |
2020-02-19 06:40:00 | NW | 1.0 | FALSE | 27 | 0 | FALSE | 0 |
2020-02-19 06:50:00 | N | 1.1 | FALSE | 28 | 0 | FALSE | 0 |
2020-02-19 07:00:00 | WNW | 1.6 | FALSE | 29 | 0 | FALSE | 0 |
2020-02-19 07:10:00 | NW | 2.3 | FALSE | 30 | 0 | FALSE | 0 |
2020-02-19 07:20:00 | NNW | 2.5 | FALSE | 31 | 0 | FALSE | 0 |
2020-02-19 07:30:00 | NNW | 3.3 | FALSE | 32 | 0 | FALSE | 0 |
2020-02-19 07:40:00 | NW | 3.4 | FALSE | 33 | 0 | FALSE | 0 |
2020-02-19 07:50:00 | NW | 3.2 | FALSE | 34 | 0 | FALSE | 0 |
2020-02-19 08:00:00 | NW | 4.5 | FALSE | 35 | 0 | FALSE | 0 |
2020-02-19 08:10:00 | WNW | 3.9 | FALSE | 36 | 0 | FALSE | 0 |
2020-02-19 08:20:00 | ESE | 3.1 | FALSE | 37 | 0 | FALSE | 0 |
2020-02-19 08:30:00 | W | 2.4 | FALSE | 38 | 0 | FALSE | 0 |
2020-02-19 08:40:00 | WSW | 6.0 | TRUE | 38 | 0 | FALSE | 0 |
2020-02-19 08:50:00 | SW | 7.8 | TRUE | 38 | 0 | FALSE | 0 |
2020-02-19 09:00:00 | SW | 6.3 | TRUE | 38 | 0 | FALSE | 0 |
2020-02-19 09:10:00 | SW | 7.8 | TRUE | 38 | 0 | FALSE | 0 |
2020-02-19 09:20:00 | WSW | 8.1 | TRUE | 38 | 0 | FALSE | 0 |
2020-02-19 09:30:00 | SW | 6.1 | TRUE | 38 | 0 | FALSE | 0 |
2020-02-19 09:40:00 | S | 7.4 | FALSE | 39 | 0 | FALSE | 0 |
2020-02-19 09:50:00 | S | 9.5 | FALSE | 40 | 0 | FALSE | 0 |
2020-02-19 10:00:00 | SSW | 8.9 | FALSE | 41 | 1 | TRUE | 1 |
2020-02-19 10:10:00 | SW | 9.1 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 10:20:00 | WSW | 10.1 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 10:30:00 | WSW | 10.5 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 10:40:00 | WSW | 11.1 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 10:50:00 | WSW | 10.1 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 11:00:00 | WSW | 10.9 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 11:10:00 | WSW | 11.3 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 11:20:00 | WSW | 13.4 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 11:30:00 | SW | 13.5 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 11:40:00 | WSW | 12.8 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 11:50:00 | WSW | 11.5 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 12:00:00 | WSW | 13.1 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 12:10:00 | WSW | 13.5 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 12:20:00 | SW | 11.1 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 12:30:00 | SW | 10.5 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 12:40:00 | WSW | 8.5 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 12:50:00 | WSW | 10.1 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 13:00:00 | WSW | 10.7 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 13:10:00 | WSW | 13.6 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 13:20:00 | WSW | 11.9 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 13:30:00 | SW | 14.9 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 13:40:00 | SW | 10.9 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 13:50:00 | SW | 10.9 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 14:00:00 | SW | 12.8 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 14:10:00 | SW | 12.1 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 14:20:00 | SW | 9.1 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 14:30:00 | SW | 8.3 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 14:40:00 | SW | 8.8 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 14:50:00 | SW | 7.4 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 15:00:00 | WSW | 8.4 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 15:10:00 | WSW | 10.3 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 15:20:00 | WSW | 10.0 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 15:30:00 | WSW | 7.0 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 15:40:00 | SW | 8.5 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 15:50:00 | SW | 8.4 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 16:00:00 | SW | 8.6 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 16:10:00 | SW | 6.7 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 16:20:00 | WSW | 7.3 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 16:30:00 | SW | 6.2 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 16:40:00 | SW | 5.9 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 16:50:00 | SW | 5.9 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 17:00:00 | SW | 5.1 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 17:10:00 | SW | 5.8 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 17:20:00 | WSW | 5.6 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 17:30:00 | SW | 6.5 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 17:40:00 | SW | 6.6 | TRUE | 41 | 1 | FALSE | 1 |
2020-02-19 17:50:00 | W | 11.7 | FALSE | 42 | 0 | FALSE | 0 |
2020-02-19 18:00:00 | WSW | 11.3 | TRUE | 42 | 0 | FALSE | 0 |
2020-02-19 18:10:00 | WSW | 8.7 | TRUE | 42 | 0 | FALSE | 0 |
2020-02-19 18:20:00 | SSW | 7.1 | FALSE | 43 | 0 | FALSE | 0 |
2020-02-19 18:30:00 | S | 6.9 | FALSE | 44 | 0 | FALSE | 0 |
2020-02-19 18:40:00 | WNW | 4.3 | FALSE | 45 | 0 | FALSE | 0 |
2020-02-19 18:50:00 | SW | 3.8 | TRUE | 45 | 0 | FALSE | 0 |
2020-02-19 19:00:00 | W | 4.3 | FALSE | 46 | 0 | FALSE | 0 |
2020-02-19 19:10:00 | WSW | 3.3 | TRUE | 46 | 0 | FALSE | 0 |
2020-02-19 19:20:00 | WSW | 2.3 | TRUE | 46 | 0 | FALSE | 0 |
2020-02-19 19:30:00 | SE | 2.3 | FALSE | 47 | 0 | FALSE | 0 |
2020-02-19 19:40:00 | SE | 3.2 | FALSE | 48 | 0 | FALSE | 0 |
2020-02-19 19:50:00 | SE | 3.2 | FALSE | 49 | 0 | FALSE | 0 |
2020-02-19 20:00:00 | NW | 2.9 | FALSE | 50 | 0 | FALSE | 0 |
2020-02-19 20:10:00 | NNE | 2.3 | FALSE | 51 | 0 | FALSE | 0 |
2020-02-19 20:20:00 | ENE | 1.5 | FALSE | 52 | 0 | FALSE | 0 |
2020-02-19 20:30:00 | SE | 1.8 | FALSE | 53 | 0 | FALSE | 0 |
2020-02-19 20:40:00 | NNW | 2.9 | FALSE | 54 | 0 | FALSE | 0 |
2020-02-19 20:50:00 | NW | 2.4 | FALSE | 55 | 0 | FALSE | 0 |
2020-02-19 21:00:00 | NW | 1.8 | FALSE | 56 | 0 | FALSE | 0 |
2020-02-19 21:10:00 | NW | 2.4 | FALSE | 57 | 0 | FALSE | 0 |
2020-02-19 21:20:00 | NW | 2.3 | FALSE | 58 | 0 | FALSE | 0 |
2020-02-19 21:30:00 | NW | 2.6 | FALSE | 59 | 0 | FALSE | 0 |
2020-02-19 21:40:00 | NW | 1.8 | FALSE | 60 | 0 | FALSE | 0 |
2020-02-19 21:50:00 | NE | 2.3 | FALSE | 61 | 0 | FALSE | 0 |
2020-02-19 22:00:00 | NW | 1.9 | FALSE | 62 | 0 | FALSE | 0 |
2020-02-19 22:10:00 | NW | 2.2 | FALSE | 63 | 0 | FALSE | 0 |
2020-02-19 22:20:00 | NW | 2.8 | FALSE | 64 | 0 | FALSE | 0 |
2020-02-19 22:30:00 | NW | 2.4 | FALSE | 65 | 0 | FALSE | 0 |
2020-02-19 22:40:00 | NW | 1.0 | FALSE | 66 | 0 | FALSE | 0 |
2020-02-19 22:50:00 | N | 1.1 | FALSE | 67 | 0 | FALSE | 0 |
2020-02-19 23:00:00 | WNW | 1.6 | FALSE | 68 | 0 | FALSE | 0 |
2020-02-19 23:10:00 | NW | 2.3 | FALSE | 69 | 0 | FALSE | 0 |
2020-02-19 23:20:00 | NNW | 2.5 | FALSE | 70 | 0 | FALSE | 0 |
2020-02-19 23:30:00 | NNW | 3.3 | FALSE | 71 | 0 | FALSE | 0 |
2020-02-19 23:40:00 | NW | 3.4 | FALSE | 72 | 0 | FALSE | 0 |
2020-02-19 23:50:00 | NW | 3.2 | FALSE | 73 | 0 | FALSE | 0 |
2020-02-20 00:00:00 | NW | 4.5 | FALSE | 74 | 0 | FALSE | 0 |
And if all we want is to identify start and end periods:
mydf1 <- mydf[ 0 != mydf$foehn1e, c( "data_POSIX", "foehn1e") ]
mydf2 <- do.call( rbind
, lapply( split( mydf1
, mydf1$foehn1e
)
, function( DF ) {
data.frame( foehn1e = DF$foehn1e[ 1 ]
, Start = min( DF$data_POSIX )
, End = max( DF$data_POSIX )
)
}
)
)
foehn1e | Start | End |
---|---|---|
1 | 2020-02-19 10:00:00 | 2020-02-19 17:40:00 |
Alternatively, the above steps can be encoded using dplyr
:
mydf3alt <- ( mydf
%>% mutate( foehn1a = main_dir %in% c( "WSW", "SW" )
, foehn1b = cumsum( !foehn1a )
)
%>% group_by( foehn1b )
%>% mutate( foehn1c = 10 < n() )
%>% ungroup()
%>% mutate( foehn1d = 0 < diff( c( 0, foehn1c ) )
, foehn1e = ifelse( foehn1c
, cumsum( foehn1d )
, 0
)
)
)
Just to confirm that this method obtains the same result as the base R method:
all( mydf == mydf3alt )
## [1] TRUE
And the collapsed form is:
mydf3alt2 <- ( mydf3alt
%>% filter( 0 < foehn1e )
%>% group_by( foehn1e )
%>% summarise( Start = min( data_POSIX )
, End = max( data_POSIX )
)
)
foehn1e | Start | End |
---|---|---|
1 | 2020-02-19 10:00:00 | 2020-02-19 17:40:00 |
Just to confirm the same final results:
all( mydf2 == mydf3alt2 )
## [1] TRUE
👍
Fini
In summary:
- The
diff
function is useful in finding “events” to begin or end groups of record classifications. - Events can also be identified using logical tests of information in records.
- The
cumsum
function is useful in converting events into grouping variables. - These grouping columns can be used to divide up the records into groups and collapse them into summary values such as start or end times, or aggregate measures relevant to those records.
- These techniques can efficiently scale up to very large data sets.
- There are some forms of event specifications that cannot be identified this way, and in such cases a for loop in R or Rcpp may be needed. In general, when continuous values are accumulated differently according to events, and subsequent events are defined in terms of those continuous accumulations, those simulations are in general not amenable to these
cumsum
anddiff
tricks.