Use ggplot to draw a single curve

As an example, consider the function

\[f(r) = 1/r\]

We may want to plot values of \(r\) in the interval \((1,5)\). Lets make a data frame with one column:

DF1 <- data.frame( r = 1:5 )
DF1
##   r
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5

Now compute the result and put it into a new column in the data frame:

DF1$t <- 1 / DF1$r
DF1
##   r         t
## 1 1 1.0000000
## 2 2 0.5000000
## 3 3 0.3333333
## 4 4 0.2500000
## 5 5 0.2000000

Now we plot data in DF1, mapping r to the x-axis, t to the y-axis, and drawing lines between the x,y pairs:

library(ggplot2)
ggplot( DF1, aes( x = r, y = t ) ) +
  geom_line()
Figure 1-1. Simple line plot

Figure 1-1. Simple line plot

We can change this to plot points instead:

ggplot( DF1, aes( x = r, y = t ) ) +
  geom_point()
Figure 1-2. Simple points plot

Figure 1-2. Simple points plot

or we can overlay them and change the size and color of the points:

ggplot( DF1, aes( x = r, y = t ) ) +
  geom_line() +
  geom_point( size = 3, color = "red" )
Figure 1-3. Simple line plot with points having fixed size and color

Figure 1-3. Simple line plot with points having fixed size and color

but we cannot get a legend for the linesize or color when we specify them on the fly. Rather, we have to map columns of category data to the color and size:

DF1$Test <- "First Test"
DF1$Data <- "Samples"
DF1
##   r         t       Test    Data
## 1 1 1.0000000 First Test Samples
## 2 2 0.5000000 First Test Samples
## 3 3 0.3333333 First Test Samples
## 4 4 0.2500000 First Test Samples
## 5 5 0.2000000 First Test Samples
ggplot( DF1, aes( x = r, y = t ) ) +
  geom_line() +
  geom_point( mapping = aes( size = Test, color = Data ) )
## Warning: Using size for a discrete variable is not advised.
Figure 1-4. Simple line plot with points and auto legend

Figure 1-4. Simple line plot with points and auto legend

Finally, if we want to control how column data are translated into size and colour, we have to make sure discrete variables are setup as factors first, and then adjust their “scales”:

DF1$Testf <- factor( DF1$Test, levels = "First Test" )
DF1$Dataf <- factor( DF1$Data, levels = c( "Samples", "Function" ) )
ggplot( DF1, aes( x = r, y = t ) ) +
  geom_line() +
  geom_point( mapping = aes( size = Testf, color = Dataf ) ) +
  scale_size_manual( name = "Test", values = 3 ) +
  scale_color_manual( name = "Data", values = "red" )
Figure 1-5. Simple line plot with points and modified legend

Figure 1-5. Simple line plot with points and modified legend

The values arguments indicate by position the graphical value corresponding to each factor level.

You can read about factors in the Introduction to R manual that comes with R. They look a lot like character vectors, but are actually vectors of integers that are automatically used as indexes in a (usually shorter) character vector of “levels”. They tend to be useful in the final stages of analysis or display, but are not well suited for combining data sources… for example, don’t try concatenating factor vectors.

Plotting functions with ggplot

While normally functions are evaluated and the data are plotted, sometimes you just want a quick way to overlay what a function looks like on your graph of data points. The stat_function function handles this:

g <- function( x ) {
  1 / x
}
ggplot( DF1, aes( x = r, y = t ) ) +
  geom_line() +
  geom_point( mapping = aes( size = Test, color = Data ) ) +
  stat_function( fun = g, mapping = aes( color = "Function" ) ) +
  scale_size_manual( values = 3 ) +
  scale_color_manual( values = c( "blue", "red" ) ) +
  xlab( "r (km/h)" ) +
  ylab( "t (h)" )
## Warning: `mapping` is not used by stat_function()
Figure 2-1. Simple line plot with points and modified legend and function

Figure 2-1. Simple line plot with points and modified legend and function

(Optional detail: notice that factors are not being specified here for size and color, because the legend applies to all data shown on the plot, and the stat_function option is automatically generating and adding a bunch of points to the plot, and their size and color must be assigned on the fly. To do this, ggplot appends new records to the mapped data with color column values for those records set to “Function”. Since concatenating factors does not work in general, the character vectors are concatenated and the unique strings are identified and sorted alphabetically to make the “effective” levels for the color factor. Since it is done internally, there is no way for us to specify the order of those levels outside ggplot. This is why the stat_function function is not used very often… better control over the results can be obtained by generating all of the points to plot before giving them to ggplot.)

Use ggplot to draw multiple curves

Now consider the function

\[f(r) = d/r\]

where \(d=1,2,3\) is a distance, \(r\) is a rate, and \(t=f(r)\) is time. We may want to plot values of \(r\) in the interval \((1,5)\) for each value of \(d\).

The direct way to make a data frame that contains combinations of x,y points is:

x <- 1:5
DF2 <- data.frame( d = rep( c( 1, 2, 3 )
                          , each = length( x ) )
                 , r = c( x, x, x )
                 )
str( DF2 )
## 'data.frame':    15 obs. of  2 variables:
##  $ d: num  1 1 1 1 1 2 2 2 2 2 ...
##  $ r: int  1 2 3 4 5 1 2 3 4 5 ...
head( DF2 )
##   d r
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 1 5
## 6 2 1

Now we have a data frame with all combinations of the inputs needed for the function. (Note that there are more compact ways to build such combinations of inputs, such as the expand.grid function that will be discussed later.) We can augment this data frame with a new column that contains the computed results:

f <- function( r, d ) {
    d/r
}
DF2$t <- f( DF2$r, DF2$d )
head( DF2 )
##   d r         t
## 1 1 1 1.0000000
## 2 1 2 0.5000000
## 3 1 3 0.3333333
## 4 1 4 0.2500000
## 5 1 5 0.2000000
## 6 2 1 2.0000000

Now we have a data frame with both input values and the corresponding output values. Now lets make a first try at the graph:

ggplot( DF2, aes( x = r, y = t, color = d ) ) +
  geom_line()
Figure 3-1. Multi-curve continuous color

Figure 3-1. Multi-curve continuous color

Not quite what we would have hoped. The problem is that we have mapped color to a continuous variable (numeric). We can fix this by changing d to a discrete “factor” variable:

DF2$df <- factor( DF2$d )
str( DF2 )
## 'data.frame':    15 obs. of  4 variables:
##  $ d : num  1 1 1 1 1 2 2 2 2 2 ...
##  $ r : int  1 2 3 4 5 1 2 3 4 5 ...
##  $ t : num  1 0.5 0.333 0.25 0.2 ...
##  $ df: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 2 2 2 2 ...

Now plot a second try using the factor:

ggplot( DF2, aes( x = r, y = t, color = df ) ) +
    geom_line()
Figure 3-2. Multi-curve factor mapped to color

Figure 3-2. Multi-curve factor mapped to color

The sequence of values in the legend is controlled by the sequence of levels used in the factor, so:

DF2$df <- factor( as.character( DF2$d ), levels = c( "3", "1", "2" ) )
str( DF2 )
## 'data.frame':    15 obs. of  4 variables:
##  $ d : num  1 1 1 1 1 2 2 2 2 2 ...
##  $ r : int  1 2 3 4 5 1 2 3 4 5 ...
##  $ t : num  1 0.5 0.333 0.25 0.2 ...
##  $ df: Factor w/ 3 levels "3","1","2": 2 2 2 2 2 3 3 3 3 3 ...

and plot a third try:

ggplot( DF2, aes( x = r, y = t, color = df ) ) +
    geom_line()
Figure 3-3. Multi-curve factor with explicit levels mapped to color

Figure 3-3. Multi-curve factor with explicit levels mapped to color

The colors are mapped by default using a scale_colour_discrete function. However, we can manually assign colors according to the sequence of the levels in the factor:

ggplot( DF2, aes( x = r, y = t, color = df ) ) +
    geom_line() +
    scale_colour_manual( values = c( "green", "red", "black" ) )
Figure 3-4. Multi-curve with manual color scale

Figure 3-4. Multi-curve with manual color scale

We can change the label on the legend by modifying the name of the colour scale. We can also restore a more sensible order to the coefficients, embellish the axis labels, and increase the line width:

DF2$df <- factor( DF2$d )
str( DF2 )
## 'data.frame':    15 obs. of  4 variables:
##  $ d : num  1 1 1 1 1 2 2 2 2 2 ...
##  $ r : int  1 2 3 4 5 1 2 3 4 5 ...
##  $ t : num  1 0.5 0.333 0.25 0.2 ...
##  $ df: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 2 2 2 2 ...
ggplot( DF2, aes( x = r, y = t, color = df ) ) +
    geom_line( size = 2 ) +
    scale_colour_manual( name = "Distance (km)"
                       , values = c( "red", "orange", "green", "blue", "black" ) ) +
    xlab( "Speed (km/h)" ) +
    ylab( "Time (h)" )
Figure 3-5. Multi-curve with manual color scale and explicit scale titles

Figure 3-5. Multi-curve with manual color scale and explicit scale titles

Use expand.grid to make the input variable columns

The expand.grid function makes all combinations of the values in two or more vectors appear in the rows of a data frame. For example:

expand.grid( d = c( 0, 1 ), b = c( 0, 2, 4 ) )
##   d b
## 1 0 0
## 2 1 0
## 3 0 2
## 4 1 2
## 5 0 4
## 6 1 4

So we could easily make lots of points at which a function is to be evaluated:

DF3 <- expand.grid( d = 1:5, r = (10:500)/10 )
DF3$df <- factor( DF3$d )
str( DF3 )
## 'data.frame':    2455 obs. of  3 variables:
##  $ d : int  1 2 3 4 5 1 2 3 4 5 ...
##  $ r : num  1 1 1 1 1 1.1 1.1 1.1 1.1 1.1 ...
##  $ df: Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5 1 2 3 4 5 ...
##  - attr(*, "out.attrs")=List of 2
##   ..$ dim     : Named int [1:2] 5 491
##   .. ..- attr(*, "names")= chr [1:2] "d" "r"
##   ..$ dimnames:List of 2
##   .. ..$ d: chr [1:5] "d=1" "d=2" "d=3" "d=4" ...
##   .. ..$ r: chr [1:491] "r= 1.0" "r= 1.1" "r= 1.2" "r= 1.3" ...

Then compute the function values and plot:

DF3$t <- with( DF3, f( r, d ) )
ggplot( DF3, aes( x = r, y = t, color = df ) ) +
    geom_line() +
    scale_colour_manual( name = "Distance (km)"
                       , values = c( "red", "orange", "green", "blue", "black" ) ) +
    xlab( "Speed (km/h)" ) +
    ylab( "Time (h)" )
Figure 4-1. Curves generated from `expand.grid` input

Figure 4-1. Curves generated from expand.grid input

Find implicit solution using root solver uniroot

Suppose we want to look at how solutions to the quadratic equation change as the coefficients change. The equation is:

\[x^2 + {BA} x + {CA} = 0\]

so we will plot solutions versus \(CA\) in \([-5,1]\), for a couple of values of \(BA\), say \(\{1, 2, 3, 4\}\). First, write a function that expresses the left side of that expression:

E <- function( x, CA, BA ) {
    x^2 + BA * x + CA
}

Note that by basic calculus this function E has an extremum at \(x= - \frac{BA}{2}\), but we won’t always be able to figure this out analytically so often a fixed search domain has to be chosen somehow. To get a clue, we can plot E:

x <- seq( -5, 5, 0.1 )
# "quick" plot version of ggplot, works with x and y vectors
qplot(x, E( x, -1, 1 ) ) + geom_hline( yintercept = 0, colour = "blue" )
Figure 5-1. Exploratory plot of function $E$ where $BA=1$ and $CA=-1$

Figure 5-1. Exploratory plot of function \(E\) where \(BA=1\) and \(CA=-1\)

and note that the minimum occurs at \(x<0\) and when \(CA=-5\) the minimum is more negative than the left solution was in the first plot, so we while the rightmost solution is positive in these cases, in general we should make the left starting point of our interval adjust with the parameters.

x <- seq( -5, 5, 0.1 )
qplot(x, E( x, -5, 5 ) ) + geom_hline( yintercept = 0, colour = "blue" )
Figure 5-2. Exploratory plot of function $E$ where $BA=5$ and $CA=-5$

Figure 5-2. Exploratory plot of function \(E\) where \(BA=5\) and \(CA=-5\)

Now write a function that returns the right-most (greatest value) zero of E, which is also the solution to the original equation:

f <- function( CA, BA ) {
    uniroot( E, c( -BA/2, 5 ), CA = CA, BA = BA )$root
}
f( -1, 1 )
## [1] 0.618025

We can set up a data frame of possible input combinations as before:

DF4 <- expand.grid( BA = 1:4, CA = seq( -5, -1, 0.1))
str( DF4 )
## 'data.frame':    164 obs. of  2 variables:
##  $ BA: int  1 2 3 4 1 2 3 4 1 2 ...
##  $ CA: num  -5 -5 -5 -5 -4.9 -4.9 -4.9 -4.9 -4.8 -4.8 ...
##  - attr(*, "out.attrs")=List of 2
##   ..$ dim     : Named int [1:2] 4 41
##   .. ..- attr(*, "names")= chr [1:2] "BA" "CA"
##   ..$ dimnames:List of 2
##   .. ..$ BA: chr [1:4] "BA=1" "BA=2" "BA=3" "BA=4"
##   .. ..$ CA: chr [1:41] "CA=-5.0" "CA=-4.9" "CA=-4.8" "CA=-4.7" ...

Now f is a function that expects a single value of CA and a single value of BA, since uniroot only works on one solution at a time. This means we can no longer use the vectorized assignment:

## won't work
# DF4$x <- with( DF4, f( CA, BA ) )

but we can use a loop to call it one row at a time:

DF4$x <- NA # make room for all the answers
for ( i in seq.int( nrow( DF4 ) ) ) {
    DF4[ i, "x" ] <- f( DF4[ i, "CA" ], DF4[ i, "BA" ] )
}

and then plot the results:

DF4$BAf <- factor( DF4$BA )
ggplot( DF4, aes( x = CA, y = x, colour = BAf ) ) +
    geom_line( size = 3 )
Figure 5-3. For-loop version of quadratic equation solutions

Figure 5-3. For-loop version of quadratic equation solutions

Rowwise processing with dplyr

The notation in the for loop can get kind of cumbersome. There is a relatively new method of clarifying how these steps are performed based on the dplyr package.

The basic principle is that the data “flows” from a starting point to the end of a sequence of operations. dplyr uses some sophisticated tricks to figure out all the DF5$ stuff for you (and to be fair, while it is easy to read when it works the dplyr rules of evaluation can surprise you while you are composing it).

To illustrate this method the intermediate results will be shown first, and the complete calculation will be shown below.

First, a basic expand.grid generates all combinations of BA and CA:

str( expand.grid( BA = 1:4, C = seq( -5, -1, 0.1 ) ) ) # construct the inputs
## 'data.frame':    164 obs. of  2 variables:
##  $ BA: int  1 2 3 4 1 2 3 4 1 2 ...
##  $ C : num  -5 -5 -5 -5 -4.9 -4.9 -4.9 -4.9 -4.8 -4.8 ...
##  - attr(*, "out.attrs")=List of 2
##   ..$ dim     : Named int [1:2] 4 41
##   .. ..- attr(*, "names")= chr [1:2] "BA" "C"
##   ..$ dimnames:List of 2
##   .. ..$ BA: chr [1:4] "BA=1" "BA=2" "BA=3" "BA=4"
##   .. ..$ C : chr [1:41] "C=-5.0" "C=-4.9" "C=-4.8" "C=-4.7" ...

Next we introduce the %>% pipe operator and mutate function from the dplyr package:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
str(   expand.grid( BA = 1:4, CA = seq( -5, -1, 0.1 ) ) # construct the inputs
   %>% mutate( BAf = factor( BA ) ) # vectorwise calculation of a new column
   )
## 'data.frame':    164 obs. of  3 variables:
##  $ BA : int  1 2 3 4 1 2 3 4 1 2 ...
##  $ CA : num  -5 -5 -5 -5 -4.9 -4.9 -4.9 -4.9 -4.8 -4.8 ...
##  $ BAf: Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...

The mutate function lets us modify or create columns based on existing columns or any other variables in the working environment.

Next, we modify the resulting data frame so that later operations will apply one row at a time:

str(   expand.grid( BA = 1:4, CA = seq( -5, -1, 0.1 ) ) # construct the inputs
   %>% mutate( BAf = factor( BA ) ) # vectorwise calculation of a new column
   %>% rowwise  # mark results obtained so far to be handled row-by-row
   )
## tibble [164 × 3] (S3: rowwise_df/tbl_df/tbl/data.frame)
##  $ BA : int [1:164] 1 2 3 4 1 2 3 4 1 2 ...
##  $ CA : num [1:164] -5 -5 -5 -5 -4.9 -4.9 -4.9 -4.9 -4.8 -4.8 ...
##  $ BAf: Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...

The details are not important to using dplyr, but you can tell that the data frame has been changed so that dplyr functions can tell that they need to work on this data frame differently than usual.

Now we apply another mutate to create the x column one value at a time:

str(   expand.grid( BA = 1:4, CA = seq( -5, -1, 0.1 ) ) # construct the inputs
   %>% mutate( BAf = factor( BA ) ) # vectorwise calculation of a new column
   %>% rowwise  # mark results obtained so far to be handled row-by-row
   %>% mutate( x = f( CA, BA ) ) # rowwise calculation of a new column
   )
## tibble [164 × 4] (S3: rowwise_df/tbl_df/tbl/data.frame)
##  $ BA : int [1:164] 1 2 3 4 1 2 3 4 1 2 ...
##  $ CA : num [1:164] -5 -5 -5 -5 -4.9 -4.9 -4.9 -4.9 -4.8 -4.8 ...
##  $ BAf: Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
##  $ x  : num [1:164] 1.79 1.45 1.19 1 1.77 ...

and the column is created in one line instead of a whole for loop. However, the rowwise modification is still active. Since this tends to slow future dplyr operations down, and some non-dplyr functions don’t work well with the modified data frame, we convert it back to a normal data frame again:

str(   expand.grid( BA = 1:4, CA =seq( -5, -1, 0.1 ) ) # construct the inputs
   %>% mutate( BAf = factor( BA ) ) # vectorwise calculation of a new column
   %>% rowwise  # mark results obtained so far to be handled row-by-row
   %>% mutate( x = f( CA, BA ) ) # rowwise calculation of a new column
   %>% as.data.frame # optionally convert result back to a plain data frame
   )
## 'data.frame':    164 obs. of  4 variables:
##  $ BA : int  1 2 3 4 1 2 3 4 1 2 ...
##  $ CA : num  -5 -5 -5 -5 -4.9 -4.9 -4.9 -4.9 -4.8 -4.8 ...
##  $ BAf: Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
##  $ x  : num  1.79 1.45 1.19 1 1.77 ...

Thus, the previous example can be boiled down to:

DF5 <- (   expand.grid( BA = 1:4, CA = seq( -5, -1, 0.1 ) ) # construct the inputs
       %>% mutate( BAf = factor( BA ) ) # vectorwise calculation of a new column
       %>% rowwise  # mark results obtained so far to be handled row-by-row
       %>% mutate( x = f( CA, BA ) ) # rowwise calculation of a new column
       %>% as.data.frame # optionally convert result back to a plain data frame
       )
ggplot( DF5, aes( x = CA, y = x, colour = BAf ) ) +
    geom_line( size = 3 ) +
    scale_color_discrete( name = "BA" )
Figure 6-1 Rowwise version of `dplyr` quadratic equation solutions

Figure 6-1 Rowwise version of dplyr quadratic equation solutions

Now that is very little code to do quite a lot of work!

Vectorizing

Some algorithms like uniroot need to be applied to one scalar value at a time. However, it can get tedious to keep writing for loops every time you want to use such an algorithm. The solution is to make a function that implements the for loop:

fv1 <- function( CA, BA ) {
  stopifnot( length( CA ) == length( BA ) )
  x <- rep( NA, length( CA ) ) # make room for all the answers
  for ( i in seq.int( length( CA ) ) ) {
    x[ i ] <- f( CA[ i ], BA[ i ] )
  }
  x # last expression gets returned - avoid using return(x) at the end of the function
}

Now we can compute a new column in a data frame in one step:

DF4$x1 <- with( DF4, fv1( CA, BA ) ) # this works
head( DF4 )
##   BA   CA        x BAf       x1
## 1  1 -5.0 1.791313   1 1.791313
## 2  2 -5.0 1.449499   2 1.449499
## 3  3 -5.0 1.192597   3 1.192597
## 4  4 -5.0 1.000030   4 1.000030
## 5  1 -4.9 1.769381   1 1.769381
## 6  2 -4.9 1.428994   2 1.428994
identical( DF4$x, DF4$x1 ) # test whether result is the same
## [1] TRUE

In fact, this pattern is common enough that a function has been included in R that does it for us: the Vectorize function. This function accepts the original function and the names of the scalar arguments, and returns to us a new function that can be called with vector arguments.

fv2 <- Vectorize( f, c( "CA", "BA" ) )
DF4$x2 <- with( DF4, fv2( CA, BA ) ) # this works
head( DF4 )
##   BA   CA        x BAf       x1       x2
## 1  1 -5.0 1.791313   1 1.791313 1.791313
## 2  2 -5.0 1.449499   2 1.449499 1.449499
## 3  3 -5.0 1.192597   3 1.192597 1.192597
## 4  4 -5.0 1.000030   4 1.000030 1.000030
## 5  1 -4.9 1.769381   1 1.769381 1.769381
## 6  2 -4.9 1.428994   2 1.428994 1.428994
identical( DF4$x, DF4$x2 ) # test whether result is the same
## [1] TRUE

and the previous dplyr solution can be shortened:

DF5 <- (   expand.grid( BA = 1:4, CA = seq( -5, -1, 0.1 ) ) # construct the inputs
       %>% mutate( BAf = factor( BA ) # vectorwise calculation of a new column
                 , x = fv2( CA, BA ) # "vectorwise" calculation of a new column
                 )
       %>% as.data.frame # optionally convert result back to a plain data frame
       )
ggplot( DF5, aes( x = CA, y = x, colour = BAf ) ) +
    geom_line( size = 3 ) +
    scale_color_discrete( name = "BA" )
Figure 7-1 Vectorwise `dplyr` version of quadratic equation solutions

Figure 7-1 Vectorwise dplyr version of quadratic equation solutions

Keep in mind that Vectorize-ed form of the function does not run any faster than the for loop version would, but it is much easier to use it in conjunction with the usual vectorwise data that R programmers work with. The speed improvement that is normally associated with vectorized functions arises from the typically-faster processing that compiled vectorwise code provides.

Handling Time Data

R has no concept of time independent of date, mostly because the number and numbering of hours in a day is different when daylight savings time starts and stops. If you really want to process time data that has no date information then you need to assume a date so R can work with it.

If you are dealing with date data (no time of day), the Date type can be used without concern about timezones. However, if time of day is part of your analysis then the base R method for handling that are the POSIXlt and POSIXct types that follow the (surprise) POSIX standard for timekeeping, and the difftime type for time intervals (time arithmetic). There are some packages that can be used if you want to ignore timezone idiosyncrasies (zoo and chron are a couple).

The rest of this discussion focuses on POSIXct, with a little POSIXlt and difftime as well.

First let’s read in some time data.

tmdta1 <- read.csv( text = 
"Hrs,DateTime1,DST1,DateTime2,DST2,DateTime2s,DST2s
0,3/8/2015 0:00,ST,11/1/2015 0:00,DST,10/31/2015 23:00,ST
0.25,3/8/2015 0:15,ST,11/1/2015 0:15,DST,10/31/2015 23:15,ST
0.5,3/8/2015 0:30,ST,11/1/2015 0:30,DST,10/31/2015 23:30,ST
0.75,3/8/2015 0:45,ST,11/1/2015 0:45,DST,10/31/2015 23:45,ST
1,3/8/2015 1:00,ST,11/1/2015 1:00,DST,11/1/2015 0:00,ST
1.25,3/8/2015 1:15,ST,11/1/2015 1:15,DST,11/1/2015 0:15,ST
1.5,3/8/2015 1:30,ST,11/1/2015 1:30,DST,11/1/2015 0:30,ST
1.75,3/8/2015 1:45,ST,11/1/2015 1:45,DST,11/1/2015 0:45,ST
2,3/8/2015 3:00,DST,11/1/2015 1:00,ST,11/1/2015 1:00,ST
2.25,3/8/2015 3:15,DST,11/1/2015 1:15,ST,11/1/2015 1:15,ST
2.5,3/8/2015 3:30,DST,11/1/2015 1:30,ST,11/1/2015 1:30,ST
2.75,3/8/2015 3:45,DST,11/1/2015 1:45,ST,11/1/2015 1:45,ST
3,3/8/2015 4:00,DST,11/1/2015 2:00,ST,11/1/2015 2:00,ST
", as.is=TRUE )
str( tmdta1 )
## 'data.frame':    13 obs. of  7 variables:
##  $ Hrs       : num  0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 ...
##  $ DateTime1 : chr  "3/8/2015 0:00" "3/8/2015 0:15" "3/8/2015 0:30" "3/8/2015 0:45" ...
##  $ DST1      : chr  "ST" "ST" "ST" "ST" ...
##  $ DateTime2 : chr  "11/1/2015 0:00" "11/1/2015 0:15" "11/1/2015 0:30" "11/1/2015 0:45" ...
##  $ DST2      : chr  "DST" "DST" "DST" "DST" ...
##  $ DateTime2s: chr  "10/31/2015 23:00" "10/31/2015 23:15" "10/31/2015 23:30" "10/31/2015 23:45" ...
##  $ DST2s     : chr  "ST" "ST" "ST" "ST" ...

The DateTime1 and DateTime2 columns contains characters that don’t convert for all rows to numeric, and the as.is=TRUE option was specified, so read.csv left it as character data. This retains the content of the column as it was given to R, but in this form it cannot be treated like continuous data in plots, nor can time intervals be selected. If you need these features (you might not always need them), we can convert it to a data type suited to such operations.

Since the DateTime column contains time along with date information, we cannot use the Date type. The POSIXct and POSIXlt types are the standard choices for these purposes that are supported by R. The POSIXct type is the “compact” type that is suitable for arithmetic and storing in data frames. The POSIXlt type is like a miniature data frame with 9 columns representing separate parts of time (day, hour, year, etc.).

The hardest part about using the POSIX types is dealing with time zones… but then time zones are probably also the best thing about them, too. Keep in mind that R did not invent the calendar, nor is it responsible for daylight savings time or leap days… so having a system powerful enough to handle all of those special cases for you should be considered a blessing rather than a curse.

Spring Forward

Here is an example of converting to POSIXct:

tmdta1$DtmCivil1 <- as.POSIXct( tmdta1$DateTime1
                              , tz = "America/Los_Angeles"
                              , format = "%m/%d/%Y %H:%M"
                              )
str( tmdta1 )
## 'data.frame':    13 obs. of  8 variables:
##  $ Hrs       : num  0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 ...
##  $ DateTime1 : chr  "3/8/2015 0:00" "3/8/2015 0:15" "3/8/2015 0:30" "3/8/2015 0:45" ...
##  $ DST1      : chr  "ST" "ST" "ST" "ST" ...
##  $ DateTime2 : chr  "11/1/2015 0:00" "11/1/2015 0:15" "11/1/2015 0:30" "11/1/2015 0:45" ...
##  $ DST2      : chr  "DST" "DST" "DST" "DST" ...
##  $ DateTime2s: chr  "10/31/2015 23:00" "10/31/2015 23:15" "10/31/2015 23:30" "10/31/2015 23:45" ...
##  $ DST2s     : chr  "ST" "ST" "ST" "ST" ...
##  $ DtmCivil1 : POSIXct, format: "2015-03-08 00:00:00" "2015-03-08 00:15:00" ...

Note that DtmCivil1 is printed differently than it was represented in the CSV file. This is typical, since a POSIXct value is represented internally as the number of seconds since some point in time (“epoch”; typically 1970-01-01 0:00 GMT), and whenever it is printed some timezone and format will have to be assumed. That timezone may be associated with that POSIXct vector (from the tz argument) or may be whatever is currently set as the default in your R workspace (Sys.setenv(TZ=whatever) if the tz for that vector is ""). The default time format used in R is YYYY-mm-dd HH:MM:SS.

The “America/Los_Angeles” time definition comes from an open-source timezone database (Olson) that is used in many Unix-based operating systems. R for Windows bundles this database with the installation package as well because supporting the built-in Windows timezone names database is too confusing for most people to adapt to. The more familiar “Pacific” or “PST” notation may or may not work for input because they are not universal… these notations are used on the west side of the Pacific ocean as well as the east (American) side. To identify which timezones are available to you, you should be able to use the OlsonNames() function.

The format option specification notation is documented in the strptime help page in R. Note that %m represents month while %M represents minutes, and %y represents a two-digit year while %Y represents a four digit year. The use of month/day/year is also regional… day/month/year is familiar to Europeans, and year-month-day is often used as a standardized format in computer applications because it sorts well even when stored as character strings.

In Figure 8-1, R demonstrates that it understands that for springtime, 3am in Daylight Savings Time is 15 minutes after 1:45am Standard Time in the DtmCivil1 column.

ggplot( tmdta1, aes( x=Hrs, y=DtmCivil1 ) ) +
  geom_point()
Figure 8-1. Civil Time versus Hours, Spring Forward

Figure 8-1. Civil Time versus Hours, Spring Forward

We can confirm this by calculation as well:

tmdta1$DtmCivil1[ 8:9 ]
## [1] "2015-03-08 01:45:00 PST" "2015-03-08 03:00:00 PDT"
deltaT1 <- tmdta1$DtmCivil1[ 9 ] - tmdta1$DtmCivil1[ 8 ]
str( deltaT1 )
##  'difftime' num 15
##  - attr(*, "units")= chr "mins"
deltaT1
## Time difference of 15 mins

The deltaT1 variable is a difftime value that gets stored in convenient units. If you need to convert it to numeric for some reason, you should always specify the units, or it will change the units automatically as larger or smaller values are computed:

deltaTh <- tmdta1$DtmCivil[ 9 ] - tmdta1$DtmCivil[ 5 ]
deltaTh # User-friendly output
## Time difference of 1 hours
as.numeric( deltaT1 )  # minutes... dangerous
## [1] 15
as.numeric( deltaTh ) # inconsistent with deltaT1 ... dangerous
## [1] 1
as.numeric( deltaT1, units="mins" )  # consistent output
## [1] 15
as.numeric( deltaTh, units="mins" ) # consistent output
## [1] 60

Fall Back

Although the timezone definition can handle the Spring Forward case without help, most of the time the Fall Back case leads to ambiguity:

tmdta1$DtmCivil2a <- as.POSIXct( tmdta1$DateTime2
                               , tz = "America/Los_Angeles"
                               , format = "%m/%d/%Y %H:%M"
                               )
str( tmdta1 )
## 'data.frame':    13 obs. of  9 variables:
##  $ Hrs       : num  0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 ...
##  $ DateTime1 : chr  "3/8/2015 0:00" "3/8/2015 0:15" "3/8/2015 0:30" "3/8/2015 0:45" ...
##  $ DST1      : chr  "ST" "ST" "ST" "ST" ...
##  $ DateTime2 : chr  "11/1/2015 0:00" "11/1/2015 0:15" "11/1/2015 0:30" "11/1/2015 0:45" ...
##  $ DST2      : chr  "DST" "DST" "DST" "DST" ...
##  $ DateTime2s: chr  "10/31/2015 23:00" "10/31/2015 23:15" "10/31/2015 23:30" "10/31/2015 23:45" ...
##  $ DST2s     : chr  "ST" "ST" "ST" "ST" ...
##  $ DtmCivil1 : POSIXct, format: "2015-03-08 00:00:00" "2015-03-08 00:15:00" ...
##  $ DtmCivil2a: POSIXct, format: "2015-11-01 00:00:00" "2015-11-01 00:15:00" ...
ggplot( tmdta1, aes( x=Hrs, y=DtmCivil2a ) ) +
  geom_point()
Figure 8-2. Civil Time versus Hours, Fall Back is a problem

Figure 8-2. Civil Time versus Hours, Fall Back is a problem

In Figure 8-2 we see from the y-axis that while R is prepared to understand that 1AM standard time is one hour after 1AM daylight time, the information needed to distinguish between them is not present in the DateTime2 column so both 1AM values get assumed to be daylight savings.

In this case I have cheated and created a column DST2 that indicates whether the DateTime2 column is daylight savings time or standard time. We can use this information to “fix” the DtmCivil2 time if it is represented in the “long” format:

# create a temporary variable not in the data frame
d <- as.POSIXlt( tmdta1$DtmCivil2a )
str( d ) # R hides the gory details
##  POSIXlt[1:13], format: "2015-11-01 00:00:00" "2015-11-01 00:15:00" "2015-11-01 00:30:00" ...
str( unclass( d ) ) # forget the class of d
## List of 11
##  $ sec   : num [1:13] 0 0 0 0 0 0 0 0 0 0 ...
##  $ min   : int [1:13] 0 15 30 45 0 15 30 45 0 15 ...
##  $ hour  : int [1:13] 0 0 0 0 1 1 1 1 1 1 ...
##  $ mday  : int [1:13] 1 1 1 1 1 1 1 1 1 1 ...
##  $ mon   : int [1:13] 10 10 10 10 10 10 10 10 10 10 ...
##  $ year  : int [1:13] 115 115 115 115 115 115 115 115 115 115 ...
##  $ wday  : int [1:13] 0 0 0 0 0 0 0 0 0 0 ...
##  $ yday  : int [1:13] 304 304 304 304 304 304 304 304 304 304 ...
##  $ isdst : int [1:13] 1 1 1 1 1 1 1 1 1 1 ...
##  $ zone  : chr [1:13] "PDT" "PDT" "PDT" "PDT" ...
##  $ gmtoff: int [1:13] -25200 -25200 -25200 -25200 -25200 -25200 -25200 -25200 -25200 -25200 ...
##  - attr(*, "tzone")= chr [1:3] "America/Los_Angeles" "PST" "PDT"
# read the help ?DateTimeClasses
# optional: View( unclass( d ) )
d$isdst
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 0
d$isdst <- ifelse( "DST" == tmdta1$DST2, 1, 0 )
d$isdst
##  [1] 1 1 1 1 1 1 1 1 0 0 0 0 0
tmdta1$DtmCivil2b <- as.POSIXct( d )
ggplot( tmdta1, aes( x=Hrs, y=DtmCivil2b ) ) +
  geom_point()
Figure 8-3. Civil Time versus Hours, Fall Back Fix 1

Figure 8-3. Civil Time versus Hours, Fall Back Fix 1

If we don’t have the DST2 column handy then we can simulate it somehow using knowledge about the data. For example, if we know that the timestamp column is actually sampled regularly and monotonically, then duplicate timestamps must be in standard time;

duplicated( tmdta1$DtmCivil2a )
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## [13] FALSE
d <- as.POSIXlt( tmdta1$DtmCivil2a )
d$isdst
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 0
d$isdst <- ifelse( duplicated( tmdta1$DtmCivil2a )
                 , 0
                 , d$isdst
                 )
d$isdst
##  [1] 1 1 1 1 1 1 1 1 0 0 0 0 0
tmdta1$DtmCivil2c <- as.POSIXct( d )
ggplot( tmdta1, aes( x=Hrs, y=DtmCivil2c ) ) +
  geom_point()
Figure 8-4. Civil Time versus Hours, Fall Back Fix 2

Figure 8-4. Civil Time versus Hours, Fall Back Fix 2

However, these solutions are hacks, so by far the least painful approach is to work with standard time input data as much as possible:

tmdta1$DtmCivil2s <- as.POSIXct( tmdta1$DateTime2s
                               , tz = "Etc/GMT+8"
                               , format = "%m/%d/%Y %H:%M"
                               )
str( tmdta1 )
## 'data.frame':    13 obs. of  12 variables:
##  $ Hrs       : num  0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 ...
##  $ DateTime1 : chr  "3/8/2015 0:00" "3/8/2015 0:15" "3/8/2015 0:30" "3/8/2015 0:45" ...
##  $ DST1      : chr  "ST" "ST" "ST" "ST" ...
##  $ DateTime2 : chr  "11/1/2015 0:00" "11/1/2015 0:15" "11/1/2015 0:30" "11/1/2015 0:45" ...
##  $ DST2      : chr  "DST" "DST" "DST" "DST" ...
##  $ DateTime2s: chr  "10/31/2015 23:00" "10/31/2015 23:15" "10/31/2015 23:30" "10/31/2015 23:45" ...
##  $ DST2s     : chr  "ST" "ST" "ST" "ST" ...
##  $ DtmCivil1 : POSIXct, format: "2015-03-08 00:00:00" "2015-03-08 00:15:00" ...
##  $ DtmCivil2a: POSIXct, format: "2015-11-01 00:00:00" "2015-11-01 00:15:00" ...
##  $ DtmCivil2b: POSIXct, format: "2015-11-01 00:00:00" "2015-11-01 00:15:00" ...
##  $ DtmCivil2c: POSIXct, format: "2015-11-01 00:00:00" "2015-11-01 00:15:00" ...
##  $ DtmCivil2s: POSIXct, format: "2015-10-31 23:00:00" "2015-10-31 23:15:00" ...

(The +8 business doesn’t make sense mathematically for zones west of GMT, but that is historically how it was represented because computer science people in the US did not want to be bothered with minus signs in their day-to-day work.)

ggplot( tmdta1, aes( x=Hrs, y=DtmCivil2s ) ) +
  geom_point()
Figure 8-2. Civil Time versus Hours, Bypassing Fall Back

Figure 8-2. Civil Time versus Hours, Bypassing Fall Back

Dataloggers often ignore daylight savings time anyway, so this is actually a reasonable solution for many engineering applications.

Tips on Timezones

Notice that the values of times in each row of tmdta1 are the same even though they are externally presented differently. For example:

attributes( tmdta1$DtmCivil2a[ 1 ] ) # tzone America/Los_Angeles
## $class
## [1] "POSIXct" "POSIXt" 
## 
## $tzone
## [1] "America/Los_Angeles"
attributes( tmdta1$DtmCivil2s[ 1 ] ) # tzone Etc/GMT+8
## $class
## [1] "POSIXct" "POSIXt" 
## 
## $tzone
## [1] "Etc/GMT+8"
tmdta1$DtmCivil2a[ 1 ] # appears DST
## [1] "2015-11-01 PDT"
tmdta1$DtmCivil2s[ 1 ] # appears ST
## [1] "2015-10-31 23:00:00 -08"
tmdta1$DtmCivil2a[ 1 ] == tmdta1$DtmCivil2s[ 1 ] # TRUE with warning
## Warning in check_tzones(e1, e2): 'tzone' attributes are inconsistent
## [1] TRUE

Be aware that each POSIXct or POSIXlt variable is a vector, and the timezone associated with that vector applies to all elements of the vector. This means that R cannot directly deal with a data frame column containing different timezones for each row. I have handled this in the past by converting each value separately and either a) keeping them in separate single-element vectors or b) putting them into one vector and displaying them using one timezone.

If your analysis works entirely on one timezone you can often set your default timezone and not worry about tz arguments in your code. For example, to work with PST data:

Sys.setenv( TZ = "Etc/GMT+8" ) # works on most but perhaps not all operating systems
# read data without converting to factors
# use as.POSIXct( dta$DateTime, format=whatever )
# analyze and output