Jeff Newmiller
October 2, 2018 (updated May 10, 2020)
Will show examples of fixing common time-related problems:
NA
?Sample residential electric load data from London, England1. Peak energy usage usually occurs around 8pm, depending somewhat on day of week.
Why do the following simple commands make such messy x-axis labeling? And it is slow, too!
In versions of R before 4.0.0 the read.csv
function by default looked for floating point, integer, or logical types, and anything else is treated like a factor. In R 4.0.0 and after, the default is to import unrecognized columns of data as character strings. Plotting character data directly does not work, so an error “NAs introduced by coercion” would appear.
Factors are labels, like “Male” or “Female”. The above plot assigns each record its own label, and then “skips” labels to avoid writing them on top of each other.
Date
or POSIXct
for date/time valuesTimestamp values may look similar to character or factor representations, but they are much more flexible.
dta <- read.csv( "../data/MAC000002clean.csv"
# , as.is=TRUE # avoid auto-convert to factor (unnecessary in R4.0..0)
)
# Simulate YYYY-mm-ddTHH:MM:SS ISO time format
dta$DateTime <- sub( " ", "T", dta$DateTime )
Sys.setenv( TZ = "US/Pacific" ) # If you forget TZ, this is the likely default for SFBay Area
dta$Dtm <- as.POSIXct( dta$DateTime ) # New column
str( dta )
## 'data.frame': 24140 obs. of 4 variables:
## $ LCLid : chr "MAC000002" "MAC000002" "MAC000002" "MAC000002" ...
## $ DateTime: chr "2012-10-12T00:30:00" "2012-10-12T01:00:00" "2012-10-12T01:30:00" "2012-10-12T02:00:00" ...
## $ KWH : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Dtm : POSIXct, format: "2012-10-12" "2012-10-12" ...
I prefer to make a new column to help troubleshoot the time conversion process.
Note there are ways to find out how to refer to timezones on Wikipedia2 and in R3.
Date
or POSIXct
for date/time valuesThe x-axis breaks are more sensible now:
Item |
Date
|
POSIXct
|
POSIXlt
|
---|---|---|---|
from chr |
as.Date( "2013-03-31" )
|
as.POSIXct( "2013-03-31 03:30:00" )
|
as.POSIXlt( "2013-03-31 03:30:00" )
|
to chr |
as.character( Dt )
|
as.character( Dtm )
|
as.character( Dtm )
|
implementation |
[1] 15795
|
[1] 1364700600 attr(,"tzone") [1] ""
|
List of 9 $ sec : num 0 $ min : int 30 $ hour : int 3 $ mday : int 31 $ mon : int 2 $ year : int 113 $ wday : int 0 $ yday : int 89 $ isdst: int 0 - attr(*, "tzone")= chr "GMT"
|
units |
days since 1970-01-01 GMT
|
sec since 1970-01-01 00:00:00 GMT
|
separate units for each of 9 elements |
Converting formatted strings to POSIXct
:
The converted Dtm
column does not seem to have hours and minutes in it:
## LCLid DateTime KWH Dtm
## 1 MAC000002 2012-10-12T00:30:00 0 2012-10-12
## 2 MAC000002 2012-10-12T01:00:00 0 2012-10-12
## 3 MAC000002 2012-10-12T01:30:00 0 2012-10-12
## 4 MAC000002 2012-10-12T02:00:00 0 2012-10-12
## 5 MAC000002 2012-10-12T02:30:00 0 2012-10-12
## 6 MAC000002 2012-10-12T03:00:00 0 2012-10-12
Just to confirm this problem, we can plot a few days of Dtm
versus row position:
R is not correctly guessing what format the data are stored in.
Experiment: remove the choice to guess which format to use:
dta$Dtm <- as.POSIXct( dta$DateTime
, format = "%Y-%m-%dT%H:%M:%S"
)
which( is.na( dta$Dtm ) ) # this may not work on Linux/Mac
## integer(0)
## LCLid DateTime KWH Dtm
## 7105 MAC000002 2013-03-10T02:00:00 0.245 2013-03-10 01:00:00
## 7106 MAC000002 2013-03-10T02:30:00 0.169 2013-03-10 01:30:00
These timestamps correspond to the 2013 “spring-forward” daylight savings time change… technically, they are invalid in this timezone. R chose the “date-only” "%Y-%m-%d"
format that made no attempt to convert the time information because the “date-only” format suceeded for all values in the input.
What does the time look like around the spring-forward records using the forced format?
## LCLid DateTime KWH Dtm
## 7103 MAC000002 2013-03-10T01:00:00 0.247 2013-03-10 01:00:00
## 7104 MAC000002 2013-03-10T01:30:00 0.235 2013-03-10 01:30:00
## 7105 MAC000002 2013-03-10T02:00:00 0.245 2013-03-10 01:00:00
## 7106 MAC000002 2013-03-10T02:30:00 0.169 2013-03-10 01:30:00
## 7107 MAC000002 2013-03-10T03:00:00 0.104 2013-03-10 03:00:00
## 7108 MAC000002 2013-03-10T03:30:00 0.120 2013-03-10 03:30:00
## 7109 MAC000002 2013-03-10T04:00:00 0.089 2013-03-10 04:00:00
In US/Pacific
timezone, 3am is one hour after 1am… what if we plot this?
But if this is actually how the data were recorded, why the bad data?
TZ
Environment VariableSolution is to tell R what the correct timezone is before converting from character. This data was from London, and it did not appear to have a “spring-forward” so it doesn’t have daylight savings:
Sys.setenv( TZ = "GMT" )
dta$Dtm <- as.POSIXct( dta$DateTime, format = "%Y-%m-%dT%H:%M:%S" )
idx <- 7101:7109
qplot( idx, dta$Dtm[ idx ] )
No NA
values, time increments uniformly! Looking better!
Sys.setenv(TZ=AppropriateTimeZone)
at the beginning of your script."%Y-%m-%d"
or "%Y-%m-%d %H:%M:%S"
), but the number of possibilities that R will not try is very large so if at all possible work with data laid out just one way and tell R what way that is (e.g. as.POSIXct( Dtm, format="%Y-%m-%d %H:%M:%S" )
)Look at February and April 2013:
# find beginning of day for each record
dta$DBegin <- as.POSIXct( trunc( dta$Dtm, units="days" ) )
# find hours from midnight
dta$DTime <- as.numeric( dta$Dtm - dta$DBegin, units="hours" )
# find timestamp at beginning of month for each record
dta$MBegin <- as.POSIXct( paste0( substr( dta$DateTime, 1, 7 )
, "-01" ) )
dta2mo <- subset( dta
, dta$MBegin
%in% as.POSIXct( c( "2013-02-01"
, "2013-04-01" ) ) )
Even though the data are recorded in GMT, the people using the electricity react to the civil time clock. After converting from character, we can change TZ
so the same timestamps will display differently when plotted or printed:
Sys.setenv( TZ = "Europe/London" )
# find beginning of day for each record
dta$DBegin <- as.POSIXct( trunc( dta$Dtm, units="days" ) )
# find hours from midnight
dta$DTime <- as.numeric( dta$Dtm - dta$DBegin, units="hours" )
# find timestamp at beginning of month for each record
dta$MBegin <- as.POSIXct( paste0( substr( dta$DateTime, 1, 7 )
, "-01" ) )
dta2mo <- subset( dta
, dta$MBegin
%in% as.POSIXct( c( "2013-02-01"
, "2013-04-01" ) ) )
An old solar weather file for San Francisco Airport 4:
These files were recorded in “local standard time” (LST; no daylight savings). Could set TZ=“GMT” if desired, but here we illustrate what an LST data set looks like, and show how to convert multiple numeric columns into a timestamp using ISOdatetime
:
## 'data.frame': 8760 obs. of 15 variables:
## $ Yr : int 90 90 90 90 90 90 90 90 90 90 ...
## $ Mo : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Dy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Hr : int 1 2 3 4 5 6 7 8 9 10 ...
## $ G_etdir: int 0 0 0 0 0 0 0 69 246 446 ...
## $ G_etdni: int 0 0 0 0 0 0 0 731 1415 1415 ...
## $ G_g : int 0 0 0 0 0 0 0 7 64 87 ...
## $ G_g_src: chr " ?" " ?" " ?" " ?" ...
## $ G_g_unc: int 0 0 0 0 0 0 0 5 5 4 ...
## $ G_n : int 0 0 0 0 0 0 0 0 54 13 ...
## $ G_n_src: chr " ?" " ?" " ?" " ?" ...
## $ G_n_unc: int 0 0 0 0 0 0 0 4 4 4 ...
## $ G_d : int 0 0 0 0 0 0 0 7 55 83 ...
## $ G_d_src: chr " ?" " ?" " ?" " ?" ...
## $ G_d_unc: int 0 0 0 0 0 0 0 5 5 5 ...
Yet another old source of solar data 5:
## 'data.frame': 8760 obs. of 43 variables:
## $ YYYY-MM-DD : chr "2010-01-01" "2010-01-01" "2010-01-01" "2010-01-01" ...
## $ HH:MM (LST) : chr "1:00" "2:00" "3:00" "4:00" ...
## $ Zenith (deg) : num 99 99 99 99 99 99 99 87.4 80.1 71.7 ...
## $ Azimuth (deg) : num -99 -99 -99 -99 -99 ...
## $ ETR (Wh/m^2) : int 0 0 0 0 0 0 0 36 243 443 ...
## $ ETRN (Wh/m^2) : int 0 0 0 0 0 0 0 790 1415 1415 ...
## $ SUNY Glo (Wh/m^2) : logi NA NA NA NA NA NA ...
## $ SUNY Glo Flg : int 99 99 99 99 99 99 99 99 99 99 ...
## $ SUNY Glo Unc (%) : logi NA NA NA NA NA NA ...
## $ SUNY Dir (Wh/m^2) : logi NA NA NA NA NA NA ...
## $ SUNY Dir Flg : int 99 99 99 99 99 99 99 99 99 99 ...
## $ SUNY Dir Unc (%) : logi NA NA NA NA NA NA ...
## $ SUNY Dif (Wh/m^2) : logi NA NA NA NA NA NA ...
## $ SUNY Dif Flg : int 99 99 99 99 99 99 99 99 99 99 ...
## $ SUNY Dif Unc (%) : logi NA NA NA NA NA NA ...
## $ METSTAT Glo (Wh/m^2) : int 0 0 0 0 0 0 0 5 42 183 ...
## $ METSTAT Glo Unc (%) : int 0 0 0 0 0 0 0 24 24 24 ...
## $ METSTAT Dir (Wh/m^2) : int 0 0 0 0 0 0 0 0 0 131 ...
## $ METSTAT Dir Unc (%) : int 0 0 0 0 0 0 0 27 27 27 ...
## $ METSTAT Dif (Wh/m^2) : int 0 0 0 0 0 0 0 5 42 142 ...
## $ METSTAT Dif Unc (%) : int 0 0 0 0 0 0 0 24 24 24 ...
## $ METSTAT CSKY Glo (Wh/m^2): int 0 0 0 0 0 0 0 12 126 275 ...
## $ METSTAT CSKY Glo Unc (%) : int 0 0 0 0 0 0 0 10 10 10 ...
## $ METSTAT CSKY Dir (Wh/m^2): int 0 0 0 0 0 0 0 130 544 732 ...
## $ METSTAT CSKY Dir Unc (%) : int 0 0 0 0 0 0 0 16 16 16 ...
## $ METSTAT CSKY Dif (Wh/m^2): int 0 0 0 0 0 0 0 6 32 46 ...
## $ METSTAT CSKY Dif Unc (%) : int 0 0 0 0 0 0 0 10 10 10 ...
## $ Meas Glo (Wh/m^2) : logi NA NA NA NA NA NA ...
## $ Meas Glo Flg : int 99 99 99 99 99 99 99 99 99 99 ...
## $ Meas Dir (Wh/m^2) : logi NA NA NA NA NA NA ...
## $ Meas Dir Flg : int 99 99 99 99 99 99 99 99 99 99 ...
## $ Meas Dif (Wh/m^2) : logi NA NA NA NA NA NA ...
## $ Meas Dif Flg : int 99 99 99 99 99 99 99 99 99 99 ...
## $ Precip Wat (cm) : num 2.6 2.6 2.5 2.5 2.5 2.6 2.6 2.6 2.6 2.6 ...
## $ Precip Wat Flg : int 3 51 51 3 51 51 3 51 51 3 ...
## $ AOD (unitless) : num 0.108 0.108 0.108 0.108 0.108 0.108 0.108 0.108 0.108 0.108 ...
## $ AOD Flg : int 1 1 1 1 1 1 1 1 1 1 ...
## $ AOD RAN (unitless) : num 0.108 0.108 0.108 0.108 0.108 0.108 0.108 0.052 0.052 0.052 ...
## $ AOD RAN Flg : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ozone (cm) : num 0.303 0.303 0.303 0.303 0.302 0.302 0.302 0.302 0.302 0.302 ...
## $ Ozone Flg : int 2 51 51 51 51 51 51 51 51 51 ...
## $ Albedo (unitless) : num 0.16 0.16 0.16 0.16 0.16 0.16 0.16 0.16 0.16 0.16 ...
## $ Albedo Flg : int 0 0 0 0 0 0 0 0 0 0 ...
The Date
type is convenient if you don’t need to worry about time, but beware of comparing or calculating between Date
and POSIXct
:
Sys.setenv(TZ = "Etc/GMT+8" ) # local standard time only
dta4$Dt <- as.Date( dta4$`YYYY-MM-DD` )
dta4$Dtm <- as.POSIXct( paste( dta4$`YYYY-MM-DD`, dta4$`HH:MM (LST)` ) )
head( dta4[ , c( "YYYY-MM-DD", "HH:MM (LST)", "Dt", "Dtm" ) ] )
## YYYY-MM-DD HH:MM (LST) Dt Dtm
## 1 2010-01-01 1:00 2010-01-01 2010-01-01 01:00:00
## 2 2010-01-01 2:00 2010-01-01 2010-01-01 02:00:00
## 3 2010-01-01 3:00 2010-01-01 2010-01-01 03:00:00
## 4 2010-01-01 4:00 2010-01-01 2010-01-01 04:00:00
## 5 2010-01-01 5:00 2010-01-01 2010-01-01 05:00:00
## 6 2010-01-01 6:00 2010-01-01 2010-01-01 06:00:00
## Time difference of 9 hours
When a Date
value is converted to POSIXct
, it is always treated as if it was in GMT timezone! If you need a midnight-date that can be compared/subtracted with POSIXct
intuitively then use as.POSIXct( trunc( Dtm ) )
. Also avoid mixing variables created with ISOdate
and ISOdatetime
for the same reason.
Date
when none of your data has time-of-day, and POSIXct
if it has time values.
TZ
before using time functions!
POSIXct
directly to numeric… it is unnecessary in most cases and makes your time manipulations overly-complicated
difftime
to and from numeric… just remember to specify the units in both conversion directionssum( is.na( Dtm ) )
to check if invalid date/times were present. These may indicate incorrect timezone assumptions (including failure to set TZ
) if they occur in the daylight-savings-time spring-forward hour. On other OSes you may just have to look closely at transition times.https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households↩
https://en.wikipedia.org/wiki/List_of_tz_database_time_zones↩
?OlsonNames
↩
NSRDB 1961-1990: Hourly Data Files, https://rredc.nrel.gov/solar/old_data/nsrdb/1961-1990/↩
National Solar Radiation Data Base 1991- 2010 Update, https://rredc.nrel.gov/solar/old_data/nsrdb/1991-2010/↩