Jeff Newmiller
September 15, 2018
Often there are missing records, duplicated records, and records with unexpected content (e.g. the string Null
in a column of numbers). The following steps describe preparing the input data to be used for the October 2018 lightning talk.
Sample residential electric load data from London, England[^5] (pre-trimmed to one house)
dta <- read.csv( "../data/MAC000002.csv"
, as.is = TRUE # don't convert character to factor
, check.names = FALSE # don't replace odd characters in column names
)
str(dta)
## 'data.frame': 24158 obs. of 3 variables:
## $ LCLid : chr "MAC000002" "MAC000002" "MAC000002" "MAC000002" ...
## $ DateTime : chr "2012-10-12 00:30:00.0000000" "2012-10-12 01:00:00.0000000" "2012-10-12 01:30:00.0000000" "2012-10-12 02:00:00.0000000" ...
## $ KWH/hh (per half hour): chr " 0 " " 0 " " 0 " " 0 " ...
The timestamp column is stored as character data because read.csv does not automatically recognize timestamps.
Tell R to assume timezone is Greenwich Mean Time (or Universal Time Coordinated = UTC)
Make a Dtm
column using base R (works without extra packages, but not as convenient)
dta_b <- dta # make a copy so we can compare methods later
dta_b$Dtm <- as.POSIXct( dta_b$DateTime )
str( dta_b$Dtm ) # confirming new column type
## POSIXct[1:24158], format: "2012-10-12 00:30:00" "2012-10-12 01:00:00" "2012-10-12 01:30:00" ...
Sometimes there are missing records, so being able to calculate how much time elapsed between records is useful. The diff
function subtracts consecutive values in a vector:
## Time differences in secs
## [1] 1800 1800 1800 1800 1800 1800
Note the special label telling you what units the difference is in… R may sometimes choose minutes, hours or days which is fine for people but can be confusing if you let the computer work with that answer.
## [1] 30 30 30 30 30 30
Lets ask R to look through all 24 thousand time differences to see if there are non-30-minute intervals:
##
## 0 7.45 22.55 30 60 1470
## 17 1 1 24133 4 1
Although most of the intervals are 30 minutes, we can see that there are some some missing records (60, 1470), duplicates (0) and non-half-hour timestamps (7.45, 22.55). Note there are no reversed time sequences (negative differences).
library(ggplot2)
dtmdif <- as.numeric( diff( dta_b$Dtm ), units="hours" )
qplot( dta_b$Dtm[ -nrow( dta_b ) ], dtmdif, geom = "line", xlab="Time", ylab = "Difftime (hours)" )
Seems to have occasional one-hour jumps (skipped record), and more regular duplicate records.
Review records with the same timestamp (duplicated function only marks the second and following instances)
dupidx <- which( duplicated( dta_b$Dtm ) ) # get integer indexes where duplicated is true
head( dta_b[ dta_b$Dtm %in% dta_b$Dtm[ dupidx ], ] )
## LCLid DateTime KWH/hh (per half hour) Dtm
## 383 MAC000002 2012-10-20 00:00:00.0000000 0.2 2012-10-20
## 384 MAC000002 2012-10-20 00:00:00.0000000 0.2 2012-10-20
## 1823 MAC000002 2012-11-20 00:00:00.0000000 0.258 2012-11-20
## 1824 MAC000002 2012-11-20 00:00:00.0000000 0.258 2012-11-20
## 3312 MAC000002 2012-12-21 00:00:00.0000000 0.238 2012-12-21
## 3313 MAC000002 2012-12-21 00:00:00.0000000 0.238 2012-12-21
Seems alright to remove timestamp duplicates because KWH is also duplicated.
Remove rows where all fields are the same by referring only to the data frame duplicated(dta_b)
:
dta_b2 <- dta_b[ !duplicated( dta_b ), ]
dtmdif2 <- as.numeric( diff( dta_b2$Dtm ), units="hours" )
qplot( dta_b2$Dtm[ -nrow( dta_b2 ) ], dtmdif2, geom = "line", xlab="Time", ylab = "Difftime (hours)" )
Still one too-small difference…
## [1] 3238 3239
## LCLid DateTime KWH/hh (per half hour)
## 3239 MAC000002 2012-12-19 12:00:00.0000000 0.36
## 3240 MAC000002 2012-12-19 12:30:00.0000000 0.233
## 3241 MAC000002 2012-12-19 12:37:27.0000000 Null
## 3242 MAC000002 2012-12-19 13:00:00.0000000 0.147
## Dtm
## 3239 2012-12-19 12:00:00
## 3240 2012-12-19 12:30:00
## 3241 2012-12-19 12:37:27
## 3242 2012-12-19 13:00:00
Extra record between two valid records, can remove it.
The energy column should be numeric but is still stored as character data because when it was read in there were extra spaces around the numbers and the Null
value record that we deleted. The trimws
base R function can clean up the energy column:
## 'data.frame': 24140 obs. of 5 variables:
## $ LCLid : chr "MAC000002" "MAC000002" "MAC000002" "MAC000002" ...
## $ DateTime : chr "2012-10-12 00:30:00.0000000" "2012-10-12 01:00:00.0000000" "2012-10-12 01:30:00.0000000" "2012-10-12 02:00:00.0000000" ...
## $ KWH/hh (per half hour): chr " 0 " " 0 " " 0 " " 0 " ...
## $ Dtm : POSIXct, format: "2012-10-12 00:30:00" "2012-10-12 01:00:00" ...
## $ KWH : num 0 0 0 0 0 0 0 0 0 0 ...
## [1] 0
## [1] 0
None of the records have anything but zeroes after the decimal point, so chop them off to make converting the timestamp easier.
strptime
To get help on the format argument used by as.POSIXct
, look up the help page for ?strptime
. This relationship isn’t obvious, but:
as.POSIXct
and as.POSIXlt
mentions strptime
several times with hyperlinks to its help page.as.POSIXct
first calls as.POSIXlt
and then converts that into POSIXct
. The as.POSIXlt
function uses strptime
to convert character data into POSIXlt
representation.strptime
is the name of a function from the POSIX standard which “parses” a string to make a (list) time.