Topic

Often there are missing records, duplicated records, and records with unexpected content (e.g. the string Null in a column of numbers). The following steps describe preparing the input data to be used for the October 2018 lightning talk.

Residential Electricity Usage

Sample residential electric load data from London, England[^5] (pre-trimmed to one house)

dta <- read.csv( "../data/MAC000002.csv"
               , as.is = TRUE # don't convert character to factor
               , check.names = FALSE # don't replace odd characters in column names
               )
str(dta)

## 'data.frame':    24158 obs. of  3 variables:
##  $ LCLid                 : chr  "MAC000002" "MAC000002" "MAC000002" "MAC000002" ...
##  $ DateTime              : chr  "2012-10-12 00:30:00.0000000" "2012-10-12 01:00:00.0000000" "2012-10-12 01:30:00.0000000" "2012-10-12 02:00:00.0000000" ...
##  $ KWH/hh (per half hour): chr  " 0 " " 0 " " 0 " " 0 " ...

The timestamp column is stored as character data because read.csv does not automatically recognize timestamps.

Base R Timestamp

Tell R to assume timezone is Greenwich Mean Time (or Universal Time Coordinated = UTC)

Sys.setenv( TZ = "GMT" ) # when you don't know how the data was encoded, use GMT

Make a Dtm column using base R (works without extra packages, but not as convenient)

dta_b <- dta  # make a copy so we can compare methods later
dta_b$Dtm <- as.POSIXct( dta_b$DateTime )
str( dta_b$Dtm ) # confirming new column type

##  POSIXct[1:24158], format: "2012-10-12 00:30:00" "2012-10-12 01:00:00" "2012-10-12 01:30:00" ...

Check Consistency of Intervals (1)

Sometimes there are missing records, so being able to calculate how much time elapsed between records is useful. The diff function subtracts consecutive values in a vector:

head( diff( dta_b$Dtm ) )

## Time differences in secs
## [1] 1800 1800 1800 1800 1800 1800

Check Consistency of Intervals (2)

Note the special label telling you what units the difference is in… R may sometimes choose minutes, hours or days which is fine for people but can be confusing if you let the computer work with that answer.

head( as.numeric( diff( dta_b$Dtm ) 
                , units="mins" # convert from "whatever" to "minutes"
                ) 
      )

## [1] 30 30 30 30 30 30

Check Consistency of Intervals (3)

Lets ask R to look through all 24 thousand time differences to see if there are non-30-minute intervals:

table( as.numeric( diff( dta_b$Dtm ), units="mins" ) )

## 
##     0  7.45 22.55    30    60  1470 
##    17     1     1 24133     4     1

Although most of the intervals are 30 minutes, we can see that there are some some missing records (60, 1470), duplicates (0) and non-half-hour timestamps (7.45, 22.55). Note there are no reversed time sequences (negative differences).

Check Consistency of Intervals (4)

library(ggplot2)
dtmdif <- as.numeric( diff( dta_b$Dtm ), units="hours" )
qplot( dta_b$Dtm[ -nrow( dta_b ) ], dtmdif, geom = "line", xlab="Time", ylab = "Difftime (hours)" )

Seems to have occasional one-hour jumps (skipped record), and more regular duplicate records.

Find Misleading Timestamps

Review records with the same timestamp (duplicated function only marks the second and following instances)

dupidx <- which( duplicated( dta_b$Dtm ) ) # get integer indexes where duplicated is true
head( dta_b[ dta_b$Dtm %in% dta_b$Dtm[ dupidx ], ] )

##          LCLid                    DateTime KWH/hh (per half hour)        Dtm
## 383  MAC000002 2012-10-20 00:00:00.0000000                   0.2  2012-10-20
## 384  MAC000002 2012-10-20 00:00:00.0000000                   0.2  2012-10-20
## 1823 MAC000002 2012-11-20 00:00:00.0000000                 0.258  2012-11-20
## 1824 MAC000002 2012-11-20 00:00:00.0000000                 0.258  2012-11-20
## 3312 MAC000002 2012-12-21 00:00:00.0000000                 0.238  2012-12-21
## 3313 MAC000002 2012-12-21 00:00:00.0000000                 0.238  2012-12-21

Seems alright to remove timestamp duplicates because KWH is also duplicated.

Remove Misleading Timestamps

Remove rows where all fields are the same by referring only to the data frame duplicated(dta_b):

dta_b2 <- dta_b[ !duplicated( dta_b ), ]
dtmdif2 <- as.numeric( diff( dta_b2$Dtm ), units="hours" )
qplot( dta_b2$Dtm[ -nrow( dta_b2 ) ], dtmdif2, geom = "line", xlab="Time", ylab = "Difftime (hours)" )

Still one too-small difference…

Find Misleading Timestamps (2)

smalldifidx <- which( dtmdif2 < 0.5 )
smalldifidx

## [1] 3238 3239

dta_b2[ 3237:3240, ]

##          LCLid                    DateTime KWH/hh (per half hour)
## 3239 MAC000002 2012-12-19 12:00:00.0000000                  0.36 
## 3240 MAC000002 2012-12-19 12:30:00.0000000                 0.233 
## 3241 MAC000002 2012-12-19 12:37:27.0000000                   Null
## 3242 MAC000002 2012-12-19 13:00:00.0000000                 0.147 
##                      Dtm
## 3239 2012-12-19 12:00:00
## 3240 2012-12-19 12:30:00
## 3241 2012-12-19 12:37:27
## 3242 2012-12-19 13:00:00

Extra record between two valid records, can remove it.

Remove Misleading Timestamps (2)

dta_b3 <- dta_b2[ -3239, ]
dtmdif3 <- as.numeric( diff( dta_b3$Dtm ), units="hours" )
qplot( dta_b3$Dtm[ -nrow( dta_b3 ) ]
     , dtmdif3
     , geom = "line"
     , xlab="Time"
     , ylab = "Difftime (hours)" )

Fix Numeric Data

The energy column should be numeric but is still stored as character data because when it was read in there were extra spaces around the numbers and the Null value record that we deleted. The trimws base R function can clean up the energy column:

dta_b3$KWH <- as.numeric( trimws( dta_b3$`KWH/hh (per half hour)` ) )
str( dta_b3 )

## 'data.frame':    24140 obs. of  5 variables:
##  $ LCLid                 : chr  "MAC000002" "MAC000002" "MAC000002" "MAC000002" ...
##  $ DateTime              : chr  "2012-10-12 00:30:00.0000000" "2012-10-12 01:00:00.0000000" "2012-10-12 01:30:00.0000000" "2012-10-12 02:00:00.0000000" ...
##  $ KWH/hh (per half hour): chr  " 0 " " 0 " " 0 " " 0 " ...
##  $ Dtm                   : POSIXct, format: "2012-10-12 00:30:00" "2012-10-12 01:00:00" ...
##  $ KWH                   : num  0 0 0 0 0 0 0 0 0 0 ...

sum( is.na( dta_b3$KWH ) )

## [1] 0

Strip Fractional Seconds

sum( ".0000000" != substr( dta_b3$DateTime, 20, 27 ) )

## [1] 0

None of the records have anything but zeroes after the decimal point, so chop them off to make converting the timestamp easier.

dta_b4 <- dta_b3
dta_b4$DateTime <- substr( dta_b4$DateTime, 1, 19 )

Save Cleaned Residential Data

write.csv( dta_b4[ , c( "LCLid", "DateTime", "KWH" ) ]
         , file = "../data/MAC000002clean.csv"
         , row.names = FALSE
         , quote = FALSE
         )

Save a version encoded with a different Timezone

Sys.setenv( TZ = "US/Pacific" )
dta_b5 <- dta_b4
dta_b5$DateTime <- as.character( dta_b5$Dtm ) 
write.csv( dta_b5[ , c( "LCLid", "DateTime", "KWH" ) ]
         , file = "../data/MAC000002cleanPT.csv"
         , row.names = FALSE
         , quote = FALSE
         )

What is `strptime`

To get help on the format argument used by as.POSIXct, look up the help page for ?strptime. This relationship isn’t obvious, but:

The help page for as.POSIXct and as.POSIXlt mentions strptime several times with hyperlinks to its help page.
as.POSIXct first calls as.POSIXlt and then converts that into POSIXct. The as.POSIXlt function uses strptime to convert character data into POSIXlt representation.
strptime is the name of a function from the POSIX standard which “parses” a string to make a (list) time.

Timestamp Cleaning