This presentation discusses the contents of the SimpleData1.R
example intended to introduce a basic pattern of programming: the three kinds of tasks.
- Input
- Analysis (a.k.a. Processing)
- Output
4/21/2016
This presentation discusses the contents of the SimpleData1.R
example intended to introduce a basic pattern of programming: the three kinds of tasks.
dta <- read.csv( "../data/sampledta1/Test1.csv" )
Potential hiccup: Windows conventionally uses \
to separate directory and file names, but this is a special character in R literal strings. Either double them up when putting the string into R code ("C:\\TEMP\\file.csv"
) or use the forward slash alternative ("C:/TEMP/file.csv"
).
dta.lm <- lm( Reading ~ Seconds, data = dta )
lm
is the “linear model” (regression) function. The dta.lm
object holds the regression results.
The first argument is a “formula” because of the ~
operator. The left side indicates what y
is and the right side indicates that a coefficient is needed for Seconds
. In the absence of a -1
on the right side, an intercept will be computed.
A general feature of analysis functions is that they don’t interact with the world directly. Information goes in as parameters, and one “blob” of data is returned as a result.
In fact, a good quality analysis function won’t have side effects. R makes it difficult (but not impossible) to change global variables from inside functions. Don’t try to circumvent this… write your functions to direct all output through the return value, like the lm
function does.
summary( dta )
## Seconds Reading ## Min. :0.00 Min. :2.200 ## 1st Qu.:0.75 1st Qu.:2.425 ## Median :1.50 Median :2.550 ## Mean :1.50 Mean :2.575 ## 3rd Qu.:2.25 3rd Qu.:2.700 ## Max. :3.00 Max. :3.000
Overview of the contents of a data frame.
dta.lm # minimal printout
## ## Call: ## lm(formula = Reading ~ Seconds, data = dta) ## ## Coefficients: ## (Intercept) Seconds ## 2.32 0.17
Default console view of linear regression analysis result.
summary( dta.lm )
## ## Call: ## lm(formula = Reading ~ Seconds, data = dta) ## ## Residuals: ## 1 2 3 4 ## -0.12 0.01 0.34 -0.23 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.3200 0.2531 9.167 0.0117 * ## Seconds 0.1700 0.1353 1.257 0.3358 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3025 on 2 degrees of freedom ## Multiple R-squared: 0.4412, Adjusted R-squared: 0.1618 ## F-statistic: 1.579 on 1 and 2 DF, p-value: 0.3358
Counterintuitively, summary
yields a more detailed printout than just printing the object does.
grid
objects that can be printedgrid
objects using different syntaxMake sure when you search for graphing functions that you don’t try to mix and match… use one system at a time.
# R base graphics... like "painting" on the screen plot( dta$Seconds, dta$Reading ) abline( dta.lm, col = "blue" )
library(lattice) p <- xyplot( Reading ~ Seconds, dta , panel = function( x, y ) { panel.xyplot( x, y ) panel.abline( dta.lm, col = "blue" ) } ) print( p )
library(ggplot2) ggp <- ggplot( dta, aes( x=Seconds, y=Reading ) ) + geom_point() + geom_smooth( method="lm", se=FALSE, color = "blue" ) print( ggp )
## `geom_smooth()` using formula 'y ~ x'
ggp <- # ggplot produces a "printable" object ggplot( dta # default source for data to plot , aes( # define "aesthetic" map from data to display x = Seconds # horizontal position , y = Reading # vertical position ) ) + # ggplot objects are "added" together geom_point() + # first layer plots data as points geom_smooth( # second layer uses data to generate a "smooth" curve method = "lm" # using linear regression , se = FALSE # don't display confidence band , color = "blue" # specify color of curve ) # object is not displayed until it is printed # print( ggp ) # can be explicit, like this, or by interactive default
These three groups of plotting functions can be used consecutively, but they don’t work together (you cannot paint an abline
on a ggplot object).
Make sure when you search for graphing functions that you don’t try to mix and match
My examples will primarily use ggplot2 functions.