By the end of this practical lab you will be able to:
Base R has functionality that enables the creation of graphics, and although flexible, it is also very common for static graphics to be created using the very popular ggplot2 package. In this practical we will introduce base R functions, ggplot2 and also Plot.ly as a method by which interactive graphics can be created.
First we will read in some 2011 census data for the UK that we will used for the practical.
census <- read.csv("./data/census_small.csv")
This should look as follows…
head(census)
## Code Ward PCT_Good_Health PCT_Higher_Managerial
## 1 E05000886 Allerton and Hunts Cross 48.97327 10.091491
## 2 E05000887 Anfield 42.20538 2.912621
## 3 E05000888 Belle Vale 40.84911 3.931920
## 4 E05000889 Central 58.62832 7.019923
## 5 E05000890 Childwall 51.90538 10.787704
## 6 E05000891 Church 53.39201 17.437790
## PCT_Social_Rented_Households
## 1 13.005190
## 2 22.772576
## 3 42.555119
## 4 18.363917
## 5 6.937488
## 6 3.025153
As we showed in an earlier practical (4. Descriptive Statistics), we can provide a summary of the attributes using the summary() function:
summary(census)
## Code Ward PCT_Good_Health
## E05000886: 1 Allerton and Hunts Cross: 1 Min. :37.32
## E05000887: 1 Anfield : 1 1st Qu.:43.47
## E05000888: 1 Belle Vale : 1 Median :46.36
## E05000889: 1 Central : 1 Mean :46.68
## E05000890: 1 Childwall : 1 3rd Qu.:48.96
## E05000891: 1 Church : 1 Max. :58.63
## (Other) :24 (Other) :24
## PCT_Higher_Managerial PCT_Social_Rented_Households
## Min. : 2.418 Min. : 3.025
## 1st Qu.: 3.701 1st Qu.:15.372
## Median : 5.215 Median :21.930
## Mean : 7.012 Mean :26.629
## 3rd Qu.: 9.962 3rd Qu.:34.563
## Max. :17.438 Max. :57.993
##
However, it is also useful to graphically present the distributions. We can create a histogram using the hist() function, with additional options to specify the labels and color (these use hex values).
#Historgram
hist(census$PCT_Good_Health, col="#00bfff", xlab="Percent", main="Histogram")
We might also be interested in the relationship between two variables. In the following plot, we show how the proportion of people who identify themselves as in good health within an area relate to the proportion of people who are living within socially rented housing.
plot(census$PCT_Good_Health,census$PCT_Social_Rented_Households,cex=.7,main="Good Health and Social Housing", xlab="% Good Health",ylab="% Social Housing",col="#00bfff",pch=19)
As was shown in a previous practical (see 4. Descriptive statistics), a mean can be calculated as follows:
mean(census$PCT_Good_Health)
## [1] 46.67722
We can then use this to test each of numbers contained in the “PCT_Good_Health” column.
census$PCT_Good_Health < mean(census$PCT_Good_Health)
## [1] FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
## [12] FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
## [23] FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
Which returns some TRUE and some FALSE values. We can then combine this with an ifelse() function to create a new variable called “target”. The ifelse() returns (rather than TRUE and FALSE) a value specified by the latter two parameters of the function. In this case, these are the strings “Yes” and “No”.
# Calculate a target for PCT in good health
census$target <- ifelse(census$PCT_Good_Health < mean(census$PCT_Good_Health),"Yes","No")
# Calculate a target for PCT social housing
census$target2 <- ifelse(census$PCT_Social_Rented_Households < mean(census$PCT_Social_Rented_Households),"Yes","No")
You will now see that these values have been added as a new variable in the data frame object:
head(census)
## Code Ward PCT_Good_Health PCT_Higher_Managerial
## 1 E05000886 Allerton and Hunts Cross 48.97327 10.091491
## 2 E05000887 Anfield 42.20538 2.912621
## 3 E05000888 Belle Vale 40.84911 3.931920
## 4 E05000889 Central 58.62832 7.019923
## 5 E05000890 Childwall 51.90538 10.787704
## 6 E05000891 Church 53.39201 17.437790
## PCT_Social_Rented_Households target target2
## 1 13.005190 No Yes
## 2 22.772576 Yes Yes
## 3 42.555119 Yes No
## 4 18.363917 No Yes
## 5 6.937488 No Yes
## 6 3.025153 No Yes
A basic bar chart showing the frequency of zones within each category can be generated as follows:
#Create a table of the results
counts <- table(census$target)
barplot(counts, main="Target Distribution", xlab="Target",col="#00bfff")
You can also created stacked and side by side bar charts:
#Create a table of the results
counts <- table(census$target, census$target2)
#Create stacked bar chart
barplot(counts, main="Target Distribution", xlab="Target",col=c("#00bfff","#00cc66"),legend = rownames(counts))
#Create side by side bar chart
barplot(counts, main="Target Distribution", xlab="Target",col=c("#00bfff","#00cc66"),legend = rownames(counts),beside=TRUE)
We will now read in another dataset that shows the population of different racial groups within New York City between 1970 and 2010.
#Read data
racial <- read.csv("./data/NYC_Pop.csv")
#Create a plot for the total population without an x-axis label
plot(racial$Population,type = "o", col = "red", xlab = "Year", ylab = "Population", main = "Population over time",xaxt = "n")
# Add axis label
axis(1, at=1:5, labels=racial$Year)
It is also possible to add multiple lines to the plot using the lines() function:
#Create a plot for the total population without an x-axis label
plot((racial$White)/100000,type = "o", col = "green", xlab = "Year", ylab = "Population (100k)", main = "Population over time",xaxt = "n",ylim=c(0,max(racial$White/100000)))
lines(racial$Black/100000, type = "o", col = "red")
lines(racial$Asian/100000, type = "o", col = "orange")
lines(racial$Hispanic_Latino/100000, type = "o", col = "blue")
# Add axis label
axis(1, at=1:5, labels=racial$Year)
#Add a legend
legend("topright", c("White","Black","Asian","Hispanic / Latino"), cex=0.8, col=c("green","red","orange","blue"),pch=1, lty=1)
The ggplot2 library provides a range of functions that make graphing and visualization of your data both visually appealing and simple to implement. There are two ways in which graphs can be created in ggplot2, the first is ggplot() which we will discuss later, and the second is qlot(), which has a simplified syntax.
library(ggplot2)
We can first create a bar chart using the factor column (“target”) of the data frame object “census”. The “geom” attribute is telling qplot
what sort of plot to make. If you remember from the last practical, the target variable were wards within Liverpool where the percentage of people in good health was less than the city mean.
qplot(target, data=census, geom="bar")
We can create a histogram by changing the “geom” and variable being plotted. Try adjusting the bin width, which alters the bins into which the values of the “PCT_Social_Rented_Households” column are aggregated.
qplot(PCT_Social_Rented_Households, data=census, geom="histogram",binwidth=10)
Another very common type of graph is a scatterplot which will typically plot the values of two continuous variables against one another on the x and y axis of the graph. This graph looks at the relationship between the percentage of people in socially rented housing, and those who are occupied in higher managerial roles. The default plot type is a scatterplot, so note in the next couple of examples we do not include geom = "point"
, however, this could be added and would return the same result (try it!)
qplot(PCT_Social_Rented_Households, PCT_Higher_Managerial, data = census)
In the previous graph, all the points were black, however, if we swap these out for color, we can highlight a factor variable, which in this case is the “target” column.
qplot(PCT_Social_Rented_Households, PCT_Higher_Managerial, data = census,colour=target)
Alternatively, you can also use “shape” to keep the points as black, but alter their shape by the factor variable.
qplot(PCT_Social_Rented_Households, PCT_Higher_Managerial, data = census,shape=target)
If we want to add a trend line to the plot this is also possible by adding an addition parameter to the “geom”.
qplot(PCT_Social_Rented_Households, PCT_Higher_Managerial, data = census,geom = c("point","smooth"))
## `geom_smooth()` using method = 'loess'
We might also want a simpler linear regression line; which requires two further parameters including “method”" and “formula”.
qplot(PCT_Social_Rented_Households, PCT_Higher_Managerial, data = census,geom = c("point","smooth"),method="lm", formula=y~x)
## Warning: Ignoring unknown parameters: method, formula
To illustrate how to create line plots we will read in some economic data downloaded from the Office for National Statistics which concerns household expenditure since 1948.
household_ex <- read.csv("./data/expenditure.csv")
We can then have a quick look at the data and check on the data class.
head(household_ex)
## Year Millions
## 1 1948 191274
## 2 1949 194639
## 3 1950 200097
## 4 1951 197686
## 5 1952 197993
## 6 1953 206868
str(household_ex)
## 'data.frame': 67 obs. of 2 variables:
## $ Year : int 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 ...
## $ Millions: int 191274 194639 200097 197686 197993 206868 215626 224699 226305 231475 ...
We can now attempt to plot the data.
qplot(Year, Millions, data = household_ex, geom = "line")
On the y axis, ggplot2 has defaulted to using scientific notation. We can change this, however, we will swap to the main ggplot syntax in order to do this. The first stage is to setup the plot, telling ggplot what data to use, and which “aesthetic mappings” (variables!) will be passed to the plotting function. In fact aes() is a function, however never used outside of ggplot(). This is stored in a variable “p”
p <- ggplot(household_ex, aes(Year, Millions))
If you just typed “p” into the terminal this would return an error as you still need to tell ggplot() which type of graphical output is desired. We do this by adding additional parameters using the “+” symbol.
p + geom_line()
Swapping out the scientific notation requires another package called “scales”. Once loaded, we can then add an additional parameter onto the graph.
library(scales)
p + geom_line() + scale_y_continuous(labels = comma)
We can also change the x and y axis labels
# Add scale labels
p <- p + geom_line() + scale_y_continuous(labels = comma) + labs(x="Years", y="Millions (£)")
# Plot p
p
Making an interactive plot is very easy with the plotly() package.
install.packages("plotly")
#Load package
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Firstly sign up to Plot.ly online; and then setup R to use Plot.ly by setting your username and also an API key - this is available here, and you need to click the “Regenerate Key” button.
One you have these details, enter these in the Sys.setenv() functions as follows and run:
# Set username
Sys.setenv("plotly_username"="your_plotly_username")
# Set API
Sys.setenv("plotly_api_key"="your_api_key")
Making an interactive plot becomes very simple when you already have a ggplot2 object created - earlier we created “p” which we can now make interactive with ggplotly():
# Create interactive plot
ggplotly(p)