Introduction to scatter plot

By Maria Fernanda Ibarra Gutiérrez

This blog looks at the ways in which scatter plots can be used to visualise multiple sets of data and the relationships between several variables. It takes a data set and deals with outliers, formatting the graphs for clarity, using bubbles to show a third variable, adding regression models and trend to the plots and splitting the data into separate graphs to change the focus from the whole data set to portions of it.

A scatter plot is a type of visualization mainly used to display the relationship between continuous variables. A specific variation of scatterplot is a bubble chart, also called a weighted scatterplot, which is used to plot three variables in which one of them is depicted by the size of the bubble. Even though a scatterplot can be used to compare one continuous and one categorical variable or two categorical variables, there are better graphics to depict these kinds of variables for instance geom_jitter, geom_count, or geom_bin2d. To work with a scatter plot, I will use the ggplot2 package of R. To go deep in the understanding of this package you can review the cheat sheet for further information (RStudio, n.d.)

In order to show how to make scatter plot visualization, I will use a dataset which gathers information from 2013 to 2015 of three Mexican censuses; the National Census of Federal Justice Procuration, the National Census of State Justice Procuration and the National Census of Government, Public Security and State Penitentiary System which were conducted by the National Institute of Geography and Statistics in Mexico (INEGI, n.d.). Using the data from these databases, I will explore the relationship between victims and defendants’ rate in femicides, the crime of murdering a girl or woman.

Firstly, I needed to install the tidyverse package which includes the ggplot2 package. Then I uploaded the database and corrected and changed some of the variable names.

require(tidyverse)
#Database 
bd <- read.csv(paste(input, "bd_femicides.csv", sep="\\"), as.is = T, stringsAsFactors = F)
view(bd)

In general, ggplot2 package uses the grammar of graphics to depict a plot, in which we need to define some components to display the graphic such as data, the geometrical function to represent data points, in this case, geom_point since I will work with scatter plots and a coordinate system. Also, we should add aesthetic properties such as size, shape and colour of the points. Further arguments could be added to the graphic such as labels, legends, themes and facets to enrich the visualizations.

Firstly, I will show the most basic way of making a scatter plot. I defined the database and the geometrical function with the aesthetic argument with the variables for the axes and the colour of the points. The resulting figure only shows the data and the default names of the axes.

ggplot(data=bd) +
geom_point(aes(x = vict_rate , y = defendant_rate), color ="steelblue4")

According to figure1, there is an outlier in our dataset that affects the visibility of the rest of the points which take lower values in both axes. In order to improve the visualization of the data, I divided the information into two groups, one with all of the observations and another without the outlier. I required the package gridExtra to use the function grid.arrange to depict both graphics in the same grid. The syntax of both graphics are based on the same structure as the previous graphic, nevertheless, I added in the aesthetic argument the size of the plot, labels for the title, subtitle, axes’ names and caption and I added a theme to customize the background of the grid.

require(gridExtra)
p1 <- ggplot(data=bd) +
  geom_point(aes(x = vict_rate , y = defendant_rate), color ="steelblue4", size = 3) + #Add a size of the point
  labs(title ="Justice for femicide victims", #Add title
       subtitle="Femicide victims vs Defendants rate", #Add subtitle
       x = "Victims per 100,000 women", y = "Defendants per 100,000 people", #Add names of axes
       caption = "Source: Census INEGI 2013, 2014 and 2015") +  #Add name of source
  theme_minimal() #Add theme

#Plot without outlier 
sub_bd<-subset(bd, bd$vict_rate<10 )
p2 <- ggplot(data=sub_bd) +
  geom_point(aes(x = vict_rate , y = defendant_rate), color ="steelblue4", size = 3) + #Add a size of the point
  labs(title ="Justice for femicide victims", #Add title
       subtitle="Femicide victims vs Defendants rate", #Add subtitle
       x = "Victims per 100,000 women", y = "Defendants per 100,000 people", #Add names of axes
       caption = "Source: Census INEGI 2013, 2014 and 2015") +  #Add name of source
  theme_minimal() #Add theme

I saved these graphics in objects p1 and p2 and I set an arrange of one row to show both objects with the function grid.arrange.

p3 <-grid.arrange(p1, p2, nrow = 1)

Figure 2 – Scatter plots of the two data sets

As shown in figure2, I have improved the graphics to understand the context of the information. However, we are plotting about three years of information, but this is impossible to distinguish in the previous graphics so further improvements are required.

From this part of the tutorial, I will only show the syntax and graphic of the dataset without the outlier point because this will be more useful to depict the scatter plots.

The next change will be focused on distinguishing the year of the data. Based on the previous graphic syntax, I added the colour of the dot regarding the year in the aesthetic argument and a new line with the scale_color argument to manually choose the colours of the points.

p2<-ggplot(data=sub_bd) +
  geom_point(aes(x = vict_rate , y = defendant_rate, color=as.factor(year)), size = 3) + #Points' colour regarding the year
  scale_color_manual(values=c("darkolivegreen","dodgerblue4", "orange3")) + #Colours
  labs(title ="Justice for femicide victims", #Add title
       subtitle="Femicide victims vs Defendants rate", #Add subtitle
       x = "Victims per 100,000 women", y = "Defendants per 100,000 people", #Add names of axes
       caption = "Source: Census INEGI 2013, 2014 and 2015", color ="Year") + #Add name of source
  theme_minimal() #Add theme

Figure 3 – Scatter plot showing each year in different colour

Now, I would like to show one more variable in the graphic, therefore, I will use a bubble chart or a weighted scatter in which the size of the dots describes the total of femicide crimes and antisocial behaviours per year and state. Based on the previous graphic structure, I added a size argument as a function of the variable crimes_tot in the aesthetic argument of the geometrical definition line.

p2<-ggplot(data=sub_bd) +
  geom_point(aes(x = vict_rate , y = defendant_rate, color=as.factor(year), size = crimes_tot)) + #Points'size regarding the total
  scale_color_manual(values=c("darkolivegreen","dodgerblue4", "orange3")) + #Colours
  labs(title ="Justice for femicide victims", #Add title
       subtitle="Femicide victims vs Defendants rate", #Add subtitle
       x = "Victims per 100,000 women", y = "Defendants per 100,000 people", #Add names of axes
       caption = "Source: Census INEGI 2013, 2014 and 2015", color ="Year", size="Total femicides") + #Add name of source
  theme_minimal() #Add theme

Figure 4 – Bubble plot comparing the number of femicides by year and state

Another important aspect that can be explored graphically is the relationships between the analysed variables. Therefore, in a scatter plot we can draw a linear model with the geo_smooth layer to show the fit of a linear regression model. In this argument, I have stated the data method ‘lm’ for a linear model, the linear formula ‘se=F’ to avoid confidence intervals and the colour of the line.

p2<-ggplot(data=sub_bd) +
  geom_point(aes(x = vict_rate , y = defendant_rate, color=as.factor(year), size = crimes_tot)) + #Points'size regarding the total
  scale_color_manual(values=c("darkolivegreen","dodgerblue4", "orange3")) + #Colours
  geom_smooth(aes(x= vict_rate, y= defendant_rate),method="lm", formula=y ~ x ,se=F, color="lightskyblue4") + #Linear Regression
  labs(title ="Justice for femicide victims", #Add title
       subtitle="Femicide victims vs Defendants rate", #Add subtitle
       x = "Victims per 100,000 women", y = "Defendants per 100,000 people", #Add names of axes
       caption = "Source: Census INEGI 2013, 2014 and 2015", color ="Year",size="Total Femicides") + #Add name of source
  theme_minimal() #Add theme

Figure 5 – Bubble plot with linear regression line

Nevertheless, it is also possible to show the fit of a quadratic model by following the same structure but with a formula which must be quadratic.

p2<-ggplot(data=sub_bd) +
  geom_point(aes(x = vict_rate , y = defendant_rate, color=as.factor(year), size = crimes_tot)) + #Points'size regarding the total
  scale_color_manual(values=c("darkolivegreen","dodgerblue4", "orange3")) + #Colours
  geom_smooth(aes(x = vict_rate , y = defendant_rate),method="lm", formula=y ~ x + I(x^2), color="lightskyblue4") + #Quadratic regression
  labs(title ="Justice for femicide victims",#Add title
       subtitle="Femicide victims vs Defendants rate", #Add subtitle
       x = "Victims per 100,000 women", y = "Defendants per 100,000 people", #Add names of axes
       caption = "Source: Census INEGI 2013, 2014 and 2015", size="Total Femicides", color ="Year") + #Add name of source
  theme_minimal() #Add theme

Figure 6 – Bubble plot with quadratic regression line

Now, as I am presenting data of crimes it is important to show the places with the highest rate. Therefore, an important aspect is to show the names of the points. This can be done by adding a geom_text argument in which you must define the axes’ variables, the label and size of the text.

p2<-ggplot(data=sub_bd) +
  geom_point(aes(x = vict_rate , y = defendant_rate, color=as.factor(year), size = crimes_tot)) + #Points'size regarding the total
  geom_text(aes(x = vict_rate , y = defendant_rate, label=ent), size=3) + #Points' names
  scale_color_manual(values=c("darkolivegreen","dodgerblue4", "orange3")) + #Colours
  labs(title ="Justice for femicide victims",#Add title
       subtitle="Femicide victims vs Defendants rate", #Add subtitle
       x = "Victims per 100,000 women", y = "Defendants per 100,000 people", #Add names of axes
       caption = "Source: Census INEGI 2013, 2014 and 2015",size="Total Femicides", color ="Year") + #Add name of source
  theme_minimal() #Add theme

Figure 7 – Bubble plot with labels to the data points

As we can see in the previous graphic the names of the dots overlap each other due to the density of the points. However, there is a function named geom_text_repel, which is part of the package ggrepel, which is helpful in positioning the labels of the dots. To use this argument, I used the same previous graphic syntax, but I substituted the geom_text function with geom_text_repel.

require(ggrepel) #First requiere the package
p2<- ggplot(data=sub_bd) +
  geom_point(aes(x = vict_rate , y = defendant_rate, size = crimes_tot, color=as.factor(year))) + #Points'size regarding the total
  geom_text_repel(aes(x = vict_rate , y = defendant_rate, label=ent), size=3) + # Points' names with ggrepel
  scale_color_manual(values=c("darkolivegreen","dodgerblue4", "orange3")) + #Colours
  labs(title ="Justice for femicide victims", #Add title
       subtitle="Femicide victims vs Defendants rate", #Add subtitle
       x = "Victims per 100,000 women", y = "Defendants per 100,000 people", #Add names of axes
       caption = "Source: Census INEGI 2013, 2014 and 2015",size="Total Femicides", color ="Year") + #Add name of source
  theme_minimal() #Add theme

Figure 8 – Bubble plot with data labels without overlap

Nevertheless, due to the density of this dataset, it will be better to show just some points of the graphic by adding a filter inside the geom_text_repel argument. In this case, I defined a filter to plot only the name of the states with a femicide rate higher than 2.

p2<- ggplot(data=sub_bd) +
  geom_point( aes(x = vict_rate , y = defendant_rate, size = crimes_tot, color=as.factor(year))) + #Points'size regarding the total
  geom_text_repel(data=filter(sub_bd, vict_rate > 2), aes(x = vict_rate, y = defendant_rate, label=ent), size=3) + #Names of some points with ggrepel
  scale_color_manual(values=c("darkolivegreen","dodgerblue4", "orange3")) + #Colours
  labs(title ="Justice for femicide victims", #Add title
       subtitle="Femicide victims vs Defendants rate", #Add subtitle
       x = "Victims per 100,000 women", y = "Defendants per 100,000 people", #Add names of axes
       caption = "Source: Census INEGI 2013, 2014 and 2015",size="Total Femicides", color ="Year") + #Add name of source
  theme_minimal() #Add theme

Figure 9 – Bubble plot showing only selected labels

Another important aspect that can be depicted in this visualization is the trend of each year. To do this we can plot smooth curves with confidence intervals per year, using the geom_smooth argument which contains the information of each axe variables and the colour of the lines.

p2<- ggplot() +
  geom_point(data=sub_bd, aes(x = vict_rate , y = defendant_rate, size = crimes_tot, color=as.factor(year))) + #Points'size regarding the total
  geom_smooth(data=sub_bd, aes(x = vict_rate , y = defendant_rate, color=as.factor(year)))+ #Smooth curve with confidence intervals
  geom_text_repel(data=filter(sub_bd, vict_rate > 2), aes(x = vict_rate , y = defendant_rate, label=ent), size=3) + #Names of some points with ggrepel
  scale_color_manual(values=c("darkolivegreen","dodgerblue4", "orange3")) + #Colours
  labs(title ="Justice for femicide victims", #Add title
       subtitle="Femicide victims vs Defendants rate", #Add subtitle
       x = "Victims per 100,000 women", y = "Defendants per 100,000 people", #Add names of axes
       caption = "Source: Census INEGI 2013, 2014 and 2015", size="Total Femicides", color ="Year") + #Add name of source
  theme_minimal() #Add theme

Figure 10 – Bubble plot with trend lines regarding the year

However, as we can see in figure10, the trend lines overlap each other therefore it would be better to divide the graphic to show each year separately. In order to do this, I used facets to divide the plot into subplots by year. Based on the previous code I added a facet_grid argument which will depict subplots by year in a row arrange, the latter can be changed to a column arrange by changing the argument facet_grid (.~ year).

p2<-ggplot() +
  geom_point(data=sub_bd, aes(x = vict_rate , y = defendant_rate, size = crimes_tot, color=as.factor(year))) + #Points'size regarding the total
  geom_smooth(data=sub_bd, aes(x = vict_rate , y = defendant_rate, color=as.factor(year))) + #Smooth curve with confidence intervals
  facet_grid(year~ .) + #Split by year in rows
  geom_text_repel(data=filter(sub_bd, vict_rate > 2), aes(x = vict_rate , y = defendant_rate, label=ent), size=3) + #Names of some points with ggrepel
  scale_color_manual(values=c("darkolivegreen","dodgerblue4", "orange3")) + #Colours
  labs(title ="Justice for femicide victims", #Add title
       subtitle="Femicide victims vs Defendants rate", #Add subtitle
       x = "Victims per 100,000 women", y = "Defendants per 100,000 people", #Add names of axes 
       caption = "Source: Census INEGI 2013, 2014 and 2015", size="Total Femicides", color ="Year") + #Add name of source
  theme_minimal() #Add theme

Figure 11 – Bubble plot divided by years

A final facet view of the graphic can be done to show the different variables for each state rather than for each year.

p2<- ggplot() +
  geom_point(data=sub_bd, aes(x = vict_rate , y = defendant_rate, color=as.factor(year))) + #Points' colour regarding the year
  facet_wrap(.~ ent) + #Split  by state
  scale_color_manual(values=c("darkolivegreen","dodgerblue4", "orange3")) + #Colours
  labs(title ="Justice for femicide victims", #Add title
       subtitle="Femicide victims vs Defendants rate", #Add subtitle
       x = "Victims per 100,000 women", y = "Defendants per 100,000 people", #Add names of axes
       caption = "Source: Census INEGI 2013, 2014 and 2015", size="Total Femicides", color ="Year") + #Add name of source
  theme_minimal() #Add theme