In this post we will get the datasets and write the code to create the following visualization of the rate of incarceration over time in U.S (custody and jurisdiction counts of all prisoners).
[Double click for full screen]
Penelope1980
What are we looking at? The video-graph represents the number of people in private and public prisons (sentenced and unsentenced) normalized by the corresponding state population. The darker is a state, the higher is the prison population relative to the total population. The year is indicated on the top left and the video shows the data from 1999 to 2015.
Some changes are quite evident. For example, Texas is the darkest among Southern states at the start of the video, but by the year 2015 it is Oklahoma to lead the South in terms of prison population.
If you liked the video/graph and you are interested in the code behind that visualization keep reading!
If not, bye bye!
Introduction
Animated plots are like spinning-back kicks in muay Thai: they are extremely risky, seldom used, but they can be highly effective and hit you right between the eyes!
Few weeks agon, I stumble upon a newly released package, av, that uses ffmpeg to capture R plots and create videos. I decided to give it a try and use it with heat-maps to visualize the change of prison population in U.S. Admittedly the video does not knock you out (changes are relatively small), but I think that it does still summarize the data at glance quite well.
Get the data and make it tidy
We load the libraries.
|
|
The primary dataset that we are going to use is the National Prisoner Statistics, 1978-2015 curated by the United States Department of Justice, the Office of Justice Program and the Bureau of Justice Statistics.
The dataset (rda file) comes with many documents including the Codebook and details of the methodology used to collect the data. We double click on the rda file to import it, we assign it to prison_pop and we take a look at some rows and columns.
|
|
1 2 3 4 5 6 7 8 9 10 11 12 |
## # A tibble: 9 x 6 ## YEAR STATEID STATE REGION CUSGT1M CUSGT1F ## <dbl> <fct> <fct> <fct> <dbl> <dbl> ## 1 2015 (50) 50. Vermont VT (1) Northeast 966 83 ## 2 2015 (51) 51. Virginia VA (3) South 28105 2325 ## 3 2015 (53) 53. Washington WA (4) West 15848 1297 ## 4 2015 (54) 54. West Virginia WV (3) South 5319 606 ## 5 2015 (55) 55. Wisconsin WI (2) Midwest 20346 1329 ## 6 2015 (56) 56. Wyoming WY (4) West 1888 245 ## 7 2015 (60) State prison total ST (7) State total 1054949 78954 ## 8 2015 (70) US prison total (st… US (5) U.S. total 1192289 89180 ## 9 2015 (99) Federal BOP FE (6) Federal Bureau… 137340 10226 |
The data is in the right format (long) but it is not tidy. For example, the last 3 observations are totals that we don’t need and the name of the states (STATEID) is recorder with other superfluous information. We need to cleaning it up and also restrict our focus of analysis.
I decided to look at the variables that indicate the number of inmates in custody (private and public facilities). The Codebook says:
“Variables CNOPRIVM, CNOPRIVF, CWPRIVM, and CWPRIVF were created by BJS starting in 1999 to address the fact that some states were counting their private prisons in their custody counts, but others were not.”
So we are going to get those variables, filter the dataframe/tibble to have years from 1999 to 2015 and clean up the STATEID column.
|
|
We calculate the totals (male + females, private only, public facilities..).
|
|
..and this is what we obtain:
|
|
1 2 3 4 5 6 7 8 9 |
## Observations: 867 ## Variables: 6 ## Groups: YEAR, STATEID [867] ## $ YEAR <dbl> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, … ## $ STATEID <chr> "alabama", "alaska", "arizona", "arkansas", "californi… ## $ STATE <fct> AL, AK, AZ, AR, CA, CO, CT, DE, DC, FL, GA, HI, ID, IL… ## $ PRIS <dbl> 21227, 3916, 25986, 10388, 160687, 12995, 16987, 6585,… ## $ PRIS_PUB <dbl> 21227, 2529, 24594, 9174, 156066, 12995, 16987, 6585, … ## $ PRIS_PRIV <dbl> 0, 1387, 1392, 1214, 4621, 0, 0, 0, 4024, 3773, 3001, … |
But now we have our first problem: a possible increase in the prison population might be the result of corresponding variation of U.S population. To circumvent that problem we need to express the prison population relative to the corresponding state population for the time interval of interest. Where do we get the data? In the website of the U.S Census Bureau, of course!
In particular this is what we need:
I have already merged all the census data together in csv file that is on the cloud and we will assign to sp_wide (it is in wide format).
|
|
We need to make the STATEID column consistent with the dataframe/tibble of the prison population and to rearrange it in a long format.
|
|
Finaly we can join the 2 dataframe/tibble by STATEID and YEAR and normalize the prison data by state population.
|
|
Plot it!
What is a map? It is a series of polygons that have been drawn on a paper sheet based on known latitude and longitude.
In our case the sheet of paper is the area within our Cartesian axes, and latitude and longitude are X and Y coordinates. We need those coordinates to draw our map. Easy done! We use the package maps to get the coordinates of each U.S. state and then we merge the obtain dataframe/tibble with the one of the prison population.
|
|
1 2 3 4 5 6 7 8 9 10 11 12 |
## Observations: 264,163 ## Variables: 10 ## $ YEAR <dbl> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, … ## $ STATEID <chr> "alabama", "alabama", "alabama", "alabama", "alabama",… ## $ STATE <fct> AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL… ## $ PRIS <dbl> 21227, 21227, 21227, 21227, 21227, 21227, 21227, 21227… ## $ PRIS_PUB <dbl> 21227, 21227, 21227, 21227, 21227, 21227, 21227, 21227… ## $ PRIS_PRIV <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ POP <dbl> 4430141, 4430141, 4430141, 4430141, 4430141, 4430141, … ## $ PERC_PRIS <dbl> 47.91495, 47.91495, 47.91495, 47.91495, 47.91495, 47.9… ## $ LONG <dbl> -87.46201, -87.48493, -87.52503, -87.53076, -87.57087,… ## $ LAT <dbl> 30.38968, 30.37249, 30.37249, 30.33239, 30.32665, 30.3… |
In our video/graph we want also to indicate the state code (2 letters). To do that we need to create a dataframe/tibble that has the coordinates of the positions of those labels. Ideally, they would be in the center of the state, so we can just average the longitude and latitude of each state. Of course don’t expect that to be perfect, the shape of states is not a perfect square or circle!
|
|
District of Columbia is too small to be seen in the map. Let’s remove it.
|
|
Finally, we get to the code to make the video. The idea behind it is pretty simple. We write a function that split our dataframe by year and plots sequentially each of the yearly data. Than we call the function and we capture the output with av. A video is just a sequence of single images!
|
|
With that we have covered all the code related to the video/graph, but I could not resist…
…making at least another plot.
A simple line plot
I wanted also to create a classic line plot with the U.S total prison population (1999-2015) and also indicate the party of the president in office (as Harvey Wickam as done in his ggplot2 book).
Let’s get the presidents…
|
|
and the plot!
|
|
From 2008 to 2015, the prison population went from about 44 inmates for 10 thousand people to about 39. It is ≈ 10% decrease! That is going to be the last graph of the post!
Next time, I will try to focus on machine learning analysis.