---
title: "Analyzing Deaths in the Iraq War Using R"
author: "Professor Michael Spagat"
format: html
editor: visual
bibliography: references.bib
---
## Preliminaries
This document provides the R code for the paper "The Death Toll from the Iraq War: 2003 - 2023". A separate .zip file provides all the data needed to run the code.
First let's shut off scientific notation. I don't like to label pictures with scientific notation which I find confusing. Also, load the R tidyverse (@tidyvers), which I'll use mainly for pictures and data wrangling.
```{r}
options(scipen=999)
library(tidyverse)
```
## Iraq Body Count (IBC)
### The core civilians-only data
The accompanying .zip file contains all the data analyzed in the present document including the IBC data. However, if you want to start from scratch you can download the IBC data (@iraqbodycounthomepage) from their database page (@iraqbodycountdatabase). Note that IBC has both minimum and maximum figures but I work only with the maximum figures because these are the ones that are freely downloadable. If you are working from scratch then you will have to clean out some explanatory text written into the downloadable .csv file so that R can read it and load it up but you don't have to bother with this minor cleaning if you use the provided file.
```{r}
IBC <- read.csv("IBC_Maximum_2003_to_2023_by_month_no_garbage.csv")
#The above file is provided as part of the package accompanying this code.
```
The file has three columns. The first lists months which begin in January of 2003. I will delete January and February of 2003, along with their 5 deaths, because the war does not officially start until March of 2003 and I will also begin the coverage for the surveys in March of 2003. Thus, I provide a level playing field for these comparisons. Of course, this deletion hardly matters because 5 deaths are so few compared the full toll of war.
Starting in February of 2017 IBC lowers the status of its numbers to "preliminary". Column 2 gives non-preliminary figures through February of 2017. Column 3 then continues with preliminary figures, including an overlapping one for February of 2017. I combine Columns 2 & 3 into a single column, choosing the preliminary figure for February of 2017 over the (slightly lower) non-preliminary one.
```{r}
IBC_monthly <- c(IBC$dataset.1[3:169],
IBC$dataset.2[170:length(IBC$Month.starting)])
#Delete January and February of 2003 and merge the settled figures of column 2 with the preliminary figures of column 3
```
### Extend IBC civilian-only numbers to include combatants
I now have a civilians-only IBC monthly series. However, all the surveys ask their respondents about all deaths, i.e., they do not distinguish between civilian deaths and combatant deaths. Thus, they measure deaths of civilians plus combatants. For an apples-to-apples comparison, therefore, we must extend the IBC data to include combatants. Fortunately, I can do this by cobbling together scattered numbers, mostly relying on IBC research.
@conetta2009 estimates 9,200 Iraq combatants killed during the initial combat phase of the war. I will spread these deaths evenly (below) across March, April and May of 2003.
@iraqwar provides the following combatant numbers along with explanations for how they are derived:
1. 597 for June through December of 2003.
2. 5575 + 20499 = 26,074 "host nation" + "enemy" combatants for January 2004 through December 2009 but, unfortunately, missing May 2004 and March 2009.
3. 652 for May or 2004.
4. 59 for March of 2009
These combatant numbers were subsequently updated in a series of releases: @iraqbodycount2010, @iraqbodycount2011, @iraqbodycount2012, @iraqbodycount2015a, @iraqbodycount2016 and @iraq202:
1. **1,084 for 2010**
2. 2,187 for 2010-11, **implying 1,103 for 2011**
3. 2,819 for 2010-12, **implying 632 for 2012**
4. 24,324 for 2010-15, **implying 21,505 for 2013-15** which I allocate equally across the three years (7,168.33 for each year).
5. 32,330 for 2010-16, **implying 8,006 for 2016**,
6. **No information for 2017 through 2020**.
7. **981 for 2021**.
8. **Nothing for 2022 or 2023** but the IBC civilian figures I'm using only go through February of 2023.
Next I use data from @ucdphom2022 to fill in the above gaps:
1. 10,025 for 2017
2. 866 for 2018
3. 498 for 2019
4. 671 for 2020
5. 701 for 2021. Note that I also have the IBC-released figure of 981 for 2021 and use that one just so as to use IBC numbers whenever possible.
6. 335 for 2022
I now create a monthly time series for combatants based on the figures given just above:
```{r}
IBCcompatible_combatants <- c(rep(9200/3, 3),rep(597/7, 7),
#6 years is 72 months but we are missing 2 months
rep((5575 + 20499)/70, 4),
652, rep((5575 + 20499)/70, 57),
59, rep((5575 + 20499)/70, 9),
rep(1084/12, 12),
rep(1103/12, 12),
rep(632/12, 12),
rep(21505/36, 36),
rep(8006/12, 12),
rep(10025/12, 12),
rep(866/12, 12),
rep(498/12, 12),
rep(671/12, 12),
rep(941/12, 12),
rep(335/12, 12),
rep(0, length(IBC_monthly)- (3 + 7 + 4 + 1 + 57 + 1 + 9 + 12 + 12 + 12 + 36 + 12 + 12 + 12 + 12 +
12 + 12 + 12)))
#check that the lengths are equal
length(IBC_monthly)
length(IBCcompatible_combatants)
#check that the total number of combatants killed is right
sum(IBCcompatible_combatants)
9200 + 597 + 5575 + 20499 + 652 + 59 + 1084 + 1103 + 632 + 21505 + 8006 + 10025 + 866 + 498 + 671 + 941 + 335
```
### Extending IBC to include estimated civilian deaths not already captured
The official US military incident data (SIGACTS) for the Iraq War was obtained and released (@davies2010) by WikiLeaks and analyzed by IBC (@iraqbodycountsigactsanalysis). IBC took a random sample of SIGACTS events with reported deaths of civilians, matched them against incidents in IBC and estimated that SIGACTS contained 15,000 deaths of civilians not captured by IBC when the SIGACTS data was released. Subsequently, IBC managed to integrate 5,580 of these deaths into its database and lowered its estimate of civilian deaths in SIGACTS but not in IBC to 9,420 (@iraqbodycount2016) I add in these non-captured deaths into my time series and allocate them equally across all months of 2004 through 2009:
```{r}
IBCcompatible_civilians <- c(rep(0, 10),rep(9420/72, 72),
rep(0, length(IBC_monthly)-82))
```
### Put it all together
I now add the combatant and additional civilian numbers to the core IBC numbers:
```{r}
IBC_extended <- IBC_monthly + IBCcompatible_combatants + IBCcompatible_civilians
sum(IBC_extended)
```
We now have 301,753 total deaths, close to the 300,000 figure that is posted on IBC's home page.
Finally I create a cumulative monthly series:
```{r}
IBC_extended_cumulative <- cumsum(IBC_extended)
```
All comparisons in the paper use cumulative figures.
## The Iraq Living Conditions Survey (ILCS)
@programme2005 estimated 24,000 "war-related deaths" covering the period from March 2003 through May 2004 with a 95% uncertainty interval running from 18,000 to 29,000 . The ILCS does not provide monthly estimates. Nor do they provide raw data that would enable me to make such estimates. So all I can do is to load the published ILCS figures into R.
The code below creates vectors that are the same length as the IBC_extended_cumulative vector by appending a lot of NA's. This is just because the graphing command expects these lengths to be identical.
```{r}
ILCS <- c(rep(0,13),24000, rep(NA, length(IBC_extended_cumulative)-14))
ILCS_low <- c(rep(0,13),18000, rep(NA, length(IBC_extended_cumulative)-14))
ILCS_high <- c(rep(0,13),29000, rep(NA, length(IBC_extended_cumulative)-14))
#Just check to make sure that the lenghths of the vectors are all the same
length(ILCS)
length(ILCS_low)
length(ILCS_high)
length(IBC_extended_cumulative)
```
## Roberts et al. (2004)
@roberts2004 covers the period from March 2003 through September 2004. The paper provides only an excess death estimate, but not a violent death estimate as is required for the my paper. However, fortunately, I am able to produce a violent-death estimate here based on information in the paper:
1. There are 71 violent deaths recorded in the sample.
2. There are 138,439 person months covered by the 18-month sample. (This figure is provided in the paper and I cannot redo this calculation but it is plausible).
I use a population number of 27,858,948 in 2004 taken from ([@iraqpop]), rather than the figure of 24,400,00 used in @roberts2004.[^1] The following calculation combines the population of Iraq with the number of person-months in the sample divided by the number of months covered by the sample to arrive at a multiplier that converts in-sample violent deaths into estimated violent deaths:
[^1]: I use @iraqpop as much as possible for population figures so as to maintain consistency across sources.
```{r}
multiplier_L1 <- 27858948/(138439/18)
#Note that "L1" is coding shorthand for Roberts et al. (2004), based on the fact that it is the first of two surveys of Iraq war deaths that were published in the Lancet.
multiplier_L1
```
The calculation below shows that our central estimate, based on @roberts2004, for the number of violent deaths for the first 18 months of the war is 260,000 (rounded to two significant digits).[^2]
[^2]: Some readers readers may remember an estimate of 100,000 deaths associated with this survey however this is an excess-deaths estimate, not a violent-deaths estimate. Moreover, for reasons that go beyond the scope of the present paper, @roberts2004 omitted the extremely violent Fallujah cluster from its excess-death estimate which would have been much higher, but with a far wider uncertainty interval, if this cluster had been included.
```{r}
73*multiplier_L1
```
Place this estimate within a vector of the same length as the IBC vector to facilitate making pictures using both sources simultaneously.
```{r}
L1 <- c(rep(0,18),73*multiplier_L1, rep(NA, length(IBC_monthly)-19))
#Check that the length of this L1 vector is 240
length(L1)
```
Unfortunately, the authors of the paper have not released cluster-by-cluster data, let alone household-by-household data, thus limiting my ability to surround this estimate with a high-quality uncertainty interval. I can proceed, nonetheless, by making assumptions about how the violent deaths are allocated across clusters. These assumptions will be charitable to the authors of the paper in the sense that they will minimize the width of the uncertainty interval which will turn out, nevertheless, to be gargantuan. We know from the paper that 52 of the 73 violent deaths recorded in the survey are for a single cluster located in Fallujah. The charitable assumption is that the remaining 21 violent deaths are spread across 21 separate clusters.[^3] Thus, I will assume that the 33 clusters break down into 1 cluster with 52 violent deaths, 21 clusters with 1 death and 11 clusters with 0 deaths. Despite my uncertainty-reducing assumption that spreads deaths evenly across clusters outside Fallujah, the following bootstrap simulation still finds a 95% uncertainty interval running from 65,000 all the way up to 640,000, i.e., the top of the interval exceeds the bottom almost a factor of 10:
[^3]: The width of the uncertainty interval would increase to the extent that some clusters have multiple deaths.
```{r}
set.seed(1)
deaths_by_cluster_generous <- c(52,rep(1, 21), rep(0,11))
results_generous <- c()
for (i in c(1:10000)) {
results_generous <- c(results_generous,
sum(sample(deaths_by_cluster_generous, 33, replace = TRUE)))
}
quantile(multiplier_L1*results_generous, probs = c(0.025, 0.975))
```
The most appropriate conclusion to draw from this uncertainty interval is that the survey was a waste of time that should be ignored, at least for the purpose of understanding the scale of violent deaths in the Iraq war. I will , nevertheless, incorporate the above information into figure 1 (below):
```{r}
bottom_generous <- quantile(multiplier_L1*results_generous,
probs = c(0.025, 0.975))[[1]]
top_generous <- quantile(multiplier_L1*results_generous,
probs = c(0.025, 0.975))[[2]]
L1_generous_low <-
c(rep(0,18),bottom_generous, rep(NA, length(IBC_monthly)-19))
L1_generous_high <-
c(rep(0,18),top_generous, rep(NA, length(IBC_monthly)-19))
#Again, check that the lengths are right
length(L1_generous_low)
length(L1_generous_high)
```
## Burnham et al. (2006)
The survey of @burnham2006 purportedly covers the period from the beginning of the war through June of of 2006. However, it is important to note that there are 24 violent deaths recorded for July of 2006 in one of the clusters. These out-of-coverage-period deaths should not have been included in the author's estimate but were included anyway and they add more than 50,000 violent deaths to the final estimate. I also include these deaths but treat the survey as covering July of 2006, contrary to what is claimed in the paper.[^4]
[^4]: Straining credibility, these 24 out-of-range violent deaths are all ascribed to car bombs. Yet the sampling protocols for the survey dictate that a cluster is composed of contiguous households within a residential area and there was no large car bombing of a residential area in July of 2006. There was a Baghdad car bomb in July of 2006 that killed more than 24 people but it was in a marketplace and we can comfortably rule out the possibility that 24 people from a single neighborhood all traveled from home to the market together and were killed there as a group. Yet it is even less likely that 24 people from one neighborhood were killed in multiple car bombings all around Baghdad in a single month.
I have data on monthly violent deaths by governorate for the @burnham2006 survey from the beginning of the war until the middle of 2006. Load this dataset:
```{r}
#Note that I will use "L2" as coding shorthand for Burnham et al. (2004), based on the fact that it was the second of two surveys of war deaths in Iraq published in the Lancet.
L2 <- read.csv("L2deaths_by_governorate_no_garbage.csv")
```
Aggregate by governorate to get national figures:
```{r}
L2_national <- L2 %>% mutate(sumrow= Anbar + Basrah + Qadissiya + Sulaymaniyah + Babylon + Baghdad + Thi.Qar + Diyala + Erbil + Kerbala + Tameem + Missan + Ninewa + Wasitt + Najaf + Saley.Al.Din)
L2_monthly <- L2_national$sumrow[1:47]
#Check to make sure that there are 302 violent deaths are there should be (based on what it says in the paper)
sum(L2_monthly)
```
Some deaths have values only for years and governorates, but not for months. At the moment they are in positions 44 - 47 in the L2_monthly vector which have years but not months attached. Spread these deaths evenly over the years they are recorded for. Also, eliminate January and February of 2003 so that the series starts in March of 2003 just like the other series:
```{r}
allocation_2003 <- rep(L2_monthly[44]/10, 10)
allocation_2004 <- rep(L2_monthly[45]/12, 12)
allocation_2005 <- rep(L2_monthly[46]/12, 12)
allocation_2006 <- rep(L2_monthly[47]/7, 7)
allocation <- c(allocation_2003, allocation_2004,
allocation_2005, allocation_2006)
#Now we add in the deaths that only have years and locations while also eliminating January and February of 2003 so that the series begins in March of 2003.
L2_monthly <- L2_monthly[3:43]+allocation
#Let's check to make sure we have 300 in-sample violent deaths as there should be. (The paper says that 2 of the violent deaths in the sample occurred before the war began)
sum(L2_monthly)
```
For consistency, we take our population figure from @iraqpop, just as we did with the @roberts2004 survey. We get a 2005 population of of 28,698,684, somewhat higher than the figure of 27,139,584 used in the paper. Now calculate the multiplier for converting in-sample deaths into estimated death, integrate it into the monthly series and create a cumulative monthly series.
```{r}
multiplier_L2 <- 28698684/12801
L2_monthly <- L2_monthly*multiplier_L2
L2_cumulative <- cumsum(L2_monthly)
L2_cumulative <- c(L2_cumulative, rep(NA,length(IBC_monthly) -
length(L2_monthly)))
#check that the length is correct
length(L2_cumulative)
```
I do have data on violent deaths by cluster (included in the .zip file). However, these include 2 violent deaths recorded as taking place in February of 2003 but without cluster information on these deaths so I cannot remove them specifically. Instead, I remove 1 violent death from each of 2 clusters listed as having 1 death. Obtain a 95% uncertainty interval through bootstrapping:
```{r}
L2_clusters <- read.csv("L2_violent deaths_by cluster.csv")
L2_clusters$Deaths
#Replace two of the clusters that had 1 violent deaths with clusters that had 0 violent deaths to take care of the fact that there are 2 violent deaths before March of 2003 but we don't know which cluster they are in.
L2_clusters$Deaths[12:13] <- c(0,0)
#Check to make sure this worked
L2_clusters$Deaths
#There should be 47 clusters
length(L2_clusters$Deaths)
```
Now run the bootstrap procedure to construct 95% uncertainty intervals and also calculate the minimum estimate across all of the 10,000 bootstrap simulations.
```{r}
set.seed(1)
results_L2 <- c()
for (i in c(1:10000)) {
results_L2 <- c(results_L2,
sum(sample(L2_clusters$Deaths, 47, replace = TRUE)))
}
quantile(multiplier_L2*results_L2, probs = c(0.025, 0.975))
min(multiplier_L2*results_L2)
```
Store the low and high values for the 95% uncertainty interval.[^5]
[^5]: The official published central estimate for violent deaths is 601,027 violent deaths with a 95% uncertainty interval of 426,369 to 793,663. This estimate is lower than mine because it relies on population figures published in @burnham2006 that are lower than the UN data that I use for all my survey-based calculations so as to keep them all on a comparable basis.
```{r}
L2_low <- c(rep(0,40),440000, rep(NA, length(IBC_monthly)-41))
L2_high <- c(rep(0,40),930000, rep(NA, length(IBC_monthly)-41))
length(L2_low)
length(L2_high)
```
I cannot create month-by-month uncertainty intervals because I do not have month-by-month-deaths-by-cluster data. The 95% uncertainty interval of 440,000 to 930,000 for March 2003 through July 2006 is very wide but not so wide as to be meaningless as was the case with @roberts2004. I will also use the figure of 270,000 which was the lowest estimate across all the 10,000 bootstrap estimates. This gives a true bootstrap-based rock-bottom number that is (just barely) compatible with the @burnham2006 data.
@rosenblum2009 proposes a different method for calculating uncertainty intervals which, they show, is more appropriate than bootstrapping is for samples that are small and heavily skewed. They apply their method to the @burnham2006 data and calculated a 95% uncertainty interval, rounded to the nearest 10,000 of 290,000 to 990,000.
```{r}
L2_low_rosenblum <- c(rep(0,40),290000,
rep(NA, length(IBC_monthly)-41))
L2_high_rosenblum <- c(rep(0,40),990000,
rep(NA, length(IBC_monthly)-41))
length(L2_low_rosenblum)
length(L2_high_rosenblum)
```
## The Iraq Family Health Survey (IFHS)
The IFHS (@violence2008), like @burnham2006, covers the period from the beginning of the war through the first half of 2006. Unfortunately, the IFHS people have (also) not released their raw data so all that follows is based on information that is supplied in the published paper.
The IFHS estimates daily death rates of 123 for March 2003 through April 2004, 115 for May 2004 through May 2005 and 126 for June 2005 through June 2006. Translate these into estimates of total deaths for each of these periods:
```{r}
#The next few lines of code are just for counting days.
date_1 <- as.Date("2003-03-20")
date_2 <- as.Date("2004-05-01")
date_3 <- as.Date("2005-06-01")
date_4 <- as.Date("2006-07-01")
IFHS1 <- 123*as.numeric(difftime(date_2, date_1, units = "days"))
IFHS2 <- 115*as.numeric(difftime(date_3, date_2, units = "days"))
IFHS3 <- 126*as.numeric(difftime(date_4, date_3, units = "days"))
IFHS1
IFHS2
IFHS3
IFHS1 + IFHS2 + IFHS3
```
The IFHS used population figures (see the appendix to the paper) that are very close to the figures in @iraqpop so it is tempting to just ignore the tiny differences and use the numbers in the IFHS. However, to maintain the greatest possible consistency with all the other figures used in the paper, I adjust the IFHS figures slightly upward by the ratio of the @iraqpop figures to the IFHS ones:
```{r}
macro_trends_pop_by_year <- c(27068823, 27858948,
28698684,28905607)
IFHS_pop_by_year <- c(26388068, 27139584, 28127117, 29114649)
factor <- sum(macro_trends_pop_by_year)/sum(IFHS_pop_by_year)
IFHS1 <- IFHS1*factor
IFHS2 <- IFHS2*factor
IFHS3 <- IFHS3*factor
IFHS1
IFHS2
IFHS3
IFHS1 + IFHS2 + IFHS3
```
This population adjustment adds fewer than 3,000 deaths to the previous estimate, bringing it almost up to the official IFHS central estimate for violent deaths of 151,000.
Create a cumulative series which increases three times (for each of the three periods for which the IFHS provides information) and is, otherwise, flat:
```{r}
IFHS_cumulative <- c(rep(0, 13), rep(IFHS1, 13), rep(IFHS1 + IFHS2, 13), IFHS1 + IFHS2 + IFHS3,
rep(NA,length(IBC_monthly) - 13 - 13 - 13 -1))
length(IFHS_cumulative)
```
It is crucial to understand that the above estimates incorporate two large upward adjustments; one of these is questionable and the other is highly dubious. The latter, aptly designated to be an "arbitrary fudge" (@bohannon2008), increases the estimate by more than 50% based on an unsupported claim that the IFHS failed to detect many deaths that were present in the sample. Note that none of the other surveys applied such an adjustment.
The second adjustment, which I will call the "missing cluster adjustment" has, at least, a plausible rationale: that field teams in Baghdad and Anbar governorates were unable to visit some clusters due to safety concerns and that these clusters may have been more violent than the visited ones *over the entire course of the war*. The italicized text is important for while it is safe to assume that the unvisited clusters were more violent than the visited ones while the survey was conducted - this excess violence was the reason the former clusters were not visited - there is no reason to assume that the unvisited clusters were more violent than the visited ones for the entire war. But the main problem with the missing cluster adjustment is the extravagant imputation of violence rates in the un-visited Baghdad clusters fully 4 times higher than the measured violence rate in the visited ones; the IFHS really makes a meal of the missing cluster adjustment. They justify this choice on the grounds that applying the adjustment brings the ratio of violence in Baghdad to violence in a basket of relatively peaceful governorates into equality with the same ratio for IBC. However, IBC's ratio is probably higher than the true one because IBC's focus on media sources is better suited to capturing deaths in Baghdad than it is for capturing deaths elsewhere. In addition, there is probably a flow of dead bodies from outside Baghdad into the Baghdad morgue where they are then recorded by IBC, incorrectly, as Baghdad deaths.
The missing-cluster adjustment for Anbar is implemented in the same way as the Baghdad one is: the IFHS ratio of Anbar violence to violence in a basket of less violent governorates is brought into line with the analogous IBC ratio. However, this adjustement strikes me as more plausible than the Baghdad adjustment and, unlike the Baghdad one, has little effect on the final estimate.
My paper presents IFHS estimates both with and without the two above adjustments. The following calculation eliminates them but the reader should bear in mind that I work both with and without the adjustments.
```{r}
#Deflate by the arbitrary fudge. The paper declares that the violence data is 62% complete.
IFHS_cumulative_nofudge <- IFHS_cumulative*0.62
#Deflate by the missing cluster adjustments
#Table 1 in the appendix to the paper gives the following information. The IFHS-measured violent death rate in Baghdad is 1.30 per 1,000 per year. The IFHS-measured violent death rate in Anbar is 0.98 per 1,000 per year. The multiplier used to increase the Baghdad estimate is 1.980 and the multiplier used to increase the Anbar estimate is 1.428
#Table 2 in the appendix tells us that the population number used for Baghdad is 6,962,650, the population number used for Anbar is 1,431,717 and the population number used for all of Iraq is 29,114,649
Baghdad_added_deaths <- (1.30*(1.980-1)*6962650/29114649)*(29114649/1000)*
as.numeric(difftime(date_4, date_1, units = "days"))/365
Anbar_added_deaths <- (0.98*(1.428-1)*1431717/29114649)*(29114649/1000)*
as.numeric(difftime(date_4, date_1, units = "days"))/365
Baghdad_added_deaths
Anbar_added_deaths
Baghdad_added_deaths + Anbar_added_deaths
missing_cluster_factor <- 1 - (Baghdad_added_deaths + Anbar_added_deaths)/153000
missing_cluster_factor
#The missing cluster adjustment adds 31,111 deaths to the final estimate but this assumes that the arbitrary fudge is still in. Removing the fudge and the missing cluster adjustment simultaneously gives:
IFHS_cumulative_nofudge_nocluster <- IFHS_cumulative_nofudge * missing_cluster_factor
IFHS_cumulative_nofudge_nocluster[40]
length(IFHS_cumulative_nofudge_nocluster)
```
Removing the arbitrary fudge and the missing cluster adjustment cuts the central estimates for violent deaths in the IFHS roughly in half, from 150,000 to 73,000.
The IFHS gives lower and upper limits of 104,000 and 223,000 respectively for its 95% uncertainty interval for violent deaths over the whole period it covers. I will, perhaps pedantically and unnecessarily adjust these slightly upward to reflect the national population figures I am using.
```{r}
IFHS_low <- c(rep(0,39),104000*factor,
rep(NA, length(IBC_monthly)-40))
IFHS_high <- c(rep(0,39),223000*factor,
rep(NA, length(IBC_monthly)-40))
length(IFHS_low)
length(IFHS_high)
```
Table 4 of the IFHS paper gives a 95% uncertainty interval around the daily death rate for March 2003 through the end of April 2004 running from 68 to 271. This allows me to calculate a 95% uncertainty interval for total deaths for the period.
```{r}
IFHS1_low <- 68*factor*as.numeric(difftime(date_2,
date_1, units = "days"))
IFHS1_high <- 271*factor*as.numeric(difftime(date_2,
date_1, units = "days"))
```
I build this uncertainty interval, covering the first 14 months of the war, into the vector that contains the uncertainty interval for the whole period covered by the IFHS and, again, place it within a long vector that will, ultimately, be embedded in a large dataframe to be used for generating the pictures.
```{r}
IFHS_low <- replace(IFHS_low, 14, IFHS1_low)
IFHS_high <- replace(IFHS_high, 14, IFHS1_high)
length(IFHS_low)
length(IFHS_high)
```
The IFHS paper does give a 95% uncertainty interval for the middle period, May 2004 through May 2005. However, I work with cumulative figures and choose not to attempt an uncertainty interval for the first two periods combined as I would have to simulate these using some strong assumptions before simulating these.
## The ORB Survey
More than a decade ago Opinion Research Business (ORB) posted on their website data tables for their 2007 survey of Iraq. @spagat2010 used these in their analysis of the ORB survey. These tables have now been taken down and the website of ORB International (@orbintea) does not mention this survey. Thus, the ORB survey is unique among our sources for its lack of online presence and support by its parent organization. I do, nevertheless, include it in my paper although my only option is to copy the central estimate and 95% uncertainty that ORB once announced in a press release: 1,033,000 with a 95% uncertainty interval of 946,000 to 1,120,000 covering the period from the beginning of the war through August of 2007. Arguably, I should omit ORB's claimed uncertainty interval since ORB calculated it as if they had drawn a simple random sample when, in fact, they had a cluster survey (@spagat2010) but I include both as a record of what ORB actually claimed and to give it visibility in figure 1 (below) beyond what it would have it were represented merely by a point.
```{r}
ORB <- c(rep(0,54), 1033000, rep(NA,length(IBC_monthly) -55))
ORB_low <- c(rep(0,54), 946000, rep(NA,length(IBC_monthly) -55))
ORB_high <- c(rep(0,54), 1120000,
rep(NA,length(IBC_monthly) -55))
length(ORB)
length(ORB_low)
length(ORB_high)
```
## University Collaborative Iraq Mortality Study (UCIMS)
The UCIMS, written up in @hagopian2013a, provides data that can be used for two qualitatively different types of violent-death estimates for the period running from the beginning of the war through the middle of 2011. The first type is the standard household-based estimate based on deaths reported to have occurred within each *household* selected into the survey. All the other survey-based estimates in the present paper are of this type. The second type of estimate begins from the same sample of households drawn for the household-based estimates but, instead, works with reported deaths of the *siblings* of respondents to the survey rather than household members.[^6] Below I make both types of estimates.
[^6]: There are, generally, several people interviewed within each household about their siblings.
I stress that the UCIMS is the only survey covered in this paper that has released a proper data set.[^7] Consequently, I can do more with the UCIMS than I am able to do with the other surveys.
[^7]: @roberts2004 and ORB released scraps of data, @burnham2006 released a bit more than scraps but only to some people deemed reliable (@kaiser2007) and the IFHS and ILCS did not release any raw data.
### Household-based estimates
#### Estimates that include all reported violent deaths
First, load a file that contains a list of all the with-in household deaths recorded by the survey and retain only the ones that are flagged as violent war-related deaths:
```{r}
UCIMS_deaths <- read.csv("hh_deaths.CSV")
UCIMS_deaths_war <- filter(UCIMS_deaths, war_death == "Y")
```
Next, create a time series giving the number of deaths each month:
```{r}
UCIMS_deaths_war_monthly <- c()
for (i in 0:101) {
UCIMS_deaths_war_monthly <- c(UCIMS_deaths_war_monthly,
nrow(filter(UCIMS_deaths_war, yod == as.integer(2003 + i/12) & mod == i%%12 + 1)))
}
#The above formulation looks a bit confusing but all it does is count the number of violent deaths listed for January of 2003, then count the number of violent deaths listed for February of 2003, etc. through to June of 2011.
#Here are some quality checks
nrow(UCIMS_deaths_war) #There are 76 violent deaths in the data set
sum(UCIMS_deaths_war_monthly) #The above procedure only picked up 73 out of the 76 deaths
length(UCIMS_deaths_war_monthly)
```
The monthly figures are missing three deaths that are listed with known years but unknown months. Two of these incompletely specified violent deaths are in 2006 and one is in 2009.[^8] Allocate these deaths evenly over the years within which they occur:
[^8]: A curious fact is that these deaths with unknown months are also listed as deaths for which interviewers viewed death certificates.
```{r}
UCIMS_deaths_war_monthly<- UCIMS_deaths_war_monthly +
c(rep(0, 36),rep(2/12, 12), rep(0, 24), rep(1/12, 12),
rep(0,length(UCIMS_deaths_war_monthly)- 84))
sum(UCIMS_deaths_war_monthly)
#Now we have all 76 deaths in the monthly time series as there should be.
length(UCIMS_deaths_war_monthly)
```
According to the @hagopian2013 paper, there were a total of 10,670 people in the sample. @iraqpop gives a population figure of 28,660,887 for 2007, the midpoint of the period covered by the survey. Use this information to calculate the multiplier for converting recorded violent deaths into estimated violent deaths:
```{r}
multiplier_UCIMS <- 28660887/10670
multiplier_UCIMS
```
Embed this multiplier into the monthly time series, delete the months of January and February, 2003, create a cumulative monthly series for violent deaths and extend its length to match the full time window covered in the present paper:
```{r}
UCIMS_deaths_war_monthly_est <-
UCIMS_deaths_war_monthly[3:length(UCIMS_deaths_war_monthly)]*
multiplier_UCIMS
UCIMS_cumulative <- cumsum(UCIMS_deaths_war_monthly_est)
UCIMS_cumulative <- c(UCIMS_cumulative,
rep(NA, length(IBC_extended_cumulative)-
length(UCIMS_cumulative)))
UCIMS_cumulative
length(UCIMS_cumulative)
```
I use bootstrapping to surround each central estimate for cumulative violent deaths with a 95% uncertainty interval for every month covered by the UCIMS, i.e, from March 2003 through June 2011. But first there must be some further data preparation.
Break violent deaths down by cluster, of which there are 100, and month without forgetting that there are three violent deaths for which the month of death is NA. As before, set these incompletely specified deaths equal to 1/12 for each month so as to spread them evenly over the whole year:
```{r}
deaths_by_cluster_year_month <- c()
for (i in c(2003:2011)) {
for (j in c(1:12)) {
for (k in c(1:100)) {
temp <- UCIMS_deaths$yod == i &
UCIMS_deaths$mod == j &
UCIMS_deaths$cluster == k &
UCIMS_deaths$war_death == "Y"
temp[is.na(temp)] <- 1/12
deaths_by_cluster_year_month <- c(deaths_by_cluster_year_month,
(sum(temp)
))
}
}
}
#Check to make sure there are a total of 76 violent deaths
sum(deaths_by_cluster_year_month)
```
Retain only the months running from March 2003 through June 2011.
```{r}
#Get rid of the first 2 (pre-war) months
deaths_by_cluster_year_month <- deaths_by_cluster_year_month[-c(1:200)]
#The survey only covers half of 2011 so get rid of the last six months
deaths_by_cluster_year_month <- deaths_by_cluster_year_month[-c(10001:10600)]
#Check to make sure that the length of the vector is correct
length(deaths_by_cluster_year_month)
(8*12+4)*100
```
Reshape the vector into a data frame and transform the entries in this data frame into cumulative sums of violent deaths by month and cluster.
```{r}
df_bootstrapping <- as.data.frame(t(matrix(deaths_by_cluster_year_month,
ncol = 100)))
df_bootstrapping_cum <- sapply(df_bootstrapping, cumsum)
```
Here is the code for doing the bootstraps and saving the data. WARNING - IT TAKES AS MUCH AS AN HOUR TO DO ALL OF THESE BOOTSTRAPS SO YOU MIGHT PREFER TO JUST LOAD UP THE RESULTS FILE THAT COMES WITH THE PACKAGE.
```{r}
#set.seed(1)
#bootstrapping_data <- c()
#for (i in c(1:100)) {
# for (j in c(1:10000)) {
# bootstrapping_data <- c(bootstrapping_data,
# sum(sample(df_bootstrapping_cum[i,], 100,
# replace = TRUE)))
# }
#}
#Check to make sure the length of the vector is right
#length(bootstrapping_data)
#2000*100
#df_bootstrap_results <- as.data.frame(matrix(bootstrapping_data,
# ncol = 100))
#write.csv(df_bootstrap_results,
#"Household_bootstrapping_results.csv")
```
If you did not run the bootstrapping code then load the data file that contains the results:
```{r}
df_bootstrap_results <- read.csv("Household_bootstrapping_results.csv")
```
Finally extract lower and upper limits for 95% confidence intervals for every month. Remember that these are based on cumulative figures so, of course, the intervals rise over time.
```{r}
temp_function <- function(x){
quantile(x, probs = c(0.025, 0.975))
}
intervals <- lapply(df_bootstrap_results[,2:101]*multiplier_UCIMS, temp_function)
#Note that the [,2:101] is there for a dumb technical reason because the first column of the dataframe just counts row numbers (1,2,3,...) and we don't want to extract percentiles for this counting maching.
intervals
```
Store the results in a form that enables incorporation into the figures of the paper:
```{r}
df_intervals <- as.data.frame(t(as.data.frame(intervals)))
colnames(df_intervals) <- c("low", "high")
UCIMS_cumulative_low <- c(df_intervals$low,
rep(NA, length(IBC_extended_cumulative)-
length(df_intervals$low)))
UCIMS_cumulative_high <- c(df_intervals$high,
rep(NA, length(IBC_extended_cumulative)-
length(df_intervals$high)))
UCIMS_cumulative_low
UCIMS_cumulative_high
length(UCIMS_cumulative_low)
length(UCIMS_cumulative_high)
```
#### Including only violent deaths backed by death certificates
The above calculations yield a household-based UCIMS central estimate along with 95% uncertainty intervals for every month of the war up through the middle of 2006. The final estimate for the whole period is 200,000 violent deaths from the beginning of the war until the middle of 2011 with a 95% uncertainty interval running from around 140,000 to 270,000.[^9]
[^9]: @hagopian2013 did not really publish a household-based violent-death estimate. The paper did, however, estimate 405,000 excess deaths and state that 60% of these were violent deaths, implying an estimate of 240,000 violent deaths without and uncertainty interval.
However, the UCIMS required interviewers to ask to see death certificates whenever respondents reported a death. It turns out that, once prompted, respondents were often unable to produce the requested death certificates. Failure to back a reported death with a death certificate casts at least some doubt on that report. I, therefore, repeat the above analysis but just the sub-set of reported deaths that are backed by death certificates.
```{r}
UCIMS_deaths_cert <-
filter(UCIMS_deaths, war_death == "Y" &
death_cert == "able to see death cert")
UCIMS_deaths_war_monthly_cert <- c()
for (i in 0:101) {
UCIMS_deaths_war_monthly_cert <-
c(UCIMS_deaths_war_monthly_cert,
nrow(filter(UCIMS_deaths_cert, yod == as.integer(2003 + i/12) & mod == i%%12 + 1)))
}
#Let's do some quality checks
nrow(UCIMS_deaths_cert)
sum(UCIMS_deaths_war_monthly_cert)
length(UCIMS_deaths_war_monthly_cert)
```
There are only 51 reported violent deaths in the UCIMS survey that are backed by death certificates. These include the three reports, noted above, that are NA for month of death. I note in passing that it is surprising that there are any deaths for which there are death certificates but unknown months of deaths because deaths certificates should include dates of death. We will, nevertheless, include these three deaths in our analysis.
```{r}
UCIMS_deaths_war_monthly_cert <- UCIMS_deaths_war_monthly_cert +
c(rep(0, 36),rep(2/12, 12), rep(0, 24), rep(1/12, 12),
rep(0,length(UCIMS_deaths_war_monthly_cert)- 84))
sum(UCIMS_deaths_war_monthly_cert)
#Now we have all 51 deaths in the monthly time series for violent deaths backed by death certificates.
```
Make monthly and cumulative central estimates for violent deaths:
```{r}
UCIMS_deaths_war_monthly_cert_est <-
UCIMS_deaths_war_monthly_cert[3:length
(UCIMS_deaths_war_monthly_cert)]*multiplier_UCIMS
UCIMS_cumulative_cert <- cumsum(UCIMS_deaths_war_monthly_cert_est)
UCIMS_cumulative_cert <- c(UCIMS_cumulative_cert,
rep(NA, length(IBC_extended_cumulative)-
length(UCIMS_cumulative_cert)))
UCIMS_cumulative_cert
length(UCIMS_cumulative_cert)
```
Requiring death-certificate backing reduces the central estimate for violent deaths over the whole period covered by the UCIMS to around 140,000.
Prepare the data for bootstrapping on just the violent deaths backed by death certificates.
```{r}
deaths_by_cluster_year_month_cert <- c()
for (i in c(2003:2011)) {
for (j in c(1:12)) {
for (k in c(1:100)) {
temp <- UCIMS_deaths$yod == i &
UCIMS_deaths$mod == j &
UCIMS_deaths$cluster == k &
UCIMS_deaths$war_death == "Y" &
UCIMS_deaths$death_cert == "able to see death cert"
temp[is.na(temp)] <- 1/12
deaths_by_cluster_year_month_cert <- c(deaths_by_cluster_year_month_cert,
(sum(temp)
))
}
}
}
#Check to make sure there are 51 violent deaths
sum(deaths_by_cluster_year_month_cert)
```
Retain only the during-war months covered by the survey:
```{r}
#Get rid of the first 2 (pre-war) months
deaths_by_cluster_year_month_cert <- deaths_by_cluster_year_month_cert[-c(1:200)]
#The survey only covers half of 2011 so get rid of the last six months
deaths_by_cluster_year_month_cert <- deaths_by_cluster_year_month_cert[-c(10001:10600)]
#Check the length of the vector
length(deaths_by_cluster_year_month_cert)
(8*12+4)*100
```
Reshape the data into a data frame of cumulative sums to facilitate bootstrapping.
```{r}
df_bootstrapping_cert <- as.data.frame(t(matrix(deaths_by_cluster_year_month_cert,
ncol = 100)))
df_bootstrapping_cert_cum <- sapply(df_bootstrapping_cert, cumsum)
```
Run the bootstraps. Again, this takes time so you might prefer to just use the prepared file of results:
```{r}
#set.seed(1)
#bootstrapping_data_cert <- c()
#for (i in c(1:100)) {
# for (j in c(1:10000)) {
# bootstrapping_data_cert <- c(bootstrapping_data_cert,
# sum(sample(df_bootstrapping_cert_cum[i,], 100,
# replace = TRUE)))
# }
#}
#Check to make sure the length of the vector is right
#length(bootstrapping_data_cert)
#2000*100
#df_bootstrap_results_cert <- as.data.frame(matrix#(bootstrapping_data_cert,
# ncol = 100))
#write.csv(df_bootstrap_results_cert, #"Household_bootstrapping_results_cert.csv")
```
You can read in the bootstrapping data without running the bootstraps if you like.
```{r}
df_bootstrap_results_cert <- read.csv("Household_bootstrapping_results_cert.csv")
```
Extract the lower and upper bounds for the uncertainty intervals.
```{r}
temp_function <- function(x){
quantile(x, probs = c(0.025, 0.975))
}
intervals_cert <- lapply(df_bootstrap_results_cert[,2:101]*multiplier_UCIMS, temp_function)
#Note that the [,2:101] is there for a dumb technical reason because the first column of the dataframe does a count of the row numbers (1,2,3,...) and we don't want to extract percentiles for this counting maching
intervals_cert
```
Store the results in a way that will work for the pictures.
```{r}
df_intervals_cert <- as.data.frame(t(as.data.frame(intervals_cert)))
colnames(df_intervals_cert) <- c("low", "high")
UCIMS_cumulative_cert_low <- c(df_intervals_cert$low,
rep(NA, length(IBC_extended_cumulative)-
length(df_intervals_cert$low)))
UCIMS_cumulative_cert_high <- c(df_intervals_cert$high,
rep(NA, length(IBC_extended_cumulative)-
length(df_intervals_cert$high)))
UCIMS_cumulative_cert_low
UCIMS_cumulative_cert_high
length(UCIMS_cumulative_cert_low)
length(UCIMS_cumulative_cert_high)
```
### Sibling-based estimates
For the sibling-based estimates the sample covers all people interviewed plus their siblings. The household method, employed by all the above surveys, although not well by ORB (see my paper for details), partitions the population into households of people living under a single roof. In contrast, the sibling method partitions the population into "sibships", i.e., groups of brothers and sisters.
As a first step toward producing a violent-death estimate, we need to figure out how many of these people were alive at the beginning of the war, i.e., in March of 2003. We cannot do this exactly because we only have monthly data and the war began on the 21st of March, 2003. We will, therefore, assume that anyone alive in March of 2003 was alive at the beginning of the war.[^10] The following code shows that there are 27,739 covered by the survey:
[^10]: This assumption introduces a slight downward bias to my sibling-based violent-death estimate because it inflates our denominator to the extent that people covered by the sample died between March 1 and March 20, 2003.
```{r}
sibdata <- read.csv("IHME_IRAQ_MORTALITY_STUDY_2001_2011_SIBLINGS.CSV")
nrow(sibdata)
#There are 29,515 rows
#Let's find the number of people alive when the war started
nrow(filter(sibdata, yod > "2002" | is.na(yod)))
#27,749 were alive at the beginning of 2003. But we need to get rid of January and February of 2003.
sibdata_start_post2003 <- filter(sibdata, yod > "2003" | is.na(yod))
sibdata_start_2003 <- filter(sibdata, yod == "2003" & mod > 2)
sibdata_start <- rbind(sibdata_start_2003, sibdata_start_post2003)
nrow(sibdata_start_2003)
nrow(sibdata_start_post2003)
nrow(sibdata_start)
#Now we have all the people who were alive when the war started, or at least in March of 2003, which is (27,739). Given that only 63 people died in the first two months of 2003 it is unlikely that a large number of people died during the first 3 weeks of March 2003.
```
The following bar plot shows that the age distribution of the sample in 2003 is far from representative of the age distribution for the population. In fact, the age distribution within the sample does not even decrease monotonically with age:
```{r}
barplot(table(2003 -sibdata_start$yob))
#Of course, the negative values in the barplot represent people who had not yet been born in 2003.
```
We need, therefore, to create weights to offset this problem of non-representativeness. The idea behind the weighting scheme is that an in-sample death within an age group that is under-represented in the sample, compared to its presence within the full population, should generate more estimated deaths than an in-sample death within an over-represented age group and vice versa.
Begin the construction of the weighting scheme by breaking the in-sample population down into the age groups 0-4, 5-9, 10-14,...,80+
```{r}
sibling_pop_pyramid <- c()
for (i in c(1:18)) {
sibling_pop_pyramid <- c(sibling_pop_pyramid,
sum(2003 - sibdata_start$yob < 5*i))
}
sibling_pop_pyramid <- diff(sibling_pop_pyramid)
sibling_pop_pyramid_fractions <-
sibling_pop_pyramid/sum(sibling_pop_pyramid)
sibling_pop_pyramid_fractions
```
Next use a population file that was provided by the UCIMS that breaks the whole population of Iraq into the same age groups that we just used for the sample:
```{r}
population <- read.csv("IHME_IRAQ_MORTALITY_STUDY_2001_2011_POPULATION.CSV")
population_2003 <- filter(population, year == 2003)
Iraq_population_2003 <- c()
for (i in c(0:16)) {
Iraq_population_2003 <- c(Iraq_population_2003,
sum(filter(population_2003,age == i*5)$population))
}
Iraq_population_2003_fractions <-
Iraq_population_2003/sum(Iraq_population_2003)
Iraq_population_2003_fractions
```
Next, calculate the weights which are just the ratios of the population fractions to the sample fractions:
```{r}
pop_weights <-
Iraq_population_2003_fractions/sibling_pop_pyramid_fractions
pop_weights
#Note - there is an extravagant weight of 145 on the 80+ category but this does not matter since the UCIMS recorded no violent deaths in this category.
```
Extract just the war deaths from the sibling file and attach ages in 2003 to all of them:
```{r}
UCIMS_war_deaths <- filter(sibdata_start, war_death == "Y")
age_2003 <- c()
for (i in c(1:length(UCIMS_war_deaths$id))) {
age_2003 <- c(age_2003,
2003 - sibdata[sibdata$id ==
UCIMS_war_deaths[i,1] &
sibdata$sibid == UCIMS_war_deaths[i,2],]$yob)
}
UCIMS_war_deaths <- mutate(UCIMS_war_deaths, age_2003)
age_2003
```
Add a new variable to the data frame for deaths multiplied by population weights :
```{r}
weighted_deaths <- pop_weights[ceiling(UCIMS_war_deaths$age_2003/5)]
UCIMS_war_deaths <- mutate(UCIMS_war_deaths, weighted_deaths)
```
Calculate the multiplier for converting in-sample deaths into estimated deaths and make a sibling-based central estimate for violent deaths over the whole period covered by the UCIMS:
```{r}
sibling_multiplier <- 28905607/nrow(sibdata_start)
sum(UCIMS_war_deaths$weighted_deaths)*sibling_multiplier
sum(UCIMS_war_deaths$weighted_deaths)
```
Note that @hagopian2013 did publish a sibling-based violent-death estimate of 132,000 with a 95% uncertainty interval of 89,000 to 174,000. My estimate is a bit higher but close to this published one.
Break weighted deaths down by month and year while remembering that some deaths are NA for month:
```{r}
weighted_sibling_deaths_by_year_month <- c()
for (i in c(2003:2011)) {
for (j in c(1:12)) {
temp <- UCIMS_war_deaths$yod == i &
UCIMS_war_deaths$mod == j
temp[is.na(temp)] <- 1/12
temp <- temp*UCIMS_war_deaths$weighted_deaths
weighted_sibling_deaths_by_year_month <-
c(weighted_sibling_deaths_by_year_month,
(sum(temp)))
}
}
#Note - this code would need to be modified if there were deaths that are NA for month in either 2003 or 2011 because neither of these are full years for the estimate. However, there are no such deaths in either of those years so this code is fine.
sum(weighted_sibling_deaths_by_year_month)
weighted_sibling_deaths_by_year_month
```
Get rid of January and February of 2003 and the last six months of 2011. However, there is one death listed for August of 2011 which is two months months after the end of the period covered by the survey. I assume that this death did occur during the coverage period but was incorrectly coded so I assign it to June of 2011.[^11]
[^11]: It seems much more likely that this death really was reported but coded wrong than that it was not reported at all but, nevertheless, entered into the data set with an impossible date.
```{r}
length(weighted_sibling_deaths_by_year_month)
weighted_sibling_deaths_by_year_month <-
weighted_sibling_deaths_by_year_month[-c(1:2)]
weighted_sibling_deaths_by_year_month[100] <-
weighted_sibling_deaths_by_year_month[100] +
weighted_sibling_deaths_by_year_month[102]
weighted_sibling_deaths_by_year_month <-
weighted_sibling_deaths_by_year_month[-c(101:106)]
length(weighted_sibling_deaths_by_year_month)
10+12*7+6
sum(weighted_sibling_deaths_by_year_month)
```
Convert the sample death numbers into estimated deaths, calculate the cumulative series and store it in a way that will facilitate pictures.
```{r}
weighted_sibling_series <- weighted_sibling_deaths_by_year_month*sibling_multiplier
weighted_sibling_cumulative <- cumsum(weighted_sibling_series)
sibling_cumulative <- c(weighted_sibling_cumulative,
rep(NA, length(IBC_extended_cumulative)-
length(weighted_sibling_cumulative)))
sibling_cumulative
length(sibling_cumulative)
```
We move now to bootstrapping which begins, as usual, with breaking down deaths, in this case weighted deaths, by month and cluster. Inspection of the data reveals that, strangely, there are deaths in, supposedly nonexistent clusters 109 and 110 but no deaths in clusters 99 and 100. So we simplify the code by designating the latter deaths as deaths for clusters 99 and 100:
```{r}
UCIMS_war_deaths$cluster[UCIMS_war_deaths$cluster == 109] <- 99
UCIMS_war_deaths$cluster[UCIMS_war_deaths$cluster == 110] <- 100
```
Break the deaths down by year, month and cluster:
```{r}
weighted_sibling_deaths_by_year_month_cluster <- c()
for (i in c(2003:2011)) {
for (j in c(1:12)) {
for (k in c(1:100)) {
temp <- UCIMS_war_deaths$yod == i &
UCIMS_war_deaths$mod == j &
UCIMS_war_deaths$cluster == k
temp[is.na(temp)] <- 1/12
temp <- temp*UCIMS_war_deaths$weighted_deaths
weighted_sibling_deaths_by_year_month_cluster <-
c(weighted_sibling_deaths_by_year_month_cluster,
(sum(temp)))
}
}
}
sum(weighted_sibling_deaths_by_year_month)
sum(weighted_sibling_deaths_by_year_month_cluster)
#The sums are consistent
```
Facilitate bootstrapping by creating a data frame with the year-month-cluster data and making the death numbers for each cluster cumulative:
```{r}
weighted_sibling_df_bootstrapping <-
as.data.frame(t(matrix(weighted_sibling_deaths_by_year_month_cluster,
ncol = 100)))
weighted_sibling_df_bootstrapping_cum <-
sapply(weighted_sibling_df_bootstrapping, cumsum)
```
Here is the bootstrapping code which, again, takes time and which you do not have to run if you prefer to just load the data:
```{r}
#set.seed(1)
#weighted_sibling_bootstrapping_data <- c()
#for (i in c(1:100)) {
# for (j in c(1:10000)) {
# weighted_sibling_bootstrapping_data <-
# c(weighted_sibling_bootstrapping_data,
# sum(sample(weighted_sibling_df_bootstrapping_cum[i,],
# 100, replace = #TRUE)))
# }
#}
```
Do a quick quality check and save the data:
```{r}
#Check to make sure the length of the vector is right
#length(sibling_bootstrapping_data)
#2000*100
#weighted_sibling_df_bootstrap_results <-
# as.data.frame(matrix(weighted_sibling_bootstrapping_data, ncol #= 100))
#write.csv(weighted_sibling_df_bootstrap_results,
# "weighted_sibling_bootstrap_results.csv")
```
Now load the data if you have not done the bootstraps:
```{r}
weighted_sibling_df_bootstrap_results <-
read.csv("weighted_sibling_bootstrap_results.csv")
```
Convert raw weighted deaths into estimated deaths:
```{r}
weighted_sibling_df_bootstrap_results <-
weighted_sibling_df_bootstrap_results*sibling_multiplier
```
Extract the upper and lower limits of the 95% confidence intervals from the bootstrapping:
```{r}
temp_function <- function(x){
quantile(x, probs = c(0.025, 0.975))
}
weighted_sibling_intervals <- lapply(weighted_sibling_df_bootstrap_results,
temp_function)
weighted_sibling_df_intervals <-
as.data.frame(t(as.data.frame(weighted_sibling_intervals)))
weighted_sibling_df_intervals <- weighted_sibling_df_intervals[2:101,]
colnames(weighted_sibling_df_intervals) <- c("low", "high")
weighted_sibling_intervals
```
Render the bootstrapping results in a form that will support the figures:
```{r}
sibling_cumulative_low <- c(weighted_sibling_df_intervals$low,
rep(NA, length(IBC_extended_cumulative)-
length(weighted_sibling_df_intervals$low)))
sibling_cumulative_high <- c(weighted_sibling_df_intervals$high,
rep(NA, length(IBC_extended_cumulative)-
length(weighted_sibling_df_intervals$high)))
length(sibling_cumulative_low)
length(sibling_cumulative_high)
```
## Put it all together
Collect together all the processed data into a single grand vector:
```{r}
grand_vector <- c(IBC_extended_cumulative,
ILCS, ILCS_low, ILCS_high,
L1, L1_generous_low, L1_generous_high,
L2_cumulative, L2_low, L2_high,
L2_low_rosenblum, L2_high_rosenblum,
IFHS_cumulative, IFHS_low, IFHS_high,
IFHS_cumulative_nofudge_nocluster,
ORB, ORB_low, ORB_high,
UCIMS_cumulative,
UCIMS_cumulative_low,
UCIMS_cumulative_high,
UCIMS_cumulative_cert,
UCIMS_cumulative_cert_low,
UCIMS_cumulative_cert_high,
sibling_cumulative,
sibling_cumulative_low,
sibling_cumulative_high
)
length(grand_vector)/28
```
```{r}
all_data <- as.data.frame((matrix(grand_vector, ncol = 28 )))%>%
mutate(month = c(1:length(IBC_extended_cumulative)))%>%
mutate(date = seq(as.Date("2003/3/1"), by = "month",
length.out = length(IBC_extended_cumulative)))%>%
rename(IBC = V1, ILCS = V2, ILCS_low = V3,
ILCS_high = V4, L1 = V5, L1_low = V6,
L1_high = V7, L2 = V8, L2_low = V9,
L2_high = V10, L2_low_rosenblum = V11,
L2_high_rosenblum =V12,
IFHS = V13, IFHS_low = V14,
IFHS_high = V15, IFHS_nofudge = V16,
ORB = V17, ORB_low = V18,
ORB_high = V19, UCIMS = V20,
UCIMS_low = V21, UCIMS_high = V22,
UCIMS_cert = V23, UCIMS_cert_low = V24,
UCIMS_cert_high = V25, sibling = V26,
sibling_low = V27, sibling_high = V28
)
View(all_data)
```
Save the data:
```{r}
write.csv(all_data, "all_data.csv")
```
## Tables
Load the gt() packages for making tables:
```{r}
library(gt)
```
Create a data frame giving cumulative death numbers covering the period from the beginning of the war through April of 2004 for all sources from which these numbers can be extracted. When possible, include lower and upper bounds for 95% uncertainty intervals. Also, when possible, give the maximum value obtained in 10,000 bootstrap simulations.
```{r}
low_2004 <- c(NA, ILCS_low[14], NA, IFHS_low[14], NA,
UCIMS_cumulative_low[14],
UCIMS_cumulative_cert_low[14],
sibling_cumulative_low[14])
central_2004 <- c(IBC_extended_cumulative[14], ILCS[14],
L2_cumulative[14], IFHS_cumulative[14],
IFHS_cumulative_nofudge_nocluster[14],
UCIMS_cumulative[14],
UCIMS_cumulative_cert[14],
sibling_cumulative[14])
high_2004 <- c(NA, ILCS_high[14], NA, IFHS_high[14], NA,
UCIMS_cumulative_high[14],
UCIMS_cumulative_cert_high[14],
sibling_cumulative_high[14])
max_2004 <- c(NA, NA, NA, NA, NA,
max(df_bootstrap_results[,14]*multiplier_UCIMS),
max(df_bootstrap_results_cert[,14]*multiplier_UCIMS),
max(weighted_sibling_df_bootstrap_results[,14]))
Sources <- c("IBC Extended", "ILCS", "Burnham et al. (2006)",
"IFHS Adjusted", "IFHS Unadjusted",
"UCIMS Household - All Deaths",
"UCIMS Household - Deaths Backed by Death Certificates",
"UCIMS Sibling")
all_2004 <- c(low_2004, central_2004, high_2004, max_2004)
all_2004_matrix <- matrix(all_2004,ncol = 4)
all_2004_df <- as.data.frame(all_2004_matrix)
colnames(all_2004_df) <- c("Percentile_2.5", "Central_number",
"Percentile_97.5", "Percentile_100")
all_2004_df <- cbind(Sources, all_2004_df)
View(all_2004_df)
all_2004_df <- arrange(all_2004_df, Central_number)
View(all_2004_df)
```
Make and save table 1:
```{r}
table1 <- all_2004_df%>%
gt()%>%
fmt_number(c(`Percentile_2.5`,`Central_number`, `Percentile_97.5`,
`Percentile_100`),decimals = 0, sep_mark = ",",
n_sigfig = 2) %>%
tab_row_group(rows = 8, "Outside the Mainstream") %>%
tab_row_group(rows = 7, label = "On the Edge") %>%
tab_row_group(rows = 1:6, label = "Mainstream") %>%
cols_label(Percentile_2.5 = "Percentile 2.5",
Central_number = "Central Number",
Percentile_97.5 = "Percentile 97.5",
Percentile_100 = "Percentile 100") %>%
tab_header(title = "Table 1 - Violent Deaths in the Iraq War",
subtitle = "March 2003 through April 2004")
gtsave(table1, "table1.png")
```
Now do the same thing for the period March 2003 through September 2004:
```{r}
low_2004_Sept <- c(NA, L1_generous_low[19], NA,
UCIMS_cumulative_low[19],
UCIMS_cumulative_cert_low[19],
sibling_cumulative_low[19])
central_2004_Sept <- c(IBC_extended_cumulative[19], L1[19],
L2_cumulative[19],
UCIMS_cumulative[19],
UCIMS_cumulative_cert[19],
sibling_cumulative[19])
high_2004_Sept <- c(NA, L1_generous_high[19], NA,
UCIMS_cumulative_high[19],
UCIMS_cumulative_cert_high[19],
sibling_cumulative_high[19])
max_2004_Sept <- c(NA, max(multiplier_L1*results_generous), NA,
max(df_bootstrap_results[,19]*multiplier_UCIMS),
max(df_bootstrap_results_cert[,19]*multiplier_UCIMS),
max(weighted_sibling_df_bootstrap_results[,19]))
Sources <- c("IBC Extended", "Roberts et al. (2004)",
"Burnham et al. (2006)", "UCIMS Household - All Deaths",
"UCIMS Household - Deaths Backed by Death Certificates",
"UCIMS Sibling")
all_2004_Sept <- c(low_2004_Sept, central_2004_Sept,
high_2004_Sept, max_2004_Sept)
all_2004_Sept_matrix <- matrix(all_2004_Sept,ncol = 4)
all_2004_Sept_df <- as.data.frame(all_2004_Sept_matrix)
colnames(all_2004_Sept_df) <- c("Percentile_2.5", "Central_number",
"Percentile_97.5", "Percentile_100")
all_2004_Sept_df <- cbind(Sources, all_2004_Sept_df)
View(all_2004_Sept_df)
all_2004_Sept_df <- arrange(all_2004_Sept_df, Central_number)
View(all_2004_Sept_df)
table2 <- all_2004_Sept_df%>%
gt()%>%
fmt_number(c(`Percentile_2.5`,`Central_number`, `Percentile_97.5`,
`Percentile_100`),decimals = 0, sep_mark = ",",
n_sigfig = 2) %>%
tab_row_group(rows = 5:6, "Outside the Mainstream") %>%
tab_row_group(rows = 1:4, label = "Mainstream") %>%
cols_label(Percentile_2.5 = "Percentile 2.5",
Central_number = "Central Number",
Percentile_97.5 = "Percentile 97.5",
Percentile_100 = "Percentile 100") %>%
tab_header(title = "Table 2 - Violent Deaths in the Iraq War",
subtitle = "March 2003 through September 2004")
gtsave(table2, "table2.png")
```
Do it again for the period March 2003 through May 2005:
```{r}
low_2005 <- c(NA, NA, NA, NA,
UCIMS_cumulative_low[27],
UCIMS_cumulative_cert_low[27],
sibling_cumulative_low[27])
central_2005 <- c(IBC_extended_cumulative[27], L2_cumulative[27],
IFHS_cumulative[27],
IFHS_cumulative_nofudge_nocluster[27],
UCIMS_cumulative[27],
UCIMS_cumulative_cert[27],
sibling_cumulative[27])
high_2005 <- c(NA, NA, NA, NA,
UCIMS_cumulative_high[27],
UCIMS_cumulative_cert_high[27],
sibling_cumulative_high[27])
max_2005 <- c(NA, NA, NA, NA,
max(df_bootstrap_results[,27]*multiplier_UCIMS),
max(df_bootstrap_results_cert[,27]*multiplier_UCIMS),
max(weighted_sibling_df_bootstrap_results[,27]))
Sources <- c("IBC Extended", "Burnham et al. (2006)",
"IFHS Adjusted", "IFHS Unadjusted",
"UCIMS Household - All Deaths",
"UCIMS Household - Deaths Backed by Death Certificates",
"UCIMS Sibling")
all_2005 <- c(low_2005, central_2005, high_2005, max_2005)
all_2005_matrix <- matrix(all_2005,ncol = 4)
all_2005_df <- as.data.frame(all_2005_matrix)
colnames(all_2005_df) <- c("Percentile_2.5", "Central_number",
"Percentile_97.5", "Percentile_100")
all_2005_df <- cbind(Sources, all_2005_df)
View(all_2005_df)
all_2005_df <- arrange(all_2005_df, Central_number)
View(all_2005_df)
table3 <- all_2005_df%>%
gt()%>%
fmt_number(c(`Percentile_2.5`,`Central_number`, `Percentile_97.5`,
`Percentile_100`),decimals = 0, sep_mark = ",",
n_sigfig = 2) %>%
tab_row_group(rows = 7, "Outside the Mainstream") %>%
tab_row_group(rows = 1:6, label = "Mainstream") %>%
cols_label(Percentile_2.5 = "Percentile 2.5",
Central_number = "Central Number",
Percentile_97.5 = "Percentile 97.5",
Percentile_100 = "Percentile 100") %>%
tab_header(title = "Table 3 - Violent Deaths in the Iraq War",
subtitle = "March 2003 through May 2005")
gtsave(table3, "table3.png")
```
Now do March 2003 through June 2006:
```{r}
low_2006 <- c(NA, L2_low[41], L2_low_rosenblum[41], IFHS_low[40], NA,
UCIMS_cumulative_low[40],
UCIMS_cumulative_cert_low[40],
sibling_cumulative_low[40])
central_2006 <- c(IBC_extended_cumulative[40],
L2_cumulative[41], NA, IFHS_cumulative[40],
IFHS_cumulative_nofudge_nocluster[40],
UCIMS_cumulative[40],
UCIMS_cumulative_cert[40],
sibling_cumulative[40])
high_2006 <- c(NA, L2_high[41], L2_high_rosenblum[41],IFHS_high[40],
NA,
UCIMS_cumulative_high[40],
UCIMS_cumulative_cert_high[40],
sibling_cumulative_high[40])
max_2006 <- c(NA, max(multiplier_L2*results_L2), NA, NA, NA,
max(df_bootstrap_results[,40]*multiplier_UCIMS),
max(df_bootstrap_results_cert[,40]*multiplier_UCIMS),
max(weighted_sibling_df_bootstrap_results[,40]))
Sources <- c("IBC Extended", "Burnham et al. (2006)",
"Burnham et al. data - Rosenblum and van der Laan UI",
"IFHS Adjusted", "IFHS Unadjusted",
"UCIMS Household - All Deaths",
"UCIMS Household - Deaths Backed by Death Certificates",
"UCIMS Sibling")
all_2006 <- c(low_2006, central_2006,
high_2006, max_2006)
all_2006_matrix <- matrix(all_2006,ncol = 4)
all_2006_df <- as.data.frame(all_2006_matrix)
colnames(all_2006_df) <- c("Percentile_2.5", "Central_number",
"Percentile_97.5", "Percentile_100")
all_2006_df <- cbind(Sources, all_2006_df)
View(all_2006_df)
all_2006_df <- arrange(all_2006_df, Central_number)
View(all_2006_df)
table4 <- all_2006_df%>%
gt()%>%
fmt_number(c(`Percentile_2.5`,`Central_number`, `Percentile_97.5`,
`Percentile_100`),decimals = 0, sep_mark = ",",
n_sigfig = 2) %>%
tab_row_group(rows = 7:8, "Outside the Mainstream") %>%
tab_row_group(rows = 1:6, label = "Mainstream") %>%
cols_label(Percentile_2.5 = "Percentile 2.5",
Central_number = "Central Number",
Percentile_97.5 = "Percentile 97.5",
Percentile_100 = "Percentile 100") %>%
tab_header(title = "Table 4 - Violent Deaths in the Iraq War",
subtitle = "March 2003 through June/July 2006")
gtsave(table4, "table4.png")
```
Extract the minimum outcome for the 10,000 bootstrap simulations for the @burnham2006 survey:
```{r}
min(multiplier_L2*results_L2)
```
This minimum greatly exceed the maximum bootstrap result for all three other surveys estimates covering the same period for which I was able to do bootstrapping, signalling a radical incompatibility between these surveys.
Finally, do it for March 2003 through September 2007:
```{r}
low_2007 <- c(NA, ORB_low[55],
UCIMS_cumulative_low[55],
UCIMS_cumulative_cert_low[55],
sibling_cumulative_low[55])
central_2007 <- c(IBC_extended_cumulative[55],
ORB[55],
UCIMS_cumulative[55],
UCIMS_cumulative_cert[55],
sibling_cumulative[55])
high_2007 <- c(NA, ORB_high[55],
UCIMS_cumulative_high[55],
UCIMS_cumulative_cert_high[55],
sibling_cumulative_high[55])
max_2007 <- c(NA, NA,
max(df_bootstrap_results[,55]*multiplier_UCIMS),
max(df_bootstrap_results_cert[,55]*multiplier_UCIMS),
max(weighted_sibling_df_bootstrap_results[,55]))
Sources <- c("IBC Extended", "ORB",
"UCIMS Household - All Deaths",
"UCIMS Household - Deaths Backed by Death Certificates",
"UCIMS Sibling")
all_2007 <- c(low_2007, central_2007,
high_2007, max_2007)
all_2007_matrix <- matrix(all_2007,ncol = 4)
all_2007_df <- as.data.frame(all_2007_matrix)
colnames(all_2007_df) <- c("Percentile_2.5", "Central_number",
"Percentile_97.5", "Percentile_100")
all_2007_df <- cbind(Sources, all_2007_df)
View(all_2007_df)
all_2007_df <- arrange(all_2007_df, Central_number)
View(all_2007_df)
table5 <- all_2007_df%>%
gt()%>%
fmt_number(c(`Percentile_2.5`,`Central_number`, `Percentile_97.5`,
`Percentile_100`),decimals = 0, sep_mark = ",",
n_sigfig = 2) %>%
tab_row_group(rows = 5, "Outside the Mainstream") %>%
tab_row_group(rows = 1:4, label = "Mainstream") %>%
cols_label(Percentile_2.5 = "Percentile 2.5",
Central_number = "Central Number",
Percentile_97.5 = "Percentile 97.5",
Percentile_100 = "Percentile 100") %>%
tab_header(title = "Table 5 - Violent Deaths in the Iraq War",
subtitle = "March 2003 through August 2007")
gtsave(table5, "table5.png")
```
## Graphs
Create a graph that contains all the sources:
```{r}
x_axis_labels <- c("03-2003","03-2004","03-2005","03-2006","03-2007",
"03-2008","03-2009","03-2010","03-2011","03-2012",
"03-2013","03-2014","03-2015","03-2016","03-2017",
"03-2018","03-2019","03-2020","03-2021","03-2022")
ggplot(all_data)+
geom_line(aes(x = month, y = IBC)) +
geom_point(aes(x = 14, y = ILCS[14])) +
geom_errorbar(aes(x = 14, ymin=ILCS_low[14], ymax=ILCS_high[14]), width=5,
color = "grey50") +
geom_point(aes(x = 19, y = L1[19])) +
geom_errorbar(aes(x = 19, ymin=L1_low, ymax=L1_high), width=5,
color = "grey50") +
geom_line(aes(x = month, y = L2)) +
geom_errorbar(aes(x = month, ymin=L2_low, ymax=L2_high), width=5,
color = "grey50") +
geom_errorbar(aes(x = month, ymin=L2_low_rosenblum,
ymax=L2_high_rosenblum), width=5,
color = "grey50") +
geom_line(aes(x = month, y = IFHS,color = "red")) +
geom_errorbar(aes(x = 14, ymin=IFHS_low[14], ymax=IFHS_high[14]), width=5,
color = "grey50") +
geom_errorbar(aes(x = 40, ymin=IFHS_low[40], ymax=IFHS_high[40]), width=5,
color = "grey50") +
geom_point((aes(x = 55, y = ORB[55])))+
geom_errorbar(aes(x = 55, ymin=ORB_low[55], ymax=ORB_high[55]), width=5,
color = "grey50") +
geom_line(aes(x = month, y = UCIMS))+
geom_ribbon(aes(x = month,
ymin=UCIMS_low, ymax=UCIMS_high ), alpha = 0.1) +
geom_line(aes(x = month, y = sibling))+
geom_ribbon(aes(x = month,
ymin=sibling_low, ymax=sibling_high ), alpha = 0.1) +
scale_y_continuous(breaks=seq(0,1000000,100000))+
scale_x_continuous(breaks=seq(1,(2022-2002)*12,12),
labels = x_axis_labels,
)+
geom_text(aes(x = 57.8, y = 1033000),label = "ORB", cex = 3)+
geom_text(aes(x = 10, y = 263000),label = "Roberts et al. (2004)", cex = 3)+
geom_text(aes(x = 50.2, y = 666000),label = "Burnham et al. (2006)", cex = 3)+
geom_text(aes(x = 106.3, y = 141500),label = "UCIMS Sibling", cex = 3)+
geom_text(aes(x = 97, y = 215000),label = "UCIMS Household - All Violent Deaths Included", cex = 3) +
geom_text(aes(x = 17.3, y = 16600),label = "ILCS", cex = 3)+
geom_text(aes(x = 33, y = 110000),label = "IFHS", cex = 3)+
geom_text(aes(x = 200, y = 281000),label = "Iraq Body Count Extended", cex = 3)+
labs(title = "Violent Deaths in the Iraq War",
subtitle = "Using all Surveys and the High Estimates for the IFHS and the UCIMS Household Surveys",
y = "Number of Violent Deaths",
x = "Time")+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.text.x = element_text(size = 10, angle=0),
legend.position = "none",
plot.margin = unit(c(1.0,0.1,0.1,0), "cm"))
ggsave("figure4.png", device = "png" ,dpi = 1500, width = 18, height = 9, units = "in")
```
Note that the above graph does not include all treatments of the sources. In particular, it has only the the adjusted IFHS figures and the UCIMS-household figures that include all reported violent deaths, regardless of death certificate confirmation.
Now show just the reliable sources together with the unadjusted IFHS figures and the UCIMS-household figures that require death certificate confirmation:
```{r}
ggplot(all_data)+
geom_line(aes(x = month, y = IBC)) +
geom_point(aes(x = 14, y = ILCS[14])) +
geom_errorbar(aes(x = 14, ymin=ILCS_low[14], ymax=ILCS_high[14]), width=5,
color = "grey50") +
geom_line(aes(x = month, y = IFHS_nofudge,color = "red")) +
geom_line(aes(x = month, y = UCIMS_cert))+
geom_ribbon(aes(x = month,
ymin=UCIMS_cert_low,
ymax=UCIMS_cert_high), alpha = 0.1) +
geom_line(aes(x = month, y = sibling))+
geom_ribbon(aes(x = month,
ymin=sibling_low, ymax=sibling_high ), alpha = 0.1) +
scale_y_continuous(breaks=seq(0,1000000,100000))+
scale_x_continuous(breaks=seq(1,(2022-2002)*12,12),
labels = x_axis_labels,
) +
geom_text(aes(x = 106.2, y = 144000),label = "UCIMS Sibling", cex = 3)+
geom_text(aes(x = 102.5, y = 124500),label = "UCIMS Household - Only Deaths Backed by Death Certificates Included", cex = 3) +
geom_text(aes(x = 12.5, y = 22000),label = "ILCS", cex = 3)+
geom_text(aes(x = 32, y = 50600),label = "IFHS",cex = 3)+
geom_text(aes(x = 200, y = 290500),label = "Iraq Body Count Extended", cex = 3)+
labs(title = "Violent Deaths in the Iraq War",
subtitle = "Using only the Reliable Surveys and low estimates for the IFHS and UCIMS Household Surveys",
y = "Number of Violent Deaths",
x = "Time")+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.text.x = element_text(size = 10, angle=0),
legend.position = "none",
plot.margin = unit(c(1.0,0.1,0.1,0), "cm"))
ggsave("figure5.png", device = "png" ,dpi = 1500, width = 18, height = 9, units = "in")
```
## Mapping the Roberts et al. (2004) Survey
Les Roberts, the first author of the @roberts2004 survey, released a spreadsheet (included in the .zip file accompanying this code) that clarifies, albeit with some ambiguity, the locations of the 33 clusters of contiguous households that were sampled. They are, with the number of clusters in parentheses, Baghdad (7), Balad (1), Baiji (2), Al Shirqat (1), Mosul (3), Sulaimaniya (3), Miqdadiya (1), Baqubah (1), Karbala (3), Babylon (3), Rifai (1), Shatrah (1), Suq Al-Shoyokh (1), Mayson (3) and Al-Fallujah (1). The ambiguity is that Babylon and Mayson are governorates, rather than specific cities or towns.
Create a map that shows Iraq's primary roads together with the 33 cluster locations of the survey but with the clusters for Mayson and Babylon placed randomly within these governorates. The first step is to load the sf() package for mapping and four shape files:
```{r}
library(sf)
Iraq_primary_routs <- st_read("primary_routes.shp")
Iraq_outline <- st_read("IRQ_adm0.shp")
Iraq_places <- st_read("irq_pplp_ocha_20140722.shp")
Iraq_governorates <- st_read("IRQ_adm1.shp")
```
The shape files were downloaded from @iraq-r, @boundary, @iraq-s and @first-le but are also made available with in the .zip file accompanying the code.
List the locations of clusters that are specified below the governorate level, i.e, all cluster locations except those in Babylon and Mayson:
```{r}
L1_locations_specific <- c("Balad", "Kut", "Shatra", "Rifa'i",
"Suq Al-Shoyokh", "Kerbala",
"Sulaymaniya", "Falluja", "Mosul",
"Shirqat", "Muqdadiya", "Ba'quba",
"Baiji", "Baghdad")
```
The above "specific" locations are still not precise GPS points so I searched the Iraq_locations file to find central locations for in each city/town and selected, with the help of Google Earth, a central place listed in this Iraq-locations file:
```{r}
#IQ-P00046 Fallujah ,IQ-P06653 Sulaymaniya, IQ-P10981 Rifa'i,
#IQ-P11313 Shatra, IQ-P11595 Suq Al-Shoyokh,
#IQ-P12156 Ba'quba, IQ-P13160 Muqdadiya, IQ-P16007 Kerbala,
# IQ-P23224 Baiji,IQ-P19942 Mosul, IQ-P21592 Kut, IQ-P23365
#Balad,IQ-P23603, Shirqat, IQ-P08960 Baghdad
Location_codes_specific <- c("IQ-P00046", "IQ-P06653",
"IQ-P10981", "IQ-P11313",
"IQ-P11595", "IQ-P12156",
"IQ-P13160", "IQ-P16007",
"IQ-P23224", "IQ-P19942",
"IQ-P21592", "IQ-P23365",
"IQ-P23603", "IQ-P08960")
```
Here's a check that the procedure has worked:
```{r}
check <- filter(Iraq_places, PCode %in% Location_codes_specific)
View(check)
```
There are fourteen unique locations as there should be.
Now select the random locations in Babylon and Mayson:
```{r}
L1_locations_general <- c("Babylon", "Babylon", "Babylon",
"Missan", "Missan", "Missan")
set.seed(1)
random_identifiers <- c()
for (i in L1_locations_general) {
temp <- filter(Iraq_places, A1NameEn == i)
temp_choices <- temp$PCode
choice <- sample(temp_choices, 1)
random_identifiers <- c(random_identifiers, choice)
}
random_identifiers
```
Collect all the sampled locations:
```{r}
Location_codes <- c(Location_codes_specific, random_identifiers)
sampled_locations <- filter(Iraq_places, PCode %in% Location_codes)
View(sampled_locations)
```
Put the sampled locations on a map that includes Iraq's primary roads and borders:
```{r}
ggplot() +
geom_sf(data = sampled_locations) +
geom_sf(data = Iraq_outline, alpha = 0) +
geom_sf(data = Iraq_primary_routs, alpha = 0.4, color = "blue") +
annotate("text", x = 44.95, y = 33.36, label = "Baghdad(7)", cex = 3) +
annotate("text", x = 43.5, y = 36.45, label = "Mosul(3)", cex = 3) +
annotate("text", x = 44.29, y = 34.11, label = "Balad", cex = 3) +
annotate("text", x = 43.46, y = 35.63, label = "Shirqat", cex = 3) +
annotate("text", x = 43.66, y = 33.275, label = "Fallujah", cex = 3) +
annotate("text", x = 45.92, y = 35.64, label = "Sulaimaniya(3)", cex = 3) +
annotate("text", x = 43.6, y = 32.71, label = "Karbala(3)", cex = 3) +
annotate("text", x = 45.4, y = 33.9, label = "Muqdadiya", cex = 3) +
annotate("text", x = 43.18, y = 35.01, label = "Baiji(2)", cex = 3) +
annotate("text", x = 45.03, y = 33.685, label = "Baqubah", cex = 3) +
annotate("text", x = 46.28, y = 31.71, label = "Rifa", cex = 3) +
annotate("text", x = 46.45, y = 31.49, label = "Shatra", cex = 3) +
annotate("text", x = 47.12, y = 31.03, label = "Suq Al-Shoyokh", cex = 3) +
annotate("text", x = 45.76, y = 32.4, label = "Kut", cex = 3) +
annotate("text", x = 41.8, y = 29, label = "Numbers of clusters at each location, when different from 1, are in parentheses", cex = 3) +
labs(title = "Primary Roads and the Sampling Locations for Roberts et. al (2004)" ,
subtitle = "Only locations specified below the governorate level are labelled",
y = "Latitude", x = "Longitude",
caption = "Note - The two triples of unlablled points are random locations in Babylon (left) and Mayson (right) governorates") +
theme(plot.caption = element_text(hjust=0, vjust = -0.5)) +
theme(plot.title = element_text(size=12)) +
theme(plot.subtitle = element_text(size = 10)) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
ggsave( "figure1.png",device = "png" ,dpi = 700, width = 7.5, height = 10, units = "in")
```
Assume now that the clusters in Babylon and Mayson were in, respectively, Hillah and Amarah and create a second map using these assumptions:
```{r}
Location_codes1 <- c("IQ-P00046", "IQ-P06653", "IQ-P10981",
"IQ-P11313", "IQ-P11595", "IQ-P12156", "IQ-P13160", "IQ-P16007",
"IQ-P23224", "IQ-P19942", "IQ-P21592", "IQ-P23365",
"IQ-P23603", "IQ-P08960", "IQ-P07421",
"IQ-P18138")
sampled_locations1 <- filter(Iraq_places, PCode %in% Location_codes1)
View(sampled_locations1)
ggplot() +
geom_sf(data = sampled_locations1) +
geom_sf(data = Iraq_outline, alpha = 0) +
geom_sf(data = Iraq_primary_routs, alpha = 0.4, color = "blue") +
annotate("text", x = 44.95, y = 33.36, label = "Baghdad(7)", cex = 3) +
annotate("text", x = 43.5, y = 36.45, label = "Mosul(3)", cex = 3) +
annotate("text", x = 44.29, y = 34.11, label = "Balad", cex = 3) +
annotate("text", x = 43.46, y = 35.63, label = "Shirqat", cex = 3) +
annotate("text", x = 43.66, y = 33.275, label = "Fallujah", cex = 3) +
annotate("text", x = 45.92, y = 35.64, label = "Sulaimaniya(3)", cex = 3) +
annotate("text", x = 43.6, y = 32.71, label = "Karbala(3)", cex = 3) +
annotate("text", x = 45.4, y = 33.9, label = "Muqdadiya", cex = 3) +
annotate("text", x = 43.18, y = 35.01, label = "Baiji(2)", cex = 3) +
annotate("text", x = 45.03, y = 33.685, label = "Baqubah", cex = 3) +
annotate("text", x = 46.28, y = 31.71, label = "Rifa", cex = 3) +
annotate("text", x = 46.45, y = 31.49, label = "Shatra", cex = 3) +
annotate("text", x = 47.12, y = 31.03, label = "Suq Al-Shoyokh", cex = 3) +
annotate("text", x = 45.76, y = 32.4, label = "Kut", cex = 3) +
annotate("text", x = 47.605, y = 31.84, label = "Amarah(3)", cex = 3) +
annotate("text", x = 44.77, y = 32.51, label = "Hillah(3)", cex = 3) +
annotate("text", x = 41.8, y = 29, label = "Numbers of clusters at each location, when different from 1, are in parentheses", cex = 3) +
labs(title = "Primary Roads and the Sampling Locations for Roberts et. al (2004)" ,
subtitle = "Sampling Locations in Babylon and Maysan are Placed in the Seat of each Governorate",
y = "Latitude", x = "Longitude") +
theme(plot.caption = element_text(hjust=0, vjust = -0.5)) +
theme(plot.title = element_text(size=12)) +
theme(plot.subtitle = element_text(size = 10)) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
ggsave("figure2.png",device = "png" ,dpi = 700, width = 7.5, height = 10, units = "in")
```
Remarkably, there are seven governorates of Iraq that had no clusters in the @roberts2004 sample. Create a map that displays in blue the governorates of Iraq that have 0 clusters in the @roberts2004 sample:
```{r}
ggplot() +
geom_sf(data = Iraq_primary_routs[6,], color = "blue") +
geom_sf(data = sampled_locations1) +
geom_sf(data = Iraq_governorates, alpha = 0) +
geom_sf(data = Iraq_governorates[c(2:6,8,12),], fill = "green") +
labs(title = "The Governorates of Iraq without Clusters in the Roberts et. al (2004) Survey" ,
subtitle = "Green Shading Indicates no Cluster",
y = "Latitude", x = "Longitude") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
ggsave("figure3.png",device = "png" ,dpi = 700, width = 7.5, height = 10, units = "in")
```
The fact that the governorates with 0 clusters form two continuous areas underscores the scale of the convenience sampling in the survey.
## One more little calculation
In the section of the paper entitled "Discussion" there is a back-of-the-envelope calculation aimed at quantifying the uncertainty surrounding the IBC-extended figures after June-2011, i.e., after there is no more survey coverage. Here is the code behind this calculation.
For each of the group-1 surveys calculate the ratio of the central estimate for the whole time period covered by the survey to the IBC-extended figure for the same time period.
```{r}
ILCS_ratio <- ILCS[14]/IBC_extended_cumulative[14]
IFHS_adjusted_ratio <- IFHS_cumulative[40]/IBC_extended_cumulative[40]
IFHS_unadjusted_ratio <- IFHS_cumulative_nofudge_nocluster[40]/IBC_extended_cumulative[40]
UCIMS_all_ratio <- UCIMS_cumulative[8*12+4]/IBC_extended_cumulative[8*12+4]
UCIMS_cert_ratio <- UCIMS_cumulative_cert[8*12+4]/IBC_extended_cumulative[8*12+4]
UCIMS_sibling_ratio <- sibling_cumulative[8*12+4]/IBC_extended_cumulative[8*12+4]
ILCS_ratio
IFHS_unadjusted_ratio
IFHS_adjusted_ratio
UCIMS_all_ratio
UCIMS_cert_ratio
UCIMS_sibling_ratio
```
Compute the mean and median of these 6 ratios:
```{r}
mean(c(ILCS_ratio, IFHS_unadjusted_ratio, IFHS_adjusted_ratio,
UCIMS_all_ratio, UCIMS_cert_ratio, UCIMS_sibling_ratio))
median(c(ILCS_ratio, IFHS_unadjusted_ratio, IFHS_adjusted_ratio,
UCIMS_all_ratio, UCIMS_cert_ratio, UCIMS_sibling_ratio))
```