rm(list = ls()) # clean-up workspace
Dr. Hua Zhou’s HW assignment
Of primary interest to public is the risk of dying from COVID-19. A commonly used measure is case fatality rate/ratio/risk (CFR), which is defined as \[ \frac{\text{number of deaths from disease}}{\text{number of diagnosed cases of disease}}. \] Apparently CFR is not a fixed constant; it changes with time, location, and other factors. Also CFR is different from the infection fatality rate (IFR), the probability that someone infected with COVID-19 dies from it.
In this exercise, we use logistic regression to study how US county-level CFR changes according to demographic information and some health-, education-, and economy-indicators.
11-02-2021.csv
: The data on COVID-19 confirmed cases and deaths on 2021-11-02 is retrieved from the Johns Hopkins COVID-19 data repository. It was downloaded from this link (commit a94c128).
us-county-health-rankings-2020.csv.zip
: The 2020 County Health Ranking Data was released by County Health Rankings. The data was downloaded from the Kaggle Uncover COVID-19 Challenge (version 1).
Load the tidyverse
package for data manipulation and visualization.
# tidyverse of data manipulation and visualization
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Read in the data of COVID-19 cases reported on 2021-11-02.
county_count <- read_csv("./11-02-2021.csv") %>%
# cast fips into dbl for use as a key for joining tables
mutate(FIPS = as.numeric(FIPS)) %>%
filter(Country_Region == "US") %>%
print(width = Inf)
## Rows: 4006 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Admin2, Province_State, Country_Region, Combined_Key
## dbl (7): FIPS, Lat, Long_, Confirmed, Deaths, Incident_Rate, Case_Fatality_...
## lgl (2): Recovered, Active
## dttm (1): Last_Update
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 3,279 × 14
## FIPS Admin2 Province_State Country_Region Last_Update Lat Long_
## <dbl> <chr> <chr> <chr> <dttm> <dbl> <dbl>
## 1 1001 Autauga Alabama US 2021-11-03 06:22:09 32.5 -86.6
## 2 1003 Baldwin Alabama US 2021-11-03 06:22:09 30.7 -87.7
## 3 1005 Barbour Alabama US 2021-11-03 06:22:09 31.9 -85.4
## 4 1007 Bibb Alabama US 2021-11-03 06:22:09 33.0 -87.1
## 5 1009 Blount Alabama US 2021-11-03 06:22:09 34.0 -86.6
## 6 1011 Bullock Alabama US 2021-11-03 06:22:09 32.1 -85.7
## 7 1013 Butler Alabama US 2021-11-03 06:22:09 31.8 -86.7
## 8 1015 Calhoun Alabama US 2021-11-03 06:22:09 33.8 -85.8
## 9 1017 Chambers Alabama US 2021-11-03 06:22:09 32.9 -85.4
## 10 1019 Cherokee Alabama US 2021-11-03 06:22:09 34.2 -85.6
## Confirmed Deaths Recovered Active Combined_Key Incident_Rate
## <dbl> <dbl> <lgl> <lgl> <chr> <dbl>
## 1 10271 148 NA NA Autauga, Alabama, US 18384.
## 2 37445 558 NA NA Baldwin, Alabama, US 16774.
## 3 3605 76 NA NA Barbour, Alabama, US 14603.
## 4 4283 89 NA NA Bibb, Alabama, US 19126.
## 5 10423 179 NA NA Blount, Alabama, US 18025.
## 6 1526 44 NA NA Bullock, Alabama, US 15107.
## 7 3365 96 NA NA Butler, Alabama, US 17303.
## 8 22341 497 NA NA Calhoun, Alabama, US 19666.
## 9 5787 144 NA NA Chambers, Alabama, US 17402.
## 10 3071 61 NA NA Cherokee, Alabama, US 11723.
## Case_Fatality_Ratio
## <dbl>
## 1 1.44
## 2 1.49
## 3 2.11
## 4 2.08
## 5 1.72
## 6 2.88
## 7 2.85
## 8 2.22
## 9 2.49
## 10 1.99
## # … with 3,269 more rows
Standardize the variable names by changing them to lower case.
names(county_count) <- str_to_lower(names(county_count))
Sanity check by displaying the unique US states and territories:
county_count %>%
select(province_state) %>%
distinct() %>%
arrange(province_state) %>%
print(n = Inf)
## # A tibble: 59 × 1
## province_state
## <chr>
## 1 Alabama
## 2 Alaska
## 3 American Samoa
## 4 Arizona
## 5 Arkansas
## 6 California
## 7 Colorado
## 8 Connecticut
## 9 Delaware
## 10 Diamond Princess
## 11 District of Columbia
## 12 Florida
## 13 Georgia
## 14 Grand Princess
## 15 Guam
## 16 Hawaii
## 17 Idaho
## 18 Illinois
## 19 Indiana
## 20 Iowa
## 21 Kansas
## 22 Kentucky
## 23 Louisiana
## 24 Maine
## 25 Maryland
## 26 Massachusetts
## 27 Michigan
## 28 Minnesota
## 29 Mississippi
## 30 Missouri
## 31 Montana
## 32 Nebraska
## 33 Nevada
## 34 New Hampshire
## 35 New Jersey
## 36 New Mexico
## 37 New York
## 38 North Carolina
## 39 North Dakota
## 40 Northern Mariana Islands
## 41 Ohio
## 42 Oklahoma
## 43 Oregon
## 44 Pennsylvania
## 45 Puerto Rico
## 46 Recovered
## 47 Rhode Island
## 48 South Carolina
## 49 South Dakota
## 50 Tennessee
## 51 Texas
## 52 Utah
## 53 Vermont
## 54 Virgin Islands
## 55 Virginia
## 56 Washington
## 57 West Virginia
## 58 Wisconsin
## 59 Wyoming
We want to exclude entries from American Samoa
, Diamond Princess
, Grand Princess
, Guam
, Northern Mariana Islands
, Puerto Rico
, Recovered
, and Virgin Islands
, and only consider counties from 50 states and DC.
county_count <- county_count %>%
filter(!(province_state %in% c("American Samoa", "Diamond Princess", "Grand Princess",
"Recovered", "Guam", "Northern Mariana Islands",
"Puerto Rico", "Virgin Islands"))) %>%
print(width = Inf)
## # A tibble: 3,192 × 14
## fips admin2 province_state country_region last_update lat long_
## <dbl> <chr> <chr> <chr> <dttm> <dbl> <dbl>
## 1 1001 Autauga Alabama US 2021-11-03 06:22:09 32.5 -86.6
## 2 1003 Baldwin Alabama US 2021-11-03 06:22:09 30.7 -87.7
## 3 1005 Barbour Alabama US 2021-11-03 06:22:09 31.9 -85.4
## 4 1007 Bibb Alabama US 2021-11-03 06:22:09 33.0 -87.1
## 5 1009 Blount Alabama US 2021-11-03 06:22:09 34.0 -86.6
## 6 1011 Bullock Alabama US 2021-11-03 06:22:09 32.1 -85.7
## 7 1013 Butler Alabama US 2021-11-03 06:22:09 31.8 -86.7
## 8 1015 Calhoun Alabama US 2021-11-03 06:22:09 33.8 -85.8
## 9 1017 Chambers Alabama US 2021-11-03 06:22:09 32.9 -85.4
## 10 1019 Cherokee Alabama US 2021-11-03 06:22:09 34.2 -85.6
## confirmed deaths recovered active combined_key incident_rate
## <dbl> <dbl> <lgl> <lgl> <chr> <dbl>
## 1 10271 148 NA NA Autauga, Alabama, US 18384.
## 2 37445 558 NA NA Baldwin, Alabama, US 16774.
## 3 3605 76 NA NA Barbour, Alabama, US 14603.
## 4 4283 89 NA NA Bibb, Alabama, US 19126.
## 5 10423 179 NA NA Blount, Alabama, US 18025.
## 6 1526 44 NA NA Bullock, Alabama, US 15107.
## 7 3365 96 NA NA Butler, Alabama, US 17303.
## 8 22341 497 NA NA Calhoun, Alabama, US 19666.
## 9 5787 144 NA NA Chambers, Alabama, US 17402.
## 10 3071 61 NA NA Cherokee, Alabama, US 11723.
## case_fatality_ratio
## <dbl>
## 1 1.44
## 2 1.49
## 3 2.11
## 4 2.08
## 5 1.72
## 6 2.88
## 7 2.85
## 8 2.22
## 9 2.49
## 10 1.99
## # … with 3,182 more rows
Graphical summarize the COVID-19 confirmed cases and deaths on 2021-11-02 by state.
county_count %>%
# turn into long format for easy plotting
pivot_longer(confirmed:recovered,
names_to = "case",
values_to = "count") %>%
group_by(province_state) %>%
ggplot() +
geom_col(mapping = aes(x = province_state, y = `count`, fill = `case`)) +
# scale_y_log10() +
labs(title = "US COVID-19 Situation on 2021-11-02", x = "State") +
theme(axis.text.x = element_text(angle = 90))
## Warning: Removed 3192 rows containing missing values (position_stack).
Read in the 2020 county-level health ranking data.
county_info <- read_csv("./us-county-health-rankings-2020.csv") %>%
filter(!is.na(county)) %>%
# cast fips into dbl for use as a key for joining tables
mutate(fips = as.numeric(fips)) %>%
select(fips,
state,
county,
percent_fair_or_poor_health,
percent_smokers,
percent_adults_with_obesity,
# food_environment_index,
percent_with_access_to_exercise_opportunities,
percent_excessive_drinking,
# teen_birth_rate,
percent_uninsured,
# primary_care_physicians_rate,
# preventable_hospitalization_rate,
# high_school_graduation_rate,
percent_some_college,
percent_unemployed,
percent_children_in_poverty,
# `80th_percentile_income`,
# `20th_percentile_income`,
percent_single_parent_households,
# violent_crime_rate,
percent_severe_housing_problems,
overcrowding,
# life_expectancy,
# age_adjusted_death_rate,
percent_adults_with_diabetes,
# hiv_prevalence_rate,
percent_food_insecure,
# percent_limited_access_to_healthy_foods,
percent_insufficient_sleep,
percent_uninsured_2,
median_household_income,
average_traffic_volume_per_meter_of_major_roadways,
percent_homeowners,
# percent_severe_housing_cost_burden,
population_2,
percent_less_than_18_years_of_age,
percent_65_and_over,
percent_black,
percent_asian,
percent_hispanic,
percent_female,
percent_rural) %>%
print(width = Inf)
## Rows: 3193 Columns: 507
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): state, county, unreliable, primary_care_physicians_ratio, dentist...
## dbl (497): fips, num_deaths, years_of_potential_life_lost_rate, 95percent_ci...
## lgl (3): presence_of_water_violation, non_petitioned_cases, petitioned_cases
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 3,142 × 30
## fips state county percent_fair_or_poor_health percent_smokers
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1001 Alabama Autauga 20.9 18.1
## 2 1003 Alabama Baldwin 17.5 17.5
## 3 1005 Alabama Barbour 29.6 22.0
## 4 1007 Alabama Bibb 19.4 19.1
## 5 1009 Alabama Blount 21.7 19.2
## 6 1011 Alabama Bullock 31.0 22.9
## 7 1013 Alabama Butler 27.9 21.8
## 8 1015 Alabama Calhoun 23.1 20.6
## 9 1017 Alabama Chambers 24.0 19.4
## 10 1019 Alabama Cherokee 20.7 17.5
## percent_adults_with_obesity percent_with_access_to_exercise_opportunities
## <dbl> <dbl>
## 1 33.3 69.1
## 2 31 73.7
## 3 41.7 53.2
## 4 37.6 16.3
## 5 33.8 15.6
## 6 37.2 2.50
## 7 43.3 48.6
## 8 38.5 47.7
## 9 40.1 61.9
## 10 35 33.4
## percent_excessive_drinking percent_uninsured percent_some_college
## <dbl> <dbl> <dbl>
## 1 15.0 8.72 62.0
## 2 18.0 11.3 67.4
## 3 12.8 12.2 34.9
## 4 15.6 10.2 44.1
## 5 14.2 13.4 53.4
## 6 12.1 11.4 35.0
## 7 11.9 11.2 41.7
## 8 13.8 11.9 59.2
## 9 12.7 11.9 48.5
## 10 14.1 11.2 51.8
## percent_unemployed percent_children_in_poverty
## <dbl> <dbl>
## 1 3.63 19.3
## 2 3.62 13.9
## 3 5.17 43.9
## 4 3.97 27.8
## 5 3.51 18
## 6 4.69 68.3
## 7 4.79 36.3
## 8 4.65 26.5
## 9 3.91 30.7
## 10 3.57 24.7
## percent_single_parent_households percent_severe_housing_problems overcrowding
## <dbl> <dbl> <dbl>
## 1 26.2 14.7 1.20
## 2 24.1 13.6 1.27
## 3 56.6 14.6 1.69
## 4 28.7 10.5 0.255
## 5 28.6 10.5 1.89
## 6 74.8 18.1 0.113
## 7 52.7 13.2 1.69
## 8 40.2 13.7 1.54
## 9 46.6 16.0 4.04
## 10 23.8 13 1.5
## percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
## <dbl> <dbl> <dbl>
## 1 11.1 13.2 35.9
## 2 10.7 11.6 33.3
## 3 17.6 22 38.6
## 4 14.5 14.3 38.1
## 5 17 10.7 35.9
## 6 23.7 24.8 45.0
## 7 19.2 20.6 41.9
## 8 17.5 15.7 41.3
## 9 19.9 17.9 37.3
## 10 15.2 12.5 35.4
## percent_uninsured_2 median_household_income
## <dbl> <dbl>
## 1 11.1 59338
## 2 14.3 57588
## 3 16.1 34382
## 4 13 46064
## 5 17.1 50412
## 6 15.2 29267
## 7 14.5 37365
## 8 15.4 45400
## 9 15.2 39917
## 10 13.9 42132
## average_traffic_volume_per_meter_of_major_roadways percent_homeowners
## <dbl> <dbl>
## 1 88.5 74.9
## 2 87.0 73.6
## 3 102. 61.4
## 4 29.3 75.1
## 5 33.4 78.6
## 6 4.07 75.5
## 7 19.3 69.9
## 8 110. 69.5
## 9 20.3 67.8
## 10 25.9 79.0
## population_2 percent_less_than_18_years_of_age percent_65_and_over
## <dbl> <dbl> <dbl>
## 1 55601 23.7 15.6
## 2 218022 21.6 20.4
## 3 24881 20.9 19.4
## 4 22400 20.5 16.5
## 5 57840 23.2 18.2
## 6 10138 21.1 16.4
## 7 19680 22.2 20.3
## 8 114277 21.6 17.7
## 9 33615 20.8 19.5
## 10 26032 19.2 23.0
## percent_black percent_asian percent_hispanic percent_female percent_rural
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19.3 1.22 2.97 51.4 42.0
## 2 8.78 1.15 4.65 51.5 42.3
## 3 48.0 0.454 4.28 47.2 67.8
## 4 21.1 0.237 2.62 46.8 68.4
## 5 1.46 0.320 9.57 50.7 90.0
## 6 69.5 0.187 7.96 45.5 51.4
## 7 44.6 1.32 1.51 53.4 71.2
## 8 20.9 0.964 3.91 51.9 33.7
## 9 39.6 1.33 2.56 52.1 49.1
## 10 4.24 0.338 1.62 50.5 85.7
## # … with 3,132 more rows
For stability in estimating CFR, we restrict to counties with \(\ge 5\) confirmed cases.
county_count <- county_count %>%
filter(confirmed >= 5)
We join the COVID-19 count data and county-level information using FIPS (Federal Information Processing System) as key.
county_data <- county_count %>%
left_join(county_info, by = "fips") %>%
print(width = Inf)
## # A tibble: 3,157 × 43
## fips admin2 province_state country_region last_update lat long_
## <dbl> <chr> <chr> <chr> <dttm> <dbl> <dbl>
## 1 1001 Autauga Alabama US 2021-11-03 06:22:09 32.5 -86.6
## 2 1003 Baldwin Alabama US 2021-11-03 06:22:09 30.7 -87.7
## 3 1005 Barbour Alabama US 2021-11-03 06:22:09 31.9 -85.4
## 4 1007 Bibb Alabama US 2021-11-03 06:22:09 33.0 -87.1
## 5 1009 Blount Alabama US 2021-11-03 06:22:09 34.0 -86.6
## 6 1011 Bullock Alabama US 2021-11-03 06:22:09 32.1 -85.7
## 7 1013 Butler Alabama US 2021-11-03 06:22:09 31.8 -86.7
## 8 1015 Calhoun Alabama US 2021-11-03 06:22:09 33.8 -85.8
## 9 1017 Chambers Alabama US 2021-11-03 06:22:09 32.9 -85.4
## 10 1019 Cherokee Alabama US 2021-11-03 06:22:09 34.2 -85.6
## confirmed deaths recovered active combined_key incident_rate
## <dbl> <dbl> <lgl> <lgl> <chr> <dbl>
## 1 10271 148 NA NA Autauga, Alabama, US 18384.
## 2 37445 558 NA NA Baldwin, Alabama, US 16774.
## 3 3605 76 NA NA Barbour, Alabama, US 14603.
## 4 4283 89 NA NA Bibb, Alabama, US 19126.
## 5 10423 179 NA NA Blount, Alabama, US 18025.
## 6 1526 44 NA NA Bullock, Alabama, US 15107.
## 7 3365 96 NA NA Butler, Alabama, US 17303.
## 8 22341 497 NA NA Calhoun, Alabama, US 19666.
## 9 5787 144 NA NA Chambers, Alabama, US 17402.
## 10 3071 61 NA NA Cherokee, Alabama, US 11723.
## case_fatality_ratio state county percent_fair_or_poor_health
## <dbl> <chr> <chr> <dbl>
## 1 1.44 Alabama Autauga 20.9
## 2 1.49 Alabama Baldwin 17.5
## 3 2.11 Alabama Barbour 29.6
## 4 2.08 Alabama Bibb 19.4
## 5 1.72 Alabama Blount 21.7
## 6 2.88 Alabama Bullock 31.0
## 7 2.85 Alabama Butler 27.9
## 8 2.22 Alabama Calhoun 23.1
## 9 2.49 Alabama Chambers 24.0
## 10 1.99 Alabama Cherokee 20.7
## percent_smokers percent_adults_with_obesity
## <dbl> <dbl>
## 1 18.1 33.3
## 2 17.5 31
## 3 22.0 41.7
## 4 19.1 37.6
## 5 19.2 33.8
## 6 22.9 37.2
## 7 21.8 43.3
## 8 20.6 38.5
## 9 19.4 40.1
## 10 17.5 35
## percent_with_access_to_exercise_opportunities percent_excessive_drinking
## <dbl> <dbl>
## 1 69.1 15.0
## 2 73.7 18.0
## 3 53.2 12.8
## 4 16.3 15.6
## 5 15.6 14.2
## 6 2.50 12.1
## 7 48.6 11.9
## 8 47.7 13.8
## 9 61.9 12.7
## 10 33.4 14.1
## percent_uninsured percent_some_college percent_unemployed
## <dbl> <dbl> <dbl>
## 1 8.72 62.0 3.63
## 2 11.3 67.4 3.62
## 3 12.2 34.9 5.17
## 4 10.2 44.1 3.97
## 5 13.4 53.4 3.51
## 6 11.4 35.0 4.69
## 7 11.2 41.7 4.79
## 8 11.9 59.2 4.65
## 9 11.9 48.5 3.91
## 10 11.2 51.8 3.57
## percent_children_in_poverty percent_single_parent_households
## <dbl> <dbl>
## 1 19.3 26.2
## 2 13.9 24.1
## 3 43.9 56.6
## 4 27.8 28.7
## 5 18 28.6
## 6 68.3 74.8
## 7 36.3 52.7
## 8 26.5 40.2
## 9 30.7 46.6
## 10 24.7 23.8
## percent_severe_housing_problems overcrowding percent_adults_with_diabetes
## <dbl> <dbl> <dbl>
## 1 14.7 1.20 11.1
## 2 13.6 1.27 10.7
## 3 14.6 1.69 17.6
## 4 10.5 0.255 14.5
## 5 10.5 1.89 17
## 6 18.1 0.113 23.7
## 7 13.2 1.69 19.2
## 8 13.7 1.54 17.5
## 9 16.0 4.04 19.9
## 10 13 1.5 15.2
## percent_food_insecure percent_insufficient_sleep percent_uninsured_2
## <dbl> <dbl> <dbl>
## 1 13.2 35.9 11.1
## 2 11.6 33.3 14.3
## 3 22 38.6 16.1
## 4 14.3 38.1 13
## 5 10.7 35.9 17.1
## 6 24.8 45.0 15.2
## 7 20.6 41.9 14.5
## 8 15.7 41.3 15.4
## 9 17.9 37.3 15.2
## 10 12.5 35.4 13.9
## median_household_income average_traffic_volume_per_meter_of_major_roadways
## <dbl> <dbl>
## 1 59338 88.5
## 2 57588 87.0
## 3 34382 102.
## 4 46064 29.3
## 5 50412 33.4
## 6 29267 4.07
## 7 37365 19.3
## 8 45400 110.
## 9 39917 20.3
## 10 42132 25.9
## percent_homeowners population_2 percent_less_than_18_years_of_age
## <dbl> <dbl> <dbl>
## 1 74.9 55601 23.7
## 2 73.6 218022 21.6
## 3 61.4 24881 20.9
## 4 75.1 22400 20.5
## 5 78.6 57840 23.2
## 6 75.5 10138 21.1
## 7 69.9 19680 22.2
## 8 69.5 114277 21.6
## 9 67.8 33615 20.8
## 10 79.0 26032 19.2
## percent_65_and_over percent_black percent_asian percent_hispanic
## <dbl> <dbl> <dbl> <dbl>
## 1 15.6 19.3 1.22 2.97
## 2 20.4 8.78 1.15 4.65
## 3 19.4 48.0 0.454 4.28
## 4 16.5 21.1 0.237 2.62
## 5 18.2 1.46 0.320 9.57
## 6 16.4 69.5 0.187 7.96
## 7 20.3 44.6 1.32 1.51
## 8 17.7 20.9 0.964 3.91
## 9 19.5 39.6 1.33 2.56
## 10 23.0 4.24 0.338 1.62
## percent_female percent_rural
## <dbl> <dbl>
## 1 51.4 42.0
## 2 51.5 42.3
## 3 47.2 67.8
## 4 46.8 68.4
## 5 50.7 90.0
## 6 45.5 51.4
## 7 53.4 71.2
## 8 51.9 33.7
## 9 52.1 49.1
## 10 50.5 85.7
## # … with 3,147 more rows
Numerical summaries of each variable:
summary(county_data)
## fips admin2 province_state country_region
## Min. : 1001 Length:3157 Length:3157 Length:3157
## 1st Qu.:18592 Class :character Class :character Class :character
## Median :29187 Mode :character Mode :character Mode :character
## Mean :30842
## 3rd Qu.:46006
## Max. :90053
## NA's :10
## last_update lat long_
## Min. :2021-11-03 06:22:09 Min. :19.60 Min. :-174.16
## 1st Qu.:2021-11-03 06:22:09 1st Qu.:34.68 1st Qu.: -98.14
## Median :2021-11-03 06:22:09 Median :38.37 Median : -90.30
## Mean :2021-11-03 06:22:09 Mean :38.45 Mean : -92.17
## 3rd Qu.:2021-11-03 06:22:09 3rd Qu.:41.83 3rd Qu.: -83.43
## Max. :2021-11-03 06:22:09 Max. :69.31 Max. : -67.63
## NA's :33 NA's :33
## confirmed deaths recovered active
## Min. : 8 Min. : 0.0 Mode:logical Mode:logical
## 1st Qu.: 1620 1st Qu.: 28.0 NA's:3157 NA's:3157
## Median : 3921 Median : 66.0
## Mean : 14557 Mean : 235.7
## 3rd Qu.: 10240 3rd Qu.: 162.0
## Max. :1495014 Max. :26661.0
##
## combined_key incident_rate case_fatality_ratio state
## Length:3157 Min. : 1962 Min. : 0.000 Length:3157
## Class :character 1st Qu.:12824 1st Qu.: 1.193 Class :character
## Mode :character Median :15077 Median : 1.632 Mode :character
## Mean :14994 Mean : 2.902
## 3rd Qu.:17153 3rd Qu.: 2.184
## Max. :54277 Max. :2829.070
## NA's :33
## county percent_fair_or_poor_health percent_smokers
## Length:3157 Min. : 8.121 Min. : 5.909
## Class :character 1st Qu.:14.361 1st Qu.:14.989
## Mode :character Median :17.261 Median :16.989
## Mean :17.975 Mean :17.525
## 3rd Qu.:20.953 3rd Qu.:19.766
## Max. :40.991 Max. :41.491
## NA's :43 NA's :43
## percent_adults_with_obesity percent_with_access_to_exercise_opportunities
## Min. :12.4 Min. : 0.00
## 1st Qu.:29.3 1st Qu.: 48.47
## Median :33.1 Median : 65.80
## Mean :32.9 Mean : 62.74
## 3rd Qu.:36.6 3rd Qu.: 79.99
## Max. :57.7 Max. :100.00
## NA's :43 NA's :48
## percent_excessive_drinking percent_uninsured percent_some_college
## Min. : 7.81 Min. : 2.263 Min. : 15.18
## 1st Qu.:15.34 1st Qu.: 7.381 1st Qu.: 49.79
## Median :17.58 Median :10.553 Median : 57.93
## Mean :17.55 Mean :11.471 Mean : 57.84
## 3rd Qu.:19.68 3rd Qu.:14.470 3rd Qu.: 66.47
## Max. :28.62 Max. :33.750 Max. :100.00
## NA's :43 NA's :43 NA's :43
## percent_unemployed percent_children_in_poverty
## Min. : 1.302 Min. : 2.50
## 1st Qu.: 3.126 1st Qu.:14.60
## Median : 3.875 Median :20.10
## Mean : 4.130 Mean :21.17
## 3rd Qu.: 4.818 3rd Qu.:26.40
## Max. :19.904 Max. :68.30
## NA's :43 NA's :43
## percent_single_parent_households percent_severe_housing_problems
## Min. : 0.00 Min. : 3.22
## 1st Qu.:25.63 1st Qu.:11.01
## Median :31.71 Median :13.33
## Mean :32.46 Mean :13.87
## 3rd Qu.:37.74 3rd Qu.:15.93
## Max. :87.20 Max. :70.89
## NA's :44 NA's :43
## overcrowding percent_adults_with_diabetes percent_food_insecure
## Min. : 0.000 Min. : 1.80 Min. : 2.90
## 1st Qu.: 1.231 1st Qu.: 9.30 1st Qu.:10.60
## Median : 1.878 Median :11.60 Median :12.75
## Mean : 2.415 Mean :12.14 Mean :13.25
## 3rd Qu.: 2.840 3rd Qu.:14.60 3rd Qu.:15.20
## Max. :51.585 Max. :34.10 Max. :36.30
## NA's :43 NA's :43 NA's :43
## percent_insufficient_sleep percent_uninsured_2 median_household_income
## Min. :23.03 Min. : 2.683 Min. : 25385
## 1st Qu.:30.10 1st Qu.: 8.537 1st Qu.: 43650
## Median :33.01 Median :12.481 Median : 50514
## Mean :33.07 Mean :13.586 Mean : 52725
## 3rd Qu.:36.13 3rd Qu.:17.429 3rd Qu.: 58741
## Max. :46.71 Max. :42.397 Max. :140382
## NA's :43 NA's :43 NA's :43
## average_traffic_volume_per_meter_of_major_roadways percent_homeowners
## Min. : 0.00 Min. :19.61
## 1st Qu.: 26.92 1st Qu.:67.54
## Median : 57.97 Median :72.58
## Mean : 129.63 Mean :71.43
## 3rd Qu.: 123.28 3rd Qu.:77.00
## Max. :4496.41 Max. :92.40
## NA's :43 NA's :43
## population_2 percent_less_than_18_years_of_age percent_65_and_over
## Min. : 152 Min. : 7.069 Min. : 4.83
## 1st Qu.: 11043 1st Qu.:20.025 1st Qu.:16.30
## Median : 26096 Median :22.051 Median :18.93
## Mean : 104770 Mean :22.038 Mean :19.28
## 3rd Qu.: 68348 3rd Qu.:23.840 3rd Qu.:21.80
## Max. :10105518 Max. :41.992 Max. :57.59
## NA's :43 NA's :43 NA's :43
## percent_black percent_asian percent_hispanic percent_female
## Min. : 0.0000 Min. : 0.0000 Min. : 0.6105 Min. :26.84
## 1st Qu.: 0.7283 1st Qu.: 0.4651 1st Qu.: 2.3948 1st Qu.:49.43
## Median : 2.2841 Median : 0.7381 Median : 4.3525 Median :50.32
## Mean : 9.0652 Mean : 1.5707 Mean : 9.6658 Mean :49.89
## 3rd Qu.:10.3587 3rd Qu.: 1.4350 3rd Qu.: 9.9949 3rd Qu.:51.03
## Max. :85.4143 Max. :43.3570 Max. :96.3595 Max. :56.87
## NA's :43 NA's :43 NA's :43 NA's :43
## percent_rural
## Min. : 0.00
## 1st Qu.: 33.25
## Median : 59.48
## Mean : 58.58
## 3rd Qu.: 87.67
## Max. :100.00
## NA's :49
List rows in county_data
that don’t have a match in county_count
:
county_data %>%
filter(is.na(state) & is.na(county)) %>%
print(n = Inf)
## # A tibble: 43 × 43
## fips admin2 province_state country_region last_update lat long_
## <dbl> <chr> <chr> <chr> <dttm> <dbl> <dbl>
## 1 2063 Chugach Alaska US 2021-11-03 06:22:09 61.2 -150.
## 2 2066 Copper … Alaska US 2021-11-03 06:22:09 60.4 -163.
## 3 90002 Unassig… Alaska US 2021-11-03 06:22:09 NA NA
## 4 90005 Unassig… Arkansas US 2021-11-03 06:22:09 NA NA
## 5 90006 Unassig… California US 2021-11-03 06:22:09 NA NA
## 6 90008 Unassig… Colorado US 2021-11-03 06:22:09 NA NA
## 7 90009 Unassig… Connecticut US 2021-11-03 06:22:09 NA NA
## 8 90010 Unassig… Delaware US 2021-11-03 06:22:09 NA NA
## 9 90012 Unassig… Florida US 2021-11-03 06:22:09 NA NA
## 10 80013 Out of … Georgia US 2021-11-03 06:22:09 NA NA
## 11 90013 Unassig… Georgia US 2021-11-03 06:22:09 NA NA
## 12 80015 Out of … Hawaii US 2021-11-03 06:22:09 NA NA
## 13 80017 Out of … Illinois US 2021-11-03 06:22:09 NA NA
## 14 90017 Unassig… Illinois US 2021-11-03 06:22:09 NA NA
## 15 90019 Unassig… Iowa US 2021-11-03 06:22:09 NA NA
## 16 90021 Unassig… Kentucky US 2021-11-03 06:22:09 NA NA
## 17 90022 Unassig… Louisiana US 2021-11-03 06:22:09 NA NA
## 18 NA Dukes a… Massachusetts US 2021-11-03 06:22:09 41.4 -70.7
## 19 90025 Unassig… Massachusetts US 2021-11-03 06:22:09 NA NA
## 20 NA Federal… Michigan US 2021-11-03 06:22:09 NA NA
## 21 NA Michiga… Michigan US 2021-11-03 06:22:09 NA NA
## 22 80026 Out of … Michigan US 2021-11-03 06:22:09 NA NA
## 23 90026 Unassig… Michigan US 2021-11-03 06:22:09 NA NA
## 24 90027 Unassig… Minnesota US 2021-11-03 06:22:09 NA NA
## 25 NA Kansas … Missouri US 2021-11-03 06:22:09 39.1 -94.6
## 26 90031 Unassig… Nebraska US 2021-11-03 06:22:09 NA NA
## 27 90033 Unassig… New Hampshire US 2021-11-03 06:22:09 NA NA
## 28 90034 Unassig… New Jersey US 2021-11-03 06:22:09 NA NA
## 29 90035 Unassig… New Mexico US 2021-11-03 06:22:09 NA NA
## 30 90036 Unassig… New York US 2021-11-03 06:22:09 NA NA
## 31 90040 Unassig… Oklahoma US 2021-11-03 06:22:09 NA NA
## 32 90044 Unassig… Rhode Island US 2021-11-03 06:22:09 NA NA
## 33 80047 Out of … Tennessee US 2021-11-03 06:22:09 NA NA
## 34 90047 Unassig… Tennessee US 2021-11-03 06:22:09 NA NA
## 35 NA Bear Ri… Utah US 2021-11-03 06:22:09 41.5 -113.
## 36 NA Central… Utah US 2021-11-03 06:22:09 39.4 -112.
## 37 NA Southea… Utah US 2021-11-03 06:22:09 39.0 -111.
## 38 NA Southwe… Utah US 2021-11-03 06:22:09 37.9 -111.
## 39 NA TriCoun… Utah US 2021-11-03 06:22:09 40.1 -110.
## 40 90049 Unassig… Utah US 2021-11-03 06:22:09 NA NA
## 41 NA Weber-M… Utah US 2021-11-03 06:22:09 41.3 -112.
## 42 90050 Unassig… Vermont US 2021-11-03 06:22:09 NA NA
## 43 90053 Unassig… Washington US 2021-11-03 06:22:09 NA NA
## # … with 36 more variables: confirmed <dbl>, deaths <dbl>, recovered <lgl>,
## # active <lgl>, combined_key <chr>, incident_rate <dbl>,
## # case_fatality_ratio <dbl>, state <chr>, county <chr>,
## # percent_fair_or_poor_health <dbl>, percent_smokers <dbl>,
## # percent_adults_with_obesity <dbl>,
## # percent_with_access_to_exercise_opportunities <dbl>,
## # percent_excessive_drinking <dbl>, percent_uninsured <dbl>, …
We found there are some rows that miss fips
.
county_count %>%
filter(is.na(fips)) %>%
select(fips, admin2, province_state) %>%
print(n = Inf)
## # A tibble: 10 × 3
## fips admin2 province_state
## <dbl> <chr> <chr>
## 1 NA Dukes and Nantucket Massachusetts
## 2 NA Federal Correctional Institution (FCI) Michigan
## 3 NA Michigan Department of Corrections (MDOC) Michigan
## 4 NA Kansas City Missouri
## 5 NA Bear River Utah
## 6 NA Central Utah Utah
## 7 NA Southeast Utah Utah
## 8 NA Southwest Utah Utah
## 9 NA TriCounty Utah
## 10 NA Weber-Morgan Utah
We need to (1) manually set the fips
for some counties, (2) discard those Unassigned
, unassigned
or Out of
, and (3) try to join with county_info
again.
county_data <- county_count %>%
# manually set FIPS for some counties
mutate(fips = ifelse(admin2 == "Dukes and Nantucket" & province_state == "Massachusetts", 25019, fips)) %>%
mutate(fips = ifelse(admin2 == "Weber-Morgan" & province_state == "Utah", 49057, fips)) %>%
# remove variable `recovered` and `active` because they are just columns of NAs
mutate(recovered = NULL, active = NULL) %>%
filter(!(is.na(fips) | str_detect(admin2, "Out of") | str_detect(admin2, "Unassigned"))) %>%
left_join(county_info, by = "fips") %>%
drop_na() %>%
print(width = Inf)
## # A tibble: 3,109 × 41
## fips admin2 province_state country_region last_update lat long_
## <dbl> <chr> <chr> <chr> <dttm> <dbl> <dbl>
## 1 1001 Autauga Alabama US 2021-11-03 06:22:09 32.5 -86.6
## 2 1003 Baldwin Alabama US 2021-11-03 06:22:09 30.7 -87.7
## 3 1005 Barbour Alabama US 2021-11-03 06:22:09 31.9 -85.4
## 4 1007 Bibb Alabama US 2021-11-03 06:22:09 33.0 -87.1
## 5 1009 Blount Alabama US 2021-11-03 06:22:09 34.0 -86.6
## 6 1011 Bullock Alabama US 2021-11-03 06:22:09 32.1 -85.7
## 7 1013 Butler Alabama US 2021-11-03 06:22:09 31.8 -86.7
## 8 1015 Calhoun Alabama US 2021-11-03 06:22:09 33.8 -85.8
## 9 1017 Chambers Alabama US 2021-11-03 06:22:09 32.9 -85.4
## 10 1019 Cherokee Alabama US 2021-11-03 06:22:09 34.2 -85.6
## confirmed deaths combined_key incident_rate case_fatality_ratio
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 10271 148 Autauga, Alabama, US 18384. 1.44
## 2 37445 558 Baldwin, Alabama, US 16774. 1.49
## 3 3605 76 Barbour, Alabama, US 14603. 2.11
## 4 4283 89 Bibb, Alabama, US 19126. 2.08
## 5 10423 179 Blount, Alabama, US 18025. 1.72
## 6 1526 44 Bullock, Alabama, US 15107. 2.88
## 7 3365 96 Butler, Alabama, US 17303. 2.85
## 8 22341 497 Calhoun, Alabama, US 19666. 2.22
## 9 5787 144 Chambers, Alabama, US 17402. 2.49
## 10 3071 61 Cherokee, Alabama, US 11723. 1.99
## state county percent_fair_or_poor_health percent_smokers
## <chr> <chr> <dbl> <dbl>
## 1 Alabama Autauga 20.9 18.1
## 2 Alabama Baldwin 17.5 17.5
## 3 Alabama Barbour 29.6 22.0
## 4 Alabama Bibb 19.4 19.1
## 5 Alabama Blount 21.7 19.2
## 6 Alabama Bullock 31.0 22.9
## 7 Alabama Butler 27.9 21.8
## 8 Alabama Calhoun 23.1 20.6
## 9 Alabama Chambers 24.0 19.4
## 10 Alabama Cherokee 20.7 17.5
## percent_adults_with_obesity percent_with_access_to_exercise_opportunities
## <dbl> <dbl>
## 1 33.3 69.1
## 2 31 73.7
## 3 41.7 53.2
## 4 37.6 16.3
## 5 33.8 15.6
## 6 37.2 2.50
## 7 43.3 48.6
## 8 38.5 47.7
## 9 40.1 61.9
## 10 35 33.4
## percent_excessive_drinking percent_uninsured percent_some_college
## <dbl> <dbl> <dbl>
## 1 15.0 8.72 62.0
## 2 18.0 11.3 67.4
## 3 12.8 12.2 34.9
## 4 15.6 10.2 44.1
## 5 14.2 13.4 53.4
## 6 12.1 11.4 35.0
## 7 11.9 11.2 41.7
## 8 13.8 11.9 59.2
## 9 12.7 11.9 48.5
## 10 14.1 11.2 51.8
## percent_unemployed percent_children_in_poverty
## <dbl> <dbl>
## 1 3.63 19.3
## 2 3.62 13.9
## 3 5.17 43.9
## 4 3.97 27.8
## 5 3.51 18
## 6 4.69 68.3
## 7 4.79 36.3
## 8 4.65 26.5
## 9 3.91 30.7
## 10 3.57 24.7
## percent_single_parent_households percent_severe_housing_problems overcrowding
## <dbl> <dbl> <dbl>
## 1 26.2 14.7 1.20
## 2 24.1 13.6 1.27
## 3 56.6 14.6 1.69
## 4 28.7 10.5 0.255
## 5 28.6 10.5 1.89
## 6 74.8 18.1 0.113
## 7 52.7 13.2 1.69
## 8 40.2 13.7 1.54
## 9 46.6 16.0 4.04
## 10 23.8 13 1.5
## percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
## <dbl> <dbl> <dbl>
## 1 11.1 13.2 35.9
## 2 10.7 11.6 33.3
## 3 17.6 22 38.6
## 4 14.5 14.3 38.1
## 5 17 10.7 35.9
## 6 23.7 24.8 45.0
## 7 19.2 20.6 41.9
## 8 17.5 15.7 41.3
## 9 19.9 17.9 37.3
## 10 15.2 12.5 35.4
## percent_uninsured_2 median_household_income
## <dbl> <dbl>
## 1 11.1 59338
## 2 14.3 57588
## 3 16.1 34382
## 4 13 46064
## 5 17.1 50412
## 6 15.2 29267
## 7 14.5 37365
## 8 15.4 45400
## 9 15.2 39917
## 10 13.9 42132
## average_traffic_volume_per_meter_of_major_roadways percent_homeowners
## <dbl> <dbl>
## 1 88.5 74.9
## 2 87.0 73.6
## 3 102. 61.4
## 4 29.3 75.1
## 5 33.4 78.6
## 6 4.07 75.5
## 7 19.3 69.9
## 8 110. 69.5
## 9 20.3 67.8
## 10 25.9 79.0
## population_2 percent_less_than_18_years_of_age percent_65_and_over
## <dbl> <dbl> <dbl>
## 1 55601 23.7 15.6
## 2 218022 21.6 20.4
## 3 24881 20.9 19.4
## 4 22400 20.5 16.5
## 5 57840 23.2 18.2
## 6 10138 21.1 16.4
## 7 19680 22.2 20.3
## 8 114277 21.6 17.7
## 9 33615 20.8 19.5
## 10 26032 19.2 23.0
## percent_black percent_asian percent_hispanic percent_female percent_rural
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19.3 1.22 2.97 51.4 42.0
## 2 8.78 1.15 4.65 51.5 42.3
## 3 48.0 0.454 4.28 47.2 67.8
## 4 21.1 0.237 2.62 46.8 68.4
## 5 1.46 0.320 9.57 50.7 90.0
## 6 69.5 0.187 7.96 45.5 51.4
## 7 44.6 1.32 1.51 53.4 71.2
## 8 20.9 0.964 3.91 51.9 33.7
## 9 39.6 1.33 2.56 52.1 49.1
## 10 4.24 0.338 1.62 50.5 85.7
## # … with 3,099 more rows
Summarize again
summary(county_data)
## fips admin2 province_state country_region
## Min. : 1001 Length:3109 Length:3109 Length:3109
## 1st Qu.:18179 Class :character Class :character Class :character
## Median :29163 Mode :character Mode :character Mode :character
## Mean :30326
## 3rd Qu.:45051
## Max. :56045
## last_update lat long_
## Min. :2021-11-03 06:22:09 Min. :19.60 Min. :-174.16
## 1st Qu.:2021-11-03 06:22:09 1st Qu.:34.65 1st Qu.: -98.07
## Median :2021-11-03 06:22:09 Median :38.35 Median : -90.21
## Mean :2021-11-03 06:22:09 Mean :38.40 Mean : -92.01
## 3rd Qu.:2021-11-03 06:22:09 3rd Qu.:41.80 3rd Qu.: -83.40
## Max. :2021-11-03 06:22:09 Max. :69.31 Max. : -67.63
## confirmed deaths combined_key incident_rate
## Min. : 19 Min. : 0.0 Length:3109 Min. : 1962
## 1st Qu.: 1644 1st Qu.: 28.0 Class :character 1st Qu.:12829
## Median : 3941 Median : 67.0 Mode :character Median :15074
## Mean : 14650 Mean : 229.9 Mean :14998
## 3rd Qu.: 10275 3rd Qu.: 164.0 3rd Qu.:17168
## Max. :1495014 Max. :26661.0 Max. :54277
## case_fatality_ratio state county
## Min. :0.000 Length:3109 Length:3109
## 1st Qu.:1.209 Class :character Class :character
## Median :1.641 Mode :character Mode :character
## Mean :1.768
## 3rd Qu.:2.189
## Max. :7.628
## percent_fair_or_poor_health percent_smokers percent_adults_with_obesity
## Min. : 8.121 Min. : 5.909 Min. :12.4
## 1st Qu.:14.361 1st Qu.:14.987 1st Qu.:29.3
## Median :17.260 Median :16.985 Median :33.1
## Mean :17.968 Mean :17.508 Mean :32.9
## 3rd Qu.:20.950 3rd Qu.:19.755 3rd Qu.:36.6
## Max. :40.991 Max. :41.491 Max. :57.7
## percent_with_access_to_exercise_opportunities percent_excessive_drinking
## Min. : 0.00 Min. : 7.81
## 1st Qu.: 48.52 1st Qu.:15.34
## Median : 65.82 Median :17.58
## Mean : 62.79 Mean :17.54
## 3rd Qu.: 80.09 3rd Qu.:19.68
## Max. :100.00 Max. :28.62
## percent_uninsured percent_some_college percent_unemployed
## Min. : 2.263 Min. :15.18 Min. : 1.302
## 1st Qu.: 7.376 1st Qu.:49.80 1st Qu.: 3.120
## Median :10.528 Median :57.93 Median : 3.873
## Mean :11.452 Mean :57.85 Mean : 4.117
## 3rd Qu.:14.445 3rd Qu.:66.47 3rd Qu.: 4.814
## Max. :33.750 Max. :90.67 Max. :18.092
## percent_children_in_poverty percent_single_parent_households
## Min. : 2.50 Min. : 0.00
## 1st Qu.:14.60 1st Qu.:25.62
## Median :20.10 Median :31.70
## Mean :21.15 Mean :32.44
## 3rd Qu.:26.40 3rd Qu.:37.70
## Max. :68.30 Max. :87.20
## percent_severe_housing_problems overcrowding percent_adults_with_diabetes
## Min. : 3.22 Min. : 0.000 Min. : 1.80
## 1st Qu.:11.01 1st Qu.: 1.230 1st Qu.: 9.30
## Median :13.32 Median : 1.877 Median :11.60
## Mean :13.84 Mean : 2.391 Mean :12.15
## 3rd Qu.:15.92 3rd Qu.: 2.840 3rd Qu.:14.60
## Max. :60.26 Max. :38.058 Max. :34.10
## percent_food_insecure percent_insufficient_sleep percent_uninsured_2
## Min. : 2.90 Min. :23.03 Min. : 2.683
## 1st Qu.:10.60 1st Qu.:30.11 1st Qu.: 8.530
## Median :12.70 Median :33.01 Median :12.460
## Mean :13.24 Mean :33.07 Mean :13.564
## 3rd Qu.:15.20 3rd Qu.:36.13 3rd Qu.:17.383
## Max. :36.30 Max. :46.71 Max. :42.397
## median_household_income average_traffic_volume_per_meter_of_major_roadways
## Min. : 25385 Min. : 0.00
## 1st Qu.: 43650 1st Qu.: 26.94
## Median : 50525 Median : 58.08
## Mean : 52737 Mean : 129.85
## 3rd Qu.: 58742 3rd Qu.: 123.43
## Max. :140382 Max. :4496.41
## percent_homeowners population_2 percent_less_than_18_years_of_age
## Min. :19.61 Min. : 277 Min. : 7.069
## 1st Qu.:67.55 1st Qu.: 11113 1st Qu.:20.027
## Median :72.59 Median : 26158 Median :22.050
## Mean :71.44 Mean : 105013 Mean :22.028
## 3rd Qu.:77.01 3rd Qu.: 68557 3rd Qu.:23.838
## Max. :92.40 Max. :10105518 Max. :41.992
## percent_65_and_over percent_black percent_asian percent_hispanic
## Min. : 4.83 Min. : 0.0000 Min. : 0.0000 Min. : 0.6105
## 1st Qu.:16.30 1st Qu.: 0.7307 1st Qu.: 0.4654 1st Qu.: 2.3926
## Median :18.93 Median : 2.3228 Median : 0.7383 Median : 4.3532
## Mean :19.29 Mean : 9.0803 Mean : 1.5709 Mean : 9.6791
## 3rd Qu.:21.81 3rd Qu.:10.3662 3rd Qu.: 1.4353 3rd Qu.:10.0066
## Max. :57.59 Max. :85.4143 Max. :43.3570 Max. :96.3595
## percent_female percent_rural
## Min. :26.84 Min. : 0.00
## 1st Qu.:49.43 1st Qu.: 33.15
## Median :50.32 Median : 59.45
## Mean :49.90 Mean : 58.54
## 3rd Qu.:51.03 3rd Qu.: 87.30
## Max. :56.87 Max. :100.00
If there are variables with missing value for many counties, we go back and remove those variables from consideration.
Let’s create a final data frame for analysis.
county_data <- county_data %>%
mutate(state = as.factor(state)) %>%
select(county, confirmed, deaths, state, percent_fair_or_poor_health:percent_rural)
summary(county_data)
## county confirmed deaths state
## Length:3109 Min. : 19 Min. : 0.0 Texas : 253
## Class :character 1st Qu.: 1644 1st Qu.: 28.0 Georgia : 159
## Mode :character Median : 3941 Median : 67.0 Virginia: 133
## Mean : 14650 Mean : 229.9 Kentucky: 120
## 3rd Qu.: 10275 3rd Qu.: 164.0 Missouri: 115
## Max. :1495014 Max. :26661.0 Kansas : 105
## (Other) :2224
## percent_fair_or_poor_health percent_smokers percent_adults_with_obesity
## Min. : 8.121 Min. : 5.909 Min. :12.4
## 1st Qu.:14.361 1st Qu.:14.987 1st Qu.:29.3
## Median :17.260 Median :16.985 Median :33.1
## Mean :17.968 Mean :17.508 Mean :32.9
## 3rd Qu.:20.950 3rd Qu.:19.755 3rd Qu.:36.6
## Max. :40.991 Max. :41.491 Max. :57.7
##
## percent_with_access_to_exercise_opportunities percent_excessive_drinking
## Min. : 0.00 Min. : 7.81
## 1st Qu.: 48.52 1st Qu.:15.34
## Median : 65.82 Median :17.58
## Mean : 62.79 Mean :17.54
## 3rd Qu.: 80.09 3rd Qu.:19.68
## Max. :100.00 Max. :28.62
##
## percent_uninsured percent_some_college percent_unemployed
## Min. : 2.263 Min. :15.18 Min. : 1.302
## 1st Qu.: 7.376 1st Qu.:49.80 1st Qu.: 3.120
## Median :10.528 Median :57.93 Median : 3.873
## Mean :11.452 Mean :57.85 Mean : 4.117
## 3rd Qu.:14.445 3rd Qu.:66.47 3rd Qu.: 4.814
## Max. :33.750 Max. :90.67 Max. :18.092
##
## percent_children_in_poverty percent_single_parent_households
## Min. : 2.50 Min. : 0.00
## 1st Qu.:14.60 1st Qu.:25.62
## Median :20.10 Median :31.70
## Mean :21.15 Mean :32.44
## 3rd Qu.:26.40 3rd Qu.:37.70
## Max. :68.30 Max. :87.20
##
## percent_severe_housing_problems overcrowding percent_adults_with_diabetes
## Min. : 3.22 Min. : 0.000 Min. : 1.80
## 1st Qu.:11.01 1st Qu.: 1.230 1st Qu.: 9.30
## Median :13.32 Median : 1.877 Median :11.60
## Mean :13.84 Mean : 2.391 Mean :12.15
## 3rd Qu.:15.92 3rd Qu.: 2.840 3rd Qu.:14.60
## Max. :60.26 Max. :38.058 Max. :34.10
##
## percent_food_insecure percent_insufficient_sleep percent_uninsured_2
## Min. : 2.90 Min. :23.03 Min. : 2.683
## 1st Qu.:10.60 1st Qu.:30.11 1st Qu.: 8.530
## Median :12.70 Median :33.01 Median :12.460
## Mean :13.24 Mean :33.07 Mean :13.564
## 3rd Qu.:15.20 3rd Qu.:36.13 3rd Qu.:17.383
## Max. :36.30 Max. :46.71 Max. :42.397
##
## median_household_income average_traffic_volume_per_meter_of_major_roadways
## Min. : 25385 Min. : 0.00
## 1st Qu.: 43650 1st Qu.: 26.94
## Median : 50525 Median : 58.08
## Mean : 52737 Mean : 129.85
## 3rd Qu.: 58742 3rd Qu.: 123.43
## Max. :140382 Max. :4496.41
##
## percent_homeowners population_2 percent_less_than_18_years_of_age
## Min. :19.61 Min. : 277 Min. : 7.069
## 1st Qu.:67.55 1st Qu.: 11113 1st Qu.:20.027
## Median :72.59 Median : 26158 Median :22.050
## Mean :71.44 Mean : 105013 Mean :22.028
## 3rd Qu.:77.01 3rd Qu.: 68557 3rd Qu.:23.838
## Max. :92.40 Max. :10105518 Max. :41.992
##
## percent_65_and_over percent_black percent_asian percent_hispanic
## Min. : 4.83 Min. : 0.0000 Min. : 0.0000 Min. : 0.6105
## 1st Qu.:16.30 1st Qu.: 0.7307 1st Qu.: 0.4654 1st Qu.: 2.3926
## Median :18.93 Median : 2.3228 Median : 0.7383 Median : 4.3532
## Mean :19.29 Mean : 9.0803 Mean : 1.5709 Mean : 9.6791
## 3rd Qu.:21.81 3rd Qu.:10.3662 3rd Qu.: 1.4353 3rd Qu.:10.0066
## Max. :57.59 Max. :85.4143 Max. :43.3570 Max. :96.3595
##
## percent_female percent_rural
## Min. :26.84 Min. : 0.00
## 1st Qu.:49.43 1st Qu.: 33.15
## Median :50.32 Median : 59.45
## Mean :49.90 Mean : 58.54
## 3rd Qu.:51.03 3rd Qu.: 87.30
## Max. :56.87 Max. :100.00
##
Display the 10 counties with highest CFR.
county_data %>%
mutate(cfr = deaths / confirmed) %>%
select(county, state, confirmed, deaths, cfr) %>%
arrange(desc(cfr)) %>%
top_n(10)
## Selecting by cfr
## # A tibble: 10 × 5
## county state confirmed deaths cfr
## <chr> <fct> <dbl> <dbl> <dbl>
## 1 Sabine Texas 957 73 0.0763
## 2 Hancock Georgia 1121 81 0.0723
## 3 McMullen Texas 117 8 0.0684
## 4 Harding New Mexico 44 3 0.0682
## 5 Knox Texas 351 21 0.0598
## 6 Jerauld South Dakota 301 17 0.0565
## 7 Motley Texas 161 9 0.0559
## 8 Candler Georgia 1560 87 0.0558
## 9 Twiggs Georgia 1084 60 0.0554
## 10 Foard Texas 181 10 0.0552
Write final data into a csv file for future use.
write_csv(county_data, "covid19-county-data-20211102.csv.gz")
Read and run above code to generate a data frame county_data
that includes county-level COVID-19 confirmed cases and deaths, demographic, and health related information.
What assumptions of CFR might be violated by defining CFR as deaths/confirmed
from this data set? With acknowledgement of these severe limitations, we continue to use deaths/confirmed
as a very rough proxy of CFR.
What assumptions of logistic regression may be violated by this data set?
Run a binomial regression, using variables state
, …, percent_rural
as predictors.
Interpret the regression coefficients of 3 significant predictors with p-value <0.01.
Apply analysis of deviance to (1) evaluate the goodness of fit of the model and (2) compare the model to the intercept-only model.
Perform analysis of deviance to evaluate the significance of each predictor. Display the 10 most significant predictors.
Construct confidence intervals of regression coefficients.
Plot the deviance residuals against the fitted values. Are there potential outliers?
Plot the half-normal plot. Are there potential outliers in predictor space?
Find the best sub-model using the AIC criterion.
Find the best sub-model using the lasso with cross validation.