What I read in 2021

R
data visualization

My year in books, by the numbers.


Author

Mara Averick

Published

2021-12-31

Modified

2024-11-20

Doi


My relationship with reading borders on pathological (and by “borders on” I mean “has literally been a topic of discussion in therapy”). I mean, I’ve gotten it under control somewhat—we’ll use my 2014 Goodreads Reading Challenge as a bar for a bit out of control—which means I can take a look back on my 2021 year in books without too much self-recrimination.

The data

For all its faults (and there are many), I’ve gotten in the habit of using Goodreads to log what I’m reading over the past seven(ish) years. If nothing more, it has a nice enough export function, which lets you (or me, or whomever) retrieve your reading data as a CSV.

I stashed my exported data in Google Sheets. So, I’ll use {googlesheets4} to read it into R with its sheet ID, and make our lives easier by passing it straight through janitor::clean_names().

View code
library(tidyverse)
library(googlesheets4)
library(skimr)
View code
gr_data <- read_sheet("1PqnJ2UOaYnfIRCVSvlYlyfeQ4OlynMiPq0eCIfDbjLU") |>
  janitor::clean_names()

Since I just want to see the books I read in 2021, I’m going to filter these by bookshelves—keeping only those books on my 2021-reads shelf. (The count of books on this shelf matches up with the count of books read this year according to my Goodreads 2021 Reading Challenge, 188, and I’m too lazy to track down the five books that go missing when I filter by date.) I’m also going to convert book_id to be a string, since it seems to come through as numeric by default.

View code
read_2021 <- gr_data |> 
  filter(str_detect(bookshelves, "2021-reads")) |> 
  mutate(book_id = as.character(book_id))

Summary stats

Let’s peep a quick summary of the data using skimr::skim().

skimr::skim(read_2021)
Data summary
Name read_2021
Number of rows 188
Number of columns 31
_______________________
Column type frequency:
character 15
list 1
logical 6
numeric 7
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
book_id 0 1.00 4 8 0 188 0
title 0 1.00 4 145 0 188 0
author 0 1.00 9 22 0 134 0
author_l_f 0 1.00 10 23 0 134 0
additional_authors 145 0.23 9 46 0 36 0
isbn 6 0.97 10 10 0 182 0
isbn13 25 0.87 13 13 0 163 0
binding 0 1.00 5 21 0 9 0
bookshelves 0 1.00 10 93 0 97 0
bookshelves_with_positions 0 1.00 16 138 0 188 0
exclusive_shelf 0 1.00 4 4 0 1 0
my_review 183 0.03 116 1156 0 5 0
recommended_for 188 0.00 NA NA 0 0 0
recommended_by 188 0.00 NA NA 0 0 0
condition 188 0.00 NA NA 0 0 0

Variable type: list

skim_variable n_missing complete_rate n_unique min_length max_length
publisher 0 1 108 0 1

Variable type: logical

skim_variable n_missing complete_rate mean count
spoiler 188 0 NaN :
private_notes 188 0 NaN :
original_purchase_date 188 0 NaN :
original_purchase_location 188 0 NaN :
condition_description 188 0 NaN :
bcid 188 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
my_rating 0 1.00 3.22 0.98 1 3.00 3.00 4.00 5 ▁▅▇▆▂
average_rating 0 1.00 4.02 0.37 2 3.81 4.06 4.25 5 ▁▁▃▇▂
number_of_pages 7 0.96 328.12 124.81 51 256.00 320.00 387.00 1152 ▃▇▁▁▁
year_published 0 1.00 2014.94 7.56 1964 2011.00 2018.00 2020.00 2022 ▁▁▁▂▇
original_publication_year 19 0.90 2008.56 16.15 1963 2002.00 2017.00 2020.00 2022 ▁▁▁▂▇
read_count 0 1.00 1.00 0.00 1 1.00 1.00 1.00 1 ▁▁▇▁▁
owned_copies 0 1.00 0.00 0.00 0 0.00 0.00 0.00 0 ▁▁▇▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date_read 5 0.97 2021-01-03 2021-12-30 2021-08-23 157
date_added 0 1.00 2012-03-21 2021-12-29 2021-07-11 147

Points of interest

For one thing, I don’t make use of the bulk of the 31 variables stored in the Goodreads export data. They’re all on my read shelf (the exclusive_shelf variable can be: “currently reading”, “read”, or “want to read”—I only have one distinct value for that variable and its max and min length is four, the same number of letters in the word “read”).

Though I’m not quite there, My book ratings (my_rating) are somewhat normally distributed, given you can only give a book 1, 2, 3, 4, or 5 starts. I’m admittedly withholding with stars for this very reason—if I just start handing out five-star reviews willy-nilly, then my ratings become devoid of value.

The length of the books I read this year varied wildly! Though the mean number_of_pages (328) seems pretty typical, a standard deviation of 125 is a big swing. With a sample size of 181 (there are seven entries missing values), you wouldn’t think that a single book would make a huge difference. That said, James Clavelle’s Shōgun clocked in at 1,152 pages, which is pretty darn hefty.

There are, indeed, five entries that have nothing in them for date_read. (The n_missing for skim_variable date_read is five). I think this is because I entered these a few days after reading them, and giving the dates for “Started Reading” and “Finished Reading” in the Goodreads interface, is not the same as clicking “I’m finished” in the “Update Progress” interface in terms of giving you a date_read.

Example of "My Activity" for a book without a date_read entry. The start and finish dates are given (2021-02-22, and 2021-02-23, respectively), but the exported data does not include a date_read.

Example of “My Activity” for a book without a date_read entry. The start and finish dates are given (2021-02-22, and 2021-02-23, respectively), but the exported data does not include a date_read.

Example of Goodreads interface for updating your progress on reading a book. The “I'm finished” button seems to beget the date_read in the exported data.

Example of Goodreads interface for updating your progress on reading a book. The “I’m finished” button seems to beget the date_read in the exported data.

If I decide to do any sort of temporal chart of my reading over the year, I’ll have to go in and manually fix the missing date_read entries, since the “started reading” and “finished reading” dates are not part of the data export.

Books over time

So, having fixed those entries with missing date_read manually, let’s take a peek at what my reading looked like over the course of the year.

View code
read_2021_rev |> 
  ggplot(aes(x = date_read, y = cumsum(read_count))) +
  geom_line() +
  scale_x_date(NULL,
               breaks = scales::date_breaks(width = "1 month"),
               labels = scales::label_date_short()) +
  labs(
    title = "Sum of books Mara read over the course of 2021",
    alt = "With x-axis range from January 2021 to January 2022, shows relatively steady increase in cumulative sum of books read over time (from zero to ~200, where y-max = 188).",
    x = "Time",
    y = "Total books read"
  ) +
  hrbrthemes::theme_ipsum_rc()
With x-axis range from January 2021 to January 2022, shows relatively steady increase in cumulative sum of books read over time (from zero to ~200, where y-max = 188).
Figure 1: Sum of books read over the course of 2021.

Not particularly riveting. It’s a pretty steady climb, and I think the slope increases mainly where I went on series benders (e.g. the Parker books at the end of the summer), and also after the Batpig died in November (I read when I’m sad).

What about for pages?

View code
read_2021_rev |> 
  mutate(number_of_pages = replace_na(number_of_pages, 0)) |> 
  mutate(total_pages = cumsum(number_of_pages)) |> 
  ggplot(aes(x = date_read, y = total_pages)) +
  geom_line() +
  scale_x_date(NULL,
               breaks = scales::date_breaks(width = "1 month"),
               labels = scales::label_date_short()) +
  scale_y_continuous(labels = scales::label_comma()) +
  labs(
    title = "Sum of pages Mara read over the course of 2021",
    alt = "With x-axis range from January 2021 to January 2022, shows relatively steady increase in cumulative sum of pages read over time (from zero to ~60000, where y-max = 59390).",
    x = "Time",
    y = "Total pages read"
  ) +
  hrbrthemes::theme_ipsum_rc()
With x-axis range from January 2021 to January 2022, shows relatively steady increase in cumulative sum of pages read over time (from zero to ~60000, where y-max = 59390).
Figure 2: Sum of pages read over the course of 2021.

Still looks like a steady climb. Translation: Nothing much to see here.

The books

Keeping in mind that just because I read a book doesn’t mean I recommend it (I have a thing about finishing books—that “thing” being that I have to do it), here’s a little widget of what I read in the year 2021 Anno Domini.

Back to top

Reuse

Citation

BibTeX citation:
@online{averick2021,
  author = {Averick, Mara},
  title = {What {I} Read in 2021},
  date = {2021-12-31},
  url = {https://dataand.me/blog/2021-12_what-i-read-in-2021/},
  doi = {10.59350/dzdwa-es082},
  langid = {en-US}
}
For attribution, please cite this work as:
Averick, Mara. 2021. “What I Read in 2021.” December 31, 2021. https://doi.org/10.59350/dzdwa-es082.