View code
library(tidyverse)
library(googlesheets4)
library(skimr)
My year in books, by the numbers.
My relationship with reading borders on pathological (and by “borders on” I mean “has literally been a topic of discussion in therapy”). I mean, I’ve gotten it under control somewhat—we’ll use my 2014 Goodreads Reading Challenge as a bar for a bit out of control—which means I can take a look back on my 2021 year in books without too much self-recrimination.
For all its faults (and there are many), I’ve gotten in the habit of using Goodreads to log what I’m reading over the past seven(ish) years. If nothing more, it has a nice enough export function, which lets you (or me, or whomever) retrieve your reading data as a CSV.
I stashed my exported data in Google Sheets. So, I’ll use {googlesheets4} to read it into R with its sheet ID, and make our lives easier by passing it straight through janitor::clean_names()
.
Since I just want to see the books I read in 2021, I’m going to filter these by bookshelves
—keeping only those books on my 2021-reads shelf. (The count of books on this shelf matches up with the count of books read this year according to my Goodreads 2021 Reading Challenge, 188, and I’m too lazy to track down the five books that go missing when I filter by date.) I’m also going to convert book_id
to be a string, since it seems to come through as numeric by default.
Let’s peep a quick summary of the data using skimr::skim()
.
Name | read_2021 |
Number of rows | 188 |
Number of columns | 31 |
_______________________ | |
Column type frequency: | |
character | 15 |
list | 1 |
logical | 6 |
numeric | 7 |
POSIXct | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
book_id | 0 | 1.00 | 4 | 8 | 0 | 188 | 0 |
title | 0 | 1.00 | 4 | 145 | 0 | 188 | 0 |
author | 0 | 1.00 | 9 | 22 | 0 | 134 | 0 |
author_l_f | 0 | 1.00 | 10 | 23 | 0 | 134 | 0 |
additional_authors | 145 | 0.23 | 9 | 46 | 0 | 36 | 0 |
isbn | 6 | 0.97 | 10 | 10 | 0 | 182 | 0 |
isbn13 | 25 | 0.87 | 13 | 13 | 0 | 163 | 0 |
binding | 0 | 1.00 | 5 | 21 | 0 | 9 | 0 |
bookshelves | 0 | 1.00 | 10 | 93 | 0 | 97 | 0 |
bookshelves_with_positions | 0 | 1.00 | 16 | 138 | 0 | 188 | 0 |
exclusive_shelf | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
my_review | 183 | 0.03 | 116 | 1156 | 0 | 5 | 0 |
recommended_for | 188 | 0.00 | NA | NA | 0 | 0 | 0 |
recommended_by | 188 | 0.00 | NA | NA | 0 | 0 | 0 |
condition | 188 | 0.00 | NA | NA | 0 | 0 | 0 |
Variable type: list
skim_variable | n_missing | complete_rate | n_unique | min_length | max_length |
---|---|---|---|---|---|
publisher | 0 | 1 | 108 | 0 | 1 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
spoiler | 188 | 0 | NaN | : |
private_notes | 188 | 0 | NaN | : |
original_purchase_date | 188 | 0 | NaN | : |
original_purchase_location | 188 | 0 | NaN | : |
condition_description | 188 | 0 | NaN | : |
bcid | 188 | 0 | NaN | : |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
my_rating | 0 | 1.00 | 3.22 | 0.98 | 1 | 3.00 | 3.00 | 4.00 | 5 | ▁▅▇▆▂ |
average_rating | 0 | 1.00 | 4.02 | 0.37 | 2 | 3.81 | 4.06 | 4.25 | 5 | ▁▁▃▇▂ |
number_of_pages | 7 | 0.96 | 328.12 | 124.81 | 51 | 256.00 | 320.00 | 387.00 | 1152 | ▃▇▁▁▁ |
year_published | 0 | 1.00 | 2014.94 | 7.56 | 1964 | 2011.00 | 2018.00 | 2020.00 | 2022 | ▁▁▁▂▇ |
original_publication_year | 19 | 0.90 | 2008.56 | 16.15 | 1963 | 2002.00 | 2017.00 | 2020.00 | 2022 | ▁▁▁▂▇ |
read_count | 0 | 1.00 | 1.00 | 0.00 | 1 | 1.00 | 1.00 | 1.00 | 1 | ▁▁▇▁▁ |
owned_copies | 0 | 1.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 | 0.00 | 0 | ▁▁▇▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date_read | 5 | 0.97 | 2021-01-03 | 2021-12-30 | 2021-08-23 | 157 |
date_added | 0 | 1.00 | 2012-03-21 | 2021-12-29 | 2021-07-11 | 147 |
For one thing, I don’t make use of the bulk of the 31 variables stored in the Goodreads export data. They’re all on my read shelf (the exclusive_shelf
variable can be: “currently reading”, “read”, or “want to read”—I only have one distinct value for that variable and its max and min length is four, the same number of letters in the word “read”).
Though I’m not quite there, My book ratings (my_rating
) are somewhat normally distributed, given you can only give a book 1, 2, 3, 4, or 5 starts. I’m admittedly withholding with stars for this very reason—if I just start handing out five-star reviews willy-nilly, then my ratings become devoid of value.
The length of the books I read this year varied wildly! Though the mean number_of_pages
(328) seems pretty typical, a standard deviation of 125 is a big swing. With a sample size of 181 (there are seven entries missing values), you wouldn’t think that a single book would make a huge difference. That said, James Clavelle’s Shōgun clocked in at 1,152 pages, which is pretty darn hefty.
There are, indeed, five entries that have nothing in them for date_read
. (The n_missing
for skim_variable
date_read
is five). I think this is because I entered these a few days after reading them, and giving the dates for “Started Reading” and “Finished Reading” in the Goodreads interface, is not the same as clicking “I’m finished” in the “Update Progress” interface in terms of giving you a date_read
.
If I decide to do any sort of temporal chart of my reading over the year, I’ll have to go in and manually fix the missing date_read
entries, since the “started reading” and “finished reading” dates are not part of the data export.
So, having fixed those entries with missing date_read
manually, let’s take a peek at what my reading looked like over the course of the year.
read_2021_rev |>
ggplot(aes(x = date_read, y = cumsum(read_count))) +
geom_line() +
scale_x_date(NULL,
breaks = scales::date_breaks(width = "1 month"),
labels = scales::label_date_short()) +
labs(
title = "Sum of books Mara read over the course of 2021",
alt = "With x-axis range from January 2021 to January 2022, shows relatively steady increase in cumulative sum of books read over time (from zero to ~200, where y-max = 188).",
x = "Time",
y = "Total books read"
) +
hrbrthemes::theme_ipsum_rc()
Not particularly riveting. It’s a pretty steady climb, and I think the slope increases mainly where I went on series benders (e.g. the Parker books at the end of the summer), and also after the Batpig died in November (I read when I’m sad).
What about for pages?
read_2021_rev |>
mutate(number_of_pages = replace_na(number_of_pages, 0)) |>
mutate(total_pages = cumsum(number_of_pages)) |>
ggplot(aes(x = date_read, y = total_pages)) +
geom_line() +
scale_x_date(NULL,
breaks = scales::date_breaks(width = "1 month"),
labels = scales::label_date_short()) +
scale_y_continuous(labels = scales::label_comma()) +
labs(
title = "Sum of pages Mara read over the course of 2021",
alt = "With x-axis range from January 2021 to January 2022, shows relatively steady increase in cumulative sum of pages read over time (from zero to ~60000, where y-max = 59390).",
x = "Time",
y = "Total pages read"
) +
hrbrthemes::theme_ipsum_rc()
Still looks like a steady climb. Translation: Nothing much to see here.
Keeping in mind that just because I read a book doesn’t mean I recommend it (I have a thing about finishing books—that “thing” being that I have to do it), here’s a little widget of what I read in the year 2021 Anno Domini.
@online{averick2021,
author = {Averick, Mara},
title = {What {I} Read in 2021},
date = {2021-12-31},
url = {https://dataand.me/blog/2021-12_what-i-read-in-2021/},
doi = {10.59350/dzdwa-es082},
langid = {en-US}
}