Goodreads
This is the very first homework assignment from my CodeClan course. Included here as I love books! Also because the clean up was more tricky than originally intended.
Data cleaning
- Initially reading in the data generated a parsing error due to mislabeled quotes
- This was fixed by including
quote = ""
in theread_csv()
- This was fixed by including
- 12 parsing errors remained in the data
- Investigating with
problems()
revealed four rows with data that has skipped a column because of an extra comma in the authors column - These commas were manually removed
- The
.csv
was resaved asbooks_edit.csv
- Investigating with
- Missing pages were recoded as missing (
NA
) if page number was zero - Average ratings were recoded as
NA
if average_rating = 0 and ratings_count > 0 (you can’t have an average rating of nothing if you have had a rating count) - Ratings count were recoded as
NA
if average_rating > 0 and ratings_count = 0 (you can’t have a rating count of nothing if there is an average rating score)
library(tidyverse)
library(here)
books_clean <- read_csv(here("./content/project/2021-09-10-goodreads/clean_data/books_clean.csv"))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## book_id = col_double(),
## title = col_character(),
## authors = col_character(),
## average_rating = col_double(),
## isbn = col_character(),
## isbn13 = col_character(),
## language_code = col_character(),
## num_pages = col_double(),
## ratings_count = col_double(),
## text_reviews_count = col_double(),
## publication_date = col_character(),
## publisher = col_character()
## )
Looking at the authors
Questions I want to answer:
- How many different authors are there?
- Who are the top ten most rated authors (based on rating count)?
- Who are the ten authors with the longest books?
#count all the unique authors
books_clean %>%
distinct(authors) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 6643
There are 6643 different authors listed. However, I note some authors may be listed more than once if they are coauthors.
Let’s look at the top ten authors based on total number of reviews.
author_subset <- books_clean %>%
select(authors, title, ratings_count, num_pages)
reviewed_top_ten <- author_subset %>%
slice_max(ratings_count, n = 10)
reviewed_top_ten
## # A tibble: 10 x 4
## authors title ratings_count num_pages
## <chr> <chr> <dbl> <dbl>
## 1 Stephenie Meyer Twilight (Twilight #1) 4597666 501
## 2 J.R.R. Tolkien The Hobbit or There and Back… 2530894 366
## 3 J.D. Salinger The Catcher in the Rye 2457092 277
## 4 Dan Brown Angels & Demons (Robert Langd… 2418736 736
## 5 J.K. Rowling/Mary Gra… Harry Potter and the Prisoner… 2339585 435
## 6 J.K. Rowling/Mary Gra… Harry Potter and the Chamber … 2293963 341
## 7 J.K. Rowling/Mary Gra… Harry Potter and the Order of… 2153167 870
## 8 J.R.R. Tolkien The Fellowship of the Ring (T… 2128944 398
## 9 George Orwell/Boris G… Animal Farm 2111750 122
## 10 J.K. Rowling/Mary Gra… Harry Potter and the Half-Blo… 2095690 652
Looks like Stephenie Meyer with the first Twilight book has received the most ratings on Goodreads! J.K. Rowling also features several times in the top ten most rated authors.
Now I will have a look at the authors with the longest books.
longest_top_ten <- author_subset %>%
slice_max(num_pages, n = 10)
longest_top_ten
## # A tibble: 10 x 4
## authors title ratings_count num_pages
## <chr> <chr> <dbl> <dbl>
## 1 Patrick O'Brian The Complete Aubrey/Matu… 1338 6576
## 2 Winston S. Churchill/John … The Second World War 1493 4736
## 3 Marcel Proust/C.K. Scott M… Remembrance of Things Pa… 6 3400
## 4 J.K. Rowling Harry Potter Collection … 28242 3342
## 5 Thomas Aquinas Summa Theologica 5 Vols 2734 3020
## 6 Dennis L. Kasper/Dan L. Lo… Harrison's Principles of… 23 2751
## 7 J.K. Rowling/Mary GrandPré Harry Potter Boxed Set … 41428 2690
## 8 Terry Goodkind The Sword of Truth Boxe… 4196 2480
## 9 Christina Scull/Wayne G. H… The J.R.R. Tolkien Compa… 45 2264
## 10 Anonymous Study Bible: NIV 4166 2198
Ah, this was a bit of a trick question as the books listed with the highest page numbers are mostly box sets! Interestingly, two J.K. Rowling box sets feature in this list.
Looking at the languages
I would like to have a look at the different languages of books in this dataset.
Questions I want to answer:
- How many languages are there?
- How many text reviews do books written in English have?
- How many text reviews do books written in non-English have?
- Is this count similar for the overall ratings received for English and non-English books?
- Who are the top ten publishers of English books?
- Who are the top ten publishers of non-English books?
#counting how many languages there are and arranging them alphabetically
books_clean %>%
distinct(language_code) %>%
arrange(language_code)
## # A tibble: 27 x 1
## language_code
## <chr>
## 1 ale
## 2 ara
## 3 en-CA
## 4 en-GB
## 5 en-US
## 6 eng
## 7 enm
## 8 fre
## 9 ger
## 10 gla
## # … with 17 more rows
There are 27 different languages in the books dataset. It looks like English is coded four times: eng (which is not a localisation language code), en-US, en-GB and en-CA. (Additional note: enm is middle english, so I will not include this as an English book).
Let’s find out what the total text_reviews_count and ratings_count is in all four English groups compared to all other languages.
#subset the dataset to answer the questions on English books
#add a column, english, set to TRUE if the language is English
language_subset <- books_clean %>%
select(language_code, publisher, ratings_count, text_reviews_count) %>%
mutate(english = case_when(
language_code %in% c("eng", "en-GB", "en-CA", "en-US") ~ TRUE,
TRUE ~ FALSE)
)
#generate a summary table counting the total number of text reviews and ratings counts for English and non-English books
language_subset %>%
group_by(english) %>%
summarise(
sum(text_reviews_count),
sum(ratings_count, na.rm = TRUE))
## # A tibble: 2 x 3
## english `sum(text_reviews_count)` `sum(ratings_count, na.rm = TRUE)`
## <lgl> <dbl> <dbl>
## 1 FALSE 31823 1560813
## 2 TRUE 5997392 198017611
Unsurprisingly perhaps, there are far more text reviews for books written in English (eng, en-GB, en-US and en-CA) than in all other languages combined (5997392 compared to 31823). If we put it as a percentage, 99.47% of the text reviews in this Goodreads dataset are written for English books, and only 0.53% of the text reviews are in another language.
This is also seen when looking at the total for all ratings count (198017611 ratings for English books compared to 1560813, or 78.2% of ratings compared to 21.8%).
Let’s look at the publishers with the most English titles and the publishers with the most non-English titles.
publisher_subset <- language_subset %>%
group_by(publisher) %>%
mutate(non_english = !english) %>%
summarise(
tot_eng = sum(english),
tot_non_eng = sum(non_english)
)
publisher_subset
## # A tibble: 2,293 x 3
## publisher tot_eng tot_non_eng
## <chr> <int> <int>
## 1 "\"Tarcher\"" 1 0
## 2 "10/18" 0 2
## 3 "1st Book Library" 1 0
## 4 "1st World Library" 1 0
## 5 "A & C Black (Childrens books)" 1 0
## 6 "A Harvest Book/Harcourt Inc." 1 0
## 7 "A K PETERS" 1 0
## 8 "AA World Services" 1 0
## 9 "Abacus" 6 0
## 10 "Abacus Books" 1 0
## # … with 2,283 more rows
#select publishers with the most books in English
top_eng_publishers <- publisher_subset %>%
slice_max(tot_eng, n = 10)
top_eng_publishers
## # A tibble: 10 x 3
## publisher tot_eng tot_non_eng
## <chr> <int> <int>
## 1 Vintage 317 1
## 2 Penguin Books 261 0
## 3 Penguin Classics 183 1
## 4 Mariner Books 149 1
## 5 Ballantine Books 143 1
## 6 HarperCollins 112 0
## 7 Pocket Books 111 0
## 8 Bantam 110 0
## 9 Harper Perennial 110 2
## 10 VIZ Media LLC 88 0
#Select publishers with the most books in non-English
top_non_eng_publishers <- publisher_subset %>%
slice_max(tot_non_eng, n = 10)
top_non_eng_publishers
## # A tibble: 11 x 3
## publisher tot_eng tot_non_eng
## <chr> <int> <int>
## 1 Debolsillo 1 17
## 2 Gallimard 1 15
## 3 小学館 0 15
## 4 Pocket 3 14
## 5 Planeta Publishing 0 13
## 6 Plaza y Janes 1 13
## 7 集英社 0 12
## 8 J'ai Lu 0 10
## 9 Ediciones B 0 9
## 10 Glénat 0 9
## 11 Punto de Lectura 0 9
The top ten publishers with the most English titles are very different from the top ten publishers with the most non-English titles! Interestingly in both groups there are some book titles in other languages. The top ten publishers with non-English titles contains 11 publishers. This is because there are several publishers with the same total count of non-English books.