Who are the Swedish radio P1 summer guests? Answer via Wikidata

This week, a very promising new R blog was launched, namely the blog of Eric Persson a.k.a as expersso on Twitter. I had really been looking forward to this because expersso’s code screenshots have always been quite cool, so seeing his no longer being limited to them is awesome! His first articles series is about a game, you should really check it out. (PSA: if you post screenshots of R code on Twitter, have a look at Sean Kross’ codefinch package!).

Because I’m a nosy person I asked Eric whether he was Swedish, his last name being quite Swedish-looking in my opinion. He is, which made me wonder about Swedish blog topics and actually decided to use one Swedish topic I came up with, the summer guests of the Swedish radio P1! Every summer since 1959, P1 selects a bunch of famous or interesting people and have them record a bit more than one hour program where they’re free to discuss what they want (important events of their life for instance) and to choose the musical breaks (which you don’t get to listen entirely to in the online version because of copyright stuff). The program is then broadcasted in the summer, one guest a day from the end of June to the beginning of August. There’s even a winter version now but I’ll ignore it because it’s too hot here in Barcelona to even think about winter.

It’s a very cool radio program in my opinion. I discovered it at the end of my 5-month research stay in Gothenburg in 2010 and decided it’d be one way to keep my Swedish skills up to date (my other methods include listening to ABBA in Swedish and reading Camilla Läckberg’s novels). I haven’t listened to that many guests but I really enjoy it when I do, and I like how diverse the list of guests is. In this post, I’ll actually try to have a look at the occupations of the guests via Wikidata!

How to get the data?

In order to answer my question I needed a list of all summer guests, and their occupations. I first thought I’d resort to webscraping P1 website, which would have been a good way to provide expersso with an opportunity to give me some constructive feedback. However after websurfing a bit to assess my options, I realized I could use Wikipedia data obtained through APIs rather than webscraping as in my posts about famous dead people. Here was my strategy.

Getting the list of all summer guests from 1959

I downloaded the table of all summer guests from 1959 from this page, which gave me their names and the dates of their episode(s) – yeah some people are invited back.

# all summer guests
sommargaester <- readr::read_csv("data/p1sommar.csv", col_names = FALSE, 
                                 locale = readr::locale(encoding = "latin1"))
knitr::kable(sommargaester[1:10,])
X1 X2 X3 X4 X5 X6 X7
Aaro,Lars-Eric 110714 NA NA NA NA NA
Abdelhadi,Magdi 070809 120716 NA NA NA NA
Abdulle,Sherihan ‘Cherrie’ 170706 NA NA NA NA NA
Abele,Anton 130725 NA NA NA NA NA
Abenius,Folke 800627 NA NA NA NA NA
Abrahamson,Kjell-Albin 880809 NA NA NA NA NA
Ackebo,Lena 970708 NA NA NA NA NA
Adami,Zanyar 060707 NA NA NA NA NA
Adamo,Amelia 860806 870718 080701 NA NA NA
Adams,Maud 920703 070805 NA NA NA NA
# get their names
sommargaester_names <- unique(sommargaester$X1)

# for putting names in the right order for later queries
transform_name <- function(name){
  paste(stringr::str_split(name, ",",
                           simplify = TRUE)[2],
        stringr::str_split(name, ",",
                           simplify = TRUE)[1])
}

pretty_sommargaester_names <- purrr::map_chr(sommargaester_names, transform_name)
sommargaester_names <- tibble::tibble(name = pretty_sommargaester_names,
                                      X1 = sommargaester_names)
sommargaester <- dplyr::left_join(sommargaester, sommargaester_names,
                                  by = "X1")

sommargaester <- dplyr::select(sommargaester, - X1)
# transform the date
sommargaester <- tidyr::gather(sommargaester, "rep", "date", X2:X7)
sommargaester <- dplyr::group_by(sommargaester, name)
sommargaester <- dplyr::mutate(sommargaester, rep = 1:n())
sommargaester <- dplyr::ungroup(sommargaester)

sommargaester <- dplyr::filter(sommargaester, !is.na(date))
# remove winter
sommargaester <- dplyr::filter(sommargaester, !stringr::str_detect(date, "V"))
# remove repeat episodes
sommargaester <- dplyr::filter(sommargaester, !stringr::str_detect(date, "R"))


# transform the date to a format with a non ambiguous year
sommargaester <- dplyr::mutate(sommargaester, date = as.numeric(date))
sommargaester <- dplyr::mutate(sommargaester, date = ifelse(date > 180000, paste0("19", date),
                                                            ifelse(date < 100000,
                                                                   paste0("200", date),
                                                                   paste0("20", date))))
sommargaester <- dplyr::mutate(sommargaester, pretty_date = lubridate::ymd(date))

knitr::kable(sommargaester[1:10,])
name rep date pretty_date
Lars-Eric Aaro 1 20110714 2011-07-14
Magdi Abdelhadi 1 20070809 2007-08-09
Sherihan ‘Cherrie’ Abdulle 1 20170706 2017-07-06
Anton Abele 1 20130725 2013-07-25
Folke Abenius 1 19800627 1980-06-27
Kjell-Albin Abrahamson 1 19880809 1988-08-09
Lena Ackebo 1 19970708 1997-07-08
Zanyar Adami 1 20060707 2006-07-07
Amelia Adamo 1 19860806 1986-08-06
Maud Adams 1 19920703 1992-07-03

Getting Wikidata about the summer guests

Then I read the list of Wikipedia R packages put together by Mikhail Popov, data scientist at the Wikimedia foundation with whom I exchanged a few messages after my Wikipedia deaths posts. These packages were either created by him or by Oliver Keyes and really give you access to plenty of structured data so I was happy to at last give the list a try!

I used WikidataQueryServiceR to access Wikidata via the query language SPARQL. Luckily I didn’t need to actually learn SPARQL, I used the first example of the documentation, getting Douglas Adams’ data and played with this online query tool to see how I could modify it to 1) obtain data in Swedish because Swedish Wikipedia is probably more complete about Swedish people and 2) obtain data about the summer guests, not Douglas Adams. For that second point I needed the item ID of each summer guests which I queried via another R package, WikidataR. I could have modified the SPARQL even more since I wouldn’t be using picture information but I wasn’t in a very perfectionist mood.

Doing all of this I only scraped the surface of the possibilities offered by the Wikipedia-related R packages and can already tell one could do a bunch of exciting analyses with them!

Note that I used Bob Rudis’ code as an example of how to insert a progress bar which made me feel quite cool. I also inserted a pause of 1 second between calls to the APIs in order to be a good person. My function name however shows how uninspired I was for naming things.

# function for getting someone's data
get_someone_data <- function(name, pb = NULL){
  if (!is.null(pb)) pb$tick()$print()
  Sys.sleep(1)
  item <- WikidataR::find_item(name, language = "sv")
  # sometimes people have no Wikidata entry so I need this condition
  if(length(item) > 0){
    entity_code <- item[[1]]$id
    query <-  paste0("PREFIX entity: <http://www.wikidata.org/entity/>
                     #partial results
                     
                     SELECT ?propUrl ?propLabel ?valUrl ?valLabel ?picture
                     WHERE
                     {
                     hint:Query hint:optimizer 'None' .
                     {	BIND(entity:",entity_code," AS ?valUrl) .
                     BIND(\"N/A\" AS ?propUrl ) .
                     BIND(\"identity\"@sv AS ?propLabel ) .
                     }
                     UNION
                     {	entity:", entity_code," ?propUrl ?valUrl .
                     ?property ?ref ?propUrl .
                     ?property rdf:type wikibase:Property .
                     ?property rdfs:label ?propLabel
                     }
                     
                     ?valUrl rdfs:label ?valLabel
                     FILTER (LANG(?valLabel) = 'sv') .
                     OPTIONAL{ ?valUrl wdt:P18 ?picture .}
                     FILTER (lang(?propLabel) = 'sv' )
                     }
                     ORDER BY ?propUrl ?valUrl
                     LIMIT 200")
    results <- WikidataQueryServiceR::query_wikidata(query)
    results$name<- name
    results
  }else{
    NULL
  }
   
  }

pb <- dplyr::progress_estimated(length(unique(sommargaester$name)))
sommargaester_wiki <- purrr::map_df(unique(sommargaester$name),
                                    get_someone_data, pb=pb)

sommargaester <- dplyr::left_join(sommargaester, sommargaester_wiki, by = "name")

readr::write_csv(sommargaester, path = "data/p1_wiki_data.csv")
knitr::kable(sommargaester[1:10,])
name rep pretty_date propLabel valLabel
Lars-Eric Aaro 1 2011-07-14 arbetsgivare LKAB
Lars-Eric Aaro 1 2011-07-14 födelseort Vittangi
Lars-Eric Aaro 1 2011-07-14 kön man
Lars-Eric Aaro 1 2011-07-14 nationalitet Sverige
Lars-Eric Aaro 1 2011-07-14 instans av människa
Lars-Eric Aaro 1 2011-07-14 alma mater Luleå tekniska universitet
Lars-Eric Aaro 1 2011-07-14 medlem av Kungliga Ingenjörsvetenskapsakademien
Lars-Eric Aaro 1 2011-07-14 förnamn Lars
Lars-Eric Aaro 1 2011-07-14 identity Lars-Eric Aaro
Magdi Abdelhadi 1 2007-08-09 sysselsättning journalist

I get one line per guest and propriety value, sometimes some propriety labels have more than one value for the same person, e.g. if the person has several occupations (occupation = sysselsättning).

Translation the occupations

This is the point at which I realized that maybe doing the search in English in the first place wouldn’t have made me get less data? Oh well. In any case for each possible occupation I can get an English translation which is awesome.

occupation <- dplyr::filter(sommargaester, propLabel == "sysselsättning")
occupations <- unique(occupation$valLabel)
translate_occupation <- function(occupation){
  job <- WikidataR::find_item(occupation, language = "sv")[[1]]$label
  if(is.null(job)){
    job <- ""
  }
  return(job)
}
english_occupations <- purrr::map_chr(occupations, translate_occupation)
translations <- data.frame(sysselsaettning = occupations,
                           occupation = english_occupations)
occupation <- dplyr::left_join(occupation, translations, by = c("valLabel" = "sysselsaettning"))

I guess it’s good to know that Wikidata has data in many languages because I can totally see it being useful as a sort of translation service.

Profiling the summer guests

How much data?

withoccupation <- dplyr::filter(sommargaester, !is.na(propLabel))
withoccupation <- dplyr::group_by(withoccupation, name)
withoccupation <- dplyr::summarise(withoccupation, withoccupation = any(propLabel == "sysselsättning"))

There are 2192 unique guests in the dataset. We got Wikidata information for 1587 of them which means 72% of them. We got at least one occupation for 1421 guests which is 65% of them. At this point I have no way to know whether the sample is representative. Maybe the people with more Wikidata output are the most famous ones and some occupations probably help getting more famous than other so I tend to think this makes my sample a bit limited.

How often do guests get invited

As I said earlier some guests were invited several times. I want to look at the distribution of the number of invitations per guest.

invites <- dplyr::select(sommargaester, name, rep)
invites <- dplyr::group_by(invites, name)
invites <- dplyr::summarize(invites, no_invites = max(rep))
invites <- unique(invites)
library("ggplot2")
 theme_set(theme_gray(base_size = 18))
ggplot(invites) +
  geom_histogram(aes(no_invites)) +
  xlab("Number of invitations per P1 summer guest")

plot of chunk unnamed-chunk-5

Some people have been invited a lot! If you remember my initial tables, it had only a few columns for the episode dates but one guest could have several lines.

Who are the people that got invited that many times?

jobs <- dplyr::select(occupation, name, occupation) 
jobs <- unique(jobs)
invites <- dplyr::left_join(invites, occupation, by = "name")
invites <- dplyr::select(invites, name, no_invites, occupation)
invites <- unique(invites)
invites <- dplyr::group_by(invites, name, no_invites)
invites <- dplyr::summarise(invites, occupation = toString(occupation))
invites <- dplyr::ungroup(invites)
invites <- dplyr::arrange(invites, desc(no_invites))
knitr::kable(invites[1:20,])
name no_invites occupation
Torsten Ehrenmark 54 journalist, translator, writer
Jörgen Cederberg 53 NA
Gösta Knutsson 42 cartoonist, journalist, translator, children’s writer, writer
Maud Reuterswärd 35 writer
Georg Eliasson 29 screenwriter
Pekka Langer 29 television presenter, actor
Britt Edwall 24 writer
Arne Ericsson 23 musician
Bengt Feldreich 23 journalist, television presenter
Berndt Friberg 23 journalist
Carl-Olof Lång 23 director
Carl-Uno Sjöblom 23 television presenter
Lars Ulvenstam 23 journalist
Lars Widding 23 journalist, writer
Åke Falck 17 film director, director, television presenter, screenwriter, actor
Barbro Alving 17 journalist, screenwriter, writer
Bo Setterlind 17 writer, poet
Bo Strömstedt 17 NA
Elisabeth Söderström 17 singer, opera singer
Fredrik Burgman 17 journalist

Most frequent occupations (by episode)

everything <- dplyr::group_by(occupation, occupation)
everything <- dplyr::summarize(everything, n = n())
everything <- dplyr::filter(everything, n > 50)
library("magrittr")
everything %>%
  dplyr::arrange(n) %>%
  dplyr::mutate(occupation = factor(occupation, ordered = TRUE, levels = unique(occupation))) %>%
ggplot() +
  geom_col(aes(occupation, n)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Number of episodes by occupation",
subtitle = "for occupations represented more than 50 times")

plot of chunk unnamed-chunk-7

So in terms of episodes I’d say the media and culture fields are quite well represented. That said it might also be due to the fact that a person doing sports for instance will be classified as either ice skater or sprinter, so there’s no chance for sports to appear here. I’ll leave the classification of occupations in categories as an exercise for the reader though.

Most frequent occupations (by unique guest)

everything <- dplyr::select(occupation, name, occupation)
everything <- unique(everything)
everything <- dplyr::group_by(everything, occupation)
everything <- dplyr::summarize(everything, n = n())
everything <- dplyr::filter(everything, n > 20)
library("magrittr")
everything %>%
  dplyr::arrange(n) %>%
  dplyr::mutate(occupation = factor(occupation, ordered = TRUE, levels = unique(occupation))) %>%
ggplot() +
  geom_col(aes(occupation, n)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Number of guests by occupation",
subtitle = "for occupations represented more than 20 times")

plot of chunk unnamed-chunk-8

With this figure by guest rather than by episode, a bit more diversity has appeared, e.g. “university professor” and “association football player”. Moreover although this was a nice exercise I won’t pretend that someone’s occupation characterizes them fully! The P1 summer guest programm really presents a variety of episodes, and I’d tend to think everyone can find an episode that they appreciate. For instance I remember liking the baker Sébastien Boudet’s episode (yep he’s French) and the one featuring the former high jumper Kajsa Berqvist. Now all this data munging has left me more than ready to look at the list of episodes again to choose the next ones I’ll listen to!