European Drug Companies

tidytuesday
Author
Affiliation
Published

March 14, 2023

For Tidy Tuesday this week (3/14), I took a quick look at the drugs brought out by European Drug companies.

The first thing we’ll do is load the data in and use skimr::skim() to understand the data structure.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
drugs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-03-14/drugs.csv')
Rows: 1988 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (13): category, medicine_name, therapeutic_area, common_name, active_su...
dbl   (1): revision_number
lgl   (8): patient_safety, additional_monitoring, generic, biosimilar, condi...
dttm  (2): first_published, revision_date
date  (4): marketing_authorisation_date, date_of_refusal_of_marketing_author...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(drugs)
Data summary
Name drugs
Number of rows 1988
Number of columns 28
_______________________
Column type frequency:
character 13
Date 4
logical 8
numeric 1
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
category 0 1.00 5 10 0 2 0
medicine_name 0 1.00 3 125 0 1976 0
therapeutic_area 285 0.86 4 400 0 669 0
common_name 4 1.00 4 220 0 1261 0
active_substance 1 1.00 4 823 0 1345 0
product_number 0 1.00 6 6 0 1932 0
authorisation_status 1 1.00 7 10 0 3 0
atc_code 28 0.99 3 18 0 1074 0
marketing_authorisation_holder_company_name 4 1.00 4 65 0 615 0
pharmacotherapeutic_group 34 0.98 7 174 0 365 0
condition_indication 12 0.99 18 7597 0 1886 0
species 1709 0.14 4 67 0 59 0
url 0 1.00 53 148 0 1988 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
marketing_authorisation_date 60 0.97 1995-10-20 2023-02-20 2013-06-09 1127
date_of_refusal_of_marketing_authorisation 1913 0.04 2004-09-07 2022-04-29 2013-04-25 67
date_of_opinion 779 0.61 1995-07-12 2022-12-15 2016-07-21 389
decision_date 45 0.98 1998-08-20 2023-03-10 2022-02-16 815

Variable type: logical

skim_variable n_missing complete_rate mean count
patient_safety 0 1 0.01 FAL: 1977, TRU: 11
additional_monitoring 0 1 0.19 FAL: 1601, TRU: 387
generic 0 1 0.16 FAL: 1673, TRU: 315
biosimilar 0 1 0.05 FAL: 1896, TRU: 92
conditional_approval 0 1 0.02 FAL: 1940, TRU: 48
exceptional_circumstances 0 1 0.02 FAL: 1940, TRU: 48
accelerated_assessment 0 1 0.02 FAL: 1940, TRU: 48
orphan_medicine 0 1 0.08 FAL: 1826, TRU: 162

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
revision_number 96 0.95 13.53 11.65 0 4.75 11 19 89 ▇▃▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
first_published 0 1.00 1998-08-20 00:00:00 2023-03-09 18:50:00 2018-04-14 23:14:30 1760
revision_date 29 0.99 2000-07-17 02:00:00 2023-03-13 11:52:00 2022-05-11 11:58:00 1932

Then we’ll take a look at the data using visdat::vis_dat()

visdat::vis_dat(drugs)

Drugs by Therapeutic Area

Not surprisingly, Diabetes Type 2 is the top therapeutic area, followed by HIV infections, and hypertension.

drugs |> 
  tidyr::drop_na(therapeutic_area) |>
  count(therapeutic_area) |> 
  arrange(desc(n)) |> 
  dplyr::filter(n > 10) |>
  gt::gt()
therapeutic_area n
Diabetes Mellitus, Type 2 73
HIV Infections 71
Hypertension 47
Diabetes Mellitus 37
Pulmonary Disease, Chronic Obstructive 30
Hepatitis C, Chronic 22
Multiple Myeloma 22
Parkinson Disease 20
Carcinoma, Non-Small-Cell Lung 19
Epilepsy 19
Schizophrenia; Bipolar Disorder 18
Breast Neoplasms 17
Hemophilia A 16
Multiple Sclerosis 16
Erectile Dysfunction 15
Influenza, Human; Immunization; Disease Outbreaks 14
Prostatic Neoplasms 14
Asthma 13
COVID-19 virus infection 13
Hypertension, Pulmonary 13
Neutropenia 13
Osteoporosis, Postmenopausal 13
Alzheimer Disease 12
Peripheral Vascular Diseases; Stroke; Myocardial Infarction 12
Carcinoma, Non-Small-Cell Lung; Mesothelioma 11
Multiple Sclerosis, Relapsing-Remitting 11
Radionuclide Imaging 11

Most Common Therapeutic Area Plot

Here’s a plotly plot that shows those therapeutic areas that have more than 10 drugs. Mouse over them to get the drug name.

my_plot <- drugs |> 
  tidyr::drop_na(therapeutic_area) |> 
  count(therapeutic_area) |> 
  arrange(desc(n)) |>
  dplyr::filter(n > 10) |> 
  dplyr::mutate(therapeutic_area = fct_reorder(therapeutic_area, n)) |>
  ggplot() + aes(x=therapeutic_area, y=n) + geom_bar(stat = "identity") + coord_flip() +
  theme(axis.title.x = element_blank(), axis.title.y = element_blank()) +
  ggtitle("Number of Drugs by Therapeutic Area")

plotly::ggplotly(my_plot)

Orphan Medicines

Looking at the therapeutic areas for orphan drugs, we see different priorities.

drugs |>
  filter(orphan_medicine == TRUE) |>
  count(therapeutic_area) |>
  arrange(desc(n)) |>
  head(n=20) |>
  gt::gt()
therapeutic_area n
Multiple Myeloma 7
Leukemia, Myeloid, Acute 5
Gastrointestinal Stromal Tumors 3
Hemophilia B 3
Muscular Atrophy, Spinal 3
Tuberculosis, Multidrug-Resistant 3
Amyloidosis 2
Cushing Syndrome 2
Cystic Fibrosis 2
Cystinosis 2
Cytomegalovirus Infections 2
Gaucher Disease 2
Growth and Development 2
Hemoglobinuria, Paroxysmal 2
Hypertension, Pulmonary 2
Lymphoma, Non-Hodgkin 2
Muscular Dystrophy, Duchenne 2
Pancreatic Neoplasms 2
Precursor Cell Lymphoblastic Leukemia-Lymphoma 2
Urea Cycle Disorders, Inborn 2

Which Companies?

What companies have the most drugs in this set?

companies <- drugs |>
  count(marketing_authorisation_holder_company_name) |>
  arrange(desc(n)) |>
  head(n = 20) 

companies |>
  gt::gt()
marketing_authorisation_holder_company_name n
Accord Healthcare S.L.U. 58
Novartis Europharm Limited 58
Pfizer Europe MA EEIG 43
Zoetis Belgium SA 40
AstraZeneca AB 36
Boehringer Ingelheim Vetmedica GmbH 36
Merck Sharp & Dohme B.V. 32
Teva B.V. 32
Intervet International BV 31
Eli Lilly Nederland B.V. 30
Novo Nordisk A/S 28
Bristol-Myers Squibb Pharma EEIG 26
Mylan Pharmaceuticals Limited 26
Roche Registration GmbH 26
Janssen-Cilag International NV 23
Boehringer Ingelheim International GmbH 22
Gilead Sciences Ireland UC 21
Sanofi Winthrop Industrie 19
GlaxoSmithKline Biologicals S.A. 18
Sandoz GmbH 18
out_plot <- ggplot(drugs) +
  aes(x=decision_date, y=revision_number, color=therapeutic_area) +
  geom_point() + theme(legend.position = "none")

plotly::ggplotly(out_plot, tooltip = c("medicine_name", "decision_date", "revision_number", "therapeutic_area"))

Citation

BibTeX citation:
@online{laderas2023,
  author = {Laderas, Ted},
  title = {European {Drug} {Companies}},
  date = {2023-03-14},
  url = {https://laderast.github.io/articles/2023-03-14-drug-companies/},
  langid = {en}
}
For attribution, please cite this work as:
Laderas, Ted. 2023. “European Drug Companies.” March 14, 2023. https://laderast.github.io/articles/2023-03-14-drug-companies/.