For Tidy Tuesday this week (3/14), I took a quick look at the drugs brought out by European Drug companies.

The first thing we’ll do is load the data in and use skimr::skim() to understand the data structure.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

drugs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-03-14/drugs.csv')

Rows: 1988 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (13): category, medicine_name, therapeutic_area, common_name, active_su...
dbl   (1): revision_number
lgl   (8): patient_safety, additional_monitoring, generic, biosimilar, condi...
dttm  (2): first_published, revision_date
date  (4): marketing_authorisation_date, date_of_refusal_of_marketing_author...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(drugs)

Data summary
Name	drugs
Number of rows	1988
Number of columns	28
_______________________
Column type frequency:
character	13
Date	4
logical	8
numeric	1
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
category	0	1.00	5	10	2
medicine_name	0	1.00	3	125	1976
therapeutic_area	285	0.86	4	400	669
common_name	4	1.00	4	220	1261
active_substance	1	1.00	4	823	1345
product_number	0	1.00	6	6	1932
authorisation_status	1	1.00	7	10	3
atc_code	28	0.99	3	18	1074
marketing_authorisation_holder_company_name	4	1.00	4	65	615
pharmacotherapeutic_group	34	0.98	7	174	365
condition_indication	12	0.99	18	7597	1886
species	1709	0.14	4	67	59
url	0	1.00	53	148	1988

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
marketing_authorisation_date	60	0.97	1995-10-20	2023-02-20	2013-06-09	1127
date_of_refusal_of_marketing_authorisation	1913	0.04	2004-09-07	2022-04-29	2013-04-25	67
date_of_opinion	779	0.61	1995-07-12	2022-12-15	2016-07-21	389
decision_date	45	0.98	1998-08-20	2023-03-10	2022-02-16	815

Variable type: logical

skim_variable	complete_rate	mean	count
patient_safety	1	0.01	FAL: 1977, TRU: 11
additional_monitoring	1	0.19	FAL: 1601, TRU: 387
generic	1	0.16	FAL: 1673, TRU: 315
biosimilar	1	0.05	FAL: 1896, TRU: 92
conditional_approval	1	0.02	FAL: 1940, TRU: 48
exceptional_circumstances	1	0.02	FAL: 1940, TRU: 48
accelerated_assessment	1	0.02	FAL: 1940, TRU: 48
orphan_medicine	1	0.08	FAL: 1826, TRU: 162

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
revision_number	96	0.95	13.53	11.65	0	4.75	11	19	89	▇▃▁▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
first_published	0	1.00	1998-08-20 00:00:00	2023-03-09 18:50:00	2018-04-14 23:14:30	1760
revision_date	29	0.99	2000-07-17 02:00:00	2023-03-13 11:52:00	2022-05-11 11:58:00	1932

Then we’ll take a look at the data using visdat::vis_dat()

visdat::vis_dat(drugs)

Drugs by Therapeutic Area

Not surprisingly, Diabetes Type 2 is the top therapeutic area, followed by HIV infections, and hypertension.

drugs |> 
  tidyr::drop_na(therapeutic_area) |>
  count(therapeutic_area) |> 
  arrange(desc(n)) |> 
  dplyr::filter(n > 10) |>
  gt::gt()

therapeutic_area	n
Diabetes Mellitus, Type 2	73
HIV Infections	71
Hypertension	47
Diabetes Mellitus	37
Pulmonary Disease, Chronic Obstructive	30
Hepatitis C, Chronic	22
Multiple Myeloma	22
Parkinson Disease	20
Carcinoma, Non-Small-Cell Lung	19
Epilepsy	19
Schizophrenia; Bipolar Disorder	18
Breast Neoplasms	17
Hemophilia A	16
Multiple Sclerosis	16
Erectile Dysfunction	15
Influenza, Human; Immunization; Disease Outbreaks	14
Prostatic Neoplasms	14
Asthma	13
COVID-19 virus infection	13
Hypertension, Pulmonary	13
Neutropenia	13
Osteoporosis, Postmenopausal	13
Alzheimer Disease	12
Peripheral Vascular Diseases; Stroke; Myocardial Infarction	12
Carcinoma, Non-Small-Cell Lung; Mesothelioma	11
Multiple Sclerosis, Relapsing-Remitting	11
Radionuclide Imaging	11

Most Common Therapeutic Area Plot

Here’s a plotly plot that shows those therapeutic areas that have more than 10 drugs. Mouse over them to get the drug name.

my_plot <- drugs |> 
  tidyr::drop_na(therapeutic_area) |> 
  count(therapeutic_area) |> 
  arrange(desc(n)) |>
  dplyr::filter(n > 10) |> 
  dplyr::mutate(therapeutic_area = fct_reorder(therapeutic_area, n)) |>
  ggplot() + aes(x=therapeutic_area, y=n) + geom_bar(stat = "identity") + coord_flip() +
  theme(axis.title.x = element_blank(), axis.title.y = element_blank()) +
  ggtitle("Number of Drugs by Therapeutic Area")

plotly::ggplotly(my_plot)

Orphan Medicines

Looking at the therapeutic areas for orphan drugs, we see different priorities.

drugs |>
  filter(orphan_medicine == TRUE) |>
  count(therapeutic_area) |>
  arrange(desc(n)) |>
  head(n=20) |>
  gt::gt()

therapeutic_area	n
Multiple Myeloma	7
Leukemia, Myeloid, Acute	5
Gastrointestinal Stromal Tumors	3
Hemophilia B	3
Muscular Atrophy, Spinal	3
Tuberculosis, Multidrug-Resistant	3
Amyloidosis	2
Cushing Syndrome	2
Cystic Fibrosis	2
Cystinosis	2
Cytomegalovirus Infections	2
Gaucher Disease	2
Growth and Development	2
Hemoglobinuria, Paroxysmal	2
Hypertension, Pulmonary	2
Lymphoma, Non-Hodgkin	2
Muscular Dystrophy, Duchenne	2
Pancreatic Neoplasms	2
Precursor Cell Lymphoblastic Leukemia-Lymphoma	2
Urea Cycle Disorders, Inborn	2

Which Companies?

What companies have the most drugs in this set?

companies <- drugs |>
  count(marketing_authorisation_holder_company_name) |>
  arrange(desc(n)) |>
  head(n = 20) 

companies |>
  gt::gt()

marketing_authorisation_holder_company_name	n
Accord Healthcare S.L.U.	58
Novartis Europharm Limited	58
Pfizer Europe MA EEIG	43
Zoetis Belgium SA	40
AstraZeneca AB	36
Boehringer Ingelheim Vetmedica GmbH	36
Merck Sharp & Dohme B.V.	32
Teva B.V.	32
Intervet International BV	31
Eli Lilly Nederland B.V.	30
Novo Nordisk A/S	28
Bristol-Myers Squibb Pharma EEIG	26
Mylan Pharmaceuticals Limited	26
Roche Registration GmbH	26
Janssen-Cilag International NV	23
Boehringer Ingelheim International GmbH	22
Gilead Sciences Ireland UC	21
Sanofi Winthrop Industrie	19
GlaxoSmithKline Biologicals S.A.	18
Sandoz GmbH	18

out_plot <- ggplot(drugs) +
  aes(x=decision_date, y=revision_number, color=therapeutic_area) +
  geom_point() + theme(legend.position = "none")

plotly::ggplotly(out_plot, tooltip = c("medicine_name", "decision_date", "revision_number", "therapeutic_area"))

Citation

BibTeX citation:

@online{laderas2023,
  author = {Laderas, Ted},
  title = {European {Drug} {Companies}},
  date = {2023-03-14},
  url = {https://laderast.github.io/articles/2023-03-14-drug-companies/},
  langid = {en}
}

For attribution, please cite this work as:

Laderas, Ted. 2023. “European Drug Companies.” March 14, 2023. https://laderast.github.io/articles/2023-03-14-drug-companies/.