For Tidy Tuesday this week (3/14), I took a quick look at the drugs brought out by European Drug companies.
The first thing we’ll do is load the data in and use skimr::skim()
to understand the data structure.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
drugs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-03-14/drugs.csv')
Rows: 1988 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): category, medicine_name, therapeutic_area, common_name, active_su...
dbl (1): revision_number
lgl (8): patient_safety, additional_monitoring, generic, biosimilar, condi...
dttm (2): first_published, revision_date
date (4): marketing_authorisation_date, date_of_refusal_of_marketing_author...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data summary
Name |
drugs |
Number of rows |
1988 |
Number of columns |
28 |
_______________________ |
|
Column type frequency: |
|
character |
13 |
Date |
4 |
logical |
8 |
numeric |
1 |
POSIXct |
2 |
________________________ |
|
Group variables |
None |
Variable type: character
category |
0 |
1.00 |
5 |
10 |
0 |
2 |
0 |
medicine_name |
0 |
1.00 |
3 |
125 |
0 |
1976 |
0 |
therapeutic_area |
285 |
0.86 |
4 |
400 |
0 |
669 |
0 |
common_name |
4 |
1.00 |
4 |
220 |
0 |
1261 |
0 |
active_substance |
1 |
1.00 |
4 |
823 |
0 |
1345 |
0 |
product_number |
0 |
1.00 |
6 |
6 |
0 |
1932 |
0 |
authorisation_status |
1 |
1.00 |
7 |
10 |
0 |
3 |
0 |
atc_code |
28 |
0.99 |
3 |
18 |
0 |
1074 |
0 |
marketing_authorisation_holder_company_name |
4 |
1.00 |
4 |
65 |
0 |
615 |
0 |
pharmacotherapeutic_group |
34 |
0.98 |
7 |
174 |
0 |
365 |
0 |
condition_indication |
12 |
0.99 |
18 |
7597 |
0 |
1886 |
0 |
species |
1709 |
0.14 |
4 |
67 |
0 |
59 |
0 |
url |
0 |
1.00 |
53 |
148 |
0 |
1988 |
0 |
Variable type: Date
marketing_authorisation_date |
60 |
0.97 |
1995-10-20 |
2023-02-20 |
2013-06-09 |
1127 |
date_of_refusal_of_marketing_authorisation |
1913 |
0.04 |
2004-09-07 |
2022-04-29 |
2013-04-25 |
67 |
date_of_opinion |
779 |
0.61 |
1995-07-12 |
2022-12-15 |
2016-07-21 |
389 |
decision_date |
45 |
0.98 |
1998-08-20 |
2023-03-10 |
2022-02-16 |
815 |
Variable type: logical
patient_safety |
0 |
1 |
0.01 |
FAL: 1977, TRU: 11 |
additional_monitoring |
0 |
1 |
0.19 |
FAL: 1601, TRU: 387 |
generic |
0 |
1 |
0.16 |
FAL: 1673, TRU: 315 |
biosimilar |
0 |
1 |
0.05 |
FAL: 1896, TRU: 92 |
conditional_approval |
0 |
1 |
0.02 |
FAL: 1940, TRU: 48 |
exceptional_circumstances |
0 |
1 |
0.02 |
FAL: 1940, TRU: 48 |
accelerated_assessment |
0 |
1 |
0.02 |
FAL: 1940, TRU: 48 |
orphan_medicine |
0 |
1 |
0.08 |
FAL: 1826, TRU: 162 |
Variable type: numeric
revision_number |
96 |
0.95 |
13.53 |
11.65 |
0 |
4.75 |
11 |
19 |
89 |
▇▃▁▁▁ |
Variable type: POSIXct
first_published |
0 |
1.00 |
1998-08-20 00:00:00 |
2023-03-09 18:50:00 |
2018-04-14 23:14:30 |
1760 |
revision_date |
29 |
0.99 |
2000-07-17 02:00:00 |
2023-03-13 11:52:00 |
2022-05-11 11:58:00 |
1932 |
Then we’ll take a look at the data using visdat::vis_dat()
Drugs by Therapeutic Area
Not surprisingly, Diabetes Type 2 is the top therapeutic area, followed by HIV infections, and hypertension.
drugs |>
tidyr::drop_na(therapeutic_area) |>
count(therapeutic_area) |>
arrange(desc(n)) |>
dplyr::filter(n > 10) |>
gt::gt()
Diabetes Mellitus, Type 2 |
73 |
HIV Infections |
71 |
Hypertension |
47 |
Diabetes Mellitus |
37 |
Pulmonary Disease, Chronic Obstructive |
30 |
Hepatitis C, Chronic |
22 |
Multiple Myeloma |
22 |
Parkinson Disease |
20 |
Carcinoma, Non-Small-Cell Lung |
19 |
Epilepsy |
19 |
Schizophrenia; Bipolar Disorder |
18 |
Breast Neoplasms |
17 |
Hemophilia A |
16 |
Multiple Sclerosis |
16 |
Erectile Dysfunction |
15 |
Influenza, Human; Immunization; Disease Outbreaks |
14 |
Prostatic Neoplasms |
14 |
Asthma |
13 |
COVID-19 virus infection |
13 |
Hypertension, Pulmonary |
13 |
Neutropenia |
13 |
Osteoporosis, Postmenopausal |
13 |
Alzheimer Disease |
12 |
Peripheral Vascular Diseases; Stroke; Myocardial Infarction |
12 |
Carcinoma, Non-Small-Cell Lung; Mesothelioma |
11 |
Multiple Sclerosis, Relapsing-Remitting |
11 |
Radionuclide Imaging |
11 |
Most Common Therapeutic Area Plot
Here’s a plotly plot that shows those therapeutic areas that have more than 10 drugs. Mouse over them to get the drug name.
my_plot <- drugs |>
tidyr::drop_na(therapeutic_area) |>
count(therapeutic_area) |>
arrange(desc(n)) |>
dplyr::filter(n > 10) |>
dplyr::mutate(therapeutic_area = fct_reorder(therapeutic_area, n)) |>
ggplot() + aes(x=therapeutic_area, y=n) + geom_bar(stat = "identity") + coord_flip() +
theme(axis.title.x = element_blank(), axis.title.y = element_blank()) +
ggtitle("Number of Drugs by Therapeutic Area")
plotly::ggplotly(my_plot)
Orphan Medicines
Looking at the therapeutic areas for orphan drugs, we see different priorities.
drugs |>
filter(orphan_medicine == TRUE) |>
count(therapeutic_area) |>
arrange(desc(n)) |>
head(n=20) |>
gt::gt()
Multiple Myeloma |
7 |
Leukemia, Myeloid, Acute |
5 |
Gastrointestinal Stromal Tumors |
3 |
Hemophilia B |
3 |
Muscular Atrophy, Spinal |
3 |
Tuberculosis, Multidrug-Resistant |
3 |
Amyloidosis |
2 |
Cushing Syndrome |
2 |
Cystic Fibrosis |
2 |
Cystinosis |
2 |
Cytomegalovirus Infections |
2 |
Gaucher Disease |
2 |
Growth and Development |
2 |
Hemoglobinuria, Paroxysmal |
2 |
Hypertension, Pulmonary |
2 |
Lymphoma, Non-Hodgkin |
2 |
Muscular Dystrophy, Duchenne |
2 |
Pancreatic Neoplasms |
2 |
Precursor Cell Lymphoblastic Leukemia-Lymphoma |
2 |
Urea Cycle Disorders, Inborn |
2 |
Which Companies?
What companies have the most drugs in this set?
companies <- drugs |>
count(marketing_authorisation_holder_company_name) |>
arrange(desc(n)) |>
head(n = 20)
companies |>
gt::gt()
Accord Healthcare S.L.U. |
58 |
Novartis Europharm Limited |
58 |
Pfizer Europe MA EEIG |
43 |
Zoetis Belgium SA |
40 |
AstraZeneca AB |
36 |
Boehringer Ingelheim Vetmedica GmbH |
36 |
Merck Sharp & Dohme B.V. |
32 |
Teva B.V. |
32 |
Intervet International BV |
31 |
Eli Lilly Nederland B.V. |
30 |
Novo Nordisk A/S |
28 |
Bristol-Myers Squibb Pharma EEIG |
26 |
Mylan Pharmaceuticals Limited |
26 |
Roche Registration GmbH |
26 |
Janssen-Cilag International NV |
23 |
Boehringer Ingelheim International GmbH |
22 |
Gilead Sciences Ireland UC |
21 |
Sanofi Winthrop Industrie |
19 |
GlaxoSmithKline Biologicals S.A. |
18 |
Sandoz GmbH |
18 |
out_plot <- ggplot(drugs) +
aes(x=decision_date, y=revision_number, color=therapeutic_area) +
geom_point() + theme(legend.position = "none")
plotly::ggplotly(out_plot, tooltip = c("medicine_name", "decision_date", "revision_number", "therapeutic_area"))
Citation
BibTeX citation:
@online{laderas2023,
author = {Laderas, Ted},
title = {European {Drug} {Companies}},
date = {2023-03-14},
url = {https://laderast.github.io/articles/2023-03-14-drug-companies/},
langid = {en}
}
For attribution, please cite this work as:
Laderas, Ted. 2023.
“European Drug Companies.” March 14,
2023.
https://laderast.github.io/articles/2023-03-14-drug-companies/.