ETL Palmer Penguins (R)
What is an ETL? ?
Extract: Fetch the data.
Transform: Clean, prepare, and transform.
Load: Load or prepare the data for analysis or modeling.
In this exercise, we will work with the “Palmer Penguins” dataset. This dataset contains data on three species of penguins from the Palmer Archipelago (Antarctica), with variables such as:
species
island
bill length and depth
flipper length
body mass
sex
year
Let’s see the location of Palmer Station:
1. EXTRACT
We load the study dataset:
# install the package
#install.packages("palmerpenguins")
# view the dataset (first 6 rows)
head(penguins)
## species island bill_len bill_dep flipper_len body_mass sex year
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
## 3 Adelie Torgersen 40.3 18.0 195 3250 female 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
We note that we have 344 observations and 8 variables. The categorical variables are already factors (species, island, and sex), so we don’t need to convert them.
Dataset structure:
## 'data.frame': 344 obs. of 8 variables:
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_len : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_dep : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_len: int 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
We can also generate a detailed summary
## species island bill_len bill_dep
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_len body_mass sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
Name | penguins |
Number of rows | 344 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_len | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
bill_dep | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
flipper_len | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
body_mass | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
2. TRANSFORM
This is the most important step. A good ETL ensures data consistency and readiness for analysis.
There are missing values. We must decide whether to drop them or impute them, for example using the median. Since there are few, we’ll remove them:
## 'data.frame': 333 obs. of 8 variables:
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_len : num 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
## $ bill_dep : num 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
## $ flipper_len: int 181 186 195 193 190 181 195 182 191 198 ...
## $ body_mass : int 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
# Update column names (spanish):
nuevos_nombres <- c("especie", "isla","longitud_pico", "ancho_pico", "longitud_aleta","peso_corporal", "genero", "año")
colnames(penguins_df) <- nuevos_nombres
# Check new column names:
colnames(penguins_df)
## [1] "especie" "isla" "longitud_pico" "ancho_pico"
## [5] "longitud_aleta" "peso_corporal" "genero" "año"
We can group the data and calculate statistics:
penguins_df %>% group_by(especie) %>% summarise(promedio_long_pico = mean(longitud_pico), promedio_ancho_pico = mean(ancho_pico))
## # A tibble: 3 × 3
## especie promedio_long_pico promedio_ancho_pico
## <fct> <dbl> <dbl>
## 1 Adelie 38.8 18.3
## 2 Chinstrap 48.8 18.4
## 3 Gentoo 47.6 15.0
Convert body mass from grams to kilograms:
## especie isla longitud_pico ancho_pico longitud_aleta peso_corporal
## 1 Adelie Torgersen 39.1 18.7 181 3.750
## 2 Adelie Torgersen 39.5 17.4 186 3.800
## 3 Adelie Torgersen 40.3 18.0 195 3.250
## 4 Adelie Torgersen 36.7 19.3 193 3.450
## 5 Adelie Torgersen 39.3 20.6 190 3.650
## 6 Adelie Torgersen 38.9 17.8 181 3.625
## genero año
## 1 male 2007
## 2 female 2007
## 3 female 2007
## 4 female 2007
## 5 male 2007
## 6 female 2007
3. LOAD
We now have a clean dataset ready for analysis or modeling.
Save the transformed dataset to CSV:
Visualizations
Boxplot with ggplot2:
graf_box <- ggplot(penguins_df, aes(x = especie, y = peso_corporal, fill = especie)) +
geom_boxplot() +
labs(title = "Boxplot of Body Mass by Species", subtitle = "Palmer Island Penguins", caption = "Data collected between 2007 and 2009") +
theme_minimal()
graf_box
Trend line and scatter plot:
linea_tendencia <- ggplot(penguins_df) +
geom_smooth(aes(x = longitud_aleta, y = peso_corporal)) +
geom_point(aes(x = longitud_aleta, y = peso_corporal)) +
labs(title = "Trend Line: Flipper Length vs Body Mass") +
theme_bw()
ggplotly(linea_tendencia)
Facet by species:
ggplotly(ggplot(penguins_df) +
geom_point(aes(x = longitud_pico, y = peso_corporal, color = especie)) +
facet_wrap(~especie) +
labs(title = "Bill Length vs Body Mass by Species") +
theme_bw())
Facet grid by species and sex:
ggplot(penguins_df)+
geom_point(aes(x = longitud_aleta, y = peso_corporal, color = especie)) +
facet_grid(genero~especie) +
labs(title = "Body Mass vs Flipper Length by Species and Sex") +
theme_bw()
Heatmap: