ETL Palmer Penguins

Roberto Baca

10/07/2025

ETL Palmer Penguins (R)

What is an ETL? ?

  • Extract: Fetch the data.

  • Transform: Clean, prepare, and transform.

  • Load: Load or prepare the data for analysis or modeling.

In this exercise, we will work with the “Palmer Penguins” dataset. This dataset contains data on three species of penguins from the Palmer Archipelago (Antarctica), with variables such as:

  • species

  • island

  • bill length and depth

  • flipper length

  • body mass

  • sex

  • year

Penguin Species of the Palmer Islands
Penguin Species of the Palmer Islands

Let’s see the location of Palmer Station:

#install.packages("leaflet")
library(leaflet)

# Location of Palmer Station
# -64.77416897295598, -64.05381917264792

# Interactive map
leaflet() %>%
  addTiles() %>%
  addMarkers(lat = -64.77416897295598, lng = -64.05381917264792, popup = "Estacion Palmer")
# library
library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)

1. EXTRACT

We load the study dataset:

# install the package
#install.packages("palmerpenguins")

# view the dataset (first 6 rows)
head(penguins)
##   species    island bill_len bill_dep flipper_len body_mass    sex year
## 1  Adelie Torgersen     39.1     18.7         181      3750   male 2007
## 2  Adelie Torgersen     39.5     17.4         186      3800 female 2007
## 3  Adelie Torgersen     40.3     18.0         195      3250 female 2007
## 4  Adelie Torgersen       NA       NA          NA        NA   <NA> 2007
## 5  Adelie Torgersen     36.7     19.3         193      3450 female 2007
## 6  Adelie Torgersen     39.3     20.6         190      3650   male 2007

We note that we have 344 observations and 8 variables. The categorical variables are already factors (species, island, and sex), so we don’t need to convert them.

Dataset structure:

str(penguins)
## 'data.frame':    344 obs. of  8 variables:
##  $ species    : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island     : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_len   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_dep   : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_len: int  181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass  : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex        : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year       : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

We can also generate a detailed summary

summary(penguins)
##       species          island       bill_len        bill_dep    
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##   flipper_len      body_mass        sex           year     
##  Min.   :172.0   Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0   1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0   Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9   Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0   3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0   Max.   :6300                Max.   :2009  
##  NA's   :2       NA's   :2
library(skimr)

skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_len 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_dep 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_len 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

2. TRANSFORM

This is the most important step. A good ETL ensures data consistency and readiness for analysis.

There are missing values. We must decide whether to drop them or impute them, for example using the median. Since there are few, we’ll remove them:

# Remove NAs
library(tidyr)

penguins_df <- drop_na(penguins)

str(penguins_df)
## 'data.frame':    333 obs. of  8 variables:
##  $ species    : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island     : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_len   : num  39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
##  $ bill_dep   : num  18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
##  $ flipper_len: int  181 186 195 193 190 181 195 182 191 198 ...
##  $ body_mass  : int  3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
##  $ sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
##  $ year       : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
# Update column names (spanish):
nuevos_nombres <- c("especie", "isla","longitud_pico", "ancho_pico", "longitud_aleta","peso_corporal", "genero", "año")

colnames(penguins_df) <- nuevos_nombres

# Check new column names:
colnames(penguins_df)
## [1] "especie"        "isla"           "longitud_pico"  "ancho_pico"    
## [5] "longitud_aleta" "peso_corporal"  "genero"         "año"

We can group the data and calculate statistics:

penguins_df %>% group_by(especie) %>% summarise(promedio_long_pico = mean(longitud_pico), promedio_ancho_pico = mean(ancho_pico))
## # A tibble: 3 × 3
##   especie   promedio_long_pico promedio_ancho_pico
##   <fct>                  <dbl>               <dbl>
## 1 Adelie                  38.8                18.3
## 2 Chinstrap               48.8                18.4
## 3 Gentoo                  47.6                15.0

Convert body mass from grams to kilograms:

penguins_df$peso_corporal <- penguins_df$peso_corporal / 1000
head(penguins_df)
##   especie      isla longitud_pico ancho_pico longitud_aleta peso_corporal
## 1  Adelie Torgersen          39.1       18.7            181         3.750
## 2  Adelie Torgersen          39.5       17.4            186         3.800
## 3  Adelie Torgersen          40.3       18.0            195         3.250
## 4  Adelie Torgersen          36.7       19.3            193         3.450
## 5  Adelie Torgersen          39.3       20.6            190         3.650
## 6  Adelie Torgersen          38.9       17.8            181         3.625
##   genero  año
## 1   male 2007
## 2 female 2007
## 3 female 2007
## 4 female 2007
## 5   male 2007
## 6 female 2007

3. LOAD

We now have a clean dataset ready for analysis or modeling.

Save the transformed dataset to CSV:

write.csv(penguins_df, "penguins_transformado.csv", row.names = FALSE)

Visualizations

Boxplot with ggplot2:

graf_box <- ggplot(penguins_df, aes(x = especie, y = peso_corporal, fill = especie)) + 
  geom_boxplot() + 
  labs(title = "Boxplot of Body Mass by Species", subtitle = "Palmer Island Penguins", caption = "Data collected between 2007 and 2009") + 
  theme_minimal()

graf_box

Trend line and scatter plot:

linea_tendencia <- ggplot(penguins_df) +
  geom_smooth(aes(x = longitud_aleta, y = peso_corporal)) +
  geom_point(aes(x = longitud_aleta, y = peso_corporal)) +
  labs(title = "Trend Line: Flipper Length vs Body Mass") +
  theme_bw()


ggplotly(linea_tendencia)

Facet by species:

ggplotly(ggplot(penguins_df) +
  geom_point(aes(x = longitud_pico, y = peso_corporal, color = especie)) +
  facet_wrap(~especie) +
  labs(title = "Bill Length vs Body Mass by Species") +
  theme_bw())

Facet grid by species and sex:

ggplot(penguins_df)+
  geom_point(aes(x = longitud_aleta, y = peso_corporal, color = especie)) +
  facet_grid(genero~especie) +
  labs(title = "Body Mass vs Flipper Length by Species and Sex") +
  theme_bw()

Heatmap:

heatmap <- ggplotly(ggplot(penguins_df, aes(x = isla, y = especie)) +
  labs(title = "Heatmap: Species vs Island") +
  geom_bin2d())

heatmap