Data Source: Global Burden of Disease Collaborative Network. Global Burden of Disease Study 2021 (GBD 2021) Results. Seattle, United States: Institute for Health Metrics and Evaluation (IHME), 2022. Available from https://vizhub.healthdata.org/gbd-results/.
Load and clean the original depression prevalence dataset.
# Load CSV
data_raw <- read_csv("IHME-GBD_2021_DATA-dd81f9d8-1.csv")
# renaming variables, filtering for percent prevalence, selecting relevant variables
data_clean <- data_raw %>%
rename(
country_id = location_id,
country = location_name,
sex = sex_name,
age = age_name,
metric = metric_name,
prevalence = val
) %>%
filter(metric == "Prozent") %>%
select(country_id, country, year, sex, age, prevalence) %>%
arrange(country)
# creating dimension table for country, sorted alphabetically
country_dim <- data_clean %>%
distinct(country) %>%
arrange(country) %>%
mutate(country_id = row_number()) %>%
relocate(country_id)
write_csv(country_dim, "country_dim.csv") # writing .csv file for country dimension table
# creating dimension table for year
year_dim <- data_clean %>%
distinct(year) %>%
arrange(year) %>%
mutate(year_id = row_number()) %>%
relocate(year_id)
write_csv(year_dim, "year_dim.csv") # writing .csv file for year dimension table
Join the cleaned dataset with the dimension tables to create a structured fact table.
# creating fact table for prevalence data
df_fact <- data_clean %>%
select(-country_id) %>% # dropping old country_id
left_join(country_dim, by = "country") %>%
left_join(year_dim, by = "year") %>%
mutate(prevalence_id = row_number()) %>% # creating new id for primary key in fact table
relocate(prevalence_id) %>%
select(-sex, -age, -country, -year) # dropping redundant or irrelevant variables
write_csv(df_fact, "df_fact.csv") # writing .csv file for fact table
Create tables in PostgreSQL with appropriate primary and foreign keys.
Run this code in pgAdmin:
-- Country Dimension Table
CREATE TABLE country_dim (
country_id INT PRIMARY KEY,
country TEXT UNIQUE NOT NULL
);
-- Year Dimension Table
CREATE TABLE year_dim (
year_id INT PRIMARY KEY,
year INT UNIQUE NOT NULL
);
-- Fact Table
CREATE TABLE depression_fact (
prevalence_id SERIAL PRIMARY KEY,
prevalence NUMERIC,
country_id INT REFERENCES country_dim(country_id),
year_id INT REFERENCES year_dim(year_id)
);
Follow these steps:
🖱️ Right-click each table and click on Import/Export Data…
📂 Choose the matching CSV file
✅ Go to Options and make sure header is checked and correct delimiter is used
🔁 Repeat for all tables: ‘country_dim’, ‘year_dim’, ‘prevalence_fact’