This project performs an end-to-end data analysis on
Fitbit daily activity data to explore whether activity patterns vary
across menstrual cycle phases.
Using Fitbit data from 33 users (2016–2019) combined with a simulated
menstrual cycle dataset, each activity date was matched to a cycle day
and cycle phase. All work aligns with a data analyst workflow: data
cleaning, preparation, joining, exploratory analysis, descriptive
statistics, and visual communication.
Can Bellabeat identify meaningful activity differences across menstrual cycle phases that could support more personalized wellness insights for users?
| Metric | Value |
|---|---|
| Unique users analyzed | 33 |
| Average daily steps | 7,638 |
| Simulated mean cycle length | ~28 days |
| Cycle phases analyzed | Menstrual, Follicular, Ovulatory, Luteal |
| Mean Δ (Follicular − Menstrual) | −114 steps |
| Cycle days included | 1–28 |
| Final merged dataset | daily_activity_cycles_full.csv |
This section outlines the datasets used, how they were cleaned and prepared, the methodology behind generating simulated menstrual cycles, how BigQuery was used to merge datasets, and the analytical approach taken during exploratory data analysis. This reflects an end-to-end data analysis workflow suitable for both academic submission and professional review.
Cleaned Fitbit daily activity data containing steps, calories, sedentary minutes, and intensity metrics. Each row represents one user on one day. excel_dailyActivity_sample.png
User IDs paired with each user’s earliest activity date, used as seed input for menstrual cycle simulation. excel_extracted_id.png
Simulated period start dates for each user, generated using an R script that applies biologically realistic cycle lengths and variability. excel_periods_simulated_sample.png
Final merged dataset containing Fitbit activity + simulated menstrual features (CycleDay, CyclePhase, mean_cycle_user, cycle_confidence). excel_daily_activity_cycles_full_sample.png
Data cleaning was performed using R and basic spreadsheet checks to ensure the Fitbit dataset was ready for downstream analysis.
These steps ensured consistent and reliable base data before adding menstrual cycle information.
Since the provided Fitbit dataset does not include menstrual cycle tracking, a simulated menstrual cycle dataset was generated using R to support cycle-based analysis. The goal was not clinical accuracy, but analytic consistency sufficient for demonstrating how Bellabeat could use cycle-aware insights.
Each user was assigned: * A personalized average cycle length between 24–35 days * Small random jitter (±2 days) to mimic natural variability * An occasional “irregular user” pattern with slightly larger variation * A start-date window tied to their first recorded activity date * A reporting probability (some periods marked as missed) Cycles were generated from ~3 months before first activity to ~2 months after last activity to ensure complete alignment.
# simulate_periods_for_activity_window.R (embedded in Rmd)
# Purpose: generate realistic simulated menstrual period start dates per user
library(tidyverse)
library(lubridate)
set.seed(42) # reproducible
# ---- FILE PATHS (use relative paths) ----
ids_path <- "Extracted_Id.csv"
out_path <- "periods_simulated.csv"
# ---- ACTIVITY WINDOW ----
activity_min <- as.Date("2016-12-04")
activity_max <- as.Date("2019-07-01")
start_window_min <- activity_min %m-% months(3)
start_window_max <- activity_min
simulation_end <- activity_max %m+% months(2)
# ---- CYCLE PARAMS ----
mean_mu <- 28
mean_sd <- 3
min_cycle <- 24
max_cycle <- 35
miss_rate <- 0.08
irregular_user_rate <- 0.12
# ---- Read Ids ----
if (!file.exists(ids_path)) stop("Ids file not found: ", ids_path)
ids_df <- read_csv(ids_path, show_col_types = FALSE)
ids_colname <- names(ids_df)[1]
ids_df <- ids_df %>%
rename(Id = !!ids_colname) %>%
mutate(Id = as.character(Id))
ids <- ids_df$Id
# ---- Generate simulated periods ----
rows <- list()
for (uid in ids) {
mean_cycle_user <- round(rnorm(1, mean_mu, mean_sd))
mean_cycle_user <- max(min_cycle, min(max_cycle, mean_cycle_user))
is_irregular <- runif(1) < irregular_user_rate
first_start <- start_window_min + days(sample(0:as.integer(start_window_max - start_window_min), 1))
cur <- first_start
while (cur <= simulation_end) {
reported_flag <- runif(1) > miss_rate
rows[[length(rows) + 1]] <- tibble(
Id = as.character(uid),
PeriodStartDate = cur,
mean_cycle = mean_cycle_user,
simulated_flag = TRUE,
reported = reported_flag
)
jitter <- sample(-2:2, 1)
if (is_irregular) jitter <- jitter + sample(-3:3, 1)
next_interval <- max(21, mean_cycle_user + jitter)
cur <- cur + days(next_interval)
}
}
periods_simulated <- bind_rows(rows) %>% arrange(Id, PeriodStartDate)
# ---- Save CSV ----
dir.create(dirname(out_path), showWarnings = FALSE, recursive = TRUE)
periods_simulated %>%
mutate(PeriodStartDate = format(as.Date(PeriodStartDate), "%Y-%m-%d")) %>%
write_csv(out_path)
message("Saved simulated periods to: ", out_path)
# ---- Sanity checks ----
cat("\nSanity checks:\n")
##
## Sanity checks:
cat("Total simulated starts:", nrow(periods_simulated), "\n")
## Total simulated starts: 1239
cat("Unique users:", n_distinct(periods_simulated$Id), "\n")
## Unique users: 33
cat("PeriodStartDate range:",
min(periods_simulated$PeriodStartDate), "to",
max(periods_simulated$PeriodStartDate), "\n")
## PeriodStartDate range: 17048 to 18140
cat("Mean cycle (summary):\n")
## Mean cycle (summary):
print(summary(periods_simulated$mean_cycle))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.00 26.00 28.00 28.25 30.00 35.00
cat("\nSample rows:\n")
##
## Sample rows:
print(head(periods_simulated, 12))
## # A tibble: 12 × 5
## Id PeriodStartDate mean_cycle simulated_flag reported
## <chr> <date> <dbl> <lgl> <lgl>
## 1 1503960366 2016-09-28 32 TRUE TRUE
## 2 1503960366 2016-10-31 32 TRUE TRUE
## 3 1503960366 2016-12-01 32 TRUE TRUE
## 4 1503960366 2017-01-03 32 TRUE TRUE
## 5 1503960366 2017-02-06 32 TRUE TRUE
## 6 1503960366 2017-03-11 32 TRUE TRUE
## 7 1503960366 2017-04-11 32 TRUE TRUE
## 8 1503960366 2017-05-13 32 TRUE TRUE
## 9 1503960366 2017-06-12 32 TRUE TRUE
## 10 1503960366 2017-07-15 32 TRUE TRUE
## 11 1503960366 2017-08-18 32 TRUE TRUE
## 12 1503960366 2017-09-20 32 TRUE TRUE
To align activity with menstrual cycles, BigQuery was used to perform a rolling join (latest period before activity date).
For each user: - Each activity date was paired with all possible period start dates - A window function ranked period starts in descending order - The “most recent period on or before activity date” was selected - CycleDay = days since period start + 1 - CyclePhase assigned using standard biological ranges
* Day 1–5: Menstrual
* Day 6–13: Follicular
* Day 14–16: Ovulatory
* Day 17+: Luteal
Additional user-level metrics (mean_cycle_user, cycle_confidence) were joined for segmentation.
-- Quick preview of uploaded periods table
SELECT *
FROM `iconic-strategy-477903-e4.bellabeat_raw.periods_simulated`
ORDER BY Id, PeriodStartDate
LIMIT 20;
---------------------------------------------------------
-- 1) Create per-user period summary
---------------------------------------------------------
CREATE OR REPLACE TABLE `iconic-strategy-477903-e4.bellabeat_raw.periods_user_summary` AS
SELECT
Id,
COUNT(1) AS n_period_starts,
ROUND(AVG(mean_cycle)) AS mean_cycle_user,
CASE
WHEN COUNT(1) >= 2 THEN 'high'
ELSE 'low'
END AS cycle_confidence
FROM `iconic-strategy-477903-e4.bellabeat_raw.periods_simulated`
GROUP BY Id;
---------------------------------------------------------
-- 2) Rolling-join: find most recent period before each activity day
---------------------------------------------------------
CREATE OR REPLACE TABLE `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles` AS
WITH candidate_matches AS (
SELECT
da.Id,
da.ActivityDate,
ps.PeriodStartDate,
ps.mean_cycle,
DATE_DIFF(da.ActivityDate, ps.PeriodStartDate, DAY) + 1 AS CycleDay,
ROW_NUMBER() OVER (
PARTITION BY da.Id, da.ActivityDate
ORDER BY ps.PeriodStartDate DESC
) AS rn
FROM `iconic-strategy-477903-e4.bellabeat_raw.cleaned_daily_activity` AS da
LEFT JOIN `iconic-strategy-477903-e4.bellabeat_raw.periods_simulated` AS ps
ON da.Id = ps.Id
AND ps.PeriodStartDate <= da.ActivityDate
)
SELECT
Id,
ActivityDate,
PeriodStartDate,
mean_cycle,
CycleDay,
CASE
WHEN CycleDay BETWEEN 1 AND 5 THEN 'Menstrual'
WHEN CycleDay BETWEEN 6 AND 13 THEN 'Follicular'
WHEN CycleDay BETWEEN 14 AND 16 THEN 'Ovulatory'
WHEN CycleDay >= 17 THEN 'Luteal'
ELSE NULL
END AS CyclePhase
FROM candidate_matches
WHERE rn = 1;
---------------------------------------------------------
-- 3) Enrich with user-level metrics
---------------------------------------------------------
CREATE OR REPLACE TABLE `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles_enriched` AS
SELECT
c.*,
p.mean_cycle_user,
p.cycle_confidence
FROM `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles` AS c
LEFT JOIN `iconic-strategy-477903-e4.bellabeat_raw.periods_user_summary` AS p
USING (Id);
---------------------------------------------------------
-- 4) Join back to cleaned daily activity table
---------------------------------------------------------
CREATE OR REPLACE TABLE `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles_full` AS
SELECT
da.*,
c.PeriodStartDate,
c.CycleDay,
c.CyclePhase,
c.mean_cycle_user,
c.cycle_confidence
FROM `iconic-strategy-477903-e4.bellabeat_raw.cleaned_daily_activity` AS da
LEFT JOIN `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles_enriched` AS c
USING (Id, ActivityDate);
---------------------------------------------------------
-- 5) Sanity checks
---------------------------------------------------------
-- Overall matched percentage
SELECT
COUNT(*) AS total_rows,
COUNTIF(CycleDay IS NOT NULL) AS matched_rows,
ROUND(100 * SAFE_DIVIDE(COUNTIF(CycleDay IS NOT NULL), COUNT(*)), 2) AS matched_pct
FROM `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles_full`;
-- Per-phase counts
SELECT
CyclePhase,
COUNT(*) AS num_days
FROM `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles_full`
GROUP BY CyclePhase
ORDER BY num_days DESC;
-- Per-user matched percentage
SELECT
Id,
COUNT(*) AS total_days,
COUNTIF(CycleDay IS NOT NULL) AS matched_days,
ROUND(100 * SAFE_DIVIDE(COUNTIF(CycleDay IS NOT NULL), COUNT(*)), 2) AS pct_matched
FROM `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles_full`
GROUP BY Id
ORDER BY pct_matched ASC, total_days DESC
LIMIT 30;
The analysis was conducted primarily in R and focused on descriptive, non-predictive insights.
The purpose of EDA was to understand patterns, not to build predictive models.
## Loaded rows: 940
## Unique users: 33
## ActivityDate range: 2016-12-04 to 2019-07-01
## PeriodStartDate range: 2016-12-01 to 2019-06-24
## Columns present: Id, ActivityDate, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDistance, VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, SedentaryActiveDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories, PeriodStartDate, CycleDay, CyclePhase, mean_cycle_user, cycle_confidence
## # A tibble: 8 × 20
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## <chr> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-12-08 12669 8.16 8.16
## 2 1503960366 2016-12-07 9762 6.28 6.28
## 3 1503960366 2016-12-06 10460 6.74 6.74
## 4 1503960366 2016-12-11 15506 9.88 9.88
## 5 1503960366 2016-12-12 10544 6.68 6.68
## 6 1503960366 2016-12-13 9819 6.34 6.34
## 7 1503960366 2016-12-10 13019 8.59 8.59
## 8 1503960366 2016-12-09 9705 6.48 6.48
## # ℹ 15 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>,
## # PeriodStartDate <date>, CycleDay <int>, CyclePhase <chr>,
## # mean_cycle_user <dbl>, cycle_confidence <chr>
## # A tibble: 1 × 10
## total_rows rows_with_cycle matched_pct n_users min_activity max_activity
## <int> <int> <dbl> <int> <chr> <chr>
## 1 940 940 100 33 2016-12-04 2019-07-01
## # ℹ 4 more variables: min_period <chr>, max_period <chr>, mean_cycle_len <dbl>,
## # sd_cycle_len <dbl>
## # A tibble: 1 × 7
## min_cycleDay q1 median q3 max_cycleDay mean_cycleDay sd_cycleDay
## <int> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 1 7.75 14 22 38 14.8 8.6
## # A tibble: 4 × 7
## CyclePhase n_days avg_steps sd_steps se_steps avg_cal avg_sedentary
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Luteal 400 8031 5244. 262. 2353 951.
## 2 Ovulatory 101 7814 5212. 519. 2393 1014.
## 3 Follicular 276 7256 4914. 296. 2226 1014.
## 4 Menstrual 163 7212 4867. 381. 2260 1037.
##
## Welch Two Sample t-test
##
## data: fol and men
## t = 0.092253, df = 342.64, p-value = 0.4633
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -751.2966 Inf
## sample estimates:
## mean of x mean of y
## 7256.225 7211.712
##
## Wilcoxon rank sum test with continuity correction
##
## data: fol and men
## W = 22478, p-value = 0.5053
## alternative hypothesis: true location shift is greater than 0
## [1] -114.2079
## [1] 1736.255
## [1] -0.06577827
## [1] -114.2079
## [1] 1736.255
This section presents the core findings from the exploratory analysis. Each figure is followed by a short interpretation written for product stakeholders and hiring reviewers.
Average daily steps by cycle day (1–28). Point size ~ sample count (capped).
A gentle mid-cycle rise is visible around days ~13–18. Average steps are fairly stable across most days with a modest peak in the peri-ovulatory window. Sample counts vary by day (point size), which is why we capped the size to avoid extreme scaling.
Bar chart of mean daily steps per phase, with standard-error style error bars. Luteal and Ovulatory show the higher means, Menstrual shows the lowest.
Histogram of users’ mean deltas (Follicular − Menstrual). Spread is wide and centered near zero.
A real user’s daily steps plotted over time, with cycle-phase shading (where present). It demonstrates intra-user variability and occasional sustained changes.
This analysis provides several meaningful insights into how daily activity patterns relate to menstrual cycle phases. The findings highlight both population-level patterns and important user-level variability that Bellabeat can leverage in product design.
Across all users combined, average daily steps show a modest pattern:
This suggests that physical activity fluctuates gently throughout the cycle, but not enough to drive population-level decisions alone.
The cycle-day trend shows a visible but gentle uplift in activity around the mid-cycle window.
This aligns with:
This pattern appears in aggregate but varies across users.
The histogram of per-user deltas (Follicular − Menstrual steps) shows:
This tells us:
✔ Cycle influences behavior — but differently for each
user
✔ Personalization is more important than global rules
✔ Bellabeat should avoid one-size-fits-all recommendations
Users with higher estimated cycle confidence (more consistent simulated patterns):
This suggests segmentation is key.
The user time-series display shows:
This reinforces the idea that cycle effects are meaningful but not dominant.
Based on the analysis, Bellabeat can improve personalization and user engagement by applying the following key recommendations:
Cycle effects vary widely across users.
Bellabeat should: - Target phase-based insights to
high-confidence users
- Keep recommendations general for low-confidence or irregular users
More accurate cycle data leads to better personalization.
Bellabeat should nudge users to: - Log periods consistently
- Update symptoms
- Verify predicted cycle dates
Cycle effects are subtle compared to factors like stress, sleep, or
work schedule.
Blend cycle insights with: - Sleep patterns
- Activity history
- User context (e.g., travel days)
This avoids over-attributing changes to the cycle alone.
The dashboard you built helps teams quickly: - Review population
vs. individual patterns
- Identify cycle-sensitive segments
- Prototype and refine feature ideas
The findings of this analysis support several meaningful opportunities for Bellabeat’s product, retention, and user-engagement strategies.
Cycle-aware activity insights—especially for high-confidence
users—allow Bellabeat to deliver recommendations that feel timely,
empathetic, and relevant.
This strengthens daily engagement and positions Bellabeat as a
“body-aware” companion rather than a generic fitness app.
Subtle, supportive nudges during low-energy phases (Menstrual) and motivating prompts during higher-energy phases (Ovulatory/Luteal) can increase:
Better personalization → better long-term retention.
Cycle-confidence and user-level deltas offer new segmentation dimensions:
These segments help Bellabeat tailor messaging, notifications, and wellness programs.
The analysis identifies clear hypotheses for experimentation:
This is directly actionable by product teams.
By integrating cycle context with activity data, Bellabeat can expand beyond step tracking and deepen its positioning as a women’s health platform—supporting holistic wellness, not just fitness metrics.
While the analysis provides meaningful insights, several limitations should be noted to ensure accurate interpretation.
The menstrual cycle data used in this project was simulated,
not self-reported.
Although biologically reasonable patterns were used, simulated data
cannot fully capture:
As a result, findings should be interpreted as analytic prototypes, not medical conclusions.
The Fitbit dataset:
This limits generalizability.
Daily activity varies due to:
These factors can blur or outweigh cycle-driven differences.
Cycle effects are real but subtle relative to life context.
Cycle confidence was inferred from the consistency of simulated
data.
Low-confidence users may appear less cycle-sensitive simply due to:
This affects accuracy.
The analysis is descriptive, not predictive.
It identifies correlations, not causal relationships.
These limitations do not invalidate the analysis but highlight why future work should incorporate real cycle data and broader health signals.
The analysis provides a strong foundation, but the following steps would significantly improve accuracy and product value:
Replacing simulated cycles with actual user-logged periods would strengthen:
Integrate sleep, stress, symptom tracking, and device wear patterns to better understand why activity changes across the cycle.
With more data, Bellabeat can cluster users by how strongly their activity varies by phase — supporting more targeted recommendations.
Develop and test:
Add filters (symptoms, sleep, context tags) to help internal teams explore more detailed user patterns.
This appendix contains optional supplemental material for reviewers who want to explore the full technical workflow. All main insights and figures appear in earlier sections; the appendix provides transparency and reproducibility.
The complete R script used to generate
periods_simulated.csv is included below for reference.
# simulate_periods_for_activity_window.R (embedded in Rmd)
# Purpose: generate realistic simulated menstrual period start dates per user
library(tidyverse)
library(lubridate)
set.seed(42) # reproducible
# ---- FILE PATHS (use relative paths) ----
ids_path <- "Extracted_Id.csv"
out_path <- "periods_simulated.csv"
# ---- ACTIVITY WINDOW ----
activity_min <- as.Date("2016-12-04")
activity_max <- as.Date("2019-07-01")
start_window_min <- activity_min %m-% months(3)
start_window_max <- activity_min
simulation_end <- activity_max %m+% months(2)
# ---- CYCLE PARAMS ----
mean_mu <- 28
mean_sd <- 3
min_cycle <- 24
max_cycle <- 35
miss_rate <- 0.08
irregular_user_rate <- 0.12
# ---- Read Ids ----
if (!file.exists(ids_path)) stop("Ids file not found: ", ids_path)
ids_df <- read_csv(ids_path, show_col_types = FALSE)
ids_colname <- names(ids_df)[1]
ids_df <- ids_df %>%
rename(Id = !!ids_colname) %>%
mutate(Id = as.character(Id))
ids <- ids_df$Id
# ---- Generate simulated periods ----
rows <- list()
for (uid in ids) {
mean_cycle_user <- round(rnorm(1, mean_mu, mean_sd))
mean_cycle_user <- max(min_cycle, min(max_cycle, mean_cycle_user))
is_irregular <- runif(1) < irregular_user_rate
first_start <- start_window_min + days(sample(0:as.integer(start_window_max - start_window_min), 1))
cur <- first_start
while (cur <= simulation_end) {
reported_flag <- runif(1) > miss_rate
rows[[length(rows) + 1]] <- tibble(
Id = as.character(uid),
PeriodStartDate = cur,
mean_cycle = mean_cycle_user,
simulated_flag = TRUE,
reported = reported_flag
)
jitter <- sample(-2:2, 1)
if (is_irregular) jitter <- jitter + sample(-3:3, 1)
next_interval <- max(21, mean_cycle_user + jitter)
cur <- cur + days(next_interval)
}
}
periods_simulated <- bind_rows(rows) %>% arrange(Id, PeriodStartDate)
# ---- Save CSV ----
dir.create(dirname(out_path), showWarnings = FALSE, recursive = TRUE)
periods_simulated %>%
mutate(PeriodStartDate = format(as.Date(PeriodStartDate), "%Y-%m-%d")) %>%
write_csv(out_path)
message("Saved simulated periods to: ", out_path)
# ---- Sanity checks ----
cat("\nSanity checks:\n")
##
## Sanity checks:
cat("Total simulated starts:", nrow(periods_simulated), "\n")
## Total simulated starts: 1239
cat("Unique users:", n_distinct(periods_simulated$Id), "\n")
## Unique users: 33
cat("PeriodStartDate range:",
min(periods_simulated$PeriodStartDate), "to",
max(periods_simulated$PeriodStartDate), "\n")
## PeriodStartDate range: 17048 to 18140
cat("Mean cycle (summary):\n")
## Mean cycle (summary):
print(summary(periods_simulated$mean_cycle))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.00 26.00 28.00 28.25 30.00 35.00
cat("\nSample rows:\n")
##
## Sample rows:
print(head(periods_simulated, 12))
## # A tibble: 12 × 5
## Id PeriodStartDate mean_cycle simulated_flag reported
## <chr> <date> <dbl> <lgl> <lgl>
## 1 1503960366 2016-09-28 32 TRUE TRUE
## 2 1503960366 2016-10-31 32 TRUE TRUE
## 3 1503960366 2016-12-01 32 TRUE TRUE
## 4 1503960366 2017-01-03 32 TRUE TRUE
## 5 1503960366 2017-02-06 32 TRUE TRUE
## 6 1503960366 2017-03-11 32 TRUE TRUE
## 7 1503960366 2017-04-11 32 TRUE TRUE
## 8 1503960366 2017-05-13 32 TRUE TRUE
## 9 1503960366 2017-06-12 32 TRUE TRUE
## 10 1503960366 2017-07-15 32 TRUE TRUE
## 11 1503960366 2017-08-18 32 TRUE TRUE
## 12 1503960366 2017-09-20 32 TRUE TRUE
The SQL logic for rolling joins and cycle alignment is included for completeness.
-- Quick preview of uploaded periods table
SELECT *
FROM `iconic-strategy-477903-e4.bellabeat_raw.periods_simulated`
ORDER BY Id, PeriodStartDate
LIMIT 20;
---------------------------------------------------------
-- 1) Create per-user period summary
---------------------------------------------------------
CREATE OR REPLACE TABLE `iconic-strategy-477903-e4.bellabeat_raw.periods_user_summary` AS
SELECT
Id,
COUNT(1) AS n_period_starts,
ROUND(AVG(mean_cycle)) AS mean_cycle_user,
CASE
WHEN COUNT(1) >= 2 THEN 'high'
ELSE 'low'
END AS cycle_confidence
FROM `iconic-strategy-477903-e4.bellabeat_raw.periods_simulated`
GROUP BY Id;
---------------------------------------------------------
-- 2) Rolling-join: find most recent period before each activity day
---------------------------------------------------------
CREATE OR REPLACE TABLE `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles` AS
WITH candidate_matches AS (
SELECT
da.Id,
da.ActivityDate,
ps.PeriodStartDate,
ps.mean_cycle,
DATE_DIFF(da.ActivityDate, ps.PeriodStartDate, DAY) + 1 AS CycleDay,
ROW_NUMBER() OVER (
PARTITION BY da.Id, da.ActivityDate
ORDER BY ps.PeriodStartDate DESC
) AS rn
FROM `iconic-strategy-477903-e4.bellabeat_raw.cleaned_daily_activity` AS da
LEFT JOIN `iconic-strategy-477903-e4.bellabeat_raw.periods_simulated` AS ps
ON da.Id = ps.Id
AND ps.PeriodStartDate <= da.ActivityDate
)
SELECT
Id,
ActivityDate,
PeriodStartDate,
mean_cycle,
CycleDay,
CASE
WHEN CycleDay BETWEEN 1 AND 5 THEN 'Menstrual'
WHEN CycleDay BETWEEN 6 AND 13 THEN 'Follicular'
WHEN CycleDay BETWEEN 14 AND 16 THEN 'Ovulatory'
WHEN CycleDay >= 17 THEN 'Luteal'
ELSE NULL
END AS CyclePhase
FROM candidate_matches
WHERE rn = 1;
---------------------------------------------------------
-- 3) Enrich with user-level metrics
---------------------------------------------------------
CREATE OR REPLACE TABLE `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles_enriched` AS
SELECT
c.*,
p.mean_cycle_user,
p.cycle_confidence
FROM `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles` AS c
LEFT JOIN `iconic-strategy-477903-e4.bellabeat_raw.periods_user_summary` AS p
USING (Id);
---------------------------------------------------------
-- 4) Join back to cleaned daily activity table
---------------------------------------------------------
CREATE OR REPLACE TABLE `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles_full` AS
SELECT
da.*,
c.PeriodStartDate,
c.CycleDay,
c.CyclePhase,
c.mean_cycle_user,
c.cycle_confidence
FROM `iconic-strategy-477903-e4.bellabeat_raw.cleaned_daily_activity` AS da
LEFT JOIN `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles_enriched` AS c
USING (Id, ActivityDate);
---------------------------------------------------------
-- 5) Sanity checks
---------------------------------------------------------
-- Overall matched percentage
SELECT
COUNT(*) AS total_rows,
COUNTIF(CycleDay IS NOT NULL) AS matched_rows,
ROUND(100 * SAFE_DIVIDE(COUNTIF(CycleDay IS NOT NULL), COUNT(*)), 2) AS matched_pct
FROM `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles_full`;
-- Per-phase counts
SELECT
CyclePhase,
COUNT(*) AS num_days
FROM `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles_full`
GROUP BY CyclePhase
ORDER BY num_days DESC;
-- Per-user matched percentage
SELECT
Id,
COUNT(*) AS total_days,
COUNTIF(CycleDay IS NOT NULL) AS matched_days,
ROUND(100 * SAFE_DIVIDE(COUNTIF(CycleDay IS NOT NULL), COUNT(*)), 2) AS pct_matched
FROM `iconic-strategy-477903-e4.bellabeat_raw.daily_activity_cycles_full`
GROUP BY Id
ORDER BY pct_matched ASC, total_days DESC
LIMIT 30;
# STEP 1: Setup + Load Data
library(tidyverse)
library(lubridate)
# Your file is in the current working directory:
merged_csv <- "daily_activity_cycles_full.csv"
# Load the merged dataset
df <- read_csv(merged_csv, show_col_types = FALSE)
# Convert expected columns to proper types
df <- df %>%
mutate(
ActivityDate = as.Date(ActivityDate),
PeriodStartDate = as.Date(PeriodStartDate),
CycleDay = as.integer(CycleDay),
CyclePhase = as.character(CyclePhase),
Id = as.character(Id)
)
# ---- Clean readable printed checks ----
cat("Loaded rows:", nrow(df), "\n")
cat("Unique users:", n_distinct(df$Id), "\n")
cat("ActivityDate range:",
format(min(df$ActivityDate, na.rm=TRUE)), "to",
format(max(df$ActivityDate, na.rm=TRUE)), "\n")
cat("PeriodStartDate range:",
format(min(df$PeriodStartDate, na.rm=TRUE)), "to",
format(max(df$PeriodStartDate, na.rm=TRUE)), "\n")
cat("Columns present:", paste(names(df), collapse = ", "), "\n\n")
# peek at first 8 rows
print(head(df, 8))
# STEP 2: Global summary checks
library(dplyr)
global_summary <- df %>%
summarise(
total_rows = n(),
rows_with_cycle = sum(!is.na(CycleDay)),
matched_pct = round(100 * rows_with_cycle / total_rows, 2),
n_users = n_distinct(Id),
min_activity = format(min(ActivityDate, na.rm = TRUE)),
max_activity = format(max(ActivityDate, na.rm = TRUE)),
min_period = format(min(PeriodStartDate, na.rm = TRUE)),
max_period = format(max(PeriodStartDate, na.rm = TRUE)),
mean_cycle_len = round(mean(mean_cycle_user, na.rm = TRUE), 1),
sd_cycle_len = round(sd(mean_cycle_user, na.rm = TRUE), 1)
)
print(global_summary)
# Quick CycleDay distribution
cycleday_summary <- df %>%
filter(!is.na(CycleDay)) %>%
summarise(
min_cycleDay = min(CycleDay, na.rm = TRUE),
q1 = quantile(CycleDay, 0.25, na.rm = TRUE),
median = median(CycleDay, na.rm = TRUE),
q3 = quantile(CycleDay, 0.75, na.rm = TRUE),
max_cycleDay = max(CycleDay, na.rm = TRUE),
mean_cycleDay = round(mean(CycleDay, na.rm = TRUE), 1),
sd_cycleDay = round(sd(CycleDay, na.rm = TRUE), 1)
)
print(cycleday_summary)
# save summaries
dir.create("analysis_outputs", showWarnings = FALSE)
write.csv(global_summary, "analysis_outputs/global_summary.csv", row.names = FALSE)
write.csv(cycleday_summary, "analysis_outputs/cycleday_summary.csv", row.names = FALSE)
# STEP 3: Phase-level summary + bar plot
library(tidyverse)
# ensure df is loaded from previous step
# df <- read_csv("daily_activity_cycles_full.csv") # if needed
# compute phase-level summary
phase_summary <- df %>%
filter(!is.na(CyclePhase)) %>%
group_by(CyclePhase) %>%
summarise(
n_days = n(),
avg_steps = round(mean(TotalSteps, na.rm=TRUE), 0),
sd_steps = round(sd(TotalSteps, na.rm=TRUE), 1),
se_steps = round(sd_steps / sqrt(n_days), 1),
avg_cal = round(mean(Calories, na.rm=TRUE), 0),
avg_sedentary = round(mean(SedentaryMinutes, na.rm=TRUE), 1)
) %>%
arrange(desc(avg_steps))
print(phase_summary)
# Save CSV
dir.create("analysis_outputs", showWarnings = FALSE)
write_csv(phase_summary, "analysis_outputs/phase_summary.csv")
# Plot avg steps by phase with error bars (se)
phase_plot <- phase_summary %>%
mutate(CyclePhase = factor(CyclePhase, levels = CyclePhase))
p_phase <- ggplot(phase_plot, aes(x=CyclePhase, y=avg_steps)) +
geom_col() +
geom_errorbar(aes(ymin = avg_steps - se_steps, ymax = avg_steps + se_steps), width = 0.2) +
labs(title = "Average Steps by Cycle Phase", x = "Cycle Phase", y = "Average Steps") +
theme_minimal(base_size = 14)
ggsave("analysis_outputs/phase_avg_steps.png", p_phase, width = 8, height = 5, dpi = 300)
print(p_phase)
# STEP 4: CycleDay trend (avg steps by CycleDay 1-28)
library(tidyverse)
day_trend <- df %>%
filter(!is.na(CycleDay), CycleDay >= 1, CycleDay <= 28) %>%
group_by(CycleDay) %>%
summarise(
avg_steps = round(mean(TotalSteps, na.rm = TRUE), 0),
sd_steps = round(sd(TotalSteps, na.rm = TRUE), 1),
n = n()
) %>%
arrange(CycleDay)
# Save CSV
write_csv(day_trend, "analysis_outputs/cycleday_trend.csv")
# Plot
p_day <- ggplot(day_trend, aes(x = CycleDay, y = avg_steps)) +
geom_line() +
geom_point(aes(size = pmin(n, 200)), alpha = 0.8) + # size ~ observation count (capped)
scale_size_continuous(name = "n (capped at 200)") +
labs(title = "Average Steps by Cycle Day (1–28)", x = "Cycle Day", y = "Average Steps") +
theme_minimal(base_size = 13)
ggsave("analysis_outputs/cycleday_avg_steps.png", p_day, width = 9, height = 4.5, dpi = 300)
print(p_day)
# STEP 5: Per-user delta (Follicular - Menstrual)
user_phase_stats <- df %>%
filter(!is.na(CyclePhase)) %>%
group_by(Id, CyclePhase) %>%
summarise(mean_steps = mean(TotalSteps, na.rm = TRUE), .groups = "drop") %>%
pivot_wider(names_from = CyclePhase, values_from = mean_steps) %>%
mutate(delta_fol_minus_men = Follicular - Menstrual)
user_delta <- user_phase_stats %>% filter(!is.na(delta_fol_minus_men))
write_csv(user_delta, "analysis_outputs/user_delta_fol_minus_men.csv")
p_delta <- ggplot(user_delta, aes(x = delta_fol_minus_men)) +
geom_histogram(bins = 12) +
labs(title = "Per-user: Follicular − Menstrual (Avg Steps)", x = "Delta (Follicular − Menstrual)", y = "Number of users") +
theme_minimal(base_size = 13)
ggsave("analysis_outputs/user_delta_hist.png", p_delta, width = 8, height = 4, dpi = 300)
print(p_delta)
# Statistical comparison: Menstrual vs Follicular
men <- df %>% filter(CyclePhase == "Menstrual") %>% pull(TotalSteps)
fol <- df %>% filter(CyclePhase == "Follicular") %>% pull(TotalSteps)
t_res <- t.test(fol, men, alternative = "greater")
wil_res <- wilcox.test(fol, men, alternative = "greater")
t_res
wil_res
# Per-user test (paired at user-level): compute each user’s mean in each phase and then test whether the user-level deltas differ from 0. This avoids conflating within-user correlation.
library(broom)
user_means <- df %>%
filter(!is.na(CyclePhase)) %>%
group_by(Id, CyclePhase) %>%
summarise(mean_steps = mean(TotalSteps, na.rm=TRUE), .groups="drop") %>%
pivot_wider(names_from = CyclePhase, values_from = mean_steps) %>%
mutate(delta = Follicular - Menstrual) %>%
filter(!is.na(delta))
# One-sample t-test on deltas (is mean delta > 0?)
t.test(user_means$delta, mu = 0, alternative = "greater") %>% tidy()
# report mean delta and basic descriptives
mean_delta <- mean(user_means$delta); sd_delta <- sd(user_means$delta)
mean_delta; sd_delta
#Effect size (Cohen's d) for user-level delta:
cohens_d <- mean_delta / sd_delta
cohens_d
# interpret: 0.2 small, 0.5 medium, 0.8 large
#Check results only for high-confidence users
high_users <- df %>% filter(cycle_confidence == "high")
user_means_high <- high_users %>%
group_by(Id, CyclePhase) %>%
summarise(mean_steps = mean(TotalSteps, na.rm=TRUE), .groups="drop") %>%
pivot_wider(names_from = CyclePhase, values_from = mean_steps) %>%
mutate(delta = Follicular - Menstrual) %>%
filter(!is.na(delta))
t.test(user_means_high$delta, mu = 0, alternative = "greater") %>% tidy()
mean(user_means_high$delta); sd(user_means_high$delta)
# cap CycleDay at mean cycle length per user (reduces outlier long CycleDay)
df_capped <- df %>%
mutate(max_day = mean_cycle_user,
CycleDay_capped = ifelse(!is.na(CycleDay) & !is.na(max_day) & CycleDay > max_day, NA, CycleDay))
# Recompute day trend:
day_trend_capped <- df_capped %>%
filter(!is.na(CycleDay_capped), CycleDay_capped >=1, CycleDay_capped <=28) %>%
group_by(CycleDay_capped) %>%
summarise(avg_steps = mean(TotalSteps, na.rm=TRUE), n = n()) %>%
arrange(CycleDay_capped)
The following files were generated as part of the analysis:
For documentation, the final Tableau dashboard is included again below.