Case Study: Why a “Good” Sleep Score Can Leave You Exhausted

Takuya Sakoda
6 days ago
5 min read

Updated: 3 days ago

A 59-Day Longitudinal Biometric and Autonomic Correlation Study

Executive Summary

Commercial wearable health devices rely on proprietary algorithms to summarize complex physiological states into simple, user-friendly numbers. However, these highly aggregated metrics can mask the specific underlying data channels that dictate daily human energy levels.

This case study investigates a persistent analytical problem: a Garmin smartwatch consistently reporting "Good" sleep scores (75–85/100) on mornings when the user experienced subjective fatigue and unrefreshed waking states. By extracting and modeling a 59-day longitudinal dataset (N=59) of raw biometric telemetry, this project evaluates the individual impacts of sleep volume, heart rate variability (HRV), and autonomic stress on net overnight recovery.

The ultimate objective of this study is twofold: to identify actionable lifestyle optimizations and to model an objective feature-development framework for digital health product consulting.

1. Ask: Problem Definition & Hypotheses

The primary analytical task was to identify which independent physiological metrics share the strongest statistical relationship with net overnight recovery, defined as the net change in Garmin's body battery metric over the sleep period (body_battery_change).

To structure the analysis, three distinct physiological hypotheses were established:

Hypothesis 1 (The Stress Throttle): High overnight autonomic nervous system activity (measured via sleep stress scores) limits the efficiency of physical recovery regardless of sleep duration.
Hypothesis 2 (The Sleep Ceiling): Total sleep duration establishes a strict operational ceiling on physical recharge that cannot be bypassed, even under conditions of near-zero stress.
Hypothesis 3 (Autonomic Flexibility): Higher parasympathetic activity (measured via average overnight HRV) runs in direct parallel with optimal cellular and physical recovery.

2. Prepare & Process: Data Pipeline & Technical Integrity

The raw data was collected over a continuous 60-day observation period from a Garmin Forrunner 255 Music smartwatch. The raw telemetry was captured as structured nested JSON objects via the Garmin Connect API and processed through an ETL (Extract, Transform, Load) pipeline using a Python cleaning script (process_sleep.py).

Upon initial inspection, the data was found to be deeply nested (hierarchical objects inside of objects), containing minute-by-minute epoch telemetry arrays, cardiorespiratory logs, and categorical sleep-stage markers. To organize this information into a workable format, I ran a Python bridge script (process_to_sql.py) to flatten these complex JSON hierarchies and extract our target metrics into a single, cohesive relational database table: garmin_sleep_master.csv.

Data Transformation & Quality Controls:

Metric Conversion: Sleep duration was extracted in raw seconds and converted to decimal hours and minutes (sleep_time_hours) to ensure proper continuous mathematical scale mapping.
Missing Value Mitigation: Out of 60 continuous tracking nights, one single date row (April 25) contained null fields caused by a device synchronization timeout. This record was completely purged from the dataset to protect downstream statistical calculations from distortion, leaving a final analytical sample size of N=59.
Zero-Value Verification: Structural checks confirmed that metrics like restless_moments_count and awake_count did not possess artificial zero-inflation caused by sensor drop-offs, verifying that zero values reflected true physiological states rather than hardware errors.

The fully cleaned data was exported to a normalized tabular format (garmin_sleep_master_clean.csv). The comprehensive pipeline script, raw data dictionaries, and structural JSON schema definitions are hosted transparently in the public repository for technical validation.

3. Analyze: Statistical Validation & Core Predictors

To move past simple visual observation, a rigorous statistical layer was applied to the full dataset. The continuous relationship between the target recovery metric (body_battery_change) and each independent predictor was evaluated using the Pearson Correlation Coefficient (r).

Complete Pearson Correlation Matrix (N=59)

Independent Variable (X)	Target Variable (Y)	Pearson r	Statistical Significance (p-value)	Relationship Interpretation
avg_sleep_stress	body_battery_change	-0.76	p < 0.0001	Extremely Strong Inverse
avg_overnight_hrv	body_battery_change	+0.70	p < 0.0001	Strong Positive Parallel
sleep_time_seconds	body_battery_change	+0.55	p < 0.0001	Moderate Positive

Given the sample size of N=59, a critical value threshold for a two-tailed alpha of 0.05 requires an r-value greater than \pm0.25. All three core metrics yielded p-values well below 0.0001, confirming that these correlations are highly statistically significant and highly unlikely to have occurred by random chance.

The Stress Throttle (r = -0.76): Autonomic sleep stress emerged as the strongest single continuous predictor. The negative coefficient demonstrates that elevated sympathetic nervous system activity acts as a direct structural brake on physical recovery.
The Autonomic Parallel (r = +0.70): Average overnight HRV tracking confirmed that robust parasympathetic engagement is highly correlated with optimized body battery recharging.

Secondary Exploratory Variables: In addition to the three core physiological drivers, sleep continuity metrics—specifically awake_count (r = -0.14) and restless_moments_count (r = -0.09)—were rigorously tested against net recovery. Because their correlation coefficients fell well below the statistical significance threshold (p > 0.05), they were classified as statistically weak, secondary noisy indicators for this specific longitudinal baseline and omitted from the primary regression models to optimise feature simplicity.

4. Share: Interactive Visual Discovery Panel

To isolate these physiological thresholds for business stakeholders, the full longitudinal dataset was mapped into an interactive three-panel Business Intelligence validation dashboard built in Tableau.

Below is the live, interactive deployment of the multi-variable regression analysis across the full 59-day baseline:

Analytical Insights from the Visual Panel:

Panel 1: Sleep Volume Capacity Constraints (r = +0.55) The linear trend line isolates a clear operational baseline: when sleep duration drops below 6.5 hours, it triggers a strict "Sleep Ceiling." Even on nights where autonomic stress was perfectly minimized, the raw restriction of time physically prevented the body battery from recharging past 45 points.
Panel 2: Autonomic Flexibility Parallel (r = +0.70) The positive linear distribution demonstrates a tight, reliable cluster. As average overnight HRV values move from left to right along the X-axis, recovery scores climb symmetrically, confirming HRV as an excellent proxy for real-time biological recovery.
Panel 3: Overnight Stress Inverse Curve (r = -0.76) This plot maps out the definitive inverse relationship. The tight linear clustering demonstrates that once average overnight stress crosses a threshold score of 25, the body battery recovery vector drops significantly, clarifying exactly why standard sleep duration scores can look "good" while leaving the user biologically unrefreshed.

5. Act: Product Implementation Roadmap

To translate these analytical findings into scalable user metrics, two key digital health product optimizations are proposed for wearable application software:

Feature A: The Dynamic "Time Cushion" Buffer Current wearable architectures alert users to high stress but fail to offer structural choices. The application should monitor a user's early-night physiological load. If the algorithm detects overnight stress tracking 20% higher than the user’s rolling 14-day baseline, it should dynamically adjust the wake-up target—calculating a personalized "Time Cushion" extension (e.g., adding 1.5 hours of sleep duration) to absorb the autonomic friction and salvage the user's morning energy capacity.
Feature B: Machine Learning Baseline Personalization A major limitation of modern digital health scoring is the application of rigid, universal metrics (such as the standard "8 hours of sleep" rule). The application should run an automated 30-day personalized correlation matrix unique to each user. If the model proves a user achieves an optimal +80 recovery on a baseline of 6.5 hours with high HRV, the app should dynamically recalibrate its sleep coaching notifications, respecting their unique, genetically shorter operational baseline.

📋 Technical Appendix & Professional Bio

Methodology Notes: Linear correlations were computed via ordinary least squares (OLS) regression using Python's NumPy and Pandas libraries. Visual dashboarding layout constructed within Tableau Desktop and deployed via Tableau Public.
Data Availability: Cleaned dataset, ETL transformations, and metadata schemas are fully archived in the project GitHub Repository .

About the Analyst: Leveraging a structural analytical background combined with core frameworks from the Google Data Analytics Professional Certificate, I specialize in transforming unstructured time-series telemetry into high-impact digital health product insights.