Collect Public Data
Use Statcast, FanGraphs, and curated prospect records to build a reproducible analysis base.
Baseball Analytics Case Study
A reproducible Python case study using Statcast, FanGraphs, and a curated prospect cohort to examine where measurable inputs translated into MLB value and where they did not.
The project evaluates pitch-level vulnerabilities, prospect readiness, roster archetypes, baserunning, defense, and team-level value using public data, notebooks, reusable feature engineering, and written findings.
It is framed as an analytical audit, not a causal claim. The workflow is designed to make assumptions, uncertainty, and limitations visible.
Use Statcast, FanGraphs, and curated prospect records to build a reproducible analysis base.
Create pitch-type, count, velocity, plate-discipline, batted-ball, roster, baserunning, and fielding features.
Evaluate player-development translation, organizational outcomes, and balanced roster archetypes.
Test exploratory pressure, baserunning, and defense composites against public WAR outcomes.
Separate measured patterns from causality and account for sample size, public-data gaps, and ongoing seasons.
notebooks/01_translation_gap.ipynb
Tools-to-production analysis for how measurable prospect inputs translated into MLB outcomes.
notebooks/02_pitch_diagnostics.ipynb
Pitch-type and count-specific vulnerabilities across chase, whiff, and contact-quality signals.
notebooks/07_yankees_systemic.ipynb
Curated cohort comparison across target organizations, with sample-size caveats.
notebooks/08_yankees_case_studies.ipynb
Baserunning, defense, home-run dependency, roster extremes, and construction tradeoffs.
notebooks/09_team_value_composite.ipynb
Exploratory pressure, baserunning, and defense composite tested against team WAR.
notebooks/10_rice_comparison.ipynb
Internal counterexample showing a different readiness-gate profile and live validation case.
src/fire_fishman/
Project code fetches and caches public data, then exposes reusable feature-engineering helpers for notebooks.
pitch-level features
Builds chase, whiff, velocity-tier, pitch-type, batted-ball, and count-context features from public data.
readiness gates
Uses readiness gates to compare prospect profiles before updating the read with MLB outcomes.
notebooks/10_rice_comparison.ipynb
Rice is the internal counterexample: lower-profile path, stronger readiness-gate profile, and better early validation.
notebooks/09_team_value_composite.ipynb
Regression and Bayesian checks test whether non-offensive components add descriptive value beyond offense.
docs/limitations.md
Separates measured public-data patterns from causal claims, private decision-making, and scouting certainty.
Offspeed chase, breaking-ball chase, and high-velocity whiff created clearer separation than aggregate discipline.
Ben Rice passed 5/5 readiness gates, broke out in 2025, and opened 2026 on OPS/SLG leaderboards.
Dominguez and Volpe remain live cases; the project flags why neither should be treated as closed.
The curated cohort shows Yankees outcomes lagging top comparison organizations, with modest-sample caveats.
Public BsR moved from +7.6 in 2017 to -17.2 in 2024, with cumulative -39.2 from 2018-2024.
Pressure, baserunning, and defense added descriptive signal beyond offense, but not causal proof.
The analysis identifies public-data patterns and missed value. It cannot fully observe internal decisions, coaching context, private models, or health.
Prospect cohorts and organizational comparisons are modest. Labels and live-player evaluations can change as seasons and careers evolve.
Statcast and FanGraphs metrics are useful but incomplete proxies for player quality, fielding value, and organizational process.
Regression and machine-learning checks support exploratory analysis. They are not scouting grades, causal models, or predictive guarantees.