The US School Calendar Dataset: What It Is and Why It Matters
There are roughly 13,700 public school districts in the United States, each setting their own calendar. There has never been a clean, structured, machine-readable dataset covering all of them — until now.
The Problem: School Calendars Are Everywhere and Nowhere
If you’ve ever tried to answer a simple question like “how many students are on spring break this week across the US?” — you already know the pain. Each of America’s 13,700+ school districts publishes its calendar independently, usually as a PDF on a district website that gets redesigned every couple of years. Some districts post clean date lists. Others publish image-only PDFs. A few still rely on printed handouts.
For a parent checking their own kid’s schedule, this works fine. For a data scientist trying to model tourism demand, retail foot traffic, or transportation patterns across multiple regions, it’s a nightmare. You’d need to visit thousands of individual websites, parse thousands of different formats, and reconcile thousands of different naming conventions — just to answer one question about one week.
The School Schedules Database (SSD) solves this by collecting, standardizing, and serving day-level school calendar data for every public school district in the country.
What’s in the Data
The SSD provides one row per district per day, covering the full calendar year. For every combination of district and date, you get a structured record telling you whether school is in session, and if not, why.
Each row includes:
The core classification: is_in_session is a decimal value from 0.0 to 1.0. A normal school day is 1.0. A day off is 0.0. A half day is 0.5. An early release might be 0.75. This matters for demand modeling — a half day still puts families in the car by lunchtime.
The reason: day_type and break_name tell you why school is out. Is it spring break? A teacher workday? Winter recess? Memorial Day? This distinction matters because a holiday Monday behaves differently from spring break week, even though both show is_in_session = 0.0.
The confidence: Every cell carries a confidence score from 0.0 to 1.0 and a source method indicator. Data extracted directly from a district’s official PDF calendar might score 0.95. Data inferred from state-level patterns might score 0.6. You decide your threshold.
Joinable IDs: Every district carries its NCES (National Center for Education Statistics) ID, making it trivial to join school calendar data with Census demographics, NCES enrollment figures, geographic boundaries, or any other federal dataset.
Here’s what a typical API response looks like:
GET /api/v1/days?state=FL&date=2026-03-27&is_in_session=0
# Response (truncated)
{
"results": [
{
"district_name": "Orange County Public Schools",
"enrollment": 207000,
"is_in_session": 0.0,
"day_type": "BREAK",
"break_name": "Spring Break",
"confidence": 0.95
},
// ... more districts
],
"total_students_out": 1847000
}
Three School Years: Past, Present, Future
The SSD currently covers three rolling school years: 2024–2025 (historical), 2025–2026 (current), and 2026–2027 (forward-looking). This three-year window enables year-over-year comparison — did a district shift its spring break a week earlier? Did a state add a fall break it didn’t have before? — while also providing the forward-looking data that planners actually need.
The 2025–2026 school year has the highest data quality, with 100% district coverage and the majority of data extracted directly from official district calendars. Secondary years are filled through a combination of direct extraction (when multi-year calendars are published) and statistical imputation from state-level patterns.
Who Uses School Calendar Data
Theme Parks and Attractions
School breaks are the single strongest predictor of theme park attendance. Spring break alone creates a rolling wave of demand from early March through late April as different districts take their weeks off at different times. Crowd forecasters have traditionally had to manually track individual district break dates — a painstaking process that caps out at a few dozen districts. The SSD provides this at national scale, for every district, via a single API call.
Hotels and Travel
Revenue management teams at hotel chains use school schedule data for dynamic pricing. When 200,000 students in Orange County, Florida go on spring break, hotel demand in Orlando spikes — but so does demand in competing destinations. Knowing which districts are on break which weeks lets operators optimize pricing across their entire portfolio.
Retail and Workforce Planning
Back-to-school shopping, holiday staffing, summer schedule adjustments — retail demand follows school calendars closely. Workforce management platforms incorporate school schedule signals to predict foot traffic and optimize staffing levels. A store near a district with an October fall break needs different coverage than one in a district that doesn’t break until Thanksgiving.
Transportation and Government
Public transit agencies adjust routes and frequencies around school schedules. Departments of transportation plan construction windows during summer breaks. Public health agencies use school calendars to model disease transmission patterns. The NCES IDs on every row make federal data joins trivial.
Machine Learning and Feature Engineering
For any time-series model where human behavior matters — ride-hailing demand, restaurant reservations, emergency room visits, electricity consumption — “percentage of students on break within N miles” is a powerful feature. The SSD provides this as a clean, structured input ready for your pipeline.
How It’s Built
The SSD is constructed through a multi-stage pipeline that prioritizes real data over inference. The primary extraction method downloads official district calendars (PDFs and web pages), converts them to structured text, and uses large language models with anti-hallucination safeguards to extract specific calendar dates. Every extraction includes a confidence score and source provenance.
Districts where direct extraction fails get progressively more expensive treatment: alternative URL discovery, different document formats, and eventually manual review. The goal is always to find and parse the actual published calendar, not to guess.
For the small number of districts where no calendar can be found (currently under 100, all with enrollment under 8,000), a state-median imputation layer fills gaps using patterns from neighboring districts with known calendars. These rows are clearly marked with lower confidence scores so you can filter them out if you prefer to work only with directly-extracted data.
Quality principle: Every value in the SSD carries a confidence score and source method. We’d rather tell you “we’re 60% sure about this” than present inferred data as fact. Your risk tolerance determines your filter threshold.
What Makes the SSD Different
A few things set this dataset apart:
Day-level granularity. Not just “spring break is in March” — a row for every day of the year, for every district. Half days, early releases, teacher workdays, and every other calendar nuance are captured.
Per-cell confidence scoring. Every value comes with a 0.0–1.0 confidence score and a source method tag. You control your own quality threshold.
Purpose-built for data pipelines. REST API, JSON and CSV output, NCES IDs for easy joins with federal datasets. No scraping, no manual lookups, no PDFs to parse.
Three rolling school years. Historical, current, and forward-looking data in a single dataset. Year-over-year comparison built in.
Priced for individual data scientists. $99/month is designed to fit on a corporate card, not require a procurement process.
Getting Started
The SSD is available as a REST API returning JSON or CSV. Pricing starts at $99/month for full access to all districts, all school years, and all endpoints — designed to slip under procurement thresholds so you can start building today without a six-month vendor approval process.