πΊ Project Roadmap: Sharif OCW Scrapy's Downloader¶
Sprint Overview (Day 0)¶
Project Goal: Deliver an MVP Scrapy-based downloader for Sharif OCW that:
- Fetches course metadata and sessions
- Downloads all downloadable files
- Organizes outputs into structured folders
- Provides progress tracking and basic error handling
Success Criteria:
- Able to download at least one complete course (videos + PDFs)
- Correct directory structure with sanitized filenames
- Basic duplicate detection + retry handling works
- GitHub issues, milestones, and PRs follow roadmap
Team Size: 1 developer (solo)
Roles & Responsibilities:
- Developer: Implement, test, document, manage repo, and review
Definition of Done (DoD):
- Code compiles and runs without errors
- Passes basic integration tests on one sample course
- Artifacts stored in correct directory structure
- Pull requests merged into
main
with review checklist passed
π Daily Breakdown (1-Week Sprint)¶
Day 1 β Setup & Foundations¶
- Primary Focus: Repo setup, environment, baseline structure
- Deliverables: GitHub repo initialized, Scrapy project skeleton,
.env
config -
Tasks:
-
Create GitHub repo and add README (1h)
- Setup Python env + Scrapy project (2h)
- Implement
OCWSettings
using Pydantic (3h) - Setup GitHub Actions CI with lint/test (2h)
- Dependencies: None
- Review Points: Repo initialized with CI passing
Day 2 β Data Models & Items¶
- Primary Focus: Define structured data models
- Deliverables: Pydantic models (
Course
,Chapter
,Session
), Scrapy Items -
Tasks:
-
Implement Pydantic models with validation (3h)
- Implement Scrapy Items (2h)
- Unit tests for model validation (3h)
- Dependencies: Day 1 settings module
- Review Points: Models validate sample OCW API responses
Day 3 β Core Spider Implementation¶
- Primary Focus: Implement OCW Spider
- Deliverables: Working spider fetching metadata + sessions
-
Tasks:
-
Implement
CourseSpider
(4h) - Handle request/response parsing (2h)
- Basic error handling/logging (2h)
- Dependencies: Data models from Day 2
- Review Points: Spider outputs structured items for one course
Day 4 β Pipelines (Validation & Storage)¶
- Primary Focus: Validate, deduplicate, and structure files
- Deliverables: ValidationPipeline, DuplicatesPipeline, FileSystemPipeline
-
Tasks:
-
Implement ValidationPipeline (2h)
- Implement DuplicatesPipeline (3h)
- Implement FileSystemPipeline (3h)
- Dependencies: Spider outputs (Day 3)
- Review Points: Files stored in correct structure with validation
Day 5 β Download & Progress Tracking¶
- Primary Focus: File downloads + progress monitoring
- Deliverables: DownloadPipeline, ProgressPipeline, Extensions
-
Tasks:
-
Implement DownloadPipeline extending Scrapy's FilesPipeline (3h)
- Implement ProgressPipeline + Extension (3h)
- Test downloads with sample courses (2h)
- Dependencies: Pipelines from Day 4
- Review Points: Files download correctly with progress logs
Day 6 β Middleware & Error Handling¶
- Primary Focus: Robust request/response handling
- Deliverables: HeadersMiddleware, RetryMiddleware, ErrorHandler
-
Tasks:
-
Implement HeadersMiddleware for API (2h)
- Implement RetryMiddleware with backoff (3h)
- Implement ErrorHandlerMiddleware + ResponseValidator (3h)
- Dependencies: Spider from Day 3
- Review Points: Handles retries, logs errors gracefully
Day 7 β Testing, Docs & Release¶
- Primary Focus: Final polish, testing, documentation
- Deliverables: CLI runner, docs, GitHub milestone closure
-
Tasks:
-
Implement CLI
runner.py
(2h) - Integration tests on sample course (2h)
- Write docs: setup + usage (2h)
- Sprint review & GitHub release (2h)
- Dependencies: All previous work
- Review Points: MVP ready, can fetch and download a full course
π +GitHub Integration¶
-
Milestones:
-
v0.1-MVP
β Completed after Sprint (Day 7) -
Issue Templates:
-
bug_report.md
feature_request.md
-
Labels:
-
must-have
,should-have
,could-have
,won't-have
bug
,enhancement
,documentation
-
Branch Strategy:
-
main
β stable dev
β active developmentfeature/*
β per feature-
PR Workflow:
-
Open PR β CI checks β Self-review β Merge to
dev
β Periodic merge tomain
β Risk Mitigation¶
-
Risk: API endpoint changes
-
Plan: Abstract URLs/settings in config, version pinning
-
Risk: Large file downloads (timeouts)
-
Plan: Retry middleware + configurable timeout
-
Risk: File name collisions / OS issues
-
Plan: Path sanitization utility + max length enforcement
π MoSCoW Prioritization¶
- Must Have: Basic spider, pipelines, file downloads, progress
- Should Have: Retry logic, duplicates detection, CLI runner
- Could Have: Advanced stats, pretty dashboards, parallel course downloads
- Wonβt Have: GUI interface (out of scope for MVP)
π Mermaid Sprint Timeline¶
gantt
title 1-Week Roadmap: Sharif OCW Scrapy's Downloader
dateFormat YYYY-MM-DD
section Setup
Day 1: Setup Repo & Settings :done, 2025-08-27, 1d
section Core Development
Day 2: Data Models & Items :active, 2025-08-28, 1d
Day 3: Spider Implementation : 2025-08-29, 1d
Day 4: Pipelines (Validation+Storage): 2025-08-30, 1d
Day 5: Download & Progress Tracking : 2025-08-31, 1d
Day 6: Middleware & Error Handling : 2025-09-1, 1d
section Release
Day 7: Testing, Docs & Release : 2025-09-02, 1d
Branch-name mapping (one branch per task)¶
Format: Task β
branch-name
Each branch name follows git-flow conventions and the stated constraints.
Day 1 β Setup & Foundations (8h)¶
- Create repo + initial README.md (1h) β
init-repo-readme
- Initialize Python project (
pyproject.toml
) and virtualenv (1h) βinit-python-project
- Add requirements.txt and basic CI (
.github/workflows/ci.yml
) (2h) βadd-ci-requirements
- Implement
OCWSettings
Pydantic wrapper +.env
example (4h) βadd-ocwsettings-env
Day 2 β Data Models & Items (8h)¶
- Implement Pydantic models (3h) β
add-pydantic-models
- Implement Scrapy Items (2h) β
add-scrapy-items
- Unit tests for validation (3h) β
add-model-validation-tests
Day 3 β Core Spider Implementation (8h)¶
- Implement
start_requests
andparse_metadata
(3h) βspider-start-parse-metadata
- Implement
parse_sessions
& item generation (3h) βspider-parse-sessions
- Logging and error handling hooks (2h) β
add-spider-logging
Day 4 β Pipelines (Validation + Storage) (8h)¶
- Write ValidationPipeline and tests (3h) β
add-validation-pipeline
- Write DuplicatesPipeline with simple hashing (2.5h) β
add-duplicates-pipeline
- Write FileSystemPipeline + PathSanitizer (2.5h) β
add-filesystem-pipeline
Day 5 β Download & Progress (8h)¶
- Implement FilesPipeline override & test small download (3h) β
add-download-pipeline
- Implement ProgressTracker + integrate with pipelines (3h) β
add-progress-tracker
- Manual integration test (2h) β
integration-manual-run
Day 6 β Middleware & Robustness (8h)¶
- Implement request headers and referer enforcement (2h) β
add-headers-middleware
- Implement retry middleware with exponential backoff (3h) β
add-retry-middleware
- Add response validation and tests (3h) β
add-response-validator
Day 7 β Tests, CLI & Docs (8h)¶
- Implement runner CLI (2h) β
add-runner-cli
- Run integration test on sample course (2h) β
run-integration-sample
- Finalize README + CONTRIBUTING (2h) β
update-readme-contributing
- Tag & release to GitHub (2h) β
create-release-tag
Compact table¶
Day | Task (short) | Branch name |
---|---|---|
1 | Init repo & README | init-repo-readme |
1 | Init python project | init-python-project |
1 | Add requirements & CI | add-ci-requirements |
1 | OCWSettings + .env | add-ocwsettings-env |
2 | Pydantic models | add-pydantic-models |
2 | Scrapy Items | add-scrapy-items |
2 | Unit tests (models) | add-model-validation-tests |
3 | Spider: start & metadata | spider-start-parse-metadata |
3 | Spider: parse sessions | spider-parse-sessions |
3 | Spider logging & errors | add-spider-logging |
4 | ValidationPipeline | add-validation-pipeline |
4 | DuplicatesPipeline | add-duplicates-pipeline |
4 | FileSystemPipeline | add-filesystem-pipeline |
5 | DownloadPipeline | add-download-pipeline |
5 | ProgressTracker | add-progress-tracker |
5 | Manual integration test | integration-manual-run |
6 | Headers middleware | add-headers-middleware |
6 | Retry middleware | add-retry-middleware |
6 | Response validator | add-response-validator |
7 | Runner CLI | add-runner-cli |
7 | Integration test run | run-integration-sample |
7 | Docs: README & CONTRIBUTING | update-readme-contributing |
7 | Release tagging | create-release-tag |