OCW Downloader System Analysis Document
Executive Summary
The OCW Downloader System is a content acquisition and organization solution designed to systematically download and persist OpenCourseWare materials. The system interfaces with multiple OCW APIs to retrieve course metadata, hierarchical content structure, and binary session files, organizing them into a deterministic filesystem structure. This architecture enables reliable, repeatable downloads with human-readable directory organization following the pattern: course_title/chapter_sort__chapter_title/session_sort__session_title.ext
.
System Overview
Core Components
The system architecture comprises five primary components working in orchestrated harmony:
User/CLI Interface
Entry point for system interaction
Accepts courseId as primary input parameter
Receives status updates and completion summaries
Downloader (Spider/Worker)
Central orchestration engine
Manages API communication sequencing
Handles error recovery and retry logic
Implements deterministic path generation algorithm
OCW API Suite
Course API : Provides course-level metadata (title, type)
Sessions API : Returns hierarchical content structure with sort ordering
Sessions Link : Binary content delivery endpoint
Local Storage (File System)
Persistent storage layer
Maintains hierarchical directory structure
Preserves content with deterministic naming convention
Component Interaction Diagram
OCW Downloader — System Architecture OCW API Gateway Course API POST /api/v1/ocw/course/get Sessions API POST /api/v1/ocw/sessions Session Link GET /cms/ocw/session_link User/CLI «core» Downloader (Spider/Worker) Local Storage (File System) Orchestrates entire workflow Implements retry logic Handles path generation</color> RESTful API endpoints JSON request/response Binary content delivery</color> courseId POST {"id": courseId} {title, type} POST { "limit": null, "order_type": "ASC", "course_id": courseId, "status": ["free","non-free"] } chapters[] {title, sort, sessions[] {title, link, type, sort}} GET session.link (per session) binary content save as course_title/ chapter_sort__chapter_title/ session_sort__session_title.ext OCW Downloader — System Architecture OCW API Gateway Course API POST /api/v1/ocw/course/get Sessions API POST /api/v1/ocw/sessions Session Link GET /cms/ocw/session_link User/CLI «core» Downloader (Spider/Worker) Local Storage (File System) Orchestrates entire workflow Implements retry logic Handles path generation</color> RESTful API endpoints JSON request/response Binary content delivery</color> courseId POST {"id": courseId} {title, type} POST { "limit": null, "order_type": "ASC", "course_id": courseId, "status": ["free","non-free"] } chapters[] {title, sort, sessions[] {title, link, type, sort}} GET session.link (per session) binary content save as course_title/ chapter_sort__chapter_title/ session_sort__session_title.ext
Interaction Analysis
The system demonstrates a well-structured service-oriented architecture with clear separation of concerns:
Key Interaction Patterns
Sequential Dependency Chain : Course metadata must be retrieved before session listing, establishing a critical path for data acquisition
Hierarchical Data Resolution : The Sessions API provides complete navigational structure in a single response, minimizing API calls
Parallel Download Capability : Individual session downloads are independent, enabling potential parallelization
Deterministic Path Generation : Sort keys ensure consistent, reproducible filesystem organization across multiple executions
Communication Protocols
Metadata APIs : JSON-based POST requests with structured payloads
Binary Endpoint : Simple GET requests with URL-based session identification
Error Handling : Non-blocking session failures with graceful degradation
Process Flow Analysis
Sequence Diagram
OCW Downloader — Process Flow Client Client Downloader (Spider/Worker) Downloader (Spider/Worker) Course API POST /api/v1/ocw/course/get Course API POST /api/v1/ocw/course/get Sessions API POST /api/v1/ocw/sessions Sessions API POST /api/v1/ocw/sessions Session Link GET /cms/ocw/session_link Session Link GET /cms/ocw/session_link File System File System Initialization [01] start(courseId) Phase 1: Course Metadata Retrieval Fetch Course Metadata [02] POST { "id": courseId } alt [Success [200 OK]] [03] { title: "Data Structures", type: "undergraduate" } Course metadata fetched for directory naming [Error [4xx/5xx]] [04] 4xx/5xx [05] ERROR: Course fetch failed [06] Phase 2: Content Hierarchy Discovery Fetch Chapter/Session Hierarchy [07] POST { "limit": null, "order_type": "ASC", "course_id": courseId, "status": ["free","non-free"] } alt [Success [200 OK]] [08] chapters[] { title, sort, sessions[] { title, link, type, sort } } Complete hierarchy retrieved in single call [Error [4xx/5xx]] [09] 4xx/5xx [10] ERROR: Sessions fetch failed [11] Phase 3: Content Download Execution Download Sessions (Ordered Processing) loop [for each chapter (ascending by sort)] Create chapter directory if not exists loop [for each session (ascending by sort)] [12] GET session.link alt [Success [200 OK]] [13] content bytes [14] write course_title/ chapter_sort__chapter_title/ session_sort__session_title.ext [15] write confirmation Path deterministically generated from metadata [Error [4xx/5xx]] [16] 4xx/5xx [17] WARN: Session skipped Continue with next session Completion [18] done(summary: { total_sessions: 47, successful_downloads: 45, failed_downloads: 2, target_paths: "./Data_Structures/" }) OCW Downloader — Process Flow Client Client Downloader (Spider/Worker) Downloader (Spider/Worker) Course API POST /api/v1/ocw/course/get Course API POST /api/v1/ocw/course/get Sessions API POST /api/v1/ocw/sessions Sessions API POST /api/v1/ocw/sessions Session Link GET /cms/ocw/session_link Session Link GET /cms/ocw/session_link File System File System Initialization [01] start(courseId) Phase 1: Course Metadata Retrieval Fetch Course Metadata [02] POST { "id": courseId } alt [Success [200 OK]] [03] { title: "Data Structures", type: "undergraduate" } Course metadata fetched for directory naming [Error [4xx/5xx]] [04] 4xx/5xx [05] ERROR: Course fetch failed [06] Phase 2: Content Hierarchy Discovery Fetch Chapter/Session Hierarchy [07] POST { "limit": null, "order_type": "ASC", "course_id": courseId, "status": ["free","non-free"] } alt [Success [200 OK]] [08] chapters[] { title, sort, sessions[] { title, link, type, sort } } Complete hierarchy retrieved in single call [Error [4xx/5xx]] [09] 4xx/5xx [10] ERROR: Sessions fetch failed [11] Phase 3: Content Download Execution Download Sessions (Ordered Processing) loop [for each chapter (ascending by sort)] Create chapter directory if not exists loop [for each session (ascending by sort)] [12] GET session.link alt [Success [200 OK]] [13] content bytes [14] write course_title/ chapter_sort__chapter_title/ session_sort__session_title.ext [15] write confirmation Path deterministically generated from metadata [Error [4xx/5xx]] [16] 4xx/5xx [17] WARN: Session skipped Continue with next session Completion [18] done(summary: { total_sessions: 47, successful_downloads: 45, failed_downloads: 2, target_paths: "./Data_Structures/" })
Process Flow Characteristics
Three-Phase Execution Model :
Phase 1 : Course metadata acquisition (blocking)
Phase 2 : Session hierarchy retrieval (blocking)
Phase 3 : Content download (non-blocking per session)
Error Recovery Strategy :
Critical failures (Phases 1-2): Terminate execution
Non-critical failures (Phase 3): Log and continue
Data Model Analysis
API Data Model Diagram
OCW API Data Model «Entity» Course title : String [NOT NULL] Constraints: • Title used for root directory «Entity» Chapter title : String [NOT NULL] sort : Integer [UNIQUE per course] Constraints: • Sort determines processing order • Sort used in directory naming • No direct content storage</color> «Entity» Session title : String [NOT NULL] link : URL [NOT NULL] ext : String sort : Integer [UNIQUE per chapter] Constraints: • Link points to binary content • Sort ensures consistent ordering • Ext derived from Link Primary Entity Identified by external courseId Retrieved via Course API Forms root of storage hierarchy</color> Organizational Container Groups related sessions Sort-prefixed directory naming No direct downloadable content</color> Content Unit Atomic downloadable resource Binary content via link URL Sort-prefixed file naming</color> Storage Path Generation Algorithm: Path = {course.title}/ {chapter.sort}__{chapter.title}/ {session.sort}__{session.title}.{ext} Example: "Introduction to Python/ 01__Getting Started/ 01__Installation Guide.pdf" </code></color> #98c379 contains (1:N) OCW API Data Model «Entity» Course title : String [NOT NULL] Constraints: • Title used for root directory «Entity» Chapter title : String [NOT NULL] sort : Integer [UNIQUE per course] Constraints: • Sort determines processing order • Sort used in directory naming • No direct content storage</color> «Entity» Session title : String [NOT NULL] link : URL [NOT NULL] ext : String sort : Integer [UNIQUE per chapter] Constraints: • Link points to binary content • Sort ensures consistent ordering • Ext derived from Link Primary Entity Identified by external courseId Retrieved via Course API Forms root of storage hierarchy</color> Organizational Container Groups related sessions Sort-prefixed directory naming No direct downloadable content</color> Content Unit Atomic downloadable resource Binary content via link URL Sort-prefixed file naming</color> Storage Path Generation Algorithm: Path = {course.title}/ {chapter.sort}__{chapter.title}/ {session.sort}__{session.title}.{ext} Example: "Introduction to Python/ 01__Getting Started/ 01__Installation Guide.pdf" </code></color> #98c379 contains (1:N)
Data Model Insights
Schema Characteristics
Course Schema
Key attributes: title (directory naming)
Chapter Schema
Organizational unit without direct content
Sort attribute ensures deterministic ordering
Relationship: Children (Sessions)
Session Schema
Atomic content unit with downloadable resource
Link attribute provides content access URL
Sort attribute maintains consistent ordering within chapter
Data Integrity Considerations
Sort values must be unique within their scope (course/chapter)
Path generation algorithm ensures filesystem compatibility