Skip to content

OCW Downloader System Analysis Document

Executive Summary

The OCW Downloader System is a content acquisition and organization solution designed to systematically download and persist OpenCourseWare materials. The system interfaces with multiple OCW APIs to retrieve course metadata, hierarchical content structure, and binary session files, organizing them into a deterministic filesystem structure. This architecture enables reliable, repeatable downloads with human-readable directory organization following the pattern: course_title/chapter_sort__chapter_title/session_sort__session_title.ext.

System Overview

Core Components

The system architecture comprises five primary components working in orchestrated harmony:

  1. User/CLI Interface

  2. Entry point for system interaction

  3. Accepts courseId as primary input parameter
  4. Receives status updates and completion summaries

  5. Downloader (Spider/Worker)

  6. Central orchestration engine

  7. Manages API communication sequencing
  8. Handles error recovery and retry logic
  9. Implements deterministic path generation algorithm

  10. OCW API Suite

  11. Course API: Provides course-level metadata (title, type)

  12. Sessions API: Returns hierarchical content structure with sort ordering
  13. Sessions Link: Binary content delivery endpoint

  14. Local Storage (File System)

  15. Persistent storage layer

  16. Maintains hierarchical directory structure
  17. Preserves content with deterministic naming convention

Component Interaction Diagram

OCW Downloader — System ArchitectureOCW API GatewayCourse APIPOST /api/v1/ocw/course/getSessions APIPOST /api/v1/ocw/sessionsSession LinkGET /cms/ocw/session_linkUser/CLI«core»Downloader(Spider/Worker)Local Storage(File System)Orchestrates entire workflowImplements retry logicHandles path generation</color>RESTful API endpointsJSON request/responseBinary content delivery</color>courseIdPOST {"id": courseId}{title, type}POST {  "limit": null,  "order_type": "ASC",  "course_id": courseId,  "status": ["free","non-free"]}chapters[] {title, sort,  sessions[] {title, link, type,sort}}GET session.link(per session)binary contentsave ascourse_title/  chapter_sort__chapter_title/    session_sort__session_title.ext
OCW Downloader — System ArchitectureOCW API GatewayCourse APIPOST /api/v1/ocw/course/getSessions APIPOST /api/v1/ocw/sessionsSession LinkGET /cms/ocw/session_linkUser/CLI«core»Downloader(Spider/Worker)Local Storage(File System)Orchestrates entire workflowImplements retry logicHandles path generation</color>RESTful API endpointsJSON request/responseBinary content delivery</color>courseIdPOST {"id": courseId}{title, type}POST {  "limit": null,  "order_type": "ASC",  "course_id": courseId,  "status": ["free","non-free"]}chapters[] {title, sort,  sessions[] {title, link, type,sort}}GET session.link(per session)binary contentsave ascourse_title/  chapter_sort__chapter_title/    session_sort__session_title.ext

Interaction Analysis

The system demonstrates a well-structured service-oriented architecture with clear separation of concerns:

Key Interaction Patterns

  1. Sequential Dependency Chain: Course metadata must be retrieved before session listing, establishing a critical path for data acquisition
  2. Hierarchical Data Resolution: The Sessions API provides complete navigational structure in a single response, minimizing API calls
  3. Parallel Download Capability: Individual session downloads are independent, enabling potential parallelization
  4. Deterministic Path Generation: Sort keys ensure consistent, reproducible filesystem organization across multiple executions

Communication Protocols

  • Metadata APIs: JSON-based POST requests with structured payloads
  • Binary Endpoint: Simple GET requests with URL-based session identification
  • Error Handling: Non-blocking session failures with graceful degradation

Process Flow Analysis

Sequence Diagram

OCW Downloader — Process FlowClientClientDownloader(Spider/Worker)Downloader(Spider/Worker)Course APIPOST /api/v1/ocw/course/getCourse APIPOST /api/v1/ocw/course/getSessions APIPOST /api/v1/ocw/sessionsSessions APIPOST /api/v1/ocw/sessionsSession LinkGET /cms/ocw/session_linkSession LinkGET /cms/ocw/session_linkFile SystemFile SystemInitialization[01]start(courseId)Phase 1: Course Metadata RetrievalFetch Course Metadata[02]POST { "id": courseId }alt[Success [200 OK]][03]{ title: "Data Structures", type: "undergraduate" }Course metadata fetchedfor directory naming[Error [4xx/5xx]][04]4xx/5xx[05]ERROR: Course fetch failed[06] Phase 2: Content Hierarchy DiscoveryFetch Chapter/Session Hierarchy[07]POST {"limit": null,"order_type": "ASC","course_id": courseId,"status": ["free","non-free"]}alt[Success [200 OK]][08]chapters[] { title, sort,sessions[] { title, link, type, sort } }Complete hierarchyretrieved in single call[Error [4xx/5xx]][09]4xx/5xx[10]ERROR: Sessions fetch failed[11] Phase 3: Content Download ExecutionDownload Sessions (Ordered Processing)loop[for each chapter (ascending by sort)]Create chapter directoryif not existsloop[for each session (ascending by sort)][12]GET session.linkalt[Success [200 OK]][13]content bytes[14]write course_title/chapter_sort__chapter_title/session_sort__session_title.ext[15]write confirmationPath deterministicallygenerated from metadata[Error [4xx/5xx]][16]4xx/5xx[17]WARN: Session skippedContinue withnext sessionCompletion[18]done(summary: {total_sessions: 47,successful_downloads: 45,failed_downloads: 2,target_paths: "./Data_Structures/"})
OCW Downloader — Process FlowClientClientDownloader(Spider/Worker)Downloader(Spider/Worker)Course APIPOST /api/v1/ocw/course/getCourse APIPOST /api/v1/ocw/course/getSessions APIPOST /api/v1/ocw/sessionsSessions APIPOST /api/v1/ocw/sessionsSession LinkGET /cms/ocw/session_linkSession LinkGET /cms/ocw/session_linkFile SystemFile SystemInitialization[01]start(courseId)Phase 1: Course Metadata RetrievalFetch Course Metadata[02]POST { "id": courseId }alt[Success [200 OK]][03]{ title: "Data Structures", type: "undergraduate" }Course metadata fetchedfor directory naming[Error [4xx/5xx]][04]4xx/5xx[05]ERROR: Course fetch failed[06] Phase 2: Content Hierarchy DiscoveryFetch Chapter/Session Hierarchy[07]POST {"limit": null,"order_type": "ASC","course_id": courseId,"status": ["free","non-free"]}alt[Success [200 OK]][08]chapters[] { title, sort,sessions[] { title, link, type, sort } }Complete hierarchyretrieved in single call[Error [4xx/5xx]][09]4xx/5xx[10]ERROR: Sessions fetch failed[11] Phase 3: Content Download ExecutionDownload Sessions (Ordered Processing)loop[for each chapter (ascending by sort)]Create chapter directoryif not existsloop[for each session (ascending by sort)][12]GET session.linkalt[Success [200 OK]][13]content bytes[14]write course_title/chapter_sort__chapter_title/session_sort__session_title.ext[15]write confirmationPath deterministicallygenerated from metadata[Error [4xx/5xx]][16]4xx/5xx[17]WARN: Session skippedContinue withnext sessionCompletion[18]done(summary: {total_sessions: 47,successful_downloads: 45,failed_downloads: 2,target_paths: "./Data_Structures/"})

Process Flow Characteristics

  1. Three-Phase Execution Model:

    • Phase 1: Course metadata acquisition (blocking)
    • Phase 2: Session hierarchy retrieval (blocking)
    • Phase 3: Content download (non-blocking per session)
  2. Error Recovery Strategy:

    • Critical failures (Phases 1-2): Terminate execution
    • Non-critical failures (Phase 3): Log and continue

Data Model Analysis

API Data Model Diagram

OCW API Data Model«Entity»Coursetitle: String [NOT NULL]Constraints:• Title used for root directory«Entity»Chaptertitle: String [NOT NULL]sort: Integer [UNIQUE per course]Constraints:• Sort determines processing order• Sort used in directory naming• No direct content storage</color>«Entity»Sessiontitle: String [NOT NULL]link: URL [NOT NULL]ext: Stringsort: Integer [UNIQUE per chapter]Constraints:• Link points to binary content• Sort ensures consistent ordering• Ext derived from LinkPrimary EntityIdentified by external courseIdRetrieved via Course APIForms root of storage hierarchy</color>Organizational ContainerGroups related sessionsSort-prefixed directory namingNo direct downloadable content</color>Content UnitAtomic downloadable resourceBinary content via link URLSort-prefixed file naming</color>Storage Path Generation Algorithm:Path = {course.title}/{chapter.sort}__{chapter.title}/{session.sort}__{session.title}.{ext} Example:"Introduction to Python/01__Getting Started/01__Installation Guide.pdf"</code></color>#98c379contains(1:N)
OCW API Data Model«Entity»Coursetitle: String [NOT NULL]Constraints:• Title used for root directory«Entity»Chaptertitle: String [NOT NULL]sort: Integer [UNIQUE per course]Constraints:• Sort determines processing order• Sort used in directory naming• No direct content storage</color>«Entity»Sessiontitle: String [NOT NULL]link: URL [NOT NULL]ext: Stringsort: Integer [UNIQUE per chapter]Constraints:• Link points to binary content• Sort ensures consistent ordering• Ext derived from LinkPrimary EntityIdentified by external courseIdRetrieved via Course APIForms root of storage hierarchy</color>Organizational ContainerGroups related sessionsSort-prefixed directory namingNo direct downloadable content</color>Content UnitAtomic downloadable resourceBinary content via link URLSort-prefixed file naming</color>Storage Path Generation Algorithm:Path = {course.title}/{chapter.sort}__{chapter.title}/{session.sort}__{session.title}.{ext} Example:"Introduction to Python/01__Getting Started/01__Installation Guide.pdf"</code></color>#98c379contains(1:N)

Data Model Insights

Schema Characteristics

  1. Course Schema
    • Key attributes: title (directory naming)
  2. Chapter Schema
    • Organizational unit without direct content
    • Sort attribute ensures deterministic ordering
    • Relationship: Children (Sessions)
  3. Session Schema
    • Atomic content unit with downloadable resource
    • Link attribute provides content access URL
    • Sort attribute maintains consistent ordering within chapter

Data Integrity Considerations

  • Sort values must be unique within their scope (course/chapter)
  • Path generation algorithm ensures filesystem compatibility