Course → Module 10: Batch Processing & Scale
Session 3 of 8

The Speed Problem

A single API call takes 3 to 15 seconds depending on the model, the prompt length, and the output length. A three-agent chain takes 9 to 45 seconds. Run 100 chains sequentially, and you are waiting 15 to 75 minutes. That is not a production bottleneck for content quality, but it is a bottleneck for your time. You are sitting idle, watching a progress bar, when you could be reviewing outputs.

Concurrency solves this. Instead of running one chain at a time, you run multiple chains simultaneously. The total processing time drops from N * time_per_chain to roughly N / concurrency_limit * time_per_chain.

How asyncio Works

Python's asyncio library lets you run multiple tasks concurrently without threads or processes. The core concept: when one task is waiting for an API response (which takes seconds), another task can start its API call. No task blocks the others.

Three pieces make this work:

Component What It Does Analogy
async def Declares a function that can pause and resume A kitchen station that can pause mid-recipe while waiting for the oven
await Pauses the function until a result arrives "Wait for the oven" without blocking other stations
asyncio.gather Runs multiple async functions concurrently Running 5 kitchen stations at once, each cooking a different dish

The Semaphore Pattern

Running 100 API calls simultaneously will trigger rate limits and may overwhelm your system. A semaphore caps the number of concurrent operations.

flowchart TD A["100 Tasks Queued"] --> B["Semaphore Gate
(max 10 concurrent)"] B --> C["10 tasks running"] C --> D["Task completes"] D --> E["Semaphore releases slot"] E --> F["Next queued task enters"] F --> C D --> G["Result saved"] style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#c47a5a,color:#ede9e3 style C fill:#222221,stroke:#6b8f71,color:#ede9e3 style D fill:#222221,stroke:#8a8478,color:#ede9e3 style E fill:#222221,stroke:#c8a882,color:#ede9e3 style F fill:#222221,stroke:#6b8f71,color:#ede9e3 style G fill:#222221,stroke:#8a8478,color:#ede9e3

The semaphore value depends on your API rate limits. If your provider allows 60 requests per minute and each chain makes 3 requests, a semaphore of 10 keeps you at approximately 30 active requests, well within limits.

Real Performance Numbers

Tested with a three-agent chain (research, write, edit) generating 1,000-word articles. Average chain time: 25 seconds.

Batch Size Sequential 5 Concurrent 10 Concurrent 25 Concurrent
10 articles 4 min 10s 1 min 15s 40s 30s
25 articles 10 min 25s 2 min 45s 1 min 30s 40s
50 articles 20 min 50s 5 min 25s 2 min 55s 1 min 15s
100 articles 41 min 40s 10 min 50s 5 min 45s 2 min 30s

At 25 concurrent chains, 100 articles generate in under 3 minutes. The bottleneck shifts entirely from generation to human review, which is where it should be.

Error Handling in Concurrent Batches

When one chain fails in a concurrent batch, it must not bring down the other chains. Wrap each chain in a try/except block. On failure, log the error and the failing row's ID. After the batch completes, retry only the failed rows.

async def safe_chain(topic, semaphore):
    async with semaphore:
        try:
            return await run_chain(topic)
        except Exception as e:
            log_error(topic, e)
            return {"id": topic.id, "status": "failed", "error": str(e)}

After batch completion, filter results for status "failed" and re-run only those rows. This gives you a complete batch with minimal wasted computation.

Connection Pooling

Each API call opens an HTTP connection. Opening 25 connections simultaneously and closing them after each call is wasteful. Connection pooling reuses connections across calls, reducing overhead. Python's aiohttp library handles this automatically with a session object.

Create one session at the start of the batch. Pass it to all chains. Close it when the batch completes. This single change can reduce per-call latency by 100 to 200 milliseconds, which compounds across hundreds of calls.

Concurrency is a throughput multiplier, not a quality multiplier. Your pipeline produces the same output whether you run one chain or 25. Concurrency just means you wait 2 minutes instead of 40. Use the saved time for the stage that actually matters: human review.

Further Reading

Assignment

Write (or have your AI coding assistant write) a script that demonstrates concurrency:

  1. Make 10 API calls sequentially. Record total time.
  2. Make 10 API calls concurrently with a semaphore limiting to 5 simultaneous calls. Record total time.
  3. Calculate the speedup factor.

Then apply this to your real pipeline: run your batch manifest from Session 10.2 with concurrent processing. Compare the total batch time to sequential processing. Document any issues: rate limit errors, connection timeouts, or quality differences between sequential and concurrent outputs.