Downloading 73,000 Handwritten Cards — The Polite Way, Then the Fast Way

March 29, 2026 · Martijn Aslander

Yesterday I downloaded all 73,715 metadata records from the Luhmann Archive — the text, the cross-references, the structure. Nine API calls. About 100 MB. Then I built a database, ran a network analysis, mapped the cross-references, and published the results. All within a couple of hours.

But that was just the metadata. The actual scans — high-resolution JPEGs of every handwritten card — are 73,650 separate image files. Roughly 350 KB each. About 25 GB total. That's where things got interesting.

Attempt 1: The polite sequential download

The first script was respectful. One image at a time. Half a second delay between requests. Skip anything already downloaded. Resume where you left off.

def download_one(image_id):
    dest = IMAGES_DIR / f"{image_id}.jpg"
    if dest.exists():
        return "skip"
    url = f"https://images.niklas-luhmann-archiv.de/image/{image_id}?size=2"
    # download, save, wait 0.5s
    time.sleep(0.5)

This worked. Slowly. At two images per second, the math was brutal:

73,650 images ÷ 2/sec = 36,825 seconds = 10.2 hours

Around image 3,277, my MacBook's disk filled up. 926 GB drive, 27 GB free, and 25 GB of images still to go. The download crashed.

The pivot: move it to a machine that doesn't sleep

This is where it gets interesting. Instead of clearing disk space and restarting, I moved the entire operation to a Mac Mini server that sits in a closet. Always on, 63 GB free, reachable via Tailscale.

The migration was surgical:

07:12 Verified Mac Mini has 63 GB free via ssh + df -h
07:13 rsync the project: data, scripts, 3,277 existing images (1.4 GB)
07:14 Start download as nohup background process on the Mini
07:15 Verify images are arriving, delete local copy, reclaim disk space

The download resumed automatically because the script checks for existing files. No progress was lost.

Attempt 2: why wait?

With the download running on the Mini, I watched the progress. Two images per second. Ten hours. On a machine with a gigabit connection that was doing nothing else.

The bottleneck wasn't the server, it wasn't the network, and it wasn't the archive's API. It was the sequential nature of the script. Each image waited for the previous one to finish, then waited another 0.5 seconds out of politeness.

So I rewrote it:

from concurrent.futures import ThreadPoolExecutor, as_completed

WORKERS = 20

with ThreadPoolExecutor(max_workers=WORKERS) as pool:
    futures = {pool.submit(download_one, img_id): img_id
               for img_id in todo}
    for i, future in enumerate(as_completed(futures), 1):
        result = future.result()
        # log progress every 500 images

Twenty parallel threads. No artificial delay. Same skip-if-exists logic. The archive's CDN handles concurrent requests just fine — it's designed to serve images to thousands of simultaneous web visitors.

Before:  2 images/sec — 10 hours
After:  23 images/sec — 50 minutes

A 10x improvement from removing artificial constraints.

But is this legal?

Before downloading 73,000 images from an academic archive, it's worth checking whether you're allowed to. I did a deep dive.

The short answer: yes. The Luhmann Archive is licensed under CC BY-SA 4.0, as registered with Text+, Germany's national research data infrastructure (NFDI). This means:

The technical signals confirm this. Their robots.txt is wide open (Allow: /). The API has no authentication, no rate limiting, no API keys. The images sit on a public CDN.

The layered rights (for the curious)

German copyright law creates an interesting stack of protections:

LayerLawStatus
Luhmann's textsUrhG, 70 years post mortemProtected until 2068
University's scholarly edition§70 UrhG25 years protection
Scan photographs§72 UrhG (Lichtbildrecht)50 years protection
CC BY-SA 4.0 licenseContract lawExplicit permission from all rights holders

The CC license doesn't override copyright — it is the permission from the rights holders (Luhmann's heirs + the university) to use the material freely. They chose openness. Other researchers on the Zettelkasten Forum have done the same bulk downloads.

The final tally

73,650 images downloaded
~25 GB total
~60 errors (503s, retryable)
Wall time: 52 minutes
Zero rate limits hit

What I learned

Politeness has diminishing returns. A 0.5-second delay per request is considerate when you're hitting someone's hobby server. For a CDN designed to serve thousands of concurrent users, it's just slow. The archive didn't notice the difference between 2 and 23 requests per second — it serves more than that to normal web traffic.

Move the work to the right machine. My laptop had limited disk space and needs to sleep. The Mac Mini has plenty of space and never sleeps. The migration took 3 minutes and saved hours of babysitting.

Check the license before you worry. I spent 20 minutes agonizing over whether bulk downloading was appropriate, when the archive had already answered that question: CC BY-SA 4.0. They want people to use this data.

Sequential is the default, not the optimum. The first version of any download script is a for-loop. That's fine for prototyping. For 73,000 files, ThreadPoolExecutor with 20 workers is still simple code — just faster code.

The archive was designed to be accessed one card at a time. But Luhmann never thought one card at a time. His whole system was about parallel connections.
Explore the Network How I Got the Metadata View on GitHub