Downloading 73,000 Handwritten Cards — The Polite Way, Then the Fast Way
March 29, 2026 · Martijn AslanderYesterday I downloaded all 73,715 metadata records from the Luhmann Archive — the text, the cross-references, the structure. Nine API calls. About 100 MB. Then I built a database, ran a network analysis, mapped the cross-references, and published the results. All within a couple of hours.
But that was just the metadata. The actual scans — high-resolution JPEGs of every handwritten card — are 73,650 separate image files. Roughly 350 KB each. About 25 GB total. That's where things got interesting.
Attempt 1: The polite sequential download
The first script was respectful. One image at a time. Half a second delay between requests. Skip anything already downloaded. Resume where you left off.
def download_one(image_id):
dest = IMAGES_DIR / f"{image_id}.jpg"
if dest.exists():
return "skip"
url = f"https://images.niklas-luhmann-archiv.de/image/{image_id}?size=2"
# download, save, wait 0.5s
time.sleep(0.5)
This worked. Slowly. At two images per second, the math was brutal:
Around image 3,277, my MacBook's disk filled up. 926 GB drive, 27 GB free, and 25 GB of images still to go. The download crashed.
The pivot: move it to a machine that doesn't sleep
This is where it gets interesting. Instead of clearing disk space and restarting, I moved the entire operation to a Mac Mini server that sits in a closet. Always on, 63 GB free, reachable via Tailscale.
The migration was surgical:
ssh + df -h
rsync the project: data, scripts, 3,277 existing images (1.4 GB)
nohup background process on the Mini
The download resumed automatically because the script checks for existing files. No progress was lost.
Attempt 2: why wait?
With the download running on the Mini, I watched the progress. Two images per second. Ten hours. On a machine with a gigabit connection that was doing nothing else.
The bottleneck wasn't the server, it wasn't the network, and it wasn't the archive's API. It was the sequential nature of the script. Each image waited for the previous one to finish, then waited another 0.5 seconds out of politeness.
So I rewrote it:
from concurrent.futures import ThreadPoolExecutor, as_completed
WORKERS = 20
with ThreadPoolExecutor(max_workers=WORKERS) as pool:
futures = {pool.submit(download_one, img_id): img_id
for img_id in todo}
for i, future in enumerate(as_completed(futures), 1):
result = future.result()
# log progress every 500 images
Twenty parallel threads. No artificial delay. Same skip-if-exists logic. The archive's CDN handles concurrent requests just fine — it's designed to serve images to thousands of simultaneous web visitors.
After: 23 images/sec — 50 minutes
A 10x improvement from removing artificial constraints.
But is this legal?
Before downloading 73,000 images from an academic archive, it's worth checking whether you're allowed to. I did a deep dive.
The short answer: yes. The Luhmann Archive is licensed under CC BY-SA 4.0, as registered with Text+, Germany's national research data infrastructure (NFDI). This means:
- Free to use, share, and adapt — including bulk downloads
- Attribution required — credit Niklas Luhmann-Archiv, Universität Bielefeld
- ShareAlike — derivative works must use the same license
- Commercial use allowed
The technical signals confirm this. Their robots.txt is wide open (Allow: /). The API has no authentication, no rate limiting, no API keys. The images sit on a public CDN.
The layered rights (for the curious)
German copyright law creates an interesting stack of protections:
| Layer | Law | Status |
|---|---|---|
| Luhmann's texts | UrhG, 70 years post mortem | Protected until 2068 |
| University's scholarly edition | §70 UrhG | 25 years protection |
| Scan photographs | §72 UrhG (Lichtbildrecht) | 50 years protection |
| CC BY-SA 4.0 license | Contract law | Explicit permission from all rights holders |
The CC license doesn't override copyright — it is the permission from the rights holders (Luhmann's heirs + the university) to use the material freely. They chose openness. Other researchers on the Zettelkasten Forum have done the same bulk downloads.
The final tally
~25 GB total
~60 errors (503s, retryable)
Wall time: 52 minutes
Zero rate limits hit
What I learned
Politeness has diminishing returns. A 0.5-second delay per request is considerate when you're hitting someone's hobby server. For a CDN designed to serve thousands of concurrent users, it's just slow. The archive didn't notice the difference between 2 and 23 requests per second — it serves more than that to normal web traffic.
Move the work to the right machine. My laptop had limited disk space and needs to sleep. The Mac Mini has plenty of space and never sleeps. The migration took 3 minutes and saved hours of babysitting.
Check the license before you worry. I spent 20 minutes agonizing over whether bulk downloading was appropriate, when the archive had already answered that question: CC BY-SA 4.0. They want people to use this data.
Sequential is the default, not the optimum. The first version of any download script is a for-loop. That's fine for prototyping. For 73,000 files, ThreadPoolExecutor with 20 workers is still simple code — just faster code.
The archive was designed to be accessed one card at a time. But Luhmann never thought one card at a time. His whole system was about parallel connections.