khairold
← Back to Blog

Scaling a legal data platform to 4 countries in a month

·
AICloudflareArchitectureBuild Log

A month ago, I shipped sgcaselaw.com — a site that takes Singapore’s court judgments out of the clunky government system and makes them searchable, browsable, and fast. 642 cases, AI summaries, full-text search, SEO-optimized. A solid v1.

Then I thought: the architecture doesn’t care what country the data comes from. The schema is the same. Cases have judges, lawyers, parties, citations. Every Commonwealth jurisdiction structures legal data the same way.

So I scaled it. Singapore → Malaysia → Hong Kong → India. One month, 150+ agent sessions, and now the platform has 18,000+ cases, 44,000+ entities, and 68,000+ pieces of AI-generated content across four jurisdictions — all running on Cloudflare’s free tier.

Here’s what actually happened.

The architecture that made it possible

The key decision was made early, before I even thought about multiple countries: one schema, many databases.

caselaw/
├── packages/
│   ├── db/          ← One Drizzle schema (cases, entities, citations, statutes)
│   ├── core/        ← Shared Astro components, queries, formatters
│   ├── pipeline/    ← D1 REST client, fetch utils, batch helpers
│   └── ui/          ← Design system (Header, Footer, Base layout, OG images)
├── sites/
│   ├── sg/          ← SG scraper + pipeline + Astro site
│   ├── my/          ← MY scraper + pipeline + Astro site
│   ├── hk/          ← HK scraper + pipeline + Astro site
│   └── in/          ← IN scraper + pipeline + Astro site

Every country gets its own Cloudflare D1 database, but the tables are identical. The Astro pages are literally the same files — they import from @caselaw/core and read everything country-specific from a single site.config.ts file. Adding a new country means: write a scraper, adapt the entity extractor, create a config file.

The pipeline is where countries diverge, and that’s by design — data sources are messy and country-specific. That’s where I spent most of the time.

Country by country

Malaysia — PDFs and watermarks

Malaysia’s eJudgment system has a JSON API behind the scenes (ASMX endpoints), which made indexing 23,000+ cases straightforward. The catch: judgment text is served as PDFs, not HTML.

Each PDF download takes 2–5 seconds, and the extracted text is riddled with eFILING watermarks — “S/N…Note: Serial number will be used to verify the originality…” — repeated throughout every document. The entity extraction pipeline has to clean all of that out before it can find judge names and counsel.

The Malaysian Bar Legal Directory turned out to be a goldmine. The API is open (no CAPTCHA, unlike Singapore’s LSRA), and I scraped the entire directory: 25,164 lawyers and 10,221 firms with admission dates, qualifications, firm affiliations, contact details. 232 lawyers and 72 firms matched to entities from the case data.

Final numbers: 3,997 cases, 15,353 entities, 13,706 attributes, 19,594 content items.

Hong Kong — when catchwords aren’t catchwords

Hong Kong’s HKLII is the cleanest data source of the four — a proper REST API with JSON responses, HTML judgments inline, reliable pagination. Scraping 9,262 cases was smooth.

The surprise was semantic. In Singapore and Malaysia, the “catchwords” field contains legal topics — “Criminal Law — Murder — Sentencing.” In Hong Kong, catchwords contain parallel citations — “(1999) 2 HKCFAR 4.” Completely different data masquerading as the same field. I had to rewrite practice areas for HK to use cited statutes (ordinances) instead of catchwords. 48 Hong Kong ordinances became the practice area taxonomy.

For enrichment, the Hong Kong Bar Association’s website is public and well-structured: 1,764 barristers with names, chambers, year of call, SC/JC status. I matched 468 to entities in the database — 53% of all lawyer entities got enriched with professional data.

HKLII also has a noteup API that tells you which cases cite which other cases. I enriched all 1,744 cases that have judgment text with citation counts. Some CFA cases have 300+ noteups.

Final numbers: 9,262 cases, 11,055 entities, 6,616 attributes, 20,377 content items.

India — three courts, one Supreme Court website that blocks bots

India was the most complex expansion. Indian Kanoon is the data source — it has everything, but it caps search results at 400 per query. For the Supreme Court’s 1,024 cases in 2024, that meant switching to monthly search windows to get them all.

I expanded from just the Supreme Court to three courts: SCI, Delhi High Court, and Bombay High Court. The scraper needed a --court flag, --monthly mode for large courts, and court-specific transform logic.

The judiciary enrichment had its own challenges. The Supreme Court website (sci.gov.in) returns 403 for bot-like User-Agents — had to use a Chrome UA. Individual judge profile pages vary wildly: some have full biographical text, most have just a name and photo. Wikipedia filled the gaps for 31 judges.

Bombay High Court’s data was particularly noisy. Indian Kanoon indexes duplicate entries for the same procedural order under different party names — Central Railway alone generated 198 duplicate first appeals. Entity reconciliation had to be aggressive with deduplication.

Final numbers: 4,003 cases across 3 courts, 12,805 entities, 2,925 attributes, 16,950 content items.

The pipeline pattern

Every country follows the same seven-stage pipeline:

  1. Scrape — fetch case metadata and judgment text from the source
  2. Transform — normalize to the shared schema (court codes, slugs, dates)
  3. Load — insert into D1 with batch operations and conflict handling
  4. Extract — pull entities (judges, lawyers, firms, parties) from judgment text
  5. Reconcile — deduplicate entities, create case↔entity links, load to D1
  6. Enrich — cross-reference external sources (bar associations, judiciary websites, Wikipedia)
  7. Content generate — AI-generated narratives, Q&A, bios for every entity and case

The first three stages are fully automated scripts. Stage 4 uses regex-based extraction (not AI — regex is faster and more reliable for structured legal text). Stage 5 handles name normalization per country — Malay honorifics (bin/binti, a/l, a/p, Tetuan), Hong Kong surname-first format (“Lee, Martin C.M.”), Indian initial variants (“B.R. Gavai” vs “B. R. Gavai”). Stage 6 is custom per country. Stage 7 uses template scripts that generate content at scale.

The whole pipeline is idempotent. Every stage can be re-run safely. This matters more than you’d think — Wrangler OAuth tokens expire after an hour, D1 has FK constraint quirks, rate limits trigger retries. The pipeline doesn’t crash; it recovers.

150 sessions of autonomous execution

The most unusual aspect of this project is how it was built. Most of the work was done by autonomous agent sessions — my pi coding agent running against a phased plan.

The workflow: I write a plan with phases, items, and exit criteria. The agent reads the plan, picks up the next uncompleted item, executes it, verifies the build passes, updates the plan files, and exits. A shell wrapper (autopilot.sh) restarts it for the next item. I review the results and course-correct when needed.

The cross-country gap closure plan alone ran 107 sessions. The UI reskin was 24 sessions. The desktop polish was 19 sessions. Each session reads MEMORY.md (accumulated context and decisions — mine is 59 entries long), checks DRIFT.md for spec changes, executes work, and logs what it did in SESSION-LOG.md.

This isn’t “AI generated my code.” It’s more like having a very diligent junior developer who works 24/7, follows instructions precisely, and writes excellent session notes. The architecture, the decisions, the quality bar — those are mine. The execution at scale — scraping 4,000 cases, extracting entities from 3,400 judgments, generating content for 16,000 items — that’s where autonomous sessions shine.

The design system

Midway through, I realized four separate sites with inconsistent UI was a problem. So I built @caselaw/ui — a shared Astro component package with a warm editorial design system:

  • Warm stone palette (#fafaf9 backgrounds, #1c1917 text, amber accents)
  • Serif headings (Lora) + Inter body — authoritative legal feel
  • Dark mode via CSS custom properties
  • Shared Header, Footer, Base layout, 404 page
  • OG image generation via satori + resvg-wasm on Workers
  • View Transitions for smooth navigation

Each country site imports from @caselaw/ui through a thin wrapper. One design change propagates to all four sites.

What’s next

The platform is live. sgcaselaw.com is the flagship, focused on building SEO traction first. The other three sites are deployed to Cloudflare Pages dev URLs but deliberately noindexed — no custom domains, no search console, no crawling. Singapore needs to prove the model before I invest in the others.

There’s also an open architectural question I haven’t resolved: should these be separate sites or one consolidated platform? Four separate domains (sgcaselaw.com, mycaselaw.com, etc.) gives each country its own SEO authority and topical focus. One unified site gives cross-jurisdiction search and a stronger single brand. The monorepo supports either approach — the shared schema and components work regardless of how the sites are deployed. I’m letting the Singapore SEO data inform this decision before committing.

The immediate focus is simple: does sgcaselaw.com rank? Does anyone actually use it? The architecture can scale to 20 countries, but it needs to prove value for one first.

Stack

ComponentTechnology
FrameworkAstro 5 (SSR) + Tailwind v4
DatabaseCloudflare D1 (one per country)
ORMDrizzle ORM
HostingCloudflare Pages + Workers
AIClaude (summaries, content generation)
PipelineBun + custom scrapers per country
Design@caselaw/ui shared package
Cost~$0/month (Cloudflare free tier)