Behind the Build
How We Verify Every Amazon Link Still Works
A static affiliate catalog rots in months. Here is the Playwright-based verifier that visits every product page weekly, classifies what it sees, and refuses to corrupt the catalog when Amazon's bot detection kicks in.
Most affiliate sites do not check their own links. They publish a "Best 10 Camping Tents" post in 2022, a quarter of the products get discontinued by 2024, and the page keeps recommending them. Click any of those links and you get the dreaded "we are sorry, this page is no longer available" Amazon page. The site keeps earning on the products that still work, the dead ones quietly underperform, and nobody fixes them because nobody is looking.
We do not want to be that site. Specifically, after a Canadian user pointed out that several of our amazon.com links were not shippable to Canada, we built infrastructure to prevent that ever happening again. This post is what we built and why it looks the way it does.
This is the deep-dive companion to How Gear Gadget Picks Your Camping Gear, which walks the broader kit-builder logic. That post mentions the verifier in one paragraph; this post is the whole story.
The Naive Approach That Does Not Work
The obvious first attempt is fetch each URL, check the HTTP status, mark anything that is not 200 as broken. We tried this. It works for about 30 percent of requests.
The other 70 percent come back as 200 OK with HTML that says "Type the characters you see in this image" because Amazon's edge detected a non-browser client and served a CAPTCHA. Status-code-only verification cannot distinguish a CAPTCHA page from a product page. You either get false negatives (real products marked as broken) or false positives (broken products marked as fine) depending on how lenient your check is.
A more elaborate fetch-based check parses the HTML and looks for product-page markers (a productTitle element, schema.org product JSON-LD). That gets you partway there. It still gets blocked the moment Amazon's bot detection tightens. Any approach that does not behave like a browser eventually breaks.
What Actually Works: Playwright + Chromium
The verifier runs a real headless Chromium browser via Playwright. It sets a locale (en-CA for amazon.ca, en-US for amazon.com), navigates to each product page, waits for the DOM to settle, then reads the rendered HTML. Amazon sees a real browser making real navigation requests; CAPTCHAs are rare.
The runner throttles to one navigation every 2 seconds. That is slower than necessary but it keeps the request rate under what triggers Amazon's per-IP rate limiter. The whole 50-product catalog takes about 4 minutes per run, which is fine for a once-a-week check.
The script is in scripts/verify/checkAmazonCatalogCA.ts if you want to read it. The runner is straightforward; the interesting part is the classifier that decides what each response means.
The 4-State Classifier
Every Amazon page response gets one of four labels, decided by what markers appear in the HTML:
The page rendered a productTitle element AND none of the out-of-stock markers fired. Product is buyable on this storefront. Safe to link.
Page rendered, productTitle is there, but the schema.org availability is OutOfStock OR the buybox includes the literal text "We don't know when or if this item will be back in stock." We do not link to OOS items for Canadian visitors. US visitors get them anyway under the lenient policy because Amazon US usually restocks fast.
404, 410, the Amazon "Looking for something?" soft-404 page, OR a search-results page (which Amazon sometimes redirects to when an ASIN does not exist). Product is gone. Remove from the catalog or replace with an alternative.
5xx response, CAPTCHA page, or anything else we cannot confidently classify. Do nothing. Re-check next run. Inconclusive means "we do not know" not "this is broken."
Why We Picked These Specific Markers
The original classifier used a broader set of out-of-stock signals, including the text "Currently unavailable" anywhere on the page. That sounds reasonable until you discover that "Currently unavailable" also appears in variant selectors (one color out of stock but the product is buyable), marketplace seller blocks (one seller out of stock but Amazon has it), and shipping-options sections.
We narrowed to two high-signal markers tied to the actual buybox state:
- The schema.org JSON-LD availability field set to OutOfStock. This is Amazon's structured-data declaration of "this product is not for sale right now." Hard to misinterpret.
- The exact text "We don't know when or if this item will be back in stock." This appears only in the buybox of an OOS product, not in variant selectors.
Both narrow markers cut the false-positive rate from "noticeable" to "near zero" without losing real OOS detection.
The Abort-on-Inconclusive Guard
The first version of the verifier had a real bug. When Amazon bot-detected the GitHub Actions runner IP (which happens periodically because the IP range is shared with thousands of other workflows), most of the responses came back as CAPTCHA pages. The classifier correctly labeled them inconclusive. The runner then wrote the inconclusive results to products.json.
Downstream, the engine saw 30 products with inconclusive Canadian status and excluded them from Canadian kits. CA visitors got near-empty kits for a week until the next run cleared the inconclusive results and put everything back.
The fix is a 5-line guard at the top of the write step:
const inconclusiveRate = inconclusive / total;
if (inconclusiveRate > 0.5) {
console.error('Aborting: ' + Math.round(inconclusiveRate * 100) +
'% inconclusive (runner IP likely bot-flagged).');
process.exit(2);
} If more than half the results are inconclusive in a single run, the runner exits 2 without writing anything. The next scheduled run will try again, and the most-recent good catalog stays in production. This is the right default: we would rather show stale-but-correct data than fresh-but-corrupted data.
Drift Reporting and GitHub Issues
When a product flips from catalog_present to not_present (or to OOS for an extended period), the workflow opens a GitHub issue labeled catalog-drift with the drift report attached. The issue title includes the date and the product ID. If a previous drift issue is still open, the workflow comments on it instead of opening a duplicate.
The drift report itself is a short Markdown table:
Product ID Region Was Is now
tent-coleman-skydome-2 CA catalog_present not_present
cooler-coleman-classic-62-ca CA catalog_present out_of_stock
chair-helinox-one US catalog_present out_of_stock The maintainer reads the issue, decides whether to replace the product or wait for stock to return, and the cycle continues. No human checks every run; the human only checks when something actually changed.
What the Verifier Does Not Catch
Be clear about the limits.
- Price changes. The catalog stores approximate "typical" prices. Real prices update at click time on Amazon. A product that doubled in price still passes catalog_present.
- Review-score collapse. A product going from 4.5 stars to 3.1 stars does not flip any classifier signal.
- Model-year refreshes. A 2024 product replaced with a 2025 model under the same ASIN passes. The new version might be better or worse; we will not know.
- Quality regressions inside a product line. Brands sometimes silently downgrade components. We cannot detect that from the listing.
- Better alternatives that emerged. The verifier asks "is this still a real product I can link to" not "is this still the best pick." Catalog curation is a separate human review.
For the limits the verifier does not address, we rely on quarterly manual catalog review and reader feedback. If something we recommend is clearly wrong, the right place to flag it is a GitHub issue.
If You Want to Steal This For Your Own Site
The whole thing is open source. The classifier is 45 lines of TypeScript. The runner is about 200 lines. The GitHub Actions workflow is about 100 lines of YAML. None of it is exotic technology, just Playwright plus a careful set of marker checks.
Key files to copy:
src/lib/affiliate/amazonCatalogClassifier.ts(the 4-state classifier and marker list)scripts/verify/checkAmazonCatalogCA.ts(the Playwright runner with the abort-on-inconclusive guard).github/workflows/amazon-verify.yml(the cron + issue-opening workflow)
All public at github.com/cainky/gear-gadget. Adapt the markers for whatever marketplace you point it at. The pattern is general; the specifics are Amazon-flavored.
FAQ
Why not just fetch the Amazon URLs with curl or node-fetch? ▼
Amazon aggressively bot-detects non-browser clients. A curl request usually gets back a CAPTCHA page that is indistinguishable from a real product page if you only check HTTP status. A fetch-based verifier produces false negatives (real products marked as missing) and false positives (broken products marked as available). The defense is to behave like a real browser, which means a real browser. We use headless Chromium via Playwright.
How often does the verifier run? ▼
Weekly, every Monday at 09:00 UTC, via a GitHub Actions cron. The schedule is in .github/workflows/amazon-verify.yml. It can also be triggered manually with workflow_dispatch. Each run takes about 4 minutes for the ~50-product catalog (2-second throttle between Amazon page navigations to stay polite).
What does the verifier do when a product disappears? ▼
If a product goes from 'catalog_present' to 'not_present' on amazon.ca, the workflow opens (or comments on) a GitHub issue labeled 'catalog-drift' with the drift report attached. A human reviews and replaces the ASIN. The verifier does NOT silently overwrite products.json with garbage data: if more than 50 percent of results come back inconclusive (typically because the runner IP got bot-flagged), the script aborts before writing anything.
Does the same verifier run against amazon.com? ▼
Optionally. Set VERIFY_US=1 to also classify amazon.com URLs. By default the workflow runs CA-strict and US-lenient: Canadian visitors only see products with confirmed amazon.ca availability, US visitors get a more permissive policy that lets recently-OOS products through. The reasoning is in src/lib/affiliate/region.ts.
What does the 4-state classifier return for each URL? ▼
catalog_present (product page rendered with a title element and no out-of-stock markers), out_of_stock (product page rendered but the schema.org availability says OutOfStock or the buybox shows 'we do not know when or if this item will be back'), not_present (the URL returned 404, 410, the Amazon 'Looking for something?' soft-404 page, or a search-results page instead of a product), inconclusive (5xx, CAPTCHA, or any other unexpected response we cannot confidently classify).
What does the verifier NOT catch? ▼
Price changes (we display approximate prices from the catalog; real prices update at click time on Amazon). Review-score collapse (a product going from 4.5 stars to 3.1 stars does not flip any of our classifier signals). Model-year refreshes (a 2024 product replaced with a 2025 version under the same ASIN still passes catalog_present). Quality regressions inside a product line. The verifier checks 'is this URL still serving a buyable product page,' not 'is this still a good product.'
Can I see the verifier code? ▼
Yes. The classifier is at src/lib/affiliate/amazonCatalogClassifier.ts (45 lines). The runner is at scripts/verify/checkAmazonCatalogCA.ts. The workflow is at .github/workflows/amazon-verify.yml. All public at github.com/cainky/gear-gadget.
See It In Action
Every product in the kit builder has passed this verifier within the last week. Run the builder and click any product to see a live verified link in your region.
Build My Kit