Automated Adobe Launch Verification with n8n

Automated Adobe Launch Verification with n8n

In digital analytics, user behaviour data is precious—and once lost, it’s gone forever. A missing or misconfigured tag can create silent blind spots that distort reports, undermine marketing insights, and lead to costly business decisions.

Most modern tracking implementations rely on tag management systems like Adobe Experience Platform Tags (commonly known as Adobe Launch). Its flexibility and rule-based firing make it a top choice, but a frequent pain point remains: the Launch embed script is usually deployed and maintained by website developers, not the analytics team, which configures the rules.

This separation of duties often results in accidental removals or modifications during site redesigns, CMS migrations, or even routine page updates. The <script> tag is loading from assets.adobedtm.com and gets overlooked, and suddenly, pages stop collecting data—no alerts, just gaps in your reports.

Strong processes like code reviews and deployment checklists are crucial, but they’re not foolproof. Proactive, automated monitoring is the best defence: regularly confirming that the correct Launch library (with the expected property and environment) loads on every page.

Commercial tools like ObservePoint provide enterprise-grade tag auditing, governance, and journey testing. They’re excellent if you need comprehensive rule validation or performance insights. However, if your primary goal is to verify the presence and details of the Adobe Launch implementation across your site, a lighter, more focused solution works perfectly—and costs significantly less.

That’s why I built a fast, automated verification crawler using n8n, the open-source workflow automation tool, combined with Puppeteer for real-browser execution.

The n8n Workflow: Key Features

This scheduled workflow crawls your entire site (or a defined portion), reliably detects Adobe Launch, and exports detailed results to Google Sheets.

  • Accurate Detection with Puppeteer: Instead of parsing HTML or static JavaScript (which can miss indirect or delayed loadings), the workflow uses Puppeteer (a headless Chrome browser) to fully load each page and check for the existence of the global _satellite object. If present, it safely extracts the property name/ID and environment stage (e.g., production, staging, development).
  • Parallel Scanning for Speed: Scans multiple pages concurrently (configurable batch size, default 5) to significantly reduce runtime, even on sites with thousands of pages.
  • Flexible Scope and Exclusion Rules: Supports regular expressions for including specific URL patterns (scope) and excluding others (e.g., PDFs, external links, or certain sections). This keeps the crawl focused and prevents it from wandering off-site.
  • Configurable Limits: Optional maximum number of pages to scan, preventing overly long runs on large sites.

How the Workflow Works

adobe launch scan workflow
  1. Read Configuration from Google Sheets and Initialise Queue
    • Pulls settings from a dedicated Google Sheet tab:
      • Starting URL
      • Batch size (for parallel processing), optional and defaults to 5
      • Max pages to visit, optional and will scan all pages to finish
      • Scope regex patterns (to include certain paths), one regex per row in the Google Sheet tab
      • Exclude regex patterns (to skip unwanted sections), one regex per row in the Google Sheet tab
    • Using a Google Sheet for config makes adjustments easy without editing the workflow.
    • Initialises the crawler queue with the starting URL.
  2. Recursive Crawl Loop with Page Inspection
    • In a loop:
      • Takes a batch of unvisited URLs from the queue.
      • Uses Puppeteer in parallel to:
        • Load each page (blocking heavy resources like images/fonts for speed).
        • Execute JavaScript to detect _satellite, extract property and environment details.
        • Extract all internal links from <a> tags.
      • Processes results:
        • Marks pages as visited and stores detection details.
        • Normalises and filters new links (same domain, within scope, not excluded, no duplicates).
        • Adds valid new links to the queue.
      • Continues looping only if there are unvisited pages and the max limit hasn’t been reached.
  3. Export Results to Google Sheets
    • Creates a new sheet named with the run timestamp.
    • Exports a table with columns:
      • URL
      • Visited (YES/NO)
      • Visited At (timestamp)
      • Satellite Exists (YES/NO)
      • Satellite Property ID
      • Satellite Property Name
      • Satellite Environment ID
      • Satellite Environment Stage
    • Easy to review, filter for missing implementations, or track over time.

Key Code Highlights

Google Sheet config tab

All configurations are in the config tab of the connected Google Sheet. There are two columns: “attribute” and “value” for all parameters as follows:

  • start: the starting URL
  • batchSize: the number of URLs to process in each batch
  • maxVisited: the number of pages to visit before stopping
  • scope: regular expression to match URLs to visit; there can be multiple records of this
  • exclude: regular expression to exclude URLs from the crawler queue; there can be multiple records of this
Google Sheets config tab

Convert Config Format

This node reads the rows from a designated Google‑Sheet tab, where each row defines a configuration attribute and its corresponding value. It then converts those rows into a structured configuration object and stores that object in the workflow’s static (global) data. By placing the configuration in static data, any downstream node can retrieve the settings instantly, without having to re‑query the sheet.

const staticData = $getWorkflowStaticData("global");

const result = {};

if ($input.all().length > 0) {
    for (const item of $input.all()) {
        const attr = item.json.attribute;
        const val = item.json.value?.toString().trim();

        if (attr === "exclude" || attr === "scope") {
            if (!result[attr]) result[attr] = [];
            result[attr].push(val);
        } else {
            result[attr] = val;
        }
    }
}

staticData.config = result;

return [
    {
        json: {
            config: staticData.config,
        },
    },
];

Initialize Queue

This node seeds the crawler queue and stores it in the workflow’s global static data.

What is stored? For each web page, the queue holds a minimal record containing:

  • url – the target address,
  • visited – a Boolean flag,
  • _satellite – the Satellite indicator, and
  • _satellite environment – additional property data.

The queue is lightweight, so keeping it in static data provides a fast, in‑memory reference that all downstream nodes can access without additional I/O. However, global static data isn’t intended for large collections. If the queue grows substantially or you notice performance degradation, consider moving it to an external store (e.g., a database, Redis, or a NoSQL service). Be aware that external look‑ups will add latency to each workflow execution.

An initial implementation used a Google Sheet as the queue source. Although simple, the n8n ↔ Google‑Sheets connection introduced noticeable latency and quickly exhausted the sheet’s API rate limits, making the workflow sluggish.

const staticData = $getWorkflowStaticData("global");

staticData.crawlerQueue = [];

staticData.crawlerQueue.push({
    url: staticData.config.start,
    visited: false,
});

return [
    {
        json: {
            staticData: staticData,
        },
    },
];

Get Batch

Selects the next batch of unvisited URLs from the queued static data and outputs each as a separate item for Puppeteer. By retrieving multiple URLs at once, the workflow can load pages in parallel, speeding up execution.

const staticData = $getWorkflowStaticData("global");
const queue = staticData.crawlerQueue || [];
const batchSize = staticData.config?.batchSize || 5;

// Get up to batchSize unvisited URLs
const unvisited = queue.filter((item) => !item.visited);
const batch = unvisited.slice(0, batchSize);

// Return batch as separate items for Puppeteer
return batch.map((item) => ({
    json: {
        url: item.url,
    },
}));

Check Page (Puppeteer)

This node loads a page in Puppeteer, blocks heavyweight resources (images, fonts, videos, etc.), and waits only for domcontentloaded, not the network to be idle. While the DOM is ready, it:

  • Extracts all outbound links
  • Retrieves the page’s _satellite data (including any environment information)

By limiting resource loading and targeting DOM‑ready, the node dramatically reduces execution time while still delivering the required link and _satellite metadata.

// Block heavy resources
await $page.setRequestInterception(true);
$page.on("request", (req) => {
    const resourceType = req.resourceType();
    if (["image", "stylesheet", "font", "media", "other"].includes(resourceType)) {
        req.abort();
    } else {
        req.continue();
    }
});

// Navigate to the page (already handled by the node, but you can override if needed)
await $page.goto($input.item.json.url, { waitUntil: "domcontentloaded" });

// Run everything in one evaluate for efficiency
const results = await $page.evaluate(() => {
    // Extract all unique outbound URLs from document.links
    const links = Array.from(document.links)
        .map((a) => a.href.trim()) // Get href, clean up
        .filter((href) => href && !href.startsWith("#") && !href.startsWith("javascript:")) // Skip anchors & JS
        .filter((href, index, self) => self.indexOf(href) === index); // Unique only

    // Check if _satellite exists
    const satelliteExists = typeof _satellite !== "undefined";

    // Get properties if exists (safe handling)
    let satelliteInfo = null;
    if (satelliteExists) {
        try {
            satelliteInfo = {
                property: _satellite.property || null,
                environment: _satellite.environment || null,
            };
        } catch (err) {
            satelliteInfo = { error: err.message };
        }
    }

    return {
        urls: links, // Array of all extracted URLs
        satelliteExists, // true/false
        satelliteInfo, // { property: "...", environment: "..." } or null
    };
});

// Return as n8n item (add to your crawler queue later if needed)
return [
    {
        json: {
            currentUrl: $input.item.json.url,
            result: results,
        },
    },
];

Process Page Result

The core node handles the entire post‑Puppeteer processing in a single JavaScript block, which keeps the workflow tidy and reduces overhead compared with chaining many separate nodes.

It first marks the current URL as visited, then evaluates every link discovered on the page. Links are filtered against the configured include/exclude patterns, and a built‑in rule guarantees that only URLs sharing the same hostname as the page are accepted—preventing the crawler from drifting outside the intended scope when no explicit scope is defined. The remaining, unique URLs are added to the pending queue, and the node decides whether the crawl should continue based on the number of items left.

When the workflow is run manually, it also writes a brief status report to the console, showing the total queue size, how many URLs have been visited, and how many are still pending, giving the operator an immediate view of progress.

const { URL } = require("url");

function normalizeUrl(url) {
    if (!url || typeof url !== "string") return "";
    try {
        const u = new URL(url.trim());
        u.hash = "";
        if (u.pathname !== "/" && u.pathname.endsWith("/")) {
            u.pathname = u.pathname.slice(0, -1);
        }
        return (u.origin + u.pathname + u.search).toLowerCase();
    } catch {
        return "";
    }
}

function getHostname(url) {
    try {
        return new URL(url).hostname.toLowerCase();
    } catch {
        return "";
    }
}

// Load queue & config
const queue = $getWorkflowStaticData("global").crawlerQueue || [];
const config = $getWorkflowStaticData("global").config;

// Build scope regexes
const scopeRegexes = (config.scope || [])
    .map((p) => {
        try {
            return new RegExp(p, "i");
        } catch {
            console.log("Invalid scope regex:", p);
            return null;
        }
    })
    .filter(Boolean);

function isScoped(url) {
    return scopeRegexes.some((re) => re.test(url));
}

// Build exclude regexes
const excludeRegexes = (config.exclude || [])
    .map((p) => {
        try {
            return new RegExp(p, "i");
        } catch {
            console.log("Invalid exclude regex:", p);
            return null;
        }
    })
    .filter(Boolean);

function isExcluded(url) {
    return excludeRegexes.some((re) => re.test(url));
}

// Get ALL items from the parallel Puppeteer batch
const batchItems = $input.all();
const puppeteerResults = batchItems.map((item) => item.json);

// Process every page in the batch
puppeteerResults.forEach((batchJson) => {
    const currentUrl = batchJson.currentUrl;
    if (!currentUrl) return;

    const currentHostname = getHostname(currentUrl);
    if (!currentHostname) return;

    // Mark current URL as visited + save satellite data
    const normCurrent = normalizeUrl(currentUrl);
    const queueItem = queue.find((q) => normalizeUrl(q.url) === normCurrent);
    if (queueItem) {
        queueItem.visited = true;
        queueItem.visitedAt = new Date().toISOString();
        queueItem.satelliteExists = batchJson.result.satelliteExists ?? null;
        queueItem.satelliteInfo = batchJson.result.satelliteInfo ?? null;
    }

    // Add new URLs from this page
    const extracted = batchJson.result.urls || [];

    const newClean = extracted
        .map((u) => u.trim())
        .filter((u) => /^https?:\/\//i.test(u))
        .filter((u) => !/^(javascript|mailto|tel):/i.test(u))
        .map((u) => normalizeUrl(u))
        .filter((u) => getHostname(u) === currentHostname)
        .filter((u) => isScoped(u)) // Multi-scope support
        .filter((u) => !isExcluded(u)) // Multi-exclude support
        .filter((u, i, a) => a.indexOf(u) === i);

    const existingSet = new Set(queue.map((q) => normalizeUrl(q.url)));
    for (const norm of newClean) {
        if (!existingSet.has(norm)) {
            const orig = extracted.find((o) => normalizeUrl(o) === norm) || norm;
            queue.push({ url: orig, visited: false });
            existingSet.add(norm);
        }
    }
});

// Stats & decide if loop continues
const total = queue.length;
const visited = queue.filter((i) => i.visited).length;
const pending = total - visited;

console.log(`Total: ${total}`, `Visited: ${visited}`, `Queue: ${pending}`);

// Return SINGLE item for "if continue"
let continueCrawling = pending > 0;
if (!!config.maxVisited && !isNaN(parseInt(config.maxVisited))) {
    const visitedCount = queue.filter((item) => item.visited).length;
    if (visitedCount >= parseInt(config.maxVisited)) {
        continueCrawling = false;
    }
}

return [
    {
        json: {
            continue: continueCrawling,
            queueStats: { total, visited, pending },
            queue: queue,
        },
    },
];

Benefits

  • Cost-Effective: Free with self-hosted n8n (or low-cost cloud).
  • Early Detection: Schedule daily/weekly runs to catch issues before they impact significant traffic.
  • Highly Extensible: Add Slack/email alerts, compare runs, or verify other tags (e.g., GTM).
  • Full Transparency: Full visibility into the process, with detailed logs and results stored centrally in Google Sheets.

This workflow has already caught several near-misses for me during deployments, preventing real data loss. Pairing solid development processes with automated verification is a game-changer for analytics governance.

If you’re responsible for Adobe Launch (or any TMS) and want greater peace of mind, give this n8n approach a try. You may download the workflow JSON for a quick start.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *