Session Recording Analysis

Sample failed-conversion replays, replay each one as an event sequence, and have the LLM run a fixed set of behavioral checks to surface friction the aggregate detectors can't see. This is not a step inside the hourly pipeline — it runs on its own schedule and its own queue.

▸Inputs · analyzes

Failed-conversion sessions that carry frustration signals (rage / dead clicks, JS errors), sampled from PostHog replays
A rolling 7-day window (-7d); falls back to older recordings when recent ones are unavailable (e.g. replay billing limits)
Store context (AOV, conversion rate, device split) for grounding — optional, best-effort

⚙Process

Sample up to 15 failed sessions (minimum 5, configurable) and reconstruct each session's event sequence
LLM runs 12 standardized behavioral checks over the whole batch
Emit session_recording_friction signals from the failed checks
Daily batch only: then run diagnosis on all undiagnosed signals and dispatch alerts

✦Outputs · generates

session_recording_friction signals → posthog_signals
A per-batch frictionScore (0–100 = % of checks that failed)
Daily batch only: persisted diagnoses + dispatched alerts

Two ways recordings get analyzed

The same sampling + LLM engine is invoked from two places on two schedules. They share code (sampleFailedConversionRecordings → analyzeRecordingBatch) but differ in scope and what happens after.

flowchart TB
    subgraph DAILY["Daily batch · own queue, ~2am · store-wide"]
        direction LR
        DCRON["Cloud Scheduler
/recordings/analyze-daily"] --> DSAMPLE["Sample failed
sessions
(-7d, up to 15)"] --> DLLM["LLM: 12
behavioral
checks"] --> DSIG["friction signals
→ posthog_signals"] --> DDIAG["Diagnose
undiagnosed
signals"] --> DALERT["Dispatch
alerts"]
    end
    subgraph HOURLY["Inline Stage 3b · inside the hourly pipeline · top product page only"]
        direction LR
        HTOP["Top product_
performance
signal"] --> HSAMPLE["Sample failed
sessions on
that URL"] --> HLLM["LLM: 12
behavioral
checks"] --> HSIG["friction signals → posthog_signals
cluster with the product signal"]
    end
    DAILY ~~~ HOURLY
    style DAILY fill:#fce4ec,stroke:#E91E63
    style HOURLY fill:#fff3e0,stroke:#FF9800

The daily batch is a self-contained mini-pipeline (signals → diagnosis → alerts). The inline Stage 3b only produces signals, which the hourly pipeline then clusters and diagnoses downstream.

The daily batch flow

Enqueued once per project onto the session-recording-analysis queue; runs independently of the hourly pipeline.

flowchart LR
    SAMPLE["sampleFailedConversionRecordings
-7d · limit 15 · min 5
fallback to older replays"]
    CTX["queryStoreMetrics
AOV · CVR · device split"]
    LLM["analyzeRecordingBatch
12 behavioral checks → frictionScore"]
    PERSIST["Persist session_recording_friction
→ posthog_signals"]
    DIAG["diagnoseSignals + persistDiagnoses
(all active undiagnosed)"]
    ALERT["dispatchAlertsForNewDiagnoses"]
    SAMPLE --> LLM
    CTX --> LLM
    LLM --> PERSIST --> DIAG --> ALERT
    style PERSIST fill:#fff3e0,stroke:#FF9800

Errors are captured to Sentry (job_type: daily_batch) and the diagnosis / alert steps are non-blocking.

The 12 behavioral checks

Each check returns pass, a quantified pattern (e.g. “5/12 sessions…”), specific session IDs + timestamps, replay URLs, and recommendations. frictionScore is the share of checks that failed.

flowchart TB
    subgraph INTERACTION["Interaction"]
        C1["1 · navigationFriction"]
        C2["2 · formInteractionClarity"]
        C3["3 · ctaResponsiveness"]
        C4["4 · variantSelectionFlow"]
        C5["5 · checkoutProgression"]
    end
    subgraph TECHNICAL["Technical (co-observation only)"]
        C6["6 · performanceImpact"]
        C7["7 · errorImpact"]
    end
    subgraph CONTEXT["Context & intent"]
        C8["8 · mobileSpecificIssues"]
        C9["9 · trustBarriers"]
        C10["10 · decisionParalysis"]
        C11["11 · loadingFeedback"]
        C12["12 · exitTrigger"]
    end

Why it works this way

The sample is drawn from failures only, so anything present in nearly every failed session (classically a high JS-error count) is consistent with population-wide noise, not causation. The technical checks (performanceImpact, errorImpact) are therefore treated as co-observations: the LLM records what it sees but is barred from naming a technical artifact as the primary cause — or emitting a recommendation from it — unless a corresponding measured signal exists in the same cluster. This keeps the most open-ended LLM step in the system from inventing root causes the aggregate detectors never corroborated.