How to Build a DSAR Pipeline That Actually Deletes Data

Every search result for "DSAR automation" is a vendor page selling you a platform. OneTrust, TrustArc, DataGrail, Transcend — $10K to $75K per year for a dashboard you don't own. None of them explain how the pipeline actually works. This post does. Five infrastructure components, code examples for each, and the hard parts nobody warns you about.

The $1,400 problem

A data subject access request arrives by email. Someone on your legal or privacy team reads it. Then the manual process begins:

  • Identify the requester across every system — email, phone number, account ID, device ID. Is jane@example.com the same person as Jane D. in your CRM, user_4821 in your analytics, and device_abc in your mobile logs?
  • Query every data store where PII might live — your primary database, data warehouse, analytics platform, CRM, email service, CDN logs, backups.
  • Compile results for access requests, or execute deletion for deletion requests — which means different operations in each store.
  • Redact third-party data so you don't accidentally expose someone else's PII in the response.
  • Document everything for the audit trail CalPrivacy expects when they ask how you handle DSARs.

Gartner estimates this costs $1,400 per request when done manually. That's 20 to 40 hours of engineering and legal time per DSAR. Organizations report 60% year-over-year increases in DSAR volume. If you're processing 50 requests per month, that's $840K per year in labor. At 200 per month, $3.36M.

The SaaS platforms solve this by putting a workflow layer on top of your infrastructure. You still have to build the connectors to each data store. You still have to define your deletion logic. You still have to maintain the integrations. The platform just gives you a dashboard and a ticketing system — and charges you $10K to $75K annually for the privilege.

The alternative: build the pipeline as infrastructure you own. Five discrete components, deployed in your cloud, version-controlled like the rest of your stack.

The pipeline architecture

A DSAR fulfillment pipeline has five stages. Each is a standalone infrastructure component with well-defined inputs and outputs. Think of it like a CI/CD pipeline for privacy compliance — each stage is idempotent, auditable, and independently testable.

  • Intake — Capture the request, validate it, classify the type (access, deletion, correction), and route it.
  • Identity resolution — Match the requester to records across every data store. One person, many identifiers.
  • Data discovery — Query every store where PII lives. Compile what you have on this person.
  • Execution — For deletion: soft delete, hard delete, or anonymize depending on the store and your retention policy. For access: compile and format the response.
  • Confirmation & audit — Generate an immutable audit trail, send confirmation to the requester, log everything for regulatory review.

Stage 1: Request intake

The intake endpoint receives DSAR submissions, validates the request type against CCPA categories, and creates a tracked request with a unique ID. This is the entry point — every request gets a correlation ID that follows it through the entire pipeline.

import { Router } from "express";
import { v4 as uuid } from "uuid";
import { dsarQueue } from "./queue";
import { auditLog } from "./audit";

const router = Router();

const VALID_TYPES = ["access", "deletion", "correction", "opt-out"];

router.post("/api/dsar", async (req, res) => {
  const { email, requestType, firstName, lastName } = req.body;

  // Validate request type against CCPA categories
  if (!VALID_TYPES.includes(requestType)) {
    return res.status(400).json({ error: "Invalid request type" });
  }

  const request = {
    id:          uuid(),
    email,
    requestType,
    firstName,
    lastName,
    status:      "received",
    receivedAt:  new Date().toISOString(),
    deadline:    deadlineFromNow(45),  // CCPA: 45 calendar days
  };

  await dsarQueue.enqueue("identity-resolution", request);
  await auditLog.write({
    requestId:  request.id,
    event:      "request_received",
    details:    { requestType, email },
    timestamp:  request.receivedAt,
  });

  res.status(201).json({ requestId: request.id, status: "received" });
});

Two things to note. First, the 45-day deadline is computed at intake and attached to the request object. Every downstream stage checks this. Second, the audit log write happens before the response — if you crash after acknowledging the request but before logging it, you have a compliance gap.

Stage 2: Identity resolution

This is the hardest stage. A single person might exist as an email address in your primary database, a phone number in your CRM, a device ID in your analytics, and a hashed identifier in your ad platform. You need to resolve all of these to a single canonical identity before you can discover and delete their data.

interface IdentityMatch {
  store:       string;
  identifiers: Record<string, string>;
  confidence:  "exact" | "probable" | "possible";
}

async function resolveIdentity(request: DSARRequest): Promise<IdentityMatch[]> {
  const matchers = [
    { store: "postgres",  fn: matchPostgres },
    { store: "hubspot",   fn: matchHubspot },
    { store: "segment",   fn: matchSegment },
    { store: "mixpanel",  fn: matchMixpanel },
    { store: "s3-logs",   fn: matchS3Logs },
    { store: "dynamodb",  fn: matchDynamoDB },
  ];

  const results = await Promise.allSettled(
    matchers.map(async ({ store, fn }) => {
      const ids = await fn(request.email, request.firstName, request.lastName);
      return ids.map(id => ({ store, ...id }));
    })
  );

  const matches: IdentityMatch[] = results
    .filter(r => r.status === "fulfilled")
    .flatMap(r => r.value);

  await auditLog.write({
    requestId: request.id,
    event:     "identity_resolved",
    details:   {
      storesSearched: matchers.length,
      matchesFound:   matches.length,
      stores:         matches.map(m => m.store),
    },
  });

  return matches;
}

Promise.allSettled is critical here. If one data store is down or slow, you still process the others. You don't want a flaky HubSpot API call to block the entire pipeline. Log the failures, flag them for retry, and continue.

The confidence levels matter for audit purposes. An exact email match in Postgres is different from a probable name + ZIP match in an analytics export. You execute deletion on exact and probable matches. Possible matches get flagged for human review.

Stage 3: Data discovery

Once you know where the person exists, you need to discover what data you have on them. For access requests, this is the response payload. For deletion requests, this is the scope of what needs to be deleted.

Data discovery is store-specific. A Postgres query is different from an S3 prefix scan, which is different from a DynamoDB query. Each store needs its own discovery adapter that returns a normalized result.

interface DiscoveryResult {
  store:      string;
  recordType: string;
  recordId:   string;
  fields:     string[];     // which fields contain PII
  size:       number;      // bytes, for progress tracking
}

async function discoverData(
  matches: IdentityMatch[],
  request: DSARRequest
): Promise<DiscoveryResult[]> {
  const adapters: Record<string, DiscoveryAdapter> = {
    "postgres":  new PostgresDiscovery(pgPool),
    "dynamodb":  new DynamoDiscovery(dynamoClient),
    "s3-logs":   new S3Discovery(s3Client, "logs-bucket"),
    "hubspot":   new HubSpotDiscovery(hubspotClient),
    "segment":   new SegmentDiscovery(segmentClient),
    "mixpanel":  new MixpanelDiscovery(mixpanelClient),
  };

  const results: DiscoveryResult[] = [];

  for (const match of matches) {
    const adapter = adapters[match.store];
    if (!adapter) continue;

    const records = await adapter.discover(match.identifiers);
    results.push(...records);
  }

  await auditLog.write({
    requestId: request.id,
    event:     "data_discovered",
    details:   {
      totalRecords: results.length,
      byStore:      groupBy(results, "store"),
    },
  });

  return results;
}

The discovery phase produces a manifest: here is every record, in every store, that contains PII belonging to this person. For access requests, this manifest becomes the response. For deletion requests, it becomes the execution plan.

Stage 4: The deletion orchestrator

This is where DSAR pipelines get interesting. "Delete" means different things in different data stores. You can't run DELETE FROM users WHERE id = ? everywhere and call it done.

interface DeletionStrategy {
  store:    string;
  method:   "hard_delete" | "soft_delete" | "anonymize" | "ttl" | "lifecycle";
  execute:  (identifiers: Record<string, string>) => Promise<DeletionResult>;
}

const strategies: DeletionStrategy[] = [
  {
    store:  "postgres",
    method: "hard_delete",
    async execute(ids) {
      const tx = await pgPool.connect();
      try {
        await tx.query("BEGIN");
        await tx.query("DELETE FROM user_profiles WHERE user_id = $1", [ids.userId]);
        await tx.query("DELETE FROM user_events WHERE user_id = $1", [ids.userId]);
        await tx.query("DELETE FROM user_preferences WHERE user_id = $1", [ids.userId]);
        await tx.query("COMMIT");
        return { success: true, rowsDeleted: 3 };
      } catch (err) {
        await tx.query("ROLLBACK");
        throw err;
      } finally {
        tx.release();
      }
    },
  },
  {
    store:  "s3-logs",
    method: "lifecycle",
    async execute(ids) {
      // Tag objects for lifecycle deletion — S3 doesn't do instant bulk delete well
      const objects = await s3.listObjectsV2({
        Bucket: "logs-bucket",
        Prefix: `users/${ids.userId}/`,
      }).promise();

      for (const obj of objects.Contents || []) {
        await s3.putObjectTagging({
          Bucket: "logs-bucket",
          Key:    obj.Key,
          Tagging: { TagSet: [{ Key: "dsar-delete", Value: "true" }] },
        }).promise();
      }
      // S3 lifecycle rule handles actual deletion within 24 hours
      return { success: true, objectsTagged: objects.Contents?.length || 0 };
    },
  },
  {
    store:  "dynamodb",
    method: "ttl",
    async execute(ids) {
      // Set TTL to expire records — DynamoDB handles actual deletion
      await dynamoClient.update({
        TableName: "user_sessions",
        Key: { userId: ids.userId },
        UpdateExpression: "SET #ttl = :ttl, #status = :status",
        ExpressionAttributeNames: { "#ttl": "expiresAt", "#status": "deletionStatus" },
        ExpressionAttributeValues: {
          ":ttl":    Math.floor(Date.now() / 1000),  // expire immediately
          ":status": "pending_deletion",
        },
      }).promise();
      return { success: true, method: "ttl_expiration" };
    },
  },
  {
    store:  "redis",
    method: "hard_delete",
    async execute(ids) {
      const keys = await redis.keys(`user:${ids.userId}:*`);
      if (keys.length > 0) await redis.del(...keys);
      return { success: true, keysDeleted: keys.length };
    },
  },
];

Each strategy encapsulates how deletion works for that specific store. Postgres gets a transaction with hard deletes. S3 gets lifecycle tagging because bulk deletion of objects across prefixes is slow and expensive — let the lifecycle rule handle it. DynamoDB uses TTL expiration. Redis gets a key scan and flush.

The orchestrator runs all strategies, collects results, and logs the outcome:

async function executeDeletion(
  request: DSARRequest,
  matches: IdentityMatch[]
): Promise<DeletionReport> {
  const results: DeletionResult[] = [];

  for (const strategy of strategies) {
    const match = matches.find(m => m.store === strategy.store);
    if (!match) continue;

    try {
      const result = await strategy.execute(match.identifiers);
      results.push({ store: strategy.store, ...result });
    } catch (err) {
      results.push({ store: strategy.store, success: false, error: err.message });
      // Don't throw — continue with other stores, flag for retry
    }
  }

  await auditLog.write({
    requestId: request.id,
    event:     "deletion_executed",
    details:   { results },
  });

  return { requestId: request.id, results, completedAt: new Date().toISOString() };
}

Failures don't stop the pipeline. If Postgres deletion succeeds but the HubSpot API is down, you record the partial completion and retry the failed stores. The audit trail shows exactly what succeeded and what needs another pass.

Stage 5: Confirmation and audit trail

After execution, two things happen: the requester gets confirmation, and you generate an immutable audit record. The audit trail is what CalPrivacy asks for when they audit your DSAR process. It needs to show the complete lifecycle of every request.

interface AuditRecord {
  requestId:  string;
  event:      string;
  timestamp:  string;
  details:    Record<string, any>;
  hash:       string;     // SHA-256 of previous record + this record
  prevHash:   string;     // chain integrity
}

class ImmutableAuditLog {
  private lastHash: string = "";

  async write(entry: Omit<AuditRecord, "hash" | "prevHash">) {
    const record: AuditRecord = {
      ...entry,
      timestamp: entry.timestamp || new Date().toISOString(),
      prevHash:  this.lastHash,
      hash:      await this.computeHash(entry, this.lastHash),
    };

    // Write to append-only store (S3 + DynamoDB, or dedicated audit DB)
    await this.persist(record);
    this.lastHash = record.hash;
  }

  private async computeHash(entry: any, prevHash: string): Promise<string> {
    const content = JSON.stringify({ ...entry, prevHash });
    const encoder = new TextEncoder();
    const data = encoder.encode(content);
    const hashBuffer = await crypto.subtle.digest("SHA-256", data);
    return Array.from(new Uint8Array(hashBuffer))
      .map(b => b.toString(16).padStart(2, "0"))
      .join("");
  }
}

The hash chain provides tamper evidence. Each record includes the hash of the previous record. If anyone modifies a historical audit entry, the chain breaks. This isn't blockchain — it's a Merkle chain, and it's the standard pattern for immutable audit logs in regulated industries.

Verification queries

After deletion, run verification queries to confirm the data is actually gone. This is the step most pipelines skip, and it's the one CalPrivacy cares most about.

async function verifyDeletion(
  request: DSARRequest,
  matches: IdentityMatch[]
): Promise<VerificationReport> {
  const failures: string[] = [];

  for (const match of matches) {
    const adapter = discoveryAdapters[match.store];
    const remaining = await adapter.discover(match.identifiers);

    if (remaining.length > 0) {
      failures.push(
        `${match.store}: ${remaining.length} records still present`
      );
    }
  }

  const verified = failures.length === 0;

  await auditLog.write({
    requestId: request.id,
    event:     verified ? "deletion_verified" : "deletion_incomplete",
    details:   { verified, failures },
  });

  if (verified) {
    await sendConfirmation(request.email, request.id);
  }

  return { requestId: request.id, verified, failures };
}

If verification fails, the pipeline flags the request for manual review and retry. The 45-day clock is still ticking. You need to know about failures immediately, not when the regulator asks.

The hard parts

The five-stage pipeline handles the straightforward case: structured data in known stores with clear identifiers. Production reality is messier. Here are the problems that make DSAR fulfillment genuinely hard.

Backups and snapshots

You deleted the user from production Postgres. But your automated database snapshots from 3 weeks ago still contain their data. Your data warehouse ETL ran yesterday and copied their records into your analytics tables. Your Elasticsearch cluster has a 30-day index rotation and their search queries are still in last week's index.

CCPA doesn't require you to delete from backups immediately, but it does require that backup data not be restored into production in an identifiable form. Your pipeline needs a "deletion registry" — a list of identifiers that have been deleted. When you restore a backup, the restoration process checks the registry and scrubs those records before they hit production.

Third-party vendor propagation

Your data doesn't just live in your infrastructure. It's in Segment, Mixpanel, HubSpot, Intercom, Google Analytics, your ad platforms, your email service provider, and every vendor you've ever piped data to. Under CCPA, when a consumer requests deletion, you have to propagate that request to every third party who received their data.

Each vendor has a different deletion API, different rate limits, different processing times, and different definitions of "deleted." Some offer self-serve APIs. Some require you to email their privacy team. Some don't have a deletion endpoint at all and you need to work through their support process.

The pipeline needs a vendor registry: every third party that receives consumer data, their deletion mechanism, their SLA, and a status tracker per DSAR request.

Derived and aggregate data

You trained an ML model on user behavior data. That model now contains learned patterns that include this user's data. Do you retrain? CCPA is ambiguous here, but the safe answer is: if the data was used in training and the model could be influenced by it, document it in the audit trail and have a legal position ready.

Aggregate analytics are cleaner. If you have "47% of users clicked this button" and the user contributed to that stat, you don't need to recompute it. Aggregated, de-identified data is generally exempt. But cached copies of their profile in Redis, CDN-cached pages that display their public data, search indexes that contain their records — those all need to be purged.

The 45-day clock

CCPA gives you 45 calendar days to fulfill a DSAR, with a possible 45-day extension if you notify the consumer. The clock starts when the request is received, not when you start processing it. If your pipeline takes 3 days to process but your intake queue has a 2-week backlog, you've burned half your runway before the first line of code executes.

The pipeline needs deadline tracking at every stage. If a request is approaching its deadline, it escalates — Slack alert, PagerDuty, email to the DPO. An overdue DSAR is a violation, and violations under CCPA are $2,663 per consumer, per incident (or $7,988 for intentional violations).

Build vs. buy

Not every company should build this pipeline from scratch. Here's the honest breakdown.

Build your own pipeline if:

  • You have more than 3 data stores where PII lives. The more stores, the more custom the deletion logic. SaaS platforms give you generic connectors; you'll spend just as much time configuring them as building your own.
  • You want to own your compliance infrastructure. When CalPrivacy audits you, "we use OneTrust" is not a defense. You need to demonstrate that deletion actually happened in your systems. Owning the pipeline means owning the audit trail.
  • You have engineers available. The initial build is 2-4 weeks of focused engineering time. Ongoing maintenance is minimal — you're adding new stores and updating vendor APIs, not rewriting the core pipeline.
  • You process more than 50 DSARs per month. At $1,400 per manual request, 50 per month is $840K per year. A purpose-built pipeline drops that to near zero marginal cost per request.

Use a SaaS platform if:

  • You have fewer than 3 data stores and your deletion logic is simple (one database, one CRM, done).
  • You need a solution this week and don't have engineering bandwidth. A platform can be operational in days; a custom pipeline takes weeks.
  • Your volume is low enough that the per-request cost of manual processing is less than the platform subscription.

The middle ground is where most mid-market companies land: too complex for a simple SaaS platform, not enough resources for a full internal build. That's where infrastructure-as-code compliance services make sense — someone builds the pipeline, deploys it in your cloud, and hands you the keys. You own the infrastructure. They do the engineering.

The pipeline is infrastructure, not a dashboard

A DSAR fulfillment pipeline is not a SaaS product you log into. It's infrastructure — API endpoints, queue workers, database adapters, vendor integrations, audit logs. It runs in your cloud, processes requests against your data stores, and produces audit trails that prove compliance.

The code examples in this post are a starting point. Your pipeline will be different because your data stores are different, your vendor stack is different, and your retention policies are different. But the architecture is the same: intake, identity resolution, discovery, execution, audit. Five stages, each independently testable, each producing an audit record.

Build it like you'd build any other critical infrastructure: version-controlled, tested, monitored, and owned by your team.

// Free CCPA gap assessment — we'll map your data stores, identify where PII lives, and scope the DSAR pipeline. 60 minutes, 48-hour gap report.