How a Vulnerability Deduplication Engine Works

Run two scanners against the same web application and you will likely see the same SQL injection reported twice, once by each tool, with two different descriptions and two different internal references. Run the same scanner again next week and you will see the same finding a third time. Multiply that across thousands of assets, a dozen scan profiles, and a handful of scanner vendors, and the result is a finding list that grows faster than anyone can remediate. The signal is buried in repetition.

A correlation and deduplication engine exists to solve exactly this. Its job is narrow and mechanical. For every result that arrives from a scanner, it decides one thing: is this a vulnerability the platform already knows about, or is it genuinely new? Everything downstream depends on that answer being correct and consistent.

This article walks through how such an engine works under the hood. It uses PMAP’s correlation engine as the worked example, because its behavior is well-defined and can be described concretely rather than in the abstract. By the end you should understand the two-stage lookup strategy, the four-case pipeline that turns a lookup into a decision, how a stable fingerprint is built, and why normalization has to come first. If you operate a multi-scanner environment and want to understand the mechanics before you evaluate any tool, this is the layer that matters.

For the broader context of how scanners are scheduled, imported, and reconciled across vendors, see the pillar guide on multi-vendor scan orchestration. This article zooms in on a single component of that larger pipeline.

What a Correlation and Deduplication Engine Actually Does

It helps to be precise about scope, because correlation engines are often described in vague terms.

A correlation engine is not a scanner. It does not connect to assets, send probes, or parse vendor file formats. It does not own a user interface, and in PMAP it has no HTTP routes of its own and is never mounted directly in the server. It is a pure engine. By the time a result reaches it, the scanner-specific parsing is already done and the result has been turned into a normalized request the engine can reason about.

Within that boundary the engine answers a single question for every inbound scanner result: does this vulnerability already exist in the platform, or is it new? That sounds simple, but the consequences of getting it wrong are significant in both directions. If the engine treats a recurring vulnerability as new, you accumulate duplicates and lose the history of when the issue first appeared. If it treats two genuinely different vulnerabilities as the same, you silently merge them and one of them disappears from view. The engine has to thread that needle thousands of times per scan, deterministically, with no human in the loop.

To do that, the engine combines two responsibilities that are sometimes split across separate components. The first is deduplication: recognizing that an inbound result corresponds to an existing finding. The second is normalization: reducing noisy, inconsistent scanner output to a stable canonical form so that the deduplication can be reliable. In PMAP these live in one engine, which matters, because normalization is what makes deduplication trustworthy. We will return to that relationship later.

When the engine reaches a decision, it does not just return a label and walk away. It writes the outcome through the finding repository, updates cross-scan visibility counters, and, on certain outcomes, triggers rule evaluation so that severity and status policies are applied at the moment of ingest. The decision and its side effects happen together in one pipeline step.

Why Duplicate Findings Pile Up Without One

To appreciate what the engine prevents, picture the ingest path without it.

Every scanner importer in PMAP, the connectors for Nessus, Qualys, Rapid7, Acunetix, Invicti, Nuclei, SonarQube, TenableSC, and the generic integration connector, calls the engine once for every result it emits. That is the key fact. Correlation is not an occasional cleanup pass that runs nightly. It sits directly in the hot path of import, invoked per result, every time.

Remove the engine from that path and each importer would write a fresh finding for every result it parses. The same authentication bypass on the same host would be written once per scan run. A weekly scan schedule would produce fifty-two copies of an unremediated issue over a year. Two vendors covering overlapping scope would each contribute their own copy of every shared vulnerability. None of these copies would know about the others, so an analyst marking one as a false positive would leave the rest untouched, and the SLA clock on the real issue would be impossible to track because it would be smeared across many records.

This is why deduplication has to happen at ingest rather than as an afterthought. Once duplicates are written, untangling them is expensive and error-prone, because you have to reverse-engineer which records describe the same underlying problem. Deciding correctly at write time, before the duplicate ever exists, is far cheaper. The engine’s position in the pipeline, upstream of every finding write and downstream of every scanner connector, is what makes that possible.

The Two-Stage Lookup Strategy

Before the engine can decide what to do with a result, it has to look for a match. It does this in two stages, in a deliberate order, because the two stages answer slightly different questions.

The first stage asks: have I seen this exact result from this exact scanner before? The second stage asks: do I already have a finding that describes this same vulnerability, regardless of which scanner reported it? The first is a precise, identity-based check. The second is a content-based check that can bridge across scanners. Trying the precise check first, then falling back to the content-based check, gives the engine both idempotency for repeated scans and cross-scanner deduplication for overlapping coverage.

Stage One, Scanner Reference Matching

Most scanners assign their own stable identifier to a result. The engine captures this as a scanner reference key, a value like nessus:1234:plugin:192.168.1.1 that encodes the scanner, the scan or plugin, and the target. When a result carries a non-empty scanner reference, the engine’s first move is to look up an existing finding by that exact key.

This stage is what makes re-ingestion idempotent. Import the same scan file twice, or re-run the same scheduled scan, and every result carries the same scanner reference it did before. The lookup finds the existing finding immediately, the engine refreshes it rather than creating a copy, and the second import produces no duplicates. Scanner reference matching takes absolute priority for this reason. When a scanner gives you a reliable identity for its own results, honoring that identity first is the cleanest way to keep repeated runs stable.

The limitation is built into the strength. A scanner reference is, by definition, specific to one scanner. Nessus and Qualys will never produce the same reference for the same vulnerability, because the format encodes vendor-specific identifiers. So scanner reference matching alone can deduplicate a tool against itself, but it cannot recognize that two different tools found the same issue. That is what the second stage is for.

Stage Two, SHA-1 Fingerprint Matching

When scanner reference matching does not produce a hit, or when a result arrives without a scanner reference at all, the engine falls back to a content-based fingerprint. This is a SHA-1 hash derived from the substance of the vulnerability rather than from any vendor identifier. We will cover exactly how it is constructed in a later section. For now the important property is that two results describing the same vulnerability on the same asset produce the same fingerprint, even if they came from different scanners.

The fingerprint lookup has one more characteristic that matters. It searches across findings in any status, including findings that have already been closed. The engine uses a lookup that does not filter out terminal states. That deliberate choice is what makes the reopen behavior possible, because a recurring vulnerability has to be able to match a finding that was previously resolved. If the fingerprint lookup ignored closed findings, a vulnerability that came back after remediation would look brand new, and you would lose the connection to its history.

Together the two stages form a priority order: identity first, content second. Both can match the same finding, and when they do, the scanner reference takes precedence as the more specific signal. This ordering is not arbitrary. It puts the most reliable signal first and uses the broader signal as a safety net.

The Four-Case Deduplication Pipeline

The two-stage lookup tells the engine whether a matching finding exists and, if so, what state it is in. The pipeline turns that lookup result into an action. There are exactly four outcomes, and every inbound result resolves to one of them.

The first case is the open update. The scanner reference lookup found an existing finding, and that finding is open, meaning it is not in a closed state. The engine refreshes the finding’s fields, its title, severity, fingerprint, and scanner source, from the new result and records that the vulnerability was seen again. No new finding is created. This is the common path for an unremediated issue that keeps showing up scan after scan. The record stays singular while its data stays current.

The second case is the fingerprint match. The scanner reference lookup found nothing, but the fingerprint lookup found a finding in the same company. From there the engine behaves the same way as the first case, updating the existing finding rather than creating a duplicate. This is the case that delivers cross-scanner deduplication. A vulnerability first reported by one scanner and later reported by another, with no shared reference key between them, still collapses to one finding because the fingerprints agree.

The third case is the reopen, which we will look at on its own in a moment.

The fourth case is creation. Neither lookup found anything, so the vulnerability is genuinely new to the platform. The engine creates a fully populated finding, carrying across the complete set of source fields the importer provided, including the extended SAST, DAST, and SCA context such as file path, line numbers, taint flow, dependency path, and license data. Creation is also one of the two outcomes that trigger immediate rule evaluation, so a brand new finding can have severity and status policy applied the instant it lands.

A useful way to read these four cases is as a single decision tree. Look up by reference. If that misses, look up by fingerprint. If either hit a finding, update it, unless it was closed, in which case reopen it. If both missed, create. Every result that enters the engine exits through exactly one of those branches, which is what keeps the outcome deterministic.

Reopening a Closed Finding on Recurrence

The third case deserves its own attention because it is where deduplication and history-keeping intersect.

Suppose a vulnerability was found, remediated, verified, and closed. Months later a scan finds it again, perhaps because a deployment reverted a patch or a configuration drifted back. Either lookup stage matches the closed finding, because the fingerprint lookup deliberately searches closed records and the scanner reference may still be the same. The engine sees that the matched finding is in a closed state and takes the reopen path. It refreshes the finding’s fields from the new result, issues a reopen, records the new scan occurrence, and runs rule evaluation.

The value here is in what it does not do. It does not create a new finding, which would suggest the vulnerability is novel and discard everything you learned the first time. Instead it revives the existing record, so the full history, when it was first seen, who worked it, when it was closed, stays attached. A recurring vulnerability is treated as a recurrence, not as a stranger. For anyone trying to understand whether remediation is actually holding, that continuity is the entire point.

Reopen is also why the fingerprint lookup has to span every status. The reopen path simply could not exist if closed findings were invisible to the matcher. The two design choices, status-agnostic lookup and the dedicated reopen case, are two halves of the same behavior.

How a Stable Fingerprint Is Built

The fingerprint is the heart of content-based deduplication, so it is worth understanding exactly what goes into it.

The version one fingerprint is a SHA-1 hash of a small, ordered set of inputs: the normalized title, the asset identifier, the endpoint, and, when present, a template identifier. Conceptually it hashes normalized_title | asset_id | endpoint [| template_id], where the template identifier is omitted when it is nil or an empty UUID. The choice of inputs is what gives the fingerprint its useful property. The same vulnerability, on the same asset, at the same endpoint, produces the same hash every time, no matter how many scans run or which scanner runs them. Two genuinely different vulnerabilities, or the same vulnerability on a different asset, produce different hashes and stay separate.

The reason a hash is used at all is practical. The engine needs a single short, comparable key it can look up efficiently and store on each finding. SHA-1 turns a variable-length combination of inputs into a fixed-length hex string that is cheap to index and compare. The cryptographic strength of SHA-1 is not the point here, because this is an identity key for deduplication, not a security signature. What matters is that the same inputs always yield the same output.

Why Normalization Comes First

A fingerprint is only as stable as its inputs, and raw scanner output is anything but stable. The same vulnerability can arrive as SQL Injection, sql injection, or SQL Injection with stray whitespace, depending on the scanner and the run. The same endpoint can arrive as https://app.example.com/login, app.example.com, or app.example.com:443 depending on how the tool reports it. Hash those raw values directly and you would get different fingerprints for what is plainly the same issue, which defeats the entire purpose.

This is why normalization runs before fingerprinting. The engine canonicalizes its inputs first. Titles are lowercased and trimmed, so casing and surrounding whitespace stop mattering. Endpoints are reduced to a host:port form: the scheme prefix is stripped, whether it is https://, http://, or ftp://, the path after the first slash is removed, and a port is appended only when one is supplied and is not empty or zero. Paths are stripped of trailing slashes. The result is a small canonical representation where superficial formatting differences have been squeezed out, so that two reports of the same issue converge on identical inputs.

The ordering is not optional. Normalization is the step that makes the fingerprint meaningful, because it guarantees that cosmetic variation in scanner output does not fracture one vulnerability into many fingerprints. Deduplication that hashed raw input would be deduplication in name only. This is also the clearest illustration of why normalization and deduplication belong in the same engine. They are not two features that happen to sit together. One is the precondition for the other.

V2 Fingerprints and Definition Priority

The engine also offers a version two fingerprint for callers that have richer context. Version two adds priority-ordered identifier disambiguation. Conceptually it hashes normalized_title | asset_id | trimmed_endpoint [| definitionID | templateID], and the rule is a priority order: a company-scoped definition identifier takes precedence over a global template identifier, and if neither is present it falls back to title, asset, and endpoint alone.

The reason for the second version is granularity. A global template identifier is shared across every tenant, while a definition identifier is scoped to a single company, which lets the engine distinguish company-specific vulnerability definitions that a global template would lump together. Callers that carry company-scoped definition context use version two to get that finer dedup resolution, while callers that do not have it use version one. Both versions share the same normalized core, so the behavior is consistent. Version two simply layers a more specific identity on top when the context is available.

What the Engine Records After a Match

Reaching a decision is only part of the work. Whatever the outcome, the engine records side effects that keep the surrounding system coherent.

After any create, update, or reopen, the engine records the scan occurrence. This maintains per-finding cross-scan aggregates: how many scans a finding has been seen in, which scan saw it most recently, and when the most recent wave touched it. These counters are what let an analyst answer questions like whether a finding is persistent across many scans or appeared only once, which is exactly the kind of signal that gets lost when duplicates proliferate. Because deduplication keeps the finding singular, the occurrence count accumulates on one record instead of fragmenting across copies.

Rule evaluation is the other notable side effect, and it fires selectively. The engine invokes rule evaluation only on the create and reopen outcomes, not on plain updates. The reasoning is that a freshly created or freshly reopened finding is the moment when severity and status policy should be applied, whereas a routine update of an already-governed finding does not need to re-run the same policies. Rule evaluation runs on a best-effort basis. If a rule fails, the failure is swallowed rather than aborting the import, so a single misbehaving rule can never block a finding from being recorded. Getting the finding written reliably takes priority over policy enforcement succeeding on the first pass.

All of this happens inside one synchronous pipeline step. The correlation call is blocking. It returns a result that names the action taken, the affected finding, and the fingerprint used, or it returns an error. There are no internal queues or timers inside the engine itself. Any background behavior belongs to the importer that called it, which may well be running in its own goroutine or async job. The engine stays simple and deterministic on purpose.

Where the Engine Sits in the Scan Pipeline

Stepping back, the engine’s position explains a lot about its design.

It sits upstream of every finding write and downstream of every scanner connector. The importer layer does the vendor-specific work: connecting, parsing, and turning raw scanner output into a normalized request. Only then does the engine take over to make the create, update, or reopen decision and to write through the finding repository. This separation is what keeps the engine scanner-agnostic. It does not need to know whether a result came from Nessus or Nuclei, because by the time it sees the result, the vendor differences have already been flattened into a common request shape.

That placement is also why one engine serves every scanner. All ten importers instantiate the same correlation engine and call the same correlate function per result. There is no separate dedup logic per vendor, which means the deduplication behavior is identical no matter which tool produced the data. A vulnerability deduplicates the same way whether it was first seen by one scanner and later by another, or seen repeatedly by the same one. Consistency across vendors is a direct consequence of putting the decision in a single shared component rather than scattering it across connectors.

If you want to see how this single decision point fits into the wider machinery of scheduling, importing, and reconciling many scanners, the multi-vendor scan orchestration guide covers the full pipeline. For a deeper comparison of principled correlation against simple string-match approaches, see the upcoming piece on correlation versus naive deduplication. And for the product-level summary of how PMAP turns many scanner results into one finding set, read the correlation and deduplication engine datasheet.

Two external references are worth keeping close when you reason about this layer. NIST’s SP 800-115 technical guide describes the scanning and assessment process that produces the raw results an engine like this has to reconcile. And the CVE program at MITRE provides the public identifiers that frequently anchor a vulnerability’s identity once it has been deduplicated and enriched.

Frequently Asked Questions

What is the difference between deduplication and correlation?

In this engine they are closely linked but distinct. Correlation is the act of recognizing that an inbound scanner result corresponds to an existing finding, which the engine does through its two-stage lookup. Deduplication is the outcome of correlation: collapsing what would otherwise be many records into one. Normalization sits underneath both, reducing noisy scanner output to a canonical form so that correlation can be reliable. PMAP keeps normalization and deduplication in a single engine precisely because stable, normalized inputs are what make correlation trustworthy.

How does a fingerprint stay stable across re-scans?

The fingerprint is a SHA-1 hash of normalized inputs rather than raw scanner output. Before hashing, the title is lowercased and trimmed and the endpoint is reduced to a canonical host:port form with scheme and path stripped. Because the same vulnerability on the same asset always normalizes to the same inputs, it always produces the same hash, scan after scan. The normalization step is what absorbs cosmetic differences in how scanners format their output, so re-scans converge on one fingerprint instead of fracturing into many.

What happens when a closed vulnerability is found again?

The engine takes the reopen path, which is the third of its four cases. The fingerprint lookup deliberately searches findings in any status, including closed ones, so a recurring vulnerability matches the finding that was previously resolved. The engine refreshes the finding’s fields, reopens it, records the new scan occurrence, and runs rule evaluation. It does not create a new finding, so the full history of the original issue stays attached and the recurrence is visible as a recurrence rather than as a new problem.

Does deduplication work across different scanners?

Yes, and that is the purpose of the second lookup stage. Scanner reference matching only deduplicates a tool against itself, because reference keys are vendor-specific. When two different scanners report the same vulnerability, their reference keys differ, so the engine falls back to the content-based SHA-1 fingerprint. Because the fingerprint is derived from the normalized title, asset, and endpoint rather than any vendor identifier, the same issue produces the same fingerprint regardless of which scanner found it, and the results collapse to one finding.

Why is scanner reference matching tried before fingerprint matching?

Because the scanner reference is the more specific and reliable signal when it is present. It is the scanner’s own stable identifier for a result, so matching on it makes re-ingestion of the same scan idempotent, producing no duplicates on repeat runs. The fingerprint is the broader fallback that bridges across scanners and handles results that arrive without a reference. Trying the precise signal first and the broad signal second puts the most trustworthy match ahead while keeping a safety net for everything the reference cannot cover.

Does the engine ever run automation when it deduplicates a finding?

It runs rule evaluation, but only on two of the four outcomes. When a finding is newly created or reopened, the engine invokes rule evaluation immediately so that severity and status policies apply at ingest time. Plain updates of an already-known finding do not trigger it. Rule evaluation is best-effort, meaning a failing rule is swallowed and never aborts the import, so the finding is always written even if a policy misbehaves.

Where does the correlation engine sit relative to the scanners?

It sits downstream of every scanner connector and upstream of every finding write. The importer layer handles all vendor-specific connecting and parsing, then hands the engine a normalized request. The engine makes the create, update, or reopen decision and writes through the finding repository. Because all ten importers call the same engine, deduplication behaves identically no matter which scanner produced the data.

PMAP Security Team

See Full Bio

One platform to ingest, correlate, triage and remediate every vulnerability finding.

Build and deliver vulnerability management with PMAP

Help Build the Vulnerability Management Platform Security Teams Trust

How a Correlation and Deduplication Engine Works