Correlation vs Naive Deduplication for Findings

Most teams discover the difference between correlation and naive deduplication the hard way. They turn on a second scanner, the finding count nearly doubles, and someone asks why the new tool is reporting the same SQL injection the old one already flagged. A naive deduplication step gets bolted on, the count drops a little, and everyone moves on. Then a vulnerability that was marked closed last quarter shows up again as a brand new finding, with no link to its own history, and the team realizes the dedup logic was never as smart as it looked.

This article is a direct comparison of the two approaches. Naive deduplication suppresses exact duplicates within a single scanner’s output. Correlation answers a harder question for every inbound result: is this the same real-world vulnerability the platform already knows about, regardless of which scanner reported it or what status it is in? Those are not two flavors of the same feature. They are different decisions with different failure modes, and choosing the wrong one quietly corrupts every count, every SLA clock, and every report downstream.

The argument here is not abstract. It is grounded in how PMAP’s correlation engine actually behaves, because a concrete reference makes the comparison honest. If you want the full step-by-step mechanics of the engine, read the companion piece on how a vulnerability deduplication engine works. This article stays on the decision: where naive dedup breaks, where correlation holds, and why the gap between them grows as your scanner count grows. For the broader category view, the multi-vendor scan orchestration pillar shows where correlation sits in the larger pipeline.

Two Ways to Handle Duplicate Findings

Before comparing, it is worth defining both models cleanly so the rest of the article has a stable reference.

Naive deduplication is the simplest possible duplicate suppression. It takes the raw output of a single scan, builds a key from a few obvious fields, and drops any result whose key it has seen before. In practice the key is often the scanner’s own identifier, or a hash of the plugin ID plus the host. Naive dedup is fast, easy to reason about, and genuinely useful for one job: stopping a single scanner from listing the same plugin result twice in one run. Its entire worldview is one scanner, one exact key, one moment in time.

Correlation is a richer decision made once per inbound result. Instead of asking “have I seen this exact key in this run,” it asks “does a finding for this real vulnerability already exist anywhere in the platform, in any status, from any scanner.” That single change in the question forces a more capable mechanism. PMAP’s correlation engine answers it with a two-stage lookup followed by a four-case decision. For every result that arrives from a scanner, the engine decides exactly one of three outcomes: create a new finding, update an existing one, or reopen a closed one. Nothing is silently dropped, and nothing that is genuinely new is silently merged.

The distinction matters because the cost of being wrong is asymmetric. A naive approach that under-merges produces duplicate noise, which is annoying but visible. A naive approach that over-merges hides a real finding behind an unrelated one, which is dangerous and invisible. Correlation is built to avoid both, and the rest of this article walks through the criteria that separate the two models in practice.

Criterion 1: What Counts as the Same Finding

Everything starts with the matching question. How does each model decide that two results describe the same vulnerability? This single design choice drives every behavior that follows.

Naive Dedup: One Exact Key, One Scanner

Naive deduplication matches on a single exact key. If two results produce the same key string, they are treated as duplicates. If they produce different keys, they are treated as distinct, even if a human would obviously call them the same issue.

This works inside one scanner because that scanner is internally consistent. It uses the same plugin IDs, the same host representation, and the same identifier scheme on every run, so its own duplicates collapse cleanly. The trouble begins the moment a second scanner enters the picture. Two tools almost never agree on identifiers. One calls the host 192.168.1.10, another calls it https://192.168.1.10/login. One uses a plugin ID, another a CWE reference, another a proprietary rule key. A single-key, single-scanner matcher has no way to see that these point at the same vulnerability, so it keeps them both. The finding count inflates, and the inflation looks like real risk.

Correlation: Two-Stage Lookup Across Scanners

Correlation replaces the single exact key with an ordered two-stage lookup, and that ordering is the heart of the design.

The first stage is the scanner reference lookup. Every scanner has its own stable unique key for a result, something like nessus:1234:plugin:192.168.1.1. The engine looks up any existing finding carrying that exact reference. When it matches, the engine knows with certainty that this is the same scanner reporting the same thing it reported before, which makes re-ingestion of repeated scan runs idempotent. This stage takes absolute priority, because a scanner’s own key is the most trustworthy signal available about its own results.

The second stage is the fingerprint lookup. When no scanner reference matches, either because this is a different scanner or because the reference changed, the engine computes a SHA-1 fingerprint from normalized fields and looks for an existing finding with the same fingerprint in the same company. Critically, this lookup spans findings in any status, including closed ones. The fingerprint is what lets two different tools converge on one finding, and it is exactly the capability naive dedup does not have. The companion article on how the deduplication engine works walks through the lookup order in full detail.

The takeaway for this criterion is simple. Naive dedup asks “is this string equal to a string I saw.” Correlation asks “is this the same vulnerability by reference, and if not by reference, then by fingerprint.” The second question survives contact with multiple scanners. The first does not.

Criterion 2: Cross-Scanner Matching

Cross-scanner matching is where the two models diverge most visibly, because it is the scenario naive dedup was never designed to handle.

Consider a web application scanned by both a DAST tool and a separate web vulnerability scanner. Both find the same reflected cross-site scripting on the same endpoint. Their scanner references are completely different, so the reference lookup finds nothing. Under naive dedup, that is the end of the story: two keys, two findings, one real vulnerability counted twice. The XSS now appears to be two separate problems, two separate SLA clocks start ticking, and two analysts may pick it up independently.

Correlation handles this through the fingerprint stage. Because the fingerprint is derived from normalized, vendor-neutral fields rather than any single scanner’s identifier, both tools produce the same fingerprint for the same vulnerability on the same asset. The reference lookup misses, the fingerprint lookup hits, and the engine updates the existing finding rather than creating a duplicate. One vulnerability, one finding, one clock, one owner.

This is not a niche case. It is the normal state of any program running more than one scanner, which is to say nearly every mature program. PMAP’s engine is invoked by the importers for Nessus, Qualys, Rapid7, Acunetix, Invicti, Nuclei, SonarQube, and Tenable.sc, as well as the generic integration connector, and every one of them feeds results through the same fingerprint logic. That shared path is what makes a finding from one scanner recognizable to the next. The neighboring comparison on multi-scanner versus single-scanner programs covers why teams add that second scanner in the first place, and the multi-vendor scan import article covers how those varied formats reach the engine.

Naive dedup cannot do this without becoming correlation. The moment you teach a single-key matcher to recognize the same vulnerability across different identifier schemes, you have rebuilt the fingerprint stage by hand, usually less carefully.

Criterion 3: Handling Recurrence, Reopen vs Duplicate

The second hidden failure of naive deduplication shows up over time rather than across scanners. It is the recurrence problem, and it is the one that quietly damages audit history.

A vulnerability gets remediated and the finding is closed. Two months later the fix regresses, a deployment reverts it, or a new server is provisioned from an old image, and the same vulnerability returns. The next scan reports it again. What should happen to that finding?

Naive dedup typically scopes its duplicate check to open or active findings, because closed records feel like history that no longer matters. So when the vulnerability returns, the matcher sees no active duplicate and creates a brand new finding. The original closed record sits in the archive, disconnected. You now have two records for one recurring problem, no link between them, and no visible signal that this is a regression rather than a first-time discovery. Trend reports treat it as new. Anyone reviewing the asset’s history sees a clean closure followed by an unrelated new issue, when the truth is that the same weakness came back.

Correlation treats recurrence as a first-class outcome. Its fingerprint lookup deliberately spans findings in any status, closed included. When an inbound result matches a closed finding, the engine does not create a duplicate. It updates the existing finding’s fields and issues a reopen, which is CASE 3 in PMAP’s pipeline. The original record is preserved with its full history intact, its status flips back to active, and the timeline now reads correctly: discovered, remediated, closed, recurred. That is a materially different and more honest picture than a disconnected new record. It also means recurrence is measurable, because reopen is a distinct outcome rather than a hidden create.

This is the cleanest example of why “deduplication” undersells what correlation does. Reopen is not duplicate suppression at all. It is the engine choosing to preserve and revive history instead of discarding it, and naive dedup has no equivalent move.

Criterion 4: The Fingerprint Itself

If the fingerprint is what makes cross-scanner and cross-time matching possible, then the quality of the fingerprint determines the quality of the whole result. A weak fingerprint matches things that are not the same, which over-merges and hides findings. An over-specific fingerprint fails to match things that are the same, which under-merges and creates duplicates. The construction has to be deliberate.

PMAP builds its fingerprint in two versions. V1 hashes a composed string of normalized_title | asset_id | endpoint, with an optional template_id appended when one is present and not a zero UUID. This binds a finding to its vulnerability type, the specific asset it lives on, and the network location where it was found. The same vulnerability on the same asset at the same endpoint produces the same fingerprint on every re-scan, which is exactly the stability you want.

V2 adds priority-ordered identity disambiguation. It uses the same title, asset, and endpoint base, then prefers a company-scoped definitionID over a global templateID, falling back to the title-plus-asset-plus-endpoint base when neither is available. The priority order is definitionID, then templateID, then omit. V2 exists for callers that carry company-scoped definition context and want finer dedup granularity, so that two findings sharing a title but belonging to genuinely different definitions do not collapse into one. Both versions are real and both are in use. The full derivation lives in the deduplication engine deep dive; here it is enough to see that the fingerprint is a designed key, not an accident of whatever fields were lying around.

Normalization Before Fingerprinting

A fingerprint is only as stable as the inputs that feed it, which is why normalization has to happen first. If the raw fields are inconsistent, the hash is inconsistent, and cross-scanner matching collapses no matter how good the rest of the logic is.

PMAP normalizes before it fingerprints. Titles are lowercased and trimmed, so “SQL Injection” and “sql injection ” resolve to the same input. Endpoints are reduced to a host:port form: scheme prefixes like https://, http://, and ftp:// are stripped, the URL path after the first slash is removed, and a port is appended only when one is supplied and is neither empty nor 0. Paths are stripped of trailing slashes. The point of all this is to erase the cosmetic differences between how two scanners describe the same location, so that https://192.168.1.10/login and 192.168.1.10 converge on the same normalized endpoint before the hash is ever computed.

Naive dedup almost never normalizes to this degree, because for a single scanner it does not need to. The scanner is already internally consistent. But that consistency is precisely what disappears the moment a second tool arrives, and without normalization the fingerprints diverge and the duplicates return. Normalization is the unglamorous step that makes everything above it work.

Criterion 5: Idempotent Re-Ingestion

There is a quieter property that separates a real correlation engine from a naive matcher: what happens when you import the same scan twice.

This is more common than it sounds. Scans get re-run on a schedule. An import job fails partway and gets retried. An operator re-uploads a file to be sure it landed. In every one of these cases, the platform receives results it has already seen, and the question is whether the second pass corrupts the data.

The scanner reference lookup makes re-ingestion idempotent by design. Because the engine looks up findings by the scanner’s own stable reference first, a repeated result for the same reference resolves to the existing finding and triggers an update rather than a create. Run the same scan ten times and you get one finding updated ten times, not ten duplicate findings. The reference key acts as the anchor that makes repetition safe.

A naive matcher that scopes its check to a single run gets this wrong across runs. Within one import it suppresses duplicates correctly, but it has no memory of previous imports, so the second run of the same scan looks entirely new and the whole result set lands again. Idempotency is not a bonus feature of correlation. It is a direct consequence of looking findings up by a stable reference that persists between runs, and it is the property that lets you re-scan freely without fear of inflating your own numbers.

Criterion 6: What Happens After the Match

A deduplication decision is rarely the end of the work. What the platform does immediately after deciding create, update, or reopen is part of what separates a correlation engine from a matcher that merely returns a verdict and stops.

Once PMAP’s engine resolves an inbound result, it does two more things in the same pipeline step. First, it records the scan occurrence, maintaining per-finding cross-scan aggregates such as how many scans have seen this finding, the last scan that touched it, and when it was last observed in a wave. This is what powers coverage and “last seen” views, and it is only possible because correlation kept one finding instead of scattering the observations across duplicates. Second, on a create or a reopen, it immediately invokes the rule engine to apply severity and status rules at ingest time. Plain updates deliberately skip rule evaluation, because re-applying rules to an unchanged finding adds churn without value. Rule evaluation runs best-effort, so a failing rule never aborts a successful import.

Naive deduplication has none of this. It produces a yes or no on duplication and leaves the downstream bookkeeping to whatever runs next, if anything does. The result is that the metadata correlation maintains automatically, occurrence counts, last-seen timestamps, ingest-time rule application, simply does not exist, or has to be reconstructed later from incomplete data. The match is not the goal. The match is the gate that lets the right post-match accounting happen against the right single record.

The Cost of Getting Dedup Wrong

It is worth being concrete about what the wrong choice costs, because the failure modes are not symmetric and they are not always visible.

Under-merging is the loud failure. When duplicates are not collapsed, the finding count inflates, the same vulnerability gets worked twice, two SLA clocks run for one problem, and reports overstate risk. It is annoying and wasteful, but at least it is visible. Someone eventually notices the same CVE listed four times and asks why.

Over-merging is the quiet failure, and it is worse. When a matcher is too aggressive and collapses two genuinely different vulnerabilities into one finding, the second vulnerability disappears. There is no duplicate to notice, because the finding count looks clean. A real, exploitable issue is now hidden behind an unrelated record, marked as the same thing, and nobody is looking for it. This is the failure mode that a naive matcher invites whenever someone widens its key to catch more cross-scanner duplicates without the discipline of a designed fingerprint.

There is a third cost that is specific to scanner diversity: lost context. Modern findings carry rich source detail. A SAST result has a file path, line numbers, a code snippet, and taint flow. A SCA result has a component name, version, ecosystem, dependency path, and license. A DAST result has the HTTP request and response. PMAP’s correlation request carries this full extended field set so that the create path preserves it losslessly. A naive dedup step that flattens results to a simple key tends to discard everything that does not fit the key, so even when it deduplicates correctly it can throw away the very detail that makes the finding actionable. Correlation deduplicates and keeps the context. That combination is the point.

How PMAP’s 4-Case Correlation Pipeline Decides

Everything above resolves into a single decision procedure. PMAP’s engine runs the same four-case pipeline for every inbound scanner result, and seeing it laid out shows how the criteria connect.

CASE 1, scanner reference match. If the result carries a non-empty scanner reference and a finding with that reference already exists, the engine updates that finding’s fields and records the scan occurrence. This is the idempotency path that makes repeated scans safe.

CASE 2, fingerprint match. If no reference matches, the engine computes the SHA-1 fingerprint and looks for an existing finding with that fingerprint in the same company, across any status. A hit here is the cross-scanner path, where a different tool converges on the same finding. The reference lookup is always tried first, so both stages can resolve to the same finding when both signals are present.

CASE 3, closed match becomes reopen. If the existing finding located by either stage is closed, the engine does not treat it as a plain update. It updates the fields, issues a reopen, records the occurrence, and runs rule evaluation. This is the recurrence path that preserves history instead of forking a duplicate.

CASE 4, no match becomes create. If neither stage finds anything, the result is genuinely new. The engine creates a fully populated finding, including the complete SAST, DAST, and SCA source fields, logs the activity, records the occurrence, and runs rule evaluation. Nothing actionable is lost on the way in.

The engine itself is synchronous. Each call to correlate a finding is a blocking decision that returns one of three actions: created, updated, or reopened. It runs silently during every scan import and is never user-visible except through its outcome. That is the practical shape of correlation. One pass, four cases, three outcomes, and a single source of truth on the far side. A naive matcher gives you one of those cases, CASE 1 within a single run, and calls it deduplication. The gap between those two is the gap this whole article is about.

If you want the implementation walked through line by line, the deduplication engine guide is the place to go. If you are still deciding whether correlation is worth the move from raw exports, the what is vulnerability deduplication primer covers the foundations, and you can read the correlation and deduplication datasheet to see how PMAP keeps one finding across every scanner.

PMAP Security Team

See Full Bio

One platform to ingest, correlate, triage and remediate every vulnerability finding.

Build and deliver vulnerability management with PMAP

Help Build the Vulnerability Management Platform Security Teams Trust

Correlation vs Naive Deduplication: Why Exact-Match Dedup Falls Short