Spotlight on security: The Curse of the False Positive
By David Harley
When is a false positive (FP) really a false positive? How much care should security vendors take to avoid or at worst fix them: do they really matter at all?
Well, let’s start off with a definition – actually, two definitions. Here are two major detection problems to which all the security products tested by AV-Comparatives are vulnerable (though some marketroids may choose to disagree when promoting their own products): false negatives and false positives.
- A false negative is what we get when a security product designed to detect something malicious – such as the presence of malware – fails to detect a malicious object or activity. This is an example of what is known in statistical test theory as a Type 2 Error. Formally, you could say that the null hypothesis “There is no malicious activity present” is incorrectly confirmed. Less formally, you’d say something like “My security software didn’t notice a malicious sample.”
- A false positive (or Type 1 Error) takes place when that same null hypothesis is incorrectly rejected. “My security software claims that this file is malware, but it definitely isn’t!”
Obviously, a false negative/missed detection matters. How much it matters depends on quite a few factors, such as how widespread the real but undetected malware is (it doesn’t have to be malware, but let’s stick with what I know best!), and exactly what impact it has on an affected system. In the security industry, this impact metric is often referred to as criticality, which might be defined as the extent to which malware (or, indeed, an FP) might impact adversely on a user’s experience.
Security and Compromise
Let me introduce you – though if you haven’t met with it before, you probably aren’t in the security industry! – to what is sometimes called the CIA tripod (or triad) model:
All three components are important for security implementation (duh), but you can’t realistically expect 100 percent success in all three. Let’s consider Marcus Ranum’s “Ultimately Secure DEEP PACKET INSPECTION AND APPLICATION SECURITY SYSTEM“. Paraphrasing, Ranum’s point is that you can make a device completely safe by cutting its power lead and the cable that connects it to the Internet, but that reduces its functionality to zero percent. To an individual or a business in the real world, availability is arguably the most important component of the CIA triad, but achieving availability tends to make complete confidentiality and integrity less achievable. Yet if you don’t have availability of every file or object that you’re entitled to, you don’t have perfect security or a viable business model. But perfection isn’t really an option. And FPs are essentially an unintended attack on availability.
FPs are also an issue because if a product often flags clean files as malicious, users might start not to trust the product’s detections (unsurprisingly): in this situation, it’s not unknown for a legitimate detection to be ignored, and the user whitelists real malware, allowing it to execute with potentially devastating results.
Criticality, Malware and FPs
Back in the days of viruses – if you’ll forgive a brief wallow in nostalgia – most viral malware didn’t actually do anything except take up a small area of main memory and/or storage. Most virus writers wanted the kudos of writing a successful (preferably widespread) virus rather than to cause damage or extort money, and some (notably Mike Ellison a.k.a. Stormbringer) were even apologetic and helpful if/when their creations caused noticeable damage. Even in recent years, there have been (infrequent) examples of ransomware authors showing some measure of remorse (or cold feet). That said, there are, of course, innumerable examples of viruses and more recent malware whose impact is considerable (from the Morris Worm to Michelangelo and CIH to ransomware, to wipers which may masquerade as ransomware but whose purpose is purely destructive). That impact is defined by the security industry in terms of criticality, which you might informally define as the amount of damage done. Fortunately, even the presence of malware which is intended to be destructive may fail to achieve its objectives because the environment in which it finds itself don’t allow it to trigger, or because the payload simply doesn’t work as expected there.
You may think that since even real malware doesn’t necessarily have significant impact, a false positive shouldn’t matter at all. Unfortunately, this isn’t the case. In fact, many of the scenarios under which malware can do significant damage also apply to FPs: not because of the misdiagnosed object itself, but because of the actions that security software takes to prevent it from doing damage. In such scenarios, the cure is usually worse than the symptom. When you think of the consequences of malware, it’s usually damage/criticality that comes to mind first, but there are other factors that also play a part:
- How widespread the FP and the misdiagnosed object happen to be (prevalence)
- How easy it is to repair any damage (recoverability)
- The type of user environment in which the incident takes place. (Types of protection in place, policy issues, and so on.)
Extrapolating from those broad categories, FP issues might include a wide range of issues.
Some false positives render an individual system unbootable, or allow it to boot but not to connect to a local network or to the Internet. Serious instances of such incidents are rare (but widely publicized when they occur). Yet no responsible vendor can guarantee that it won’t happen to their customers at some point. There have been occasions when security products have managed to prevent Windows from launching properly or at all because system files such as svchost.exe have been incorrectly diagnosed as malicious. (Sometimes, of course, such files really are compromised by malware with similar impact, but that’s another issue.) Sometimes these incidents are restricted to esoteric combinations of OS version and region, but a widespread incident originating with a major vendor is not only inconvenient (or worse) for the customer, but also causes considerable damage to the reputation of the vendor and therefore the marketability of its product range.
Then there are FPs that allow access to the Internet but prevent access to services such as email or web services. One of my favourite FPs of all time was the email filter that blocked all emails containing a certain letter of the alphabet. Don’t you hate it when the alphabet starts to spam you? (“Today’s denial of service is brought to you by the letter P and the number 7.”
Blocked access to such sites (for the individual or the enterprise) is often associated with the correct or incorrect diagnosis of malware being served from specific URLs. Unfortunately, the occurrence of both Type 1 and Type 2 errors is symptomatic of serious and ongoing problems with the way in which criminals distribute malware from websites.
Question: when is a malicious URL not a malicious URL?
Answer: quite often, as happens when the gang maintaining malicious URLs puts in measures to reduce the chances of those URLs being detected as malicious by security researchers. Obvious examples of such countermeasures are seen when a site avoids serving malware to IP blocks known to be associated with law enforcement or security companies, or to IP addresses in the same block that keep returning to the malicious site.
There are many other techniques used to hamper detection of malicious code that are, perhaps, better known. In most cases, early viruses leapt from system to system without actually changing the core code that enabled them to work, making them fairly simple to detect. Present day malware is very different: sophisticated programming techniques make it easy to keep altering code, for instance by obfuscating it differently with each iteration. What do security companies do to counter such countermeasures?
In fact, the anti-malware industry several decades ago started to move away from the simplistic ‘signature-based’ model of ‘detect a virus (or whatever), analyse a virus, blacklist a virus’. After all, even the biggest labs aren’t resourced to handle all the samples they receive on a daily basis individually and with visual inspection by a human being. (Hence the present-day emphasis on the use of machine intelligence in sample processing.) Every time vendors add a detection, they try to generate code that will catch a wide spread of similar samples, not just one short-lived iteration of malicious code. Unfortunately, this generalized or generic approach can be extended so far as to block a whole class of objects, many of which contain perfectly innocent code. There have been many examples of this over the years, but one interesting instance that AV-Comparatives has come across is as follows.
Detection versus Blocking: an Example
Consider a Word document containing a macro that launches a legitimate and signed app. Word documents have been a point of contention since the first widespread macro virus in the 1990s, of course. Even then, macros were a legitimate and potentially extraordinarily useful tool, especially for business users. However, early on in the development of macro viruses, at least one vendor overstepped the mark by ‘cleaning’ all files by disabling the execution of macros (and then announcing that they’d ‘fixed’ the macro virus problem). Today’s macros are a lot harder to compromise than in the heady days of the WM/Concept virus: Microsoft and other players in the security industry have expended a great deal of effort in order to rehabilitate them. And, of course, there are many business scenarios in which the launch of an app is not only legitimate but essential. Yet AV-Comparatives has found a number of products that tag as suspicious and/or block the execution of innocent files via macro.
Delegating the Decision
Is this a false positive? Well, not if a reasonably well-informed system owner has agreed – explicitly or simply by accepting a default – to go with one of the following positions:
- To accept the responsibility of deciding whether an object or process flagged by the product as suspicious is, in fact, malicious.
- To assume that the chances of sustaining major damage from a false positive are low enough to justify sustaining some false positives.
But few computer owners – or even system administrators – are experts in the art of malware management, and are often content to accept whatever defaults a security vendor chooses to set, in the hope that these defaults will afford them the Holy Grail of anti-malware technology: 100% protection against 100% of malware, plus a complete absence of false positives. Sadly, a great deal of security product marketing promotes the fallacy that 100% detection and 0% FPs are compatible goals. Offering defaults that devolve at least part of the responsibility for ‘recognizing’ malware upon the customer is one of the anti-malware industry’s ways of attempting to implement an acceptable compromise.
The Good, the Bad, and the Indeterminate
Here’s another example from AV-Comparatives of what is, arguably, a false positive. You may never have thought about the process of software installation, just accepting it as a process that runs when you first acquire an app and very possibly just accepting the defaults offered.
When I was a young (OK, young-ish) aspiring programmer, many programs – even quite large ones, sometimes – didn’t require an installation process worth mentioning. Once the program was on your system, all you had to do was run it and maybe make a few decisions about default settings.
As operating systems have become more sophisticated and complex, though, the installation process for even the simplest applications have also become more complex, and there is a whole class of installer program available designed to simplify the installation of other programs. Such programs usually include features such as compression and encryption that can be used to camouflage the presence of malicious software routines. The security industry has for many years benefited from the ability to identify some specific installers, packers and crypters (encryption software) that were exclusively associated with malware. However, if the package is sometimes (even if only rarely) used by legitimate software, and that legitimate software is diagnosed as malicious, we have a problem. Whatever action a product takes when it detects such a package already puts it close to the borderline between generic detection and false positives.
What AV-Comparatives is reporting, though, is something even more problematic. The testers have created or compiled installation routines using the NSIS scripting language and drawn from official Open Source projects, and found that several products detect clean installers as malicious. It’s probably safe to assume that not all users of security software realize that their product includes this generic detection by class rather than by specific detection of malicious code. And diagnosing innocent code as malicious is a perfectly viable definition of a false positive.
The most critical impact on a system will normally include more than one of all the categories into which FPs may fall: for example, impeding access to applications, to data such as corporate databases, to networks, and to business processes that impact the ability of the individual, the financial institution or the retailer to complete a transaction. The precise details, though, will depend on many factors. And criticality is not the only factor used to define how serious a false positive is: prevalence is another, though it can be difficult to measure. After all, if you’re a home user of the Internet doing everything from your laptop or even your smartphone or tablet, the inability to use that device may be more devastating than the loss of function on one or more PCs on a corporate network. Assuming, that is, that such assets as software licenses and business data are easily replaced or recovered – as should be the case in a well-administered IT infrastructure.
Like recoverability, environment is also important: some environments are able to be much more tolerant of false positives than others, and – in principle at least – customers are better able to decide where they stand on that continuum than vendors. But it’s a complicated issue that many companies and most individuals have difficulty in understanding, desirable though it would be for people and organizations in general to have a better working knowledge of the malware problem.
Today, the role of security testers in terms of helping customers to reach a better understanding of the subject is more vital than ever, and well-implemented false positive testing is a major component of the services they provide. But so is their ability to indicate to vendors the need to better understand the needs of their customers.
Low Prevalence, Low Interest
A small developer creates an app which is flagged as malicious or at least suspicious by security products. The developer generally doesn’t know about this until their customers complain about it and/or they see it misdiagnosed on VirusTotal or a similar resource. We know that a widespread and widely publicized false positive is harmful to the reputation of the security vendor, but in this case it’s the reputation and marketability of the app vendor that is threatened, and the security vendor may see this as less of a priority. When the developer tries to contact the security vendor(s) to get the FP fixed, it’s reported that vendors may respond that:
- Because the incorrected flagged file is not in widespread use, the problem will not be addressed because the effort required to make the necessary changes to the security product is considered out of proportion to the size of the problem. It is true that resolving such an issue may entail considerable re-engineering. The industry has become very reliant on making detections as generic as possible in order to maximize automatic detection, so amending and testing a complex detection in order to eliminate a false positive may be far from trivial. The correction needs to be carefully implemented, not only so that it remains effective with the next build, but also so that it doesn’t generate detection problems when other files are scanned. Simply whitelisting the affected file is an unsatisfactory alternative, since each rebuild of the app introduces the possibility of another FP. There is also the possibility of a security breach resulting from a compromised version of the app.
Should the low prevalence of an FP really absolve a vendor of any responsibility for the accuracy of their detection? We may not expect a security product to detect every threat at first sight, but we do expect such products to be updated to detect threats as they become known. Indeed, while you might expect vendors to keep information on new threats to themselves in order to maintain competitive advantage, in fact reputable vendors trade information on such threats, putting the welfare of the wider community above their own competitive interests. Nor is there an official threshold below which no one cares if information is not shared. Is it unreasonable to expect vendors to be equally responsive to FPs, irrespective of prevalence?
- They are marketing an enterprise security product so it’s of no consequence to them if niche products such as (innocent) games are misdiagnosed, since such apps are not expected to be used legitimately in a corporate environment. Apart from the fact that this expectation sounds almost willfully naïve, surely it’s not a security developer’s role to ignore known damage to another developer’s reputation, even if that developer is not one of their customers or a major player in its own market? This is a problem when a single vendor FPs, but there is a further problem when other vendors follow suit, not wanting to show up on VirusTotal as failing to detect malware that is detected by at least one competing vendor.
This reflects a long-recognized problem, in that some vendors have actually ‘stolen’ detections without verification, resulting in a ‘snowball effect’ by which a potential misdetection becomes more widely assumed to be correct because of the number of vendors who misdiagnose the so-called malicious object. Kaspersky highlighted this problem a decade ago by experimentally creating innocent executable files, deliberately flagging some of them as malicious, and uploading the files to VirusTotal. Kaspersky reported subsequently that within 10 days, 14 other vendors had flagged the same files as malicious, obviously without checking the files themselves. While I didn’t altogether approve of the way in which the experiment was performed, I was disturbed enough by its implications to write a joint blog article with Kaspersky researcher Magnus Kalkuhl focusing on the problem of ‘cascading’ FPs rather than on the details of the demonstration.
It’s also worth asking how many people would, in any case, realize that a security vendor might have different detection criteria according to whether a product is intended for the enterprise or for the home market, for the corporate server or for the home desktop. Probably not the majority. Yet this remains an issue of some importance when computer users make inappropriate use of VirusTotal to compare products. Furthermore, it’s reported that in some cases there is no way to contact the vendor to get FPs fixed, or at any rate the contact point is hard to find. In particular, it seems that certain enterprise-targeted vendors only allow contact/support for their paying customers.
It’s important for vendors to take false positives as seriously as they do missed detections (false negatives). It’s not good enough to cut corners by ignoring FPs using such rationalizations as “it doesn’t matter if it isn’t prevalent”, “this isn’t the sort of program our customer base needs to take into account because it’s not a corporate application”, or “it has no critical impact in the real world”. Any incorrect detection of clean files is an FP, and an FP’s eventual impact may turn out to be far greater than may be realized when it’s first reported. How and how well a company deals with a real FP is a viable indicator of its ethics as well as its professionalism. So it’s not surprising and entirely reasonable that sound testing organizations have evolved methodologies by which to measure it.
Though David Harley regards himself more as a musician than a security guy these days, he has spent more than three decades working in cybersecurity, in areas including system and network administration/management in the medical research and public health sectors, product testing and evaluation, plus over a decade working closely with a major security vendor. He is therefore still unable to resist offering his opinion on security matters when invited to. Most of his publicly available writing on security is linked or stored here.