Spotlight on security: Why do AV products score so highly in professional tests?
This question often arises on security-focussed internet forums. Why do antivirus solutions perform worse when tested by amateurs than when tested by professional testing organizations? It seems odd when hobbyist home testers publish tests on YouTube that seem to be a greater challenge to the AV programs than the comparative tests of professional organizations. Despite popular conspiracy theories, there is a logical explanation for these apparently contradictory test results.
YouTube testers’ results may not be a reflection of reality
Although most home/hobbyist testers don’t publish their tests on YouTube or security forums, we will refer to home testers in general as YouTube testers (as they are often so-called in the AV industry). However, we would like to point out that some YouTube testers may publish their reviews with the best of intentions and contain useful insights into e.g. the GUI of the product. We encourage users to install a trial version of any AV product they are interested in before making a purchase, so that they can decide for themselves whether it fits their own personal requirements. We also suggest you don’t rely blindly on one single test report, even from reputable independent test labs. Two or three tests, covering all aspects of a product’s protection and performance, will give a more complete picture, and a cross check using a test by another reputable lab might not be a bad idea.
Most home and hobbyist testers usually only have access to publicly available malware packs, and upload to VirusTotal the samples that their test subjects have missed, in order to check the validity of their sample set. Often only a few of the AV products flag these samples as malware, leading to the (incorrect) conclusion that because the AV products tested failed to detect them, AV solutions in general are useless against malware. The conclusion is incorrect for various reasons. One of the reasons is that the participating AV products on VirusTotal only scan the uploaded samples with their on-demand scanner. This explains why many AV products miss these samples, because only one of their many detection mechanisms is used, while their most powerful technologies are ignored (like behavioural analysis, sandboxing, etc.). In general, YouTube testers often use flawed (partial) test methods. Most YouTube testers often download a malware pack, unzip this malware sample pack (with disabled antivirus), enable the AV-solution of choice and run (or even just scan) the malware executables. By doing so, they cut out many of the protection mechanisms of the AV solution tested. For example, most AV products will behave differently when a file is downloaded from an URL with a poor reputation. This is why a specific sample might bypass an AV-solution in an ‘unzip and scan/execute’ test, while the same sample with the same AV solution could be blocked in AV-Comparatives’ ‘Real-World Protection’ test (which mimics the infection chain and use the real source URL in the execution scenario). Another mistake often made by hobbyist testers is the use of flawed (partially inactive) test samples. Some of the samples might only work on an unpatched PC. When a YouTube tester copies URLs from a public source of malicious URLs and (tries) to execute it, it could well be that vendor A would neglect this lame sample completely, vendor B would block it only at execution and vendor C could block access to the URL (maybe even for the sole reason that the URL is listed on the public repository). For the YouTube tester this sample would show the difference in protection strength of these three vendors. In all scenarios this sample would not have infected the system, no matter which AV product was installed. Finally, most hobbyist testers misinterpret the results of their tests, because they have insufficient means to check whether the malware really infected the system. For ransomware it is often quite easy to show, but for other types of malware (worms, backdoors and keyloggers) it is for non-professionals sometimes harder to determine whether the system was really compromised. A typical YouTuber test script is to run a series of (supposedly active and working) malware samples. An antivirus prompt or pop-up is counted as a successful block (although a detection message does not mean that the malware was really blocked or the system protected). Failures are determined by simply subtracting the blocks from the number of samples. In this scenario, crippled samples not infecting the system are counted as failures if not blocked. After this, the system is often scanned with a few popular malware-cleaning tools, and if remnants are found they are used as proof that the AV product has failed or that another product is better.
Furthermore, an aspect that is often ignored by some YoutTube testers are false alarms. It is quite easy to create a product which blocks every malware, if this is done on the cost of a high FP rate (or by asking the user to take a decision everytime by itself).
Why YouTube testers’ results may be flawed: (1) partial test methods
Why YouTube testers’ results may be flawed: (2) malformed samples
Professional testers collect thousands of samples per day. Most of the over 300k samples that are seen each day is auto-generated and not really different. A huge number of these samples are just the same malware variants. They will probably never be seen active infecting a user’s system. The chance of an average home user running into the specific generated variants in real-world conditions is near zero. This is why we focus on malware family diversity and exclude these ‘more of the same family’ variants, as well as focussing on prevalent malware which is seen in the field in order to create a representative test-set. Two-third of those 300k are usually unsuitable for various reasons (remnants, corrupted files, not working on target operating system patch level, etc.) or PUA (Potentially Unwanted Applications). Such files are excluded from professional tests, as they may be just potentially unwanted or not perform any malicious activity on the target system. Due to various opinions and classification of what is PUA and what is legitimate (cultural differences also apply), including PUAs would make it impossible to compare the results of one vendor with those of another. Since that is the main goal of a comparative test, we filter out PUAs too.
Some YouTube testers might even write their own malware (which could be considered as unethical and in some countries also illegal); beside the fact that the self-written/artificial test malware would (hopefully) never be seen in the wild (and therefore not represent the real-world), it is also to keep in mind that practically any product can be bypassed if enough time and resources are used to do so. So, depending on the intent of the person (it is often unknown who is being an anonymous nickname and what affiliations the person has) doing the YouTube tests, any outcome can be constructed.
Most malicious websites serving malware are only active for a short time (hours/days at most). We balance our test set to reflect the malware representation in the field in current real-world conditions; we do not use proxies or other shortcuts, and we use live test cases which are currently found to be active on the web. This explains why we ‘only’ use an average test set size of about 200 malicious test cases per month. Also other professional test labs use on average similar number of malicious test-cases in their real-world protection tests (e.g. AV-Test 250 per two months, MRG-Effitas 300 per quarter and SE Labs around 100 per quarter). The test cases are executed simultaneously in separated environments. Each AV product runs on a separate machine. It is important that products are tested in parallel, to prevent timing influencing the results, or letting one AV product benefit from a detection by another AV product (as some AV products use third-party AV signatures/cloud services, they might detect differently if they get tested some seconds later).
Most YouTube tests finish with a block percentage of the tested AV product, which may be well below the score achieved by the same product in a professional test. With overstretched conclusions based on limited test approaches, and without a representative and balanced set of malware samples, YouTube testers’ results may not reflect the real AV product’s protection performance. As explained above, most professional testers select their final sample test set to reflect malware-family representation in the field and also include a false positives test to compare the detection effectiveness of the different AV products. At AV-Comparatives we perform systematic and certified testing to provide unbiased comparative reports. The huge numbers of views and downloads underline the information value of these reports to consumers. We are proud to have earned this trust and thank our readers for the feedback provided in our annual security survey.