The Role of Content Categorization & URL Databases
Categorizing the entire World Wide Web is a monumental task, especially considering that it is continuously evolving. New websites and services are constantly launching; while hundreds of millions of existing sites add or change content (without even mentioning the constant changes in underlying code and infrastructure). In most cases, these are legitimate improvements and updates conducted by a well-intentioned website owner. But what about sites that are launched with the express intent of malicious wrong-doing? And what about sites that are compromised—where harmful payloads or other changes are made by a hacker for nefarious purposes?
A myriad of technologies play a role in helping billions of network devices determine which destinations are safe and which are not. The majority of internet users are familiar with some of these (in concept at least) such as firewalls and antivirus software. But other terms, technologies, and security methodologies are less well known—hidden beneath the hood.
Distributed URL/IP databases along with the underlying content classification and malicious detection systems that drive them are examples of critical, but often overlooked and undervalued components in network security. Categorizing web content and identifying malicious threats is zvelo’s core business offering. For over 20 years, we have focused on developing and maintaining robust, scalable solutions that deliver industry-leading web content categorization and traffic data to support our OEM partners and keep internet users safe all over the world. Our partners take our data and solutions, using them to augment their own solutions and serve more than 1 billion end users worldwide.
There have been a number of challenges to overcome along the way, but through those, zvelo has been on a continuous path of improvement—and today delivers the industry’s leading URL database with 99.9% Coverage and over 99% Accuracy of ActiveWeb URLs. In this blog, we’ll explore the important categories and malicious threats that require rapid response for critical security and other needs, as well as the deployment options our partners leverage to deliver real-time coverage and protection for networks all over the world.
Critical Real-Time Coverage For Newly Identified Threats
First of all, let’s talk about timing. Why is real-time, up-to-the-minute coverage important?
As mentioned, the web is changing at a rapid rate. Malicious payloads and phishing URLs are constantly being deployed—used to target victims as soon as they “go live”. A well-timed and executed attack can compromise, infect, or steal data from countless victims before many traditional cybersecurity measures can even hope to identify, block, and/or mitigate the threat.
While these traditional security measures remain critical to protecting networks and users—they are still reactive—bolstering a defense only after the hacker has begun executing an attack. But without a “Minority Report” it’s just not possible to flag and defend against every threat before (or AS) they are launched. However; thanks to advancements in AI and machine learning, the efficiency and speed in which models and computers can identify suspicious web destinations is rapidly improving. And even though reactive, these systems are becoming faster and more accurate, enabling us to properly identify and mitigate threats sooner. Which in turn, translates to fewer instances of compromise, fewer victims, and less damage done.
So, every day, month, and year we get better at identifying various malicious threats. But we still have to ensure that those identifications (i.e. data) are propagated to our partners and their systems (e.g. computers, routers, UTMs, gateways, data centers, IoT devices, etc.) all over world—and as fast as possible. In order to stay “current” with all the newly identified threats—we would need to update all instances of URL databases around the world AS threats are identified. With nearly 2 billion websites on the Internet at the time of this writing, even if malicious and objectionable identifications only represented a fraction of a percent of the total domains and webpages—that would equate to a significant number of URLs and pages.
There are some challenges, especially when considering we support a wide range of proprietary partner systems. To accomplish the task, we offer a variety of flexible deployment options. Many of these implementations communicate directly with the zveloDB™ Master Database—which is always up to date. But some implementations require a smaller, “consolidated” version of our database. This is often a requirement for consumer networking devices and systems designed with lower internal storage space, but it can also include UTMs, gateways, routers, etc. Many of our partners choose to deploy our zveloDB SDK for these devices. Our SDK is equipped with a consolidated “cache” of our Master URL database. Despite its smaller size, these versions of our Master database still provide an 99.9% Coverage of the ActiveWeb—a figure we are immensely proud to maintain.
Depending on their location, purpose, and implementation‚ keeping these devices (i.e. database instances) up to date can be more difficult—though equally important for security. For these types of devices and implementations, updating the entire database every time a category change happens just isn’t feasible. So, for data usage optimization, these devices are typically configured to pull down a “fresh” version of the database once per day. But as we discussed earlier, that is a lot of time between updates and a significant number of URLs will change over the course of 24 hours.
To support these devices, we designed our SDK infrastructure to support “real-time” critical security updates, we call zvelo Instant Protection, or zIP. When important category changes are made, they are delivered to deployed SDKs all over the world in “real-time” (within seconds).
Here’s how we tackle those individual updates. In addition to the primary “consolidated” database on these devices (updated daily), our SDK is also equipped with a temporary “CustomDB”, which is used to store all of the incremental category changes (real-time, zip updates) that occur between the daily refresh. When a query is made to the SDK on these devices, it will first be checked against the “CustomDB” in case any updates have been made. After that, the query will be made against the “cached” version of zveloDB. If the URL is not found in either local database, the SDK can be configured to make a cloud query to zvelo’s Master Database. The SDK can also be configured to submit uncategorized URL requests to the zveloAI cloud network—improving categorization and coverage for everyone via a crowd-sourced approach to gathering URLs.
Prioritizing URL Categorization Updates For Up-to-the-Minute Protection
With our global network of more than 1 billion end users, category changes are happening continuously, 24/7. In order to provide the highest levels of protection and coverage for newly identified threats and classification changes, we prioritize ActiveWeb URL category updates. We prioritize these changes into four priority bands as follows:
- New Malicious Detection: A URL category change because of a malicious detection
- Safety Update: A URL category change from malicious to a “safe” security reputation
- Miscategorization Update: A URL category change due to a reported miscategorization
- Objectionable Update: A URL category change to an objectionable or highly blocked category
In this way, we always propagate critical security updates and malicious and phishing detections first (Malware, Phishing & Fraud, Command & Control, Botnets, and more), and then miscategorized and objectionable changes immediately after. When a new identification or category change happens in our system, it is instantly propagated (within seconds) to all SDKs and Database instances worldwide that have zvelo Instant Protection updates enabled.
Additional Thoughts
This is a broad overview of our real-time updates infrastructure that supports the continuous stream of base domain and full-path ActiveWeb URLs received by our global network of end users. For more information about how we manage newly-visited uncategorized URLs or how we are able to maximize coverage of the ActiveWeb, check out our blog: Secret to the Market’s Most Accurate and Comprehensive Web Categorization Service.
You can also take our web categorization service for a test drive and enter URLs to get back a category value on the zveloLIVE: URL Checker. If you have questions or are ready and committed to implementing the industry’s most comprehensive and accurate URL Database for your solution, contact our sales team.