We often get asked how we are able to attain such high levels of coverage of the web with our URL database. A question raised in a recent discussion with a prospective partner went as follows:
“These numbers look really good. Can we take a moment and a step back and have you explain how you build your database of URLs and categorizations? It sounds like it’s a combination of spidering and feeds.”
The question provides a perfect opportunity to take someone through the story behind what goes into creating the best database coverage in the market.
History and Challenges
When zvelo first started working on a web classification database over a decade ago, we tried spidering but quickly found out it didn’t work very well. That approach leaves some pretty large “gaps” and doesn’t do a great job of capturing what we refer to as the “Active Web”–which are those web pages which actual users are actively visiting.
Spidering also created massive amounts of bloat by inserting URLs into the database which weren’t active. So, you end up with a situation where you don’t have the URLs in the database that are most active and important—and you have way too many URLs in the database that simply aren’t active.
As that was a dead-end approach, we took a step back and realized we had to address the following technical and business challenges:
- Instantaneous detection of new URLs – at the domain, sub-domain, and full-path as they become active
- Very fast processing and classification of the URLs
- Real-time updating of classified URLs
Further, we had to be able to do this with certain limitations—we didn’t have billions of dollars to solve the problems and we didn’t have hundreds of PhD’s on staff—in other words, we weren’t Google. But we needed to solve many of the same problems as Google—but in a mirror image sort of way—rather than providing insight about what URLs match a certain type of content or search term, we were looking at providing insight on the type of content of a URL.
Instant Detection of New URLs and Domains
As noted above, spidering doesn’t solve the problem. After considerable analysis, we landed on a variation of a crowd-sourcing approach that was a combination of a technical solution and a business model solution. We would establish partnerships with network security vendors, antivirus vendors, endpoint security providers, web filtering companies, telco’s/ISPs, SIEM vendors, CASBs, and others who would deploy zvelo’s probes in order to gain insight into relevant web traffic and visibility to the web activities of users. The tradeoff was that these partners would get the broadest and most accurate web categorization database in the market, providing the most effective protection for their users.
This approach took significant time, but we felt it was worth the investment as we have very close to 100% coverage of the active web. Through these relationships we were able to deploy probes/sensors, proxies, and a network which is used to detect the active web surfing of actual users. These partners represent over 650 million end users in every corner of the world. In this way, we have a continuous source of “active” web traffic and activity from the clickstream behavior of actual human users, allowing us to see—by the second— what URLs (domain, sub-domain, full-path) are being visited, which are categorized (typically 99.9% at the base domain level), which need to be categorized, and so forth. Each of our partners and their users get the coverage and protection represented by the combined and collective web activity of ALL of our partners’ users.
Very Fast Domain and Full-Path URL Classification
All of this traffic, along with SIEM and subscriber analytics logs, impression data and other data are fed into our AI-based content categorization engines and malicious detection systems, where processing and categorization occurs in seconds. These “human supervised” models support 500 topical, objectionable, malicious, phishing, and other classifications and detections. Further, on a daily basis we run comparisons of the zvelo database against website ranking services for the global 1 million, as well as the most popular websites for the top 25 Internet markets around the world.
Real-Time Updates For Categorizations and Threat Detections
The resulting categorizations and threat detections are immediately propagated to our master database and all local deployments, so every deployment has the immediate, real-time coverage and protection of the cumulative web activity of the total end user base.
The Result?
A URL database with the broadest coverage in the market, fastest classification speed and real-time updates—providing our partners with unmatched protection for their users and a critical competitive advantage in the market. Or, as we call it—the industry’s most comprehensive and accurate view of the “ActiveWeb”.