Secret Behind The Most Accurate Web Content Categorization

Estimated Reading Time: 3 minutes

At one time, the artist Steve Martin was actually a (pretty funny) comedian. In one particular skit, he would announce the secret to making a million dollars. He’d play his banjo and then say…drumroll…the first step is to start with a million dollars. Hey, back then, before the Internet (BTI??) it was considered humor.

Anyway, if you started reading this article because you thought you were going to find the actual secret to the market’s most accurate web categorization service, all we can say is….check out our blog on clickbait headlines. While we won’t cover the actual down-in-the-weeds details, we will outline the concepts, processes, and framework that form the foundation for zvelo’s web categorization services.

It’s this foundation, along with secret combinations of AI models, Machine Learning and data mining, that provide the classification results that makes zveloDB the database you can trust.

A Hybrid Approach: Human-Supervised Machine Learning.

Over many years of testing, trial and error, zvelo ultimately determined that a human-machine “hybrid” approach to classification produced the best outcomes. The human element provided the verifications necessary for the highest levels of accuracy, while machines (ie. AI/ML models and calculations) provided the scaling necessary to deal with the incredible volumes of new URLs and content being published at an increasing rate.

Specifically, zvelo adopted a “Human-supervised” approach to the classification challenge—where humans prepare training corpora (aka “data sets”) used in the teaching and training of the AI models. As there are hundreds and hundreds of AI models, there needs to be extensive amounts of training data in order for the AI models to achieve the targeted levels of accuracy and efficiency. For zvelo, this has meant creating a database of millions of human-verified training datasets.

In addition to the training datasets, it’s also necessary to have an independent and human-verified set of test corpora to prevent confirmation bias and incorrect training of the AI models. For zvelo, this means the continuous creation of human-verified testing datasets for each of the AI models, resulting in additional millions of testing data that has been created by zvelo.

Continuous Quality Assurance and Sampling

Finally, zvelo implemented a random daily sampling of the classification output of thousands of URLs from the production systems—which are then human-verified—with the confirmed results being used as a 360 degree feedback loop providing the AI models with a continuous stream of new human-verified real-time training data.

For tracking purposes, zvelo identified a set of KPI’s to measure accuracy, precision, recall, efficiency at the micro (individual AI models) and macro (overall/aggregate) level, as well as tracking of these KPIs across languages and geographies. Using these KPIs, zvelo is able to monitor performance on an hourly and daily basis on both the test data (in a staging environment) and the production classifications.

The source of the URLs being classified by zvelo is the “active web” (referred to as ActiveWeb) surfing of our OEM partners’ hundreds of millions of web filtering and parental controls uses, as well as subscriber analytics and brand safety applications where the zveloDB is implemented. This ActiveWeb activity allows zvelo to focus the categorization activities on those URLs across the web that are actively being visited and not waste time crawling those parts of the web which are inactive.

READ: AI Application Risk Intelligence: Advancing SaaS App Intelligence

The focus on ActiveWeb URLs, along with the industry’s most advanced categorization approach that combines the relative strengths of humans and machines, is what allows our partners to have complete confidence in the accuracy and coverage of the zveloDB!

Web Content Categorization Workflow

The diagram below provides a high level overview of the zvelo categorization process.

Dynamic Web Content Categorization Diagram

zveloDB and Content Categorization Services—performance and results you can trust.

99.9% Coverage. Over 99% Accurate. Categorically Superior.

zveloLIVE is our web-facing tool used to check URL categories and malicious status for domains all over the internet. Check any URL for it’s category status, malicious status, and seven other taxonomy lookups. When you’re ready to start an evaluation, contact us!

Try zveloLIVE