Estimated Reading Time: 7 minutes
Based on data from January 1, 2019, there are now roughly 1.94 billion websites on the internet. Compared to ~1.81 billion from January 2018, that would equate to a net gain of over 350,000 new sites per day over the course of the year. While the majority of web content remains in English, new domains are increasingly being published in non-English languages. The proliferation of smartphones and technology that connects people to the internet has led to increased numbers of non-English users on the web. According to Statista, here’s the current breakdown of internet users by language.
For businesses with a global perspective, it is essential to be aware of the growing diversity of users on the web. These trends represent new markets, opportunities, and challenges—requiring regular adjustment of marketing efforts, sales strategies, etc. For other industries, such as advertising, web filtering, and more, it is even more important—requiring continuous adjustment of products and services to accommodate language, culture, politics, and any number of other preferences pertaining to operations in target countries and regions. Even straying off course or overlooking a seemingly simple adjustment can have lasting consequences and create lasting negative sentiments towards a product or company.
At zvelo, we provide state-of-the-art web categorization in a growing number of languages. We categorize web content at the hostname (site level) and beyond—all the way down to the individual page or post-level. One of the key factors for categorizing web content is, of course, understanding the target language of the content. By leveraging language translation, and applying machine learning and other techniques for categorization, we are able to achieve a high level of accuracy and consistency.
How do we do this and what challenges do we encounter along the way? Come find out.
Introduction to Web Content Categorization
Web categorization is the task of classifying a domain, URL, or webpage under a pre-classified category. zvelo has 497 defined topic-based, malicious, and objectionable content categories. These categories range from Tennis to Medical to Travel to Malicious.
We operate in two primary business domains, one is for content filtering, parental controls, and malicious detection. The other is Ad Tech. If a parent would like to filter out objectionable content or a school would like to protect itself from malicious websites, then content filtering is the challenge and our zveloDB™ URL Database is the solution. On the other hand, Ad Tech leverages the high accuracy of our content categorizations in order to increase the relevancy of advertisements, and thereby improve conversions for said advertisement. An advertisement about a basketball product is more likely to sell when the advertisement is placed on a basketball-related website—whereas conversions would decrease if the same advertisement was displayed on a website about sailing, or cosmetics, or tobacco. They also use those same categorizations for brand safety—protecting their advertisements from appearing on sites and pages that they would not want to be associated with.
Some key challenges to web categorization are:
- Subjectivity in categories: One person might classify a website as a Blog but another might declare it to be News or Entertainment
- Granularity of categories and hierarchical categories: Should a website be classified as something specific like “Tennis” or is “Sports” good enough?
- Speed & Scalability: There are nearly 2 billion websites out there so the solution has to be fast and far reaching!
- The internet is constantly changing: models need to be updated constantly
Machine learning and artificial intelligence are central to the categorization task. Humans can efficiently annotate some urls, but it would never be cost effective to have humans classify ALL websites. Automation is a must.
The classification problem can easily be cast as a Bayesian inference problem. As a quick overview, Bayesian inference is a statistical method for calculating and updating the probability of a given hypothesis as additional information is gained. This form of inference derives its name from Bayes’ theorem—which describes the probability of an event given prior knowledge and conditions. For you history buffs, it is named after Thomas Bayes, who is credited for providing the first equation to update the probability of events by accounting for changing evidence.
So, in application, let’s say that we are trying to predict the probability that a particular website is part of the “Medical” category. We go to that website and see a rendered version of the underlying html. We can grab all of the html data along with any links and images associated with the website. Essentially, what we want is a mapping from all of the “features” that can be found in the website to the probability that a particular website belongs to each of the 497 categories. This is essentially a programmatic equivalent to what a human would do when they look at a webpage and try to fit it into a category. We want to know the probability distribution over the categories that we have defined given the features extracted from a website. What we want to find is a function f(features) that accurately maps the features to the probability of a particular category.
Selecting the top category is then as simple finding the category that has the highest probability given the features.
Let’s return to our example of estimating a probability for the “Medical” category. We have fetched and analyzed the page’s entire html document (including tags, css, text, and images). The English language is complex and full of exceptions and nuances, but let’s say we found the word “heart” within the text of the webpage. This might be a signal that the page “could” fit under the medical category as it’s discussing “heart attacks”, or “heart arythmia”. Alternatively, it could be describing Valentines day (i.e. a “Dating & Relationships” category) or perhaps it is being used as a metaphor for in a number of other language-specific contexts. For this reason, our model would assign:
You can see how the category is highly dependent upon the context in which the word is found. For example, what if we also see the words, “Valentines Day”. Now we can be fairly certain that the site isn’t medical at all. Therefore:
What if we see the word “surgeon” instead? Now the probability is likely to be higher. Especially, consider if we have a picture of a surgeon with a medical mask, then we are just about certain that the webpage belongs under the “Medical” category.
To calculate p(category | features) = f(features) we turn to machine learning.
Supervised vs. Unsupervised Machine Learning
Supervised machine learning involves collecting pre-categorized websites (training labels) along with the corresponding html and images for those websites (training features). We then “train” a model to create a mapping from the large number of features to the labels. Feedback is provided to the supervised model in the form of a loss function, where the model is penalized for incorrect answers and rewarded for correct answers. In this way, the machine learning algorithm slowly gets better as more and more labelled data enters the model.
On the other hand, unsupervised learning involves using only the training features WITHOUT labels to determine useful trends and “clusters” in the data. This method can work well if you have lots and lots of data and need a place to start; however, models will be much less accurate. And yes, it is exhaustive work to analyze and label the amount of websites that is required to achieve a state-of-the-art model.
At zvelo, we prioritize accuracy and strive to provide the best-in-class solution. That is why we focus on accuracy first, then move towards scalability. This means that supervised learning is our only option. That is also why we have established a dedicated internal team focused on creating reliable, consistent, and quality “ground-truth” examples.
Unfortunately, a machine learning model is never perfect and just like humans, mistakes happen. That is why it is essential to be constantly evaluating models against humans and vice versa. We are constantly testing our models against human verifiers to make sure that they are always up to date and accurate. When a human finds that the algorithm has made a mistake, this data is automatically incorporated back into the system so that the model can be retrained and avoid such mistakes in the future. Through countless hours of quality assurance and testing, we have found that this constant monitoring, flagging, and retraining process is key to building the quality of accuracy that our customers have come to expect.
We have now established how we extract information from a website, use data quality to ensure that we have annotated labels for training, and that we want to calculate the probability distribution over categories given those features.
How do we calculate probability distribution over categories? Which machine learning models do we use?
If you have read much about the machine learning space, you have probably heard of models like logistic regression, classification and regression trees, support vector machines, and maybe even more exotic models like random forests, gradient boosting trees, and deep neural networks. It is important not to get too bogged down in any particular model. Each has their own strengths and weaknesses.
That is why we use an ensemble method to combine a number of different models into one. The philosophy is that one model may be more adept at classifying a particular category or part of the webpage or specific feature than another. If this is true, then an ideal system would enable many models to work in concert and automatically determine when a model is better suited for a particular task than another. This approach also allows us to add new models to the system very easily. These models can be added and tested to determine the increase in accuracy on a test set. If the new model, significantly improves performance, then it will be added to the overall system and deployed to production. In this way, we have built an extremely flexible and adaptable model that can be automatically retrained, corrected by humans, and quickly refined as new models and research is released.
Thanks for reading and stay tuned for Part II where I’ll explore our approach to content translation—and its integral role in our categorization systems for non-English URLs.