Fake News: Understanding the Problem and the Challenge with its Classification


With the recent, polarizing U.S. Presidential election, the topic of fake news has come up often. The national conversation about it is usually centered around two topics:  how much it has influenced and continues to influence the political opinions of Americans who frequent social media, and how something needs to be done to better identify and filter it out. At zvelo, since web categorization is our bread and butter, this is a very important and relevant issue. With the increased pressure to better classify news sources, web page categorization becomes a paramount issue for media buyers and technology companies alike. It is also equally important to better define fake news and lay out the characteristics these sites possess.

Not Just TheOnion.com Anymore

When fake news is brought up, sites like TheOnion.com, Clickhole.com, and Reductress.com come to my mind. Before early November, my definition of this term was firmly grounded in sites that produce fictional news stories with humorous or inflammatory headlines to accompany them. Taking a small sampling of headlines from the aforementioned sites, you find things such as:




But after the Nov. 8th election, the definition of this term seems to have shifted, largely due in part to the research of Melissa Zimdars, a media professor at Merrimack College.  Zimdars and her team generated a list of what they deemed to be fake news sites. Her work garnered a lot of media attention. The reception to the list was varied. Some of the very site properties that her research labeled as fake questioned the validity of the list. However, much of the mainstream media coverage of her research implied that these fake news sites heavily influenced the outcome of the election, mainly due to the viral and sometimes misleading nature in which links originating from these domains were shared. This caused a perceived hysteria that most unassuming Americans could no longer discern real news from misinformation, due to how Facebook and other social media sites were presenting links to its users. Due to this backlash and the increased external pressure that came with it, Facebook recently promised to take measures to crack down on this newly coined “fake news.” But ultimately, many news sources and critics assert that it is too difficult of a problem to solve. We wanted to look further into this assertion.

Pre-Analysis Observations

Prior to running the sites on Zimdars’ list through our content categorization engine, the reason for the somewhat controversial reception of this fake news list becomes apparent after a cursory, manual review. A lot of the domains called out in the research are really just partisan-leaning, in either direction. These sites, for the most part, did not contain any fictionalized accounts of news stories or humorous headlines. Rather, many reported on events that actually occurred, but just offered a partisan-leaning (or just highly opinionated) spin to the event. For sites such as Breitbart.com or OccupyDemocrats.com, you’d have headlines such as the following, respectively:



A closer examination of the articles that accompany these headlines demonstrate that they are covering actual events, albeit with a partisan agenda. There is no doubt that some of the articles within these labeled fake news domains may stretch the truth, disparage certain groups, or leave out key details, but for the most part, many of them do report on events that have actually occurred (i.e. what I’d define as “news”). I’d consider sites like Breitbart and Occupy Democrats in the category of Opinion/Op-Ed vs. Fake. However, it’s worth noting that completely satirical and/or fictionalized domains still appear as part of Zimdars’ list, such as those mentioned earlier in this post – TheOnion.com, Clickhole.com, and Reductress.com.

zvelo Content Categorization Analysis

Our findings of running the sites through the zvelo content categorization engine corroborated this hypothesis that the websites have actual news stories, albeit in many cases with opinionated or partisan-leaning bias.  There are several sites that are intentionally satirical in nature, while a few websites truly are fake and appear designed specifically to generate traffic in order to drive a digital advertising revenue stream. Making this experiment more challenging, but necessary to have an accurate understanding of the topic, is that you have to examine and make an overarching categorization decision for individual articles, blog posts and pages within a domain. The diversity of articles, contributors, and opinions varies greatly across many news and aggregator websites, making decisions to categorize entire websites at the domain-level a somewhat pointless and often misleading exercise.  

Of zvelo’s nearly 500 content categories, I had an expectation that these fake news sites (based on the working definition I had of “fake news” prior to last month) would likely fall into any of the following categories (either as their primary category, or secondary/tertiary one): Humor, Personal Pages & Blog, User Generated Content, Community Forums, Hate Speech, or Social Media

Our analysis of the list of URLs (keep in mind these are mostly base domain-level URLs) concluded the following:

  • Only ~11% of sites have primary or secondary content categories of Humor; ~8% have a Personal Pages & Blog content category, none contain the User Generated Content, Hate Speech nor Community Forums category; just one site has the Social Media category value.
  • ~59% fall into either National News; International News; Portal, and Search;  or the Politics content categories.  
  • ~15% simultaneously have News and Humor or Blog content categories.

Fake News is a Grey Area

The results of our analysis help to illustrate the challenge of classifying fake news, particularly at the domain-level. It is quite a grey area–from a topic classification standpoint, any of these sites can have valid/truthful topics or keywords that would lead categorization engines to assign it to a “News” category while also simultaneously possessing untruthful topics or points of view.  Categorizing content at the individual article, blog or page-level provides better granularity and topic-specificity. However in most cases, the content is still reflecting an actual event combined with the author’s opinion or bias.  The opinion or bias doesn’t necessarily make the article “fake”.  As much as Facebook deserves increased scrutiny on how they are controlling the content that their users see on their newsfeeds, the problem of effectively filtering fake news really isn’t as easy as it sounds. Taking the obvious issue of revenue out of the equation, I really don’t blame Facebook for not volunteering to step up and “solve” the fake news problem immediately.

What Should We Do About It?

Companies have recently purported to identify fake news sites. By my best guess, this would have to be a heavily manual task, due to the fact that most of these sites have the trappings of a legitimate site that an algorithm could not easily identify. Further, in some of the more legitimately fake news cases, most recently in the instance of Jestin Coler, publishers spin up and shut down new versions of these sites often, deeming a manually effort not scalable.

In order for companies to move towards solving this problem, more focus has to be placed on the page/article-level, as opposed to blacklisting entire domains that fall under a certain content category. Additionally, more emphasis will be needed on contextual/keyword identification and sentiment analysis at the page/article-level.  Even with this due diligence, the context, sentiment and opinions that one would define as helping to identify fake news will vary greatly from customer to customer.  This issue will also need to be approached similarly to that of Brand Safety, where there are many custom taxonomies to suit individual customer needs.

It is really important that media and technology companies come to a consensus on how fake news is defined and how it is handled when presented to the masses. Thus far, I don’t think this has happened. There is a certain irony to this whole issue. The alarmist headlines about the fake news phenomenon has spawned a knee-jerk reaction, with many scrambling to try to block entire websites as soon as possible (taking high traffic websites and valuable impression opportunities out of the digital advertising ecosystem); this reaction is based on the misleading information of those very headlines that are reporting on the problem of fake news. So, in essence, fake news about fake news has got us all up in arms about fake news.

zvelo will remain diligent about closely monitoring this issue, and we will look to use our page-level Content categorization dataset, as well as our upcoming Keyword and Sentiment datasets to help to offer solutions for customers.