The truth about Google facts

Links are better than the truth, for now.

Google has a new paper out that discusses how it might rank pages based on facts as opposed to links. If this were to become the case, it would represent a huge move for the search engine, which has historically used links as a major indication of relevance. In fact, it was the PageRank algorithm that really put Google on the map in the first place, and led to the search engine overtaking other players like Yahoo years ago.

These days, Google has at least over 200 signals it uses to rank content, but links are still a significant part of that. Just how significant they is debatable, particularly as Google includes more and more content and answers directly in its search results.

Of course just having this paper doesn’t mean that Google has implemented such a ranking strategy, nor does it necessarily mean that it will. The company has countless patents, and not all of them are in use. That said, the fact that Google has been researching this, and has indeed authored a paper on it, combined with the moves the search engine has already made, suggest that this is something Google could implement at some point.

The abstract reads as follows:

The quality of web sources has been traditionally evaluated using exogenous signals such as the hyperlink structure of the graph. We propose a new approach that relies on endogenous signals, namely, the correctness of factual information provided by the source. A source that has few false facts is considered to be trustworthy.

The facts are automatically extracted from each source by information extraction methods commonly used to construct knowledge bases. We propose a way to distinguish errors made in the extraction process from factual errors in the web source per se, by using joint inference in a novel multi-layer probabilistic model.

We call the trustworthiness score we computed Knowledge-Based Trust (KBT). On synthetic data, we show that our method can reliably compute the true trustworthiness levels of the sources. We then apply it to a database of 2.8B facts extracted from the web, and thereby estimate the trustworthiness of 119M webpages. Manual evaluation of a subset of the results confirms the effectiveness of the method.

So, they’ve confirmed the effectiveness of this method. That’s interesting. And if that wasn’t enough to get you thinking about where Google might be headed, the opening paragraph of the paper’s introduction pretty much discredits links as a valuable signal:

Quality assessment for web sources is of tremendous importance in web search. It has been traditionally evaluated using exogenous signals such as hyperlinks and browsing history. However, such signals mostly capture how popular a webpage is. For example, the gossip websites listed in mostly have high PageRank scores, but would not generally be considered reliable. Conversely, some less popular websites nevertheless have very accurate information.

Curious about which “gossip sites” they’re referring to? Well, the section it points to points readers to this list of the top 15 most popular celebrity gossip sites, which include: Yahoo! OMG!, TMZ, E Online, People, USMagazine, WonderWall, Gawker, ZimBio, PerezHilton, HollywoodLife, RadarOnline, PopSugar, WetPaint, MediaTakeOut, and FishWrapper.

Later in the paper, it notes that among these fifteen sites, fourteen have a PageRank among the top 15% of websites due to popularity, but for all of them, the KBT are in the bottom 50%.

“In other words, they are considered less trustworthy than half of the websites,” it says. It also says that forum websites tend to get low KBT, specifically calling out an example of inaccurate info found on Yahoo Answers, which you’ve probably seen ranking highly in Google results repeatedly.

The paper does also note that KBT as a signal is orthogonal to more traditional signals like PageRank. It also appears to hint at identifying content that is irrelevant to the main topic of a website.

This all really just scratches the surface of what the paper itself gets into, so feel free to jump in there for a deeper dive into what we’re dealing with.

In theory, what Google is proposing here could lead to some major improvements to search rankings. It makes some really good points. Chief among them is the one that popularity isn’t necessarily the best indicator of relevance.

Questions will remain, however, about just how well Google really is able to distinguish fact from fiction and/or fact versus outdated information. We’ve seen Google struggle with this time and time again with its Knowledge Graph. If Google’s “knowledge” is to become the backbone of ranking in the way that PageRank has been historically, it could open the algorithm up to potential errors.

That said, given that Google uses so many signals, and this would still just be one of them, I personally feel like this could be a more legitimate signal than PageRank. It’s been well-documentecd how links can be manipulated while Google plays whack-a-mole both manually and algorithmically. This might be harder for evildoers to game. Facts would certainly be harder to buy, although you have to wonder how the native advertising/sponsored content industry will play into this.

For now, it’s all theoretical anyway. You should really be more concerned with getting your site mobile-friendly. This is an actual signal Google will launch next month. If you have an Android app, you should get it set up for app indexing. These are the things that can make a difference in the near term.