Abandon Conventional Data to Win

Today’s digital sphere presents an endless number of opportunities for brands to gain unprecedented insight into their audiences.

In his debut post for /Scribe, Social Chain’s lead data analyst, Rob Sullivan, discusses the merits of using social listening tools, like our own Crimson Hexagon, to scrape data from more unconventional sources:

When we think about social listening tools, we typically think about what we can obtain from mainstream social media channels, like Twitter, Facebook and Instagram. So, why look elsewhere?

Twitter: there’s more to data capture

Twitter is often a primary source of data capture in the social media world. However, the platform still has its biases; many accounts are fake, making for skewed and often meaningless results.

Harvard Business Review gave the example of Twitter data around Hurricane Sandy a couple of years ago.

Whilst data taken from Twitter did reveal some fascinating details, such as the revival of nightlife activity a day after the storm, most of the data came from Manhattan where smartphone and Twitter usage is highest.

Very few messages came from Coney Island, Breezy Point or Rockaway – some of the most severely affected areas. It is debatable whether analysts pulled useful data from the most affected communities.

When relying on insights from Twitter, ask yourself the following: are you getting data from people who really care about a subject or just those who shout loudest? Is Twitter a good representative sample of the general population in x country or not?

Case Study: Pew Research on American Twitter users

Pew Research in 2016 released some survey data on who American Twitter users are (seen below). They lean towards being more female, younger, and with a college degree and higher incomes.

So even from survey data, we can see the requirement to be careful about whose posts we are analysing. Are those 18-29-year-olds representative of the 18-29-year-olds in the population?

graphtable to use

Source: Pew Research Center, US Twitter Survey from Social Media Update, n=696 (2016)

Unconventional sources

Indeed, given these concerns, there is a strong argument for grabbing data from less obvious sourcesFor example, blogs, news articles, forums, YouTube comments, VKontakte (for Eastern European campaigns), and Reddit comments. 

With a few lines of code in R, a powerful open-source platform maintained by programmers, Social Chain can grab all the Reddit comments from a thread or a subreddit to analyse users’ true feelings about a particular subject.

Using more advanced code, we could analyse the subreddits where commenters crossover.


Source: Reddit user Felipe Hoffa used Big Query to analyse 1.7 billion Reddit comments, this is just one screenshot from a much bigger map (2016)

The value of looking elsewhere

Very quickly, you can see how marketers can glean valuable data from Reddit – i.e. where there are influencers on subreddits, where there are particularly engaged communities, where brands and products appear on subreddits, and what kind of audience a brand is getting in front of.

Let’s return to Crimson where we have conducted a search around the footballer Marcus Rashford. All the available sources we can get data from are shown below – notice this extends far beyond Facebook and Twitter.CRIMSONTINGS

Source: Crimson Hexagon ForSight Platform (2017)

Approach with relevance

Naturally, because of their partnership with Twitter and access to the firehose, Crimson still pulls in a lot of Twitter data. However, we see other sources emerging too. And from here we can start to build up a more conclusive picture, each source acts as a way to lessen the biases from others.


You can see the types of comments and articles we can draw from. With too many sources, you can begin to get spammy. But through Crimson, you can work to filter these out. Nevertheless, including every content source, every time you do a search, is not always so practical.


For instance, longer form content like a news or opinion article from The Economist or The Wall Street Journal, may only briefly mention a brand or a person and in a less relevant context.

If we were searching for “Harvard” to analyse how they are perceived online, scraping for mentions in news articles will find many instances of y politician, or z representative “who went to Harvard”.

None of that would be as relevant to perceiving Harvard as a brand, as say a Facebook comment from a student voicing their opinion. However, if you had a specific research piece looking at how the media portrays universities, including news’ sources, would be a top priority.

The key to working with social data is not to create more noise than necessary, or you will miss the signals. Ultimately, what sources you should and shouldn’t include is a qualitative decision of technical ability, accuracy and usefulness.

Want to find out more? Check out our interview with Rob on the future of logo recognition. Alternatively, like us on Facebook for more key insights.

Liked this? Then become a member of /Scribe. Your go-to destination for breaking news, insights and everything Social Chain.

« Back to scribe