Social media and the web at large are becoming ever more visual worlds. Rather than share the world around us through words, we allow others to see it through our own eyes in the form of images, audio clips and video. This visually rich world has opened unprecedented opportunities to experience life across the world, witnessing major events as they happen through the perspective of those involved or witnessing them. At the same time, as the web evolves from its textual roots towards a visual-first world, it is increasingly becoming inaccessible to the deep learning and data mining algorithms we rely upon to make sense of its deluge.
Even the most powerful computer vision algorithms today are able to make only the most basic sense of images. The most common production deep learning capability is to tag photographs with predefined libraries of topical keywords. Images are essentially tested against this library of topics and each relevant keyword assigned. At the most basic level they can typically tell you an image contains a crowd, perhaps labeling it a “protest” and maybe even recording that there is a police car and a fire in the background. That’s amazing compared to the limits of even just a few years ago. Yet, it still tells us precious little about the image. It is good enough to flag that there is a dramatic surge in imagery depicting protests coming out of a certain city over the last few minutes, suggesting something big is happening. That alone is tremendously powerful for some applications. At the same time, it tells us nothing more than that. We can’t tell if the police are dispersing the protesters, whether it is a peaceful protest with the “fire” being an orange flag waving in the breeze or if it is a violent riot clashing with police and setting entire buildings on fire.
The most powerful production tools have larger vocabularies that can add additional context with a richer library of topical tags, but at the end of the day they are still merely assigning metadata tags to images. In many ways we’re back to the days of human catalogers appending a few basic topic tags to each image in the photo morgue.
Many tools recognize logos, which are useful for corporate brand monitoring. There are also tools that can recognize locations, faces, text, etc.
Research grade and some production tools can craft intriguing, but still extraordinarily basic, captions that lend an air of fluency to an otherwise mundane list of metadata tags.
At the end of the day, however, even the best of today’s tools are merely toddlers when it comes to offering useful insights from the firehose of imagery that is increasingly taking over the social sphere.
The extraordinary computational cost of even basic image recognition algorithms means that few social media analytics companies run them at scale across the totality of social media imagery. Some will run basic analyses, such as logo recognition or a small set of predefined topics, but few companies run recognition models with tens or hundreds of thousands of labels across the entirety of imagery in the Twitter firehose, for instance.
In fact, the overwhelming majority of production social media analytics research relies primarily on hashtags and associated caption text, rather than the actual contents of images themselves.
In contrast, textual analytics form the basis of most of the real insights we draw from social media today. From simplistic bag of words counting algorithms through advanced deep learning approaches, it is text that forms the lens through which we see social media. We count hashtags, compute word and phrase histograms, measure textual sentiment, flag brand mentions, compute follower and retweet graphs, measure meme velocity and so on. All of these are based on text or structural characteristics.
Even Twitter’s own frontpage trends list is based on text, not visual analysis. A trending image is surfaced only insomuch as it has a single distinct hashtag used to describe it, rather than a wide array of hashtags capturing its differing interpretations, contexts and languages of those sharing it.
Take a look at the features list of most major social media analytics firms and you will see that it is nearly all textual or structural. The rare firm offering image analysis typically offers only a few basic lenses through which to analyze that content, such as logos and a small number of topics.
Any social analytics firm can tell you how many tweets per day over the past month have mentioned your company in the tweet text or hashtag. Nearly all of them will offer you a timeline of the average “tone” of those tweets by day as well, along with who the most positive and most negative tweeters about your brand are. A few can even tell you how many tweeted images contained your brand’s logo somewhere in the image.
Yet, very few can take all those images with your brand’s logo and tell you whether they portrayed your brand in a positive or negative light. Some can offer that all of the human faces in the image were smiling and it did not depict violence or large numbers of police, offering that perhaps it was a positive image, but that’s about it.
Of course, visual context is extraordinarily hard to assess. Images don’t capture reality, they construct it. Myriad factors from framing to lighting can have an immense impact on the emotional tenor of an image, not to mention the subject it depicts.
There is also the non-visual context that cannot be inferred directly from the image itself. An image of a smiling person walking out of a building clutching a designer briefcase might at first seem like a positive image for the briefcase’s designer. However, what if the person in question is an alleged mass rapist and serial murderer who is walking jubilantly out of court after having their case overturned on a technicality? Suddenly the brand association might not be so beneficial.
Conversely, the US president walking out of a garbage-filled loading dock to the waiting motorcade looking exhausted after signing a peace deal ending a genocide, surrounded by police officers and holding that same briefcase might be a fantastic endorsement, despite the garbage and heavy police presence.
Much as humans require additional non-visual context to assess an image’s “tone” so too do machines, though current deep learning approaches are largely unable to incorporate such external world knowledge into their assessments.
The end result is that computer vision analysis of social media imagery does not today form a meaningful percentage of the total analytics that are generated from social each day. Even those companies that heavily tout deep learning image analytic capabilities still focus the majority of their capabilities and output on textual analytics.
Why does this matter? It matters because as social media is becoming increasingly visual, we increasingly express ourselves exclusively through visual forms. We share an uncaptioned photograph of a clear blue sky to express a beautiful day, we share a photograph of our beautiful meal, we livestream ourselves at a concert, we share a video of our child saying their first words. Each of these moments is shared as a self-explanatory self-contained visual object, without text or hashtags to make it accessible to our machine algorithms. For the majority of social analytics firms, this visual content is entirely inaccessible. Yet, even for those companies offering basic visual analytic capabilities, their ability to discern meaning from visual content is merely a microscopic fraction of what can be done with text.
In short, as social media becomes more visual, it becomes less accessible to our data mining algorithms. In turn, as social media is less and less data minable, we are less and less able to understand it. Given that visual expression skews towards the younger and influencer demographics of greatest interest to many brands, this transition is especially damaging to their ability to extract useful insights from social media.
Most social platforms offer some form of accessibility features, such as ALT text for imagery or captioning for video. However, few users bother to use these tools to make their content accessible. If more users cared about making their posts accessible to all, such as including rich descriptive ALT text for all of their images, the visual transition would also be less disruptive to social analytics.
Putting this all together, as the web and especially social media become ever more visual, that content is increasingly inaccessible to the text-first landscape that has come to dominate modern social analytics. While a small number of companies offer rudimentary image analysis, they are few and far between and their capabilities pale in comparison to those offered for text. At the same time, few consumers of social analytics seem to fully grasp this transition and the rapidly growing hole in the social insights they receive.
In the end, as social platforms rush towards a visual-first world, the vast landscape of social analytics is getting less and less representative of what we’re really talking about.