Text analytics is using sampling data which is undergoing a process to read unstructured text and locate and identify grammatical elements/categories within social mentions. 


They enhance our ability to structure and understand words.

Our algorithms are analyzing mentions as they’re crawled in order to detect potential entities and the part-of-speech composition.

That’s the magic of machine learning : it knows from experience what is an entity & to which category it’s associated. But that’s why also there can be mistakes. 


We used public and in-house annotated datasets to achieve these performances.

At Synthesio, we’ve been analyzing what the grammar composition could provide in terms of insights for our customers and we’ve regrouped them in different categories :

  • Single words (also called actions & descriptions) that could bring value : what verbs, adjective & nouns are people using online?

  • Expressions (or chunks), which shows better what do they actually say, and how do they say it?


*The number of mentions for text analytics are estimated volume which is generated from sampling data. 



If we compare Text Analytics and Keywords in dashboards, we will often find that Text Analytics words / expressions have a lower volume than Keywords. We will explore here why we have this difference. 

 

Difference explanations 

 

1. Different rules to count Text Analytics 

While for Keywords, every tokens from every mentions in the dashboard are counted, there are some rules that are applied to Text Analytics:  

  • Retweets are not processed: since retweets are redundant information and represent a high volume in Blackhole, we do not process Text Analytics and Entities for retweets 

  • Quoted-retweets are not processed entirely: only the non quoted part of the retweet in processed by Text Analytics and Entities 

  • Titles of Comments are not processed: for comments, title corresponds to the original post, so we don’t process it to avoid counting the original mention multiple times 

 

2. Texts Analytics are lemmatized 

Words from Text Analytics are lemmatized, meaning that we keep only the canonical form of the word (infinitive for a verb, singular for a noun, etc.): for example if the word “games” appears in a mention, we will count it as “game”. This is not the case for Keywords, as we keep the words as it appears in the mention.