Background: Social media platforms such as Twitter are rapidly becoming key resources for public health surveillance applications, yet little is known about Twitter users' levels of informedness and sentiment toward tobacco, especially with regard to the emerging tobacco control challenges posed by hookah and electronic cigarettes.
Objective: To develop a content and sentiment analysis of tobacco-related Twitter posts and build machine learning classifiers to detect tobacco-relevant posts and sentiment towards tobacco, with a particular focus on new and emerging products like hookah and electronic cigarettes.
Methods: We collected 7362 tobacco-related Twitter posts at 15-day intervals from December 2011 to July 2012. Each tweet was manually classified using a triaxial scheme, capturing genre, theme, and sentiment. Using the collected data, machine-learning classifiers were trained to detect tobacco-related vs irrelevant tweets as well as positive vs negative sentiment, using Naïve Bayes, k-nearest neighbors, and Support Vector Machine (SVM) algorithms. Finally, phi contingency coefficients were computed between each of the categories to discover emergent patterns.
Results: The most prevalent genres were first- and second-hand experience and opinion, and the most frequent themes were hookah, cessation, and pleasure. Sentiment toward tobacco was overall more positive (1939/4215, 46% of tweets) than negative (1349/4215, 32%) or neutral among tweets mentioning it, even excluding the 9% of tweets categorized as marketing. Three separate metrics converged to support an emergent distinction between, on one hand, hookah and electronic cigarettes corresponding to positive sentiment, and on the other hand, traditional tobacco products and more general references corresponding to negative sentiment. These metrics included correlations between categories in the annotation scheme (phihookah-positive=0.39; phie-cigs-positive=0.19); correlations between search keywords and sentiment (χ24=414.50, P<.001, Cramer's V=0.36), and the most discriminating unigram features for positive and negative sentiment ranked by log odds ratio in the machine learning component of the study. In the automated classification tasks, SVMs using a relatively small number of unigram features (
Conclusions: Novel insights available through Twitter for tobacco surveillance are attested through the high prevalence of positive sentiment. This positive sentiment is correlated in complex ways with social image, personal experience, and recently popular products such as hookah and electronic cigarettes. Several apparent perceptual disconnects between these products and their health effects suggest opportunities for tobacco control education. Finally, machine classification of tobacco-related posts shows a promising edge over strictly keyword-based approaches, yielding an improved signal-to-noise ratio in Twitter data and paving the way for automated tobacco surveillance applications.
Keywords: social media; twitter messaging; smoking; natural language processing
Social media platforms such as Twitter are rapidly becoming key resources for public health surveillance applications. Vast amounts of freely available, user-generated online content, in addition to allowing for efficient and potentially automated, real-time monitoring of public sentiment and informedness, allow for bottom-up discovery of emergent patterns that may not be readily detectable using traditional surveillance methodologies such as pre-formulated surveys. In this study, we demonstrate the feasibility of a Twitter-based "infoveillance" [
Twitter offers a number of key benefits as a data source for public health surveillance. First, the dataset is large and readily accessible. In 2012, 340 million tweets were being posted daily [
Recent applications have sought to harness the unique public health surveillance opportunities offered by Twitter. A number of studies have tracked public sentiment and informedness during natural disasters, such as the 2011 Tohoku earthquake [
"Infoveillance" is defined by Eysenbach as "the science of distribution and determinants of information in an electronic medium, specifically the Internet, or in a population, with the ultimate aim to inform public health and public policy" [
Preliminary tobacco research using Twitter data has addressed several specific domains. Freeman [
Our first objective was to provide a content analysis of tobacco-related tweets. Work reported by Prier et al [
Our second objective was to improve the signal-to-noise ratio in Twitter data by automatically filtering out irrelevant content. Strictly keyword-based approaches are susceptible to lexical ambiguity in natural language: the keyword and wildcard combination smok*, for example, matches not only tobacco-related tweets but also tweets referring to smoked cheese. In order to reduce the presence of this type of noise, we trained machine classifiers to distinguish between tobacco-related and unrelated tweets.
The third distinctive objective of our work was to demonstrate the utility of Twitter in addressing new public health challenges related to tobacco usage. Two such issues are the growing popularity of hookah and e-cigarettes. As we discuss below, Twitter surveillance is particularly suited to understanding these new challenges.
A hookah (also known as shisha or narghile) is a waterpipe used to smoke flavored tobacco. Hookah is smoked by an estimated 100 million people daily [
E-cigarettes (or e-cigs) are another recently popular tobacco product subject to only sparse regulation and research. An e-cigarette is an electronic inhaler that produces vapor to simulate cigarette smoking and that may or may not contain nicotine. While e-cigarettes have surged in popularity as cessation devices, no consensus exists among public heath researchers regarding their health effects, and they are not endorsed by either the USFDA or the CDC [
Regulation of e-cigarettes is sparse and variable by jurisdiction-no warning labels are required, and the product is easily available online [
Using the Twitter Application Programming Interface (API), we collected a sample of tweets between November 2011 and July 2012 that represented 1% of the entire Twitter feed. This 1% sample consisted of an average 1.3 million tweets per day. In order to extract tobacco-related tweets from this dataset, we constructed a list of keywords relevant to general tobacco usage as well as hookah and e-cigarettes. Our initial list consisted of 30 such terms culled from online slang dictionaries, but we pruned this list to the 11 terms that were attested more than once per day in our Twitter sample (see below). These were cig*, nicotine, smok*, tobacco; hookah, shisha, waterpipe; e-juice, e-liquid, vape, and vaping (where * is a wildcard such that cig* matches tweets containing cigar, e-cig, and so on).
Our initial dataset included all tweets containing these keywords at 15-day intervals from December 5, 2011, to July 17, 2012, inclusive, which results in equal sampling of each day of the week. We thus avoided potential bias based on day of the week, which has been observed for alcohol-related tweets, which spike in positive sentiment on Fridays and Saturdays [
One of our keywords, smok*, was dramatically more frequent and ambiguous than any of the others, matching far more tobacco-irrelevant tweets (for example, tweets referring to smoked cheese). In a preliminary sample of 500 smok* tweets, only 16.8% were relevant to tobacco according to manual classification. Furthermore, over 100,000 smok* tweets were included in our 16-day dataset, making hand classification impractical. We thus included smok* tweets only for days where there were less than 400 total tweets matched by all other keywords, so that each day's total tweet count was at least 400, ensuring a balance such that no individual date was underrepresented. Following this procedure, 0.04% of all smok* tweets were included in the dataset. The resulting final dataset thus contained 7362 tweets, with a mean of 460 tweets per day (SD 35).
We developed a triaxial classification scheme to capture each tweet's genre, theme, and sentiment. The former two axes are similar in scope to the content and qualifier categories developed in Chew & Eysenbach [
The set of 7362 tweets was then manually classified according to the final version of this scheme by the 2 annotators. Tweets were assigned multiple categories within a single axis if applicable, and duplicate or re-tweeted posts were included only once to prevent spam or overly popular posts from biasing the sample. Non-English, unintelligible, or tobacco-irrelevant tweets were coded as belonging to none of the categories in the classification scheme.
In order to discover emergent trends in tobacco-related Twitter content, we computed correlations for each pairwise combination of the 30 categories within the entire coding scheme. In other words, given two categories such as hookah and positive sentiment, we compared the number of tweets manually classified under both categories to the number expected by chance to be classified under both categories. The contingency coefficient phi (which is equivalent to Cramer's V in the current 2×2 case) equals the square root of χ2/n, where χ2 is the chi-square statistic for the 2×2 contingency table, and n is the total number of observations. The phi coefficient ranges from 0 to 1, with 0 indicating no correlation between the two categories and 1 indicating perfect correlation.
We compared the performance of several machine learning algorithms on three classification tasks on the corpus of manually annotated tweets: relevance to tobacco, positive sentiment, and negative sentiment. Relevance to tobacco was operationalized as whether the tweet was classified under any of the categories in the scheme. Our goal was to test the feasibility of creating a natural language processing machine learning classifier with which we could automatically identify tobacco-related tweets in real-time.
We varied three parameters for each task: the machine learning algorithm, the order of n-gram used as features, and the number of features used. Algorithms used were Naïve Bayes, k-nearest-neighbors (KNN), and Support Vector Machines (SVM) (see Figure 3 for a brief description of these algorithms) [
We employed the Rainbow toolkit [
Features used for machine learning were represented as binary presence/absence of words in a tweet rather than the number of times each term occurred in a tweet. Term frequencies are unlikely to be significantly more informative, since words are relatively rarely repeated within tweets (mean type-token ratio 0.96, SD 0.08). Two additional standard feature-processing measures were taken: first, all tweets were passed through the Porter stemmer [
In order to evaluate the machine learning results, five standard classification metrics were computed for each task. Accuracy is simply the percentage of tweets correctly classified by the algorithm. We also computed precision, recall, specificity, and F scores, which are defined in Multimedia Appendix 2.
The corpus of 7362 tweets was annotated by authors MM and MC according to the classification scheme described in the Methods section. Interannotator agreement (kappa) met the standard threshold of 0.7 for each of the three axes of the scheme: genre=0.78, theme=0.70, sentiment=0.77. Of the tweets, 4215 (57.3%) were classified as relevant to tobacco, with the remainder comprising tweets that were not in English or that matched alternate senses of their keyword, such as smoked cheese in the case of smok*.
Among the tobacco-related tweets (ie, 4215 out of a total of 7362), the most prevalent genre was first-hand experience, matching 40% of tweets, followed by second-hand experience (14%), and opinion (9%) (recall that tweets may be assigned multiple categories). The top themes were hookah (20%), cessation (14%), and pleasure (11%). Finally, sentiment toward tobacco was overall more positive (46% of tweets) than negative (32%) or neutral, even excluding the 9% of tweets categorized as marketing, which resulted in a 41%/30% positive/negative ratio.
Search keywords associated with each tweet correlated significantly with more general properties, such as sentiment. Examining the five most frequent keywords (representing 96% of tweets), Figure 6 illustrates the tendency for tweets matching the keywords hookah, shisha, and vape/vaping to be classified as showing positive sentiment more often than expected by chance, and for those matching tobacco to show negative sentiment disproportionately often (note that low frequency keywords-nicotine, waterpipe, e-juice, and e-liquid-were excluded). The correlation is highly significant according to a two-tailed chi-square test for independence (χ24= 414.50, P<.001, Cramer's V=0.36). In this way, a general split in sentiment is observed between, on one hand, the new public health challenges represented by hookah and e-cigarettes, which are viewed more positively, and on the other hand, traditional products such as cigarettes as well as more general references to tobacco, which are viewed more negatively. In other words, smoking hookah is viewed more favorably than smoking traditional tobacco products, even though smoking hookah typically involves smoking tobacco.
Correlations between all pairwise combinations of categories in the classification scheme, computed as described in the Methods section, are reported in Figure 1. The highest intercategory correlations were observed between (
The three classification tasks investigated here are (
The most informative unigram features for each of the three classification tasks, ranked by log odds ratio, are listed in Table 2. Among the most informative features distinguishing tobacco-related from unrelated tweets are relatively predictable, unambiguous words such as cigarette, hookah, and tobacco. Several other emergent classes of words are apparent: marketing-related words including buy and http (typically part of sales website URLs); words semantically or pragmatically associated with tobacco usage such as smell and bar; and conversational words such as I'm, don't, and lol that are suggestive of personal expression rather than, for example, news or marketing.
Turning to the most informative features for positive and negative sentiment, several Twitter- specific expressions appear. gt and lt correspond to the greater-than symbol and the less-than symbol, which are, respectively, explicit tokens of positive and negative sentiment. smh, an acronym for shaking my head, is a general token of disapproval and is among the most informative features for negative sentiment toward tobacco.
A key point of contrast between highly informative positive words and highly informative negative words is evident in the kind of tobacco product to which they refer. Words related to hookah and e-cigarettes are highly predictive of positive sentiment (respectively, hookah, hose, shisha; electronic), whereas cigarettes and more general terms such as smoke and tobacco are predictive of negative sentiment. Discussion of this distinction, as well as its relation to the similar result in the interaction of search keywords and sentiment, is taken up in the next section.
The remaining positive and negative unigrams reveal informative semantic groupings. Words related to recreation and social interaction generally predict positive sentiment toward tobacco, and include bar, tonight, and night. Marketing-related words, such as buy, free, coupon, checkout, code, and win, are also prevalent in the positive category. Groupings in the negative category include words related to disgust and social image, such as nasty, unattractive, people, and girls, where these last two terms most often occurred in tweets disapproving of particular social groups' use of tobacco. Finally, words predictive of negative sentiment toward tobacco were also related to health, information, and cessation: health, kill, study, finds, quit.
Features Naïve Bayes KNN SVM
Acca F Preb Recc Sped Acc F Pre Rec Spe Acc F Pre Rec Spe
Relevance
Unigrams 0.77 0.83 0.73 0.95 0.53 0.73 0.78 0.73 0.83 0.59 0.82 0.85 0.82 0.88 0.75
Bigrams 0.66 0.77 0.63 0.97 0.24 0.65 0.76 0.63 0.97 0.24 0.73 0.75 0.82 0.69 0.79
Trigrams 0.61 0.74 0.6 0.99 0.1 0.6 0.74 0.59 0.97 0.11 0.61 0.74 0.59 0.99 0.1
Positive sentiment
Unigrams 0.76 0.5 0.56 0.45 0.87 0.76 0.37 0.58 0.27 0.93 0.75 0.38 0.53 0.3 0.91
Bigrams 0.77 0.44 0.62 0.34 0.93 0.76 0.42 0.58 0.33 0.92 0.77 0.43 0.61 0.33 0.92
Trigrams 0.76 0.26 0.62 0.16 0.96 0.76 0.26 0.62 0.17 0.96 0.76 0.27 0.61 0.17 0.96
Negative sentiment
Unigrams 0.84 0.52 0.57 0.48 0.92 0.72 0.3 0.27 0.33 0.8 0.83 0.39 0.53 0.3 0.94
Bigrams 0.85 0.35 0.73 0.23 0.98 0.31 0.3 0.18 0.82 0.2 0.84 0.44 0.59 0.35 0.95
Trigrams 0.84 0.24 0.76 0.14 0.99 0.22 0.3 0.18 0.94 0.07 0.84 0.37 0.66 0.25 0.97
a Acc: accuracy.
b Pre: precision.
c Rec: recall.
d Spe: specificity.
Tobacco-related Positive sentiment Negative sentiment
cigarette hookah lt
hookah cigar cigarettes
lt bar smell
smoking tonight hate
tobacco gt smoke
cigs electronic people
electronic night disgusting
http good tobacco
smell code finds
cigar checkout study
im love girls
bar lol alcohol
hate free nasty
day ecigarette unattractive
dont buy smh
gt hose smells
buy win kill
lol coupon health
people flavored mouth
good shisha quit
The Twitter surveillance results converge in several key classes of findings, which we discuss in turn in this section. First, the content analysis allows for a general pulse or snapshot to be taken of tobacco-related discussion on Twitter. Second, new insight can be gained into causes for positive and negative sentiment toward tobacco, especially with respect to hookah and e-cigarettes. Finally, several potential opportunities for tobacco education emerge, and we discuss them in the context of future research directions.
The relative prevalence of the various categories in the content analysis reflect a general pulse of tobacco-related discussion on Twitter. By far the most common categories are personal experiences and opinion, affirming the value of Twitter in assessing public sentiment and informedness. The next most common genre, marketing, is followed relatively distantly by information and news, and most tweets in these categories are not posted by recognized health or news organizations. In sum, reliable information is far less accessible on Twitter than are opinions, marketing posts, and information from unverified sources, indicating potential for greater public education in tobacco prevention policies.
Twitter surveillance allows for new insight into the correlates of positive and negative sentiment toward tobacco. Among Twitter users that post about tobacco in our dataset, sentiment is overall more positive than negative, even with marketing posts excluded. The strongest correlate of positive sentiment is first-hand personal experience, while negative sentiment correlates more strongly with opinion. In this regard, Twitter surveillance may reveal insights not available through surveys, where participants do not spontaneously relate experiences to an audience of friends and followers and are instead more likely to express more carefully crafted opinions. Indeed, surveys may thus underestimate the prevalence of positive sentiment toward tobacco.
Among the clearest correlates of positive sentiment are hookah and e-cigarettes. On all measures computed in this study, including (
Social relationships, especially among younger users, emerge as another key component of positive sentiment toward tobacco on Twitter, often in conjunction with products such as hookah. In the following example, tobacco usage is a central component of a positive experience in a social relationship: "Smoking that good hookah with the bro Sultan! #GoodOldDays #brotherforlife". These positive tobacco-centric social experiences also frequently involve young or under-age users: "Beer ponggg / hookah round 2 with my goons waddduppppppp. I love when my parents rnt home!"
In a related vein, these products are also associated with initiation of tobacco usage, as in the following: "an e-cigarette salesman at a mall to Parris and I: 'Do you guys smoke?' 'No.' 'Do you wanna start?'. "
In this way, positive sentiment toward tobacco appears to participate in a complex interaction between newer products such as hookah and e-cigarettes, younger users, and positive social experiences.
A social component is also central to negative sentiment toward tobacco. Categories corresponding to disgust and stereotypes were among the most highly correlated with negative sentiment, in fact outranking the explicit health category. A key distinction, however, is that while the category of social image correlated with negative sentiment, social relationships correlated with positive sentiment. Taken together, these findings indicate that social factors are central in driving sentiment toward tobacco and suggest that public health campaigns may do well to make use of this correlation.
Several novel findings, in sum, speak to the unique insights available through Twitter surveillance. All measures converged on an emergent distinction between two recently popular tobacco products, hookah and e-cigarettes, which corresponded to positive sentiment, and other products as well as references to tobacco more generally, which corresponded to negative sentiment. Sentiment toward tobacco overall among Twitter users is more positive than negative, affirming Twitter's value as a resource to understand positive sentiment in developing improved prevention policies. Negative sentiment is equally useful; for example, observed high correlations between negative sentiment and social image, but not health issues, may usefully inform tobacco control strategies. Twitter surveillance further reveals opportunities for education. Positive sentiment toward the term hookah but negative sentiment toward tobacco suggests a disconnect in users' perceptions of the health effects of hookah (ie, hookah is not regarded in the same negative light as traditional tobacco products). Finally, machine classification of tobacco-related posts shows a promising edge over strictly keyword-based approaches, yielding an improved signal-to-noise ratio and paving the way for automated tobacco surveillance applications.
The work reported in this paper does have some limitations. First, we harvested our data from the free 1% Twitter feed, rather than the full Twitter firehose. Second, our annotated dataset was relatively small, and there is some risk of our model overfitting. Third, the number of smoking keywords used to identify tobacco-relevant tweets was quite limited. It would be useful to augment our keyword list with tobacco-related slang (eg, "cancer sticks", "coffin nails") or electronic cigarette brands (eg, "blucigs", "greensmoke"). Fourth, in this work we have concentrated exclusively on analyzing tobacco-related tweets using natural language processing rather than on the social network aspect of Twitter (ie, we did not analyze the characteristics of those tweets most likely to be retweeted). Finally, one key issue that we have not addressed in this work is the role of novelty effects in attitudes towards e-cigarettes (ie, will interest in the products be sustained over time?). In future work we will address all these issues.
Our medium-term goal, building on the work described in this paper, is to create a Web-based social media monitoring system for tobacco-related products and smoking behaviors, integrating natural language processing, geographical information systems, and social network analysis to provide a service that will allow public health workers and other interested parties to monitor and track public attitudes towards a range of both established and emerging tobacco products, and to formulate policy and interventions accordingly.
We would like to thank Ms Madeleine Lee (Department of Family & Preventive Medicine at the University of California, San Diego) for her support and useful suggestions and Drs Son Doan and Sharon Cummins (Department of Medicine at the University of California, San Diego, and the Department of Family & Preventive Medicine at University of California, San Diego, respectively) for offering useful comments on an earlier version of this manuscript.
This work was supported in part by grants from the National Cancer Institute (grant: U01 CA154280) and the NIH Roadmap for Medical Research (grant: U54HL108460). Author MM was partially supported by a Jacob K Javits Graduate Fellowship.
None declared.
Annotation scheme.
[PDF File (Adobe PDF File), 144KB]
Evaluation metrics.
[PDF File (Adobe PDF File), 61KB]
API: Application Programming Interface
CDC: Centers for Disease Control
KNN: k-nearest neighbors
SVM: support vector machine
USFDA: United States Federal Drug Administration
PHOTO (COLOR): Figure 1. Correlations between all pairwise combinations of categories; values range from 0-1; correlations greater than 0.3 are underlined.
PHOTO (COLOR): Figure 2. Example tweets manually classified using annotation scheme (relevant categories are shaded).
PHOTO (COLOR): Figure 3. Machine learning algorithm description.
PHOTO (COLOR): Figure 4. N-gram text representation.
PHOTO (COLOR): Figure 5. Machine learning experiment workflow.
PHOTO (COLOR): Figure 6. Tweet sentiment by search keyword.
PHOTO (COLOR): Figure 7. Classification accuracy as a function of number of unigram features for 3 algorithms in the tobacco-relevance task.
By Mark Myslín, Department of Linguistics, University of California, San Diego, La Jolla, CA, United States; Shu-Hong Zhu, PhD, Department of Family and Preventive Medicine, University of California, San Diego, La Jolla, CA, United States; Wendy Chapman, PhD, Department of Medicine, University of California, San Diego, La Jolla, CA, United States and Mike Conway, PhD, 9500 Gilman Drive, La Jolla, CA, 92093, United States, Phone: 1 858 822 4931, Fax: 1 858 822 1934, Email: mconway@ucsd.edu
Edited by G Eysenbach