Document Type


Publication Date



Understanding and exploiting user generated (textual) content (UGC) on social media is at the forefront of information management challenges today. The variety of UGC in detailed blog commentaries, collaborative wiki-content, online conversations, short messages in micro-blogs etc., are powering several personalization, monetization, crowd/business intelligence applications, and also providing an electronic microscope on social phenomena at an extraordinary scale. Certain characteristics of UGC however, necessitate key computational linguistic interventions before systems can tap into this data. A large portion of language found on social media is in the Informal English domain a blend of abbreviations, slang and context dependent terms delivered with an indifferent approach to grammar and spelling. Large-scale informal content analysis issues are not only a challenge for natural language engineers but are also relevant to scientists observing the effects of content semantics and style on the structure and behavior of a social medium. In this talk, I will cover some representative work in understanding three dimensions of user-generated informal content on social media - what are the named entities and topics they are making references to (what), what words or language constructs are they making use of (how) and what are the intentions behind what they write (why). Many of these investigations were conducted in collaboration with researchers at IBM, MSR and UC Berkeley. We will demonstrate how these perspectives along with other contextual properties of data (when, where they were generated) and the network they were generated in (type of media, poster characteristics) have been absorbed into two deployed Social Intelligence applications. We will close with a discussion on the potential in using social data for understanding complex phenomena like online conversations, diffusion of information, study of emergent social order etc., that necessitate a confluence of both network and content analyses.


Keynote at the Social Data on the Web Workshop with the International Semantic Web Conference, Washington, D.C. 2009.