The diversity of Atlanta’s 242 neighborhoods captivated me. Some neighborhoods are indistinguishable, while others possessed vivid personalities. There is Cabbagetown, an eclectic, hip community with historic shotgun-style homes. Walking from Cabbagetown through the Krog Street Tunnel, you will discover walls of evolving street art. Nearby Inman Park has its own character with Queen Anne, Colonial Revival, Jacobean Revival and Shingle houses surrounded by verdant nature. Drive further north and you will encounter Buckhead, with its massive mansions, mid-rise residences, skyscrapers, ritzy restaurants, and exceptional shopping.
Between 2010 and 2015, the city of Atlanta gained more than 40,000 new residents. By 2040, metro Atlanta is predicted to have more than 8 million residents, an increase of the 5.7 million today. This fast-paced growth continuously alters neighborhoods that have become home to both transplants and locals. As one of these transplants, I have lived in Atlanta long enough to see development projects like the Atlantic Station, the Beltline, and Ponce City Market reshape sections of the city. Sometimes I feel like a tourist in my own city as I stumble into new shops and restaurants. This varied and evolving vibe of Atlanta neighborhoods intrigued me and became the focal point of my master’s project at Georgia Tech. My project was inspired by the neighborhoods feature on Airbnb, where local editors highlight the personalities of neighborhoods in various major cities. I was curious to see if, through the use of social media data, one could discover neighborhood vibe profiles that would change alongside neighborhood transformations.
Understanding Social Media Data
I chose Twitter as the initial step into my data science project. Twitter is a public platform for nonstop immediate expression of thoughts and sentiments through tweets that can also include geotags—critical for geography-based analysis. As appealing a source as Twitter is for data science, it also has inherent challenges.
- Tweets are naturally short. At just 140 characters, users can be resourceful, employing abbreviations and slang. Creating dictionaries for translating non-standard or locale specific usage is critical. In addition, document clustering algorithms work best with large documents containing several topics. (Note: Since the completion of my master’s project, Twitter has relaxed the 140 character limit.)
- Tweets are unstructured. There are few rules for how tweets are written. They may be any combination of words, short phrases, or complete sentences. In addition, they may have hashtags (like #riseup, used for Atlanta Falcons Football fans). These tags must be parsed into individual words for further analysis, and may require local knowledge to clarify meaning. Other challenges include emoticons, which can emphasize or reverse a sentiment. In addition, the tweets may have simple misspellings that must be corrected before analysis.
- Geotagged tweets may not represent the full population. Other research on Twitter shows that only a small percentage of users geotag their tweets, often because of privacy concerns. These willing users are sometimes travelers at major airports, parks, and public facilities. In other cases, they are non-personal accounts—such as marketers, bots, and spammers—that can deliberately weaken or reinforce the typecast of a neighborhood, as opposed to exposing novel encounters.
- Using only geotagged tweets limits the volume of data. I expected small neighborhoods would have fewer tweets than large neighborhoods. After cleaning/processing tweets, my most active Twitter neighborhoods averaged ~5,000 tweets per month whereas my least active neighborhoods averaged 300 tweets per month. The small data size added to the risk of not accurately representing the neighborhood population.
- Location may not match the topic or context of the tweet. While Twitter has the allure of immediacy, there is no guarantee that users will tweet at the moment of experience. To properly match a tweet to a neighborhood, context is needed to determine neighborhood relevancy; however, the short text limits the ability to derive context.
Collecting, Processing, and Analyzing Tweets
As a student, I had the freedom to explore in the face of challenges and determine if tweets could be used for neighborhood vibe research.
I used the Twitter Search API to collect geotagged English tweets within a 50-mile radius of Atlanta’s city center. Using neighborhood boundaries from mapping data and a tweet’s geotag, I associated each tweet to a neighborhood.
After data collection, I cleaned the data by replacing common misspellings, slang words, and abbreviations with English words from a custom dictionary of Atlanta specific slang, and abbreviations. I also parsed hashtags into separate words and removed all emoticons that could not be translated into relevant word meanings.
Because unsupervised document-clustering algorithms may not perform well with short tweets, I organized tweets into several large documents by neighborhoods, based on hashtags when available. I assumed tweets with the same hashtag would be related to each other. Tweets without hashtags were grouped together by Twitter account. In addition, I managed the impact of spam accounts by limiting the number of tweets contributed to a document by a single account. For example, there were non-personal accounts—for instance job recruiters or new nightclub marketers—that could have contributed more than 25% of the tweets for a neighborhood if left unchecked.
The tweet documents were input into Gensim, a free Python library for topic modeling/clustering. I then tweaked the parameters (number of topics, words per topics, and alpha and beta hyperparameters) to get a list of neighborhood topics. I manually classified the discovered topics into four vibe categories: points of interests, restaurants, perceptions/sentiments, and food/cuisine.
Putting the Vibe Topics to the Test
Understanding the many challenges Twitter-derived data presented, I had low initial expectations. However, I was pleasantly surprised to see distinct topics emerging per neighborhood. To determine if the topics were relevant, I ran a user study to assess two metrics: precision and recall. Precision measures the fraction of discovered topics that are relevant to a neighborhood; recall measures the fraction of relevant topics that were actually discovered.
I included the vibe topics for three neighborhoods with distinct vibes: Midtown, Virginia Highlands, and Cabbagetown. These three areas also had a varied volume of tweets, ranging from approximately 5,000 tweets per month in Midtown to Cabbagetown’s average of 300 tweets per month.
The user study was an anonymous online survey using SurveyMonkey. To recruit different population groups, I shared the survey on Facebook, Twitter, Yik Yak, and Atlanta subreddits, encouraging participants to also share with their friends and family. In formal terms, this is a snowball non-probability sampling method.
For each neighborhood, I included a Google classic map of its boundaries and asked potential to self-qualify based on recent knowledge (in the past six months) of the neighborhood. As people qualified, they proceeded to the topic questions. Each of the vibe topic categories (the aforementioned points of interests, restaurants, perceptions/sentiments, and food/cuisine) had two questions:
- First, the survey presented a list of topics and asked participants to select topics that matched their understanding of that neighborhood (precision testing)
- The survey then asked participants to add topics that they believed were missing from that category (recall testing).
Precision
I calculated precision based on the number of topics that participants confirmed as matching their knowledge of that neighborhood. To eliminate outliers, a topic was included in the precision calculation only if at least 10% of the participants confirmed relevancy. Based on these criteria, I found that three out of four topics were relevant to a neighborhood. The food category performed worst, with only one out of two topics being relevant. This could be due to personal bias for certain cuisines.
Despite attempts to limit the effect of marketers, they still had an impact on precision. For example, Club Kapture appeared as a top topic for Midtown, yet none of the survey participants found it relevant.
I also observed that participants did not always confirm a point of interest with low ratings as being relevant, even when the location was within a neighborhood’s boundaries.
The survey also showed that neighborhood boundaries are fuzzy. For example, Righteous Room and Majestic are in Poncey Highlands, which borders the Virginia Highlands. Both places appeared in the list of topics for Virginia Highlands. However, some participants adamantly opposed including them in Virginia Highlands, saying they belonged to Poncey Highlands. Others confirmed relevancy within the Virginia Highlands, perhaps feeling they were close enough to be included in that neighborhood.
This might also be an example of survey acquiescence bias—participants wanted to please the researcher by confirming their relevancy. Arguing for this explanation, there were other points of interests that participants confirmed to be associated with Virginia Highlands that were nowhere near its borders. For example, participants confirmed the relevancy of Smith’s Olde Bar to Virginia Highlands, yet the bar is in Piedmont Heights, separated from Virginia Highlands by a whole neighborhood.
Recall
I then calculated the number of topics that were suggested from participants via tweet analysis. I compared the participants’ suggestions to the topics found from one month of tweets and found that recall was 20%. This means that the topic modeling only found 20% of the potential topics for that neighborhood. Because of the low recall rate, I expanded the topic modeling to use three months of tweets.
This low recall rate likely reflected how geotagged tweets do not represent the population and are therefore missing key topics for a neighborhood vibe. For example, Midtown is known for being an LGBT-friendly neighborhood; however, none of the topics that emerged from the data analysis were LGBT-related.
Participants also complained that favorite local hangout spots were missing. For example, Moe’s and Joe’s, a fixture in the Virginia Highlands for 67 years, did not appear in enough geotagged tweets to become a top topic. This is likely caused by either geotagged tweets not representing the population or Twitter users not tweeting about their common daily experiences.
I also found that ambiguous proper names led to misclassification of favorite hot spots. For example, Home Grown is a popular restaurant in Cabbagetown, but I manually classified it as homegrown vegetables. Little’s Food Store is a neighborhood grocery store also in Cabbagetown; however, Little’s may have been misclassified as an adjective. Finally, Yeah! Burger, a popular burger joint in Midtown, may have been misclassified as an affirmation (Yeah!).
Analyzing Temporal Vibe
For my final analysis, I compared topics from the same neighborhoods—from Spring 2015 to Spring 2016—hoping to find that some neighborhoods would show how the vibe was evolving. I was disappointed to find no dramatic differences and few that could resulted from chance. For example, Spring 2016 introduced tweets about the Atlanta Beltline, an urban revitalization project to convert a railway corridor that encircles Atlanta to walking trails. Tweets could have increased because of new developments near the Atlanta Beltline. I also saw a rise in voting topics in 2016, however, 2016 was a Presidential election year. Mentions of exhibits at the High Museum surged in 2016, which may have reflected improved marketing or more interesting exhibit changes to the neighborhood vibe.
My project was a proof of concept to see if vibe analysis was possible with the use of geotagged tweets. Motivated by Airbnb’s curated neighborhood feature, in place of leveraging local subject matter experts, I explored how social media may describe the neighborhood vibe. Vibe can be derived from tweets on subjects like food, music, activity, art, and events.
Since social media data can have context, for example, location and time, it can reflect the up-to-the-minute vibe. The feel of a neighborhood can adjust based on time of day, day of the week, or season of the year. An obvious example is how a neighborhood can evolve from a business/corporate crowd to a hopping nightlife in a span of a day. These are nuances that can be overlooked when constructing a careful descriptive narrative for a neighborhood.
In addition to current vibe, social media may allow us to discover tipping points or emerging trends. Discovering these trends can steer important decisions, such as home-buying, site selections for restaurants or retail, and even city revitalization projects.
The vibe of a neighborhood is often dictated by the people congregating within them. My analysis could have included ascertaining categories of personalities for Twitter users and determining which neighborhoods attract those users.
As a proof of concept, I was able to see both the weaknesses and strengths of using social media data to understand neighborhood vibe.
[bluebox]
References:
“Census: Metro Atlanta Population Reaches 5.7 Million” Atlanta Journal-Constitution. Web. 9 Jan. 2017.
Malik, Momin, et al. “Population bias in geotagged tweets.” ICWSM Workshop on Standards and Practices in Large-Scale Social Media Research. 2015.
Acknowledgment: Dr. Munmun De Choudhury (Master’s project Advisor)
[/bluebox]