Basic analysis of Twitter takedown data

This entry is part 3 of 3 in the series Consider Social Media

Earlier this month, political analyst Amal Sinha tweeted some preliminary analyses of Twitter data about false accounts / bots:

Some background: On 12th June, Twitter released new datasets that compiled anonymised data from accounts that seem to be linked to information operations run by the Chinese (PRC), Russian, and Turkish states. These accounts have since been shut down, but Twitter has retained data about the profiles and their tweets. This is part of Twitter’s ongoing compilation of data about “potentially state-backed information operations” on their platform.

Sinha’s analyses looked at behavioural trends in the Chinese account dataset, including the timing of tweets:

This piqued my interest, of course. As you can probably tell from this blogchain, I’ve been thinking about social media and its influence on information dissemination and consumption. Sinha ends his thread by pointing out how these behavioural patterns and attributes could be used to create some accessible way to identify / flag fake accounts like these. That’s catnip for nerds in a world of digital disinformation, really.1Even if we factor out the very relevant fact that social media is destroying public discourse in my home country.

So I went and downloaded a copy of Twitter’s datasets to try poking through the data myself. The better to practice some R programming, too.

Simple Tweet Data Analysis with R

First things first, Twitter’s datasets are about as tidy as you can hope for. The Chinese set contained two main datasets:

  • account information, which compiled metadata about each profile (so attributes like user’s reported location, number of followers, etc.)
  • tweet information, which compiled individual tweet contents as well as metadata (time the tweet was published; reply, retweet, and quote counts; etc.)

There were 23,750 accounts in all, and a total of 348,608 individual tweets.

If you download the datasets, Twitter also provides a handy Read Me file that enumerates all the variables available for each dataset. For these quick probes of the data, I mostly did some simple transformations to isolate the variables I wanted to look at.

Examining Tweet Timings

First, I tried to recreate Sinha’s graph of tweet timings. I think the trickiest step here might be remembering to convert time zones, since Twitter provides timings in UTC by default.

## create column for tweet time by hour and store copy in new object
by_hour <- tweets_all %>% 
  mutate(chn_hour = with_tz(tweet_time, tzone = "Asia/Shanghai"),
         hour_level = hour(chn_hour))

## check new object
glimpse(by_hour)

## check count of instances by hour
by_hour_sum <- by_hour %>% 
  group_by(hour_level) %>% 
  summarise(count = n())

From there, it’s a simple matter of visualising the data using ggplot2, with “chn_hour” (basically, the hour in China’s standardised local time) as the focal variable:

## line graph version
by_hour_graph_line <- by_hour_sum %>% 
  ggplot() +
  geom_line(aes(x = hour_level, y = count)) +
  scale_x_continuous(name = "Hour of Day",
                   limits = c(0,24),
                   breaks = 0:24) +
  scale_y_continuous(name = "Tweets",
                    breaks = seq(0,60000,5000)) +
  labs(title="PRC Fake Twitter Accounts - Tweets By Hour", 
       subtitle="Tweeting trends correspond with working hours in China",
       caption="Source: Dataset from Twitter.com") +
  ggthemes::theme_economist()

This gives us the following graph:

I tried to create a bar graph version too, in the sense that it might be a better representation of discrete hours (as opposed to the line graph, which links each hour together into a continuous phenomenon).

## bar graph version
by_hour_graph_bar <- by_hour %>% 
  ggplot() +
  geom_bar(aes(x = hour_level)) +
  scale_x_continuous(name = "Hour of Day",
                     limits = c(0,24),
                     breaks = 0:24) +
  scale_y_continuous(name = "Tweets",
                     breaks = seq(0,60000,5000)) +
  labs(title="PRC Fake Twitter Accounts - Tweets By Hour", 
       subtitle="Tweeting trends correspond with working hours in China",
       caption="Source: Dataset from Twitter.com") +
  ggthemes::theme_economist()

Which gives us this graph:

The findings track with Sinha’s own graph, which he shared in his Twitter thread. Obviously, this would be the outcome since we were working with the dataset — but it’s always good to have that quick assurance that your own code was structured correctly and yielded the same results.

Examining Twitter Profile Age

Sinha didn’t tweet about this, but I figured I might as well check. In the Philippines, just from what I’ve seen from regular social media browsing, troll accounts tend to be fairly new. I wondered if that might be the case for these PRC accounts as well — if, perhaps, that indicated that most accounts used for specific information ops goals are only created shortly before the campaign starts.

First, then, I had to figure out how long each account was active — that is, each account’s “age.”

Twitter’s dataset doesn’t include activity ranges, but it does provide the account creation date for each profile. The Twitter profiles included in the dataset were taken down in May 2020, so I used that as my end date. Then, it was time to calculate ages for each account.

# Grouping accounts by age ####
mark_date <- as.Date("2020-05-01")

by_age <- accounts_all %>% 
  mutate(current = mark_date)

## set interval between twitter reporting date and account creation date
by_age <- by_age %>% 
  mutate(int = interval(by_age$account_creation_date, by_age$current))

## find length of interval and assign ranges
by_age <- by_age %>% 
  mutate(duration = round(time_length(by_age$int, unit = "month"))) %>% 
  mutate(range = cut(duration,
                     c(0,3,6,9,12,Inf),
                     c("0-3 months", "4-6 months", "7-9 months", "10-12 months", "13+ months")))

I figured there would be considerable variation when it came to the number of months each profile was active. To avoid getting a fairly messy graph2Just imagine 30+ ticks all over your X-axis, I decided to simplify things further and group accounts according to specified age ranges:

  • 0-3 months
  • 4-6 months
  • 7-9 months
  • 10-12 months
  • 13+ months

Then, it was a matter of graphing the results using ggplot2:

## check count per month age
by_age_sum <- by_age %>% 
  group_by(duration) %>% 
  summarise(accounts = n())

glimpse(by_age_sum)

## graph count per range level
by_age_graph <- by_age %>%
  ggplot(aes(x = range)) +
  geom_bar(aes(fill = range), show.legend = F) +
  scale_y_continuous(name = "Number of Accounts",
                     breaks = seq(1000,13000,1000)) +
  scale_x_discrete(name = "Age") +
  labs(title="PRC Fake Twitter Accounts by Age", 
       subtitle="Most fake accounts tend to be less than 7 months old",
       caption="Source: Dataset from Twitter.com") +
  ggthemes::theme_economist()

This gives us the following graph:

The vast majority of these troll accounts appear to have been less than a year old. There are a lot of factors that could affect account age, though: maybe Twitter tends to identify and take down troll accounts before most of them can breach the 6-month mark; maybe accounts get abandoned or deleted after a certain campaign; and so on.

This graph is mostly descriptive; sussing out some kind of explanation for this behaviour will take much more research and analysis. Still, it’s an interesting point to bring to light about these kinds of accounts.

More Information

I tried visualising these accounts as a network, but apparently that was too much work for my lone laptop. R couldn’t even produce a visualisation. 😅

Other, better analysts have, of course, studied this data and come up with much more sophisticated analyses. Twitter has been working with the Stanford Cyber Policy Center’s Internet Observatory, which has published its findings online. They’ve got a fantastic model of the network as divided among the topics of their tweets, as well as some interesting takeaways about the specific narratives that these accounts tried to amplify.

There’s a lot more data to be studied, but if nothing else, this quick look at a couple of Twitter’s datasets highlights the scale and sophistication of the information operations being carried out online. Social media can be a scary place, more so when you consider how its massive reach and influence is essentially unchecked. Like Sinha pointed out in his thread, though, studying these information operations could give us a fighting chance against disinformation online.