AQL for Twitter Analysis

oauth aql arangodb

May 2014

The Twitter API is great to tap into machine readable representations of Twitter data. In this post, we explore a small combination of a Node script together with ArangoDB.

As ArangoDB is built around the V8 JavaScript engine, it is possible to load and run JavaScript modules from the ArangoDB console to explore data. This post discusses preliminary ideas on connecting to Twitter and basic AQL functions.

Authentication

Before you can play with data from Twitter, you must setup authentication for API calls to Twitter. This is the first hurdle to access Twitter data. To make requests to the Twitter API, you must signup for a free Twitter developer account and register your app.

For this, you can go to dev.twitter.com/app/new and signup your app. After successful signup, you’ll get an API key and an API secret. These token are needed to verify that an request is coming from your app.

Next, you will find some URLs:

App-only authentication: https://api.twitter.com/oauth2/token
Request token URL: https://api.twitter.com/oauth/request_token
Authorize URL: https://api.twitter.com/oauth/authorize
Access token URL: https://api.twitter.com/oauth/access_token

With these URLs, your app can request access tokens for a user. This might be interesting if you want to analyze data for multiple Twitter users. In order to authorize an application for a user, you can use this script here.

Lookup Twitter Users

Once your Twitter user has an access token, you can make all the request to the Twitter API that are documented at dev.twitter.com. For our data analysis experiments, things are getting interesting from Friends & Followers section.

Let’s first fetch our friends. Since the OAuth requests can get a bit hairy, Twitter provides an OAuth tool, see the following screenshot:

For example, if you go to the documentation of GET /users/lookup, you’ll see a button “Generator OAuth Signature”. Clicking through the different options, you eventually end up with an example curl request, that might look similar to:

$ curl --get 'https://api.twitter.com/1.1/users/lookup.json' --data 'screen_name=twitterapi%2Ctwitter' --header 'Authorization: OAuth oauth_consumer_key="....", oauth_nonce="....", oauth_signature="....", oauth_signature_method="HMAC-SHA1", oauth_timestamp="1400617420", oauth_token=".....", oauth_version="1.0"' --verbose

If you run this curl request from the command line, you should see something like:

[{"id":6253282,"id_str":"6253282","name":"Twitter API","screen_name":"twitterapi","location":"San Francisco, CA","description":"The Real Twitter API. ..."}]

Let’s replace the manual curl call with a small Node script:

var request = require('request');
var qs = require('querystring');

var CONSUMER_KEY = '....';
var CONSUMER_SECRET = '...';
var access_token = '....';
var access_token_secret = '....';

var oauth = { consumer_key: CONSUMER_KEY,
  consumer_secret: CONSUMER_SECRET,
  token: access_token,
  token_secret: access_token_secret
};

var friends = 'https://api.twitter.com/1.1/friends/ids.json';
var lookup = 'https://api.twitter.com/1.1/users/lookup.json?';

request.get({url: friends, oauth: oauth, json: true}, function(e, r, resp) {

  if (e) {
    console.log('error: ', e);
    return;
  }

  console.log(resp.ids.length);
  lookup += 'user_id=' + resp.ids.slice(1,100);
  console.log(lookup);

  request.get({url: lookup, oauth: oauth, json: true}, function (e, r, users) {
    if (e) {
      console.log('error: ', e);
      return
    }
    console.log(users.length);
    fs.writeFile('friends.json', JSON.stringify(users));
  });
})

If all goes well, you’ll have a 100 friends in a file friends.json

AQL Basics

Let’s do some data manipulation with AQL and ArangoDB next. First, you must import your Twitter data into the database.

$ arangoimp --create-collection true --collection mulpat_friends friends.json

----------------------------------------
database:         _system
collection:       mulpat_friends
create:           yes
file:             friends.json
type:             json
connect timeout:  3
request timeout:  300
----------------------------------------
Starting JSON import...
2014-05-21T05:33:43Z [76325] INFO processed 32768 bytes (11.42%) of input file
2014-05-21T05:33:43Z [76325] INFO processed 65536 bytes (22.83%) of input file
2014-05-21T05:33:43Z [76325] INFO processed 98304 bytes (34.25%) of input file
2014-05-21T05:33:43Z [76325] INFO processed 131072 bytes (45.66%) of input file
2014-05-21T05:33:43Z [76325] INFO processed 163840 bytes (57.08%) of input file
2014-05-21T05:33:43Z [76325] INFO processed 196608 bytes (68.49%) of input file
2014-05-21T05:33:43Z [76325] INFO processed 229376 bytes (79.91%) of input file
2014-05-21T05:33:43Z [76325] INFO processed 262144 bytes (91.32%) of input file
2014-05-21T05:33:43Z [76325] INFO processed 287050 bytes (100.00%) of input file

created:          99
errors:           0
total:            99

Let’s explore the data with AQL. To run AQL, I prefer going to your local ArangoDB instance at http://0.0.0.0.8529 and select AQL. This should be a shortcut. Your AQL console should look similar to:

For the first query, we want to get a quick overview on the friends at a Twitter location:

for u in mulpat_friends
  limit 2
  filter u.location == 'San Francisco'
  return u.screen_name

The syntax should be self-explaining, but for reference, you might want to checkout the AQL docs.

The query above returns:

[
  "Fullstack node.js developer",
  "Labs Manager @WeWork"
]

Often it is nice to aggregate information. With AQL, the main statement is ‘collect’ and its syntax is as:

for u in mulpat_friends
  collect city = u.location into g
  return {city: city}

This results into:

[
  {
    "city": "Amsterdam"
  },
  {
    "city": "Barcelona"
  },
  {
    "city": "Barcelona, the World"
  },
  {
    "city": "bavaria and berlin"
  },
  {
    "city": "Bavaria/Berlin "
  }
  // ...
]

You can see already that grouping friends to a location might be difficult since it would require parsing. But with this “imperfect” data, what is the location where we have most friends? This question can be answered by sorting to the length of members in a group:

for u in mulpat_friends
  collect city = u.location into g
  sort length(g) desc
  return {city: city, count: length(g)}

This results into:

[
  {
    "city": "",
    "count": 13
  },
  {
    "city": "San Francisco, CA",
    "count": 5
  },
  {
    "city": "Munich, Germany",
    "count": 4
  },
  {
    "city": "Munich",
    "count": 4
  },
  ...    ]

Last, AQL also allows subqueries. For example, in what locations could we reach most followers? To answer this question, you could the following nested query:

for u in mulpat_friends
  collect city = u.location into g
  sort length(g) desc
  return { 
           city: city, count: length(g),
           friends_count: (for f in g return f.u.friends_count)
         }

This query would return something like:

  {
    "city": "San Francisco, CA",
    "count": 5,
    "friends_count": [
      112,
      573,
      1007,
      49,
      286
    ]
  },
  {
    "city": "Munich, Germany",
    "count": 4,
    "friends_count": [
      995,
      38,
      48,
      246
    ]
  },
  {
    "city": "Munich",
    "count": 4,
    "friends_count": [
      25,
      282,
      47,
      63
    ]
  },

Or, you could ask, in what locations live most of Twitter influencers, based on calculating a ratio between friends and followers:

for u in mulpat_friends
  collect city = u.location into g
  sort length(g) desc
  return {city: city, count: length(g), friends_count: (for f in g return f.u.friends_count/f.u.followers_count)}

Of course this would need to be averaged out.

Leave me feedback

Follow me on Twitter here.

comments powered by Disqus