«
»

Observing & Querying Tweets (Part 1)

As we look to build compelling examples using Saffron Sierra we’ve often talked about using Twitter as a datasource.  If we could get Twitter data/tweets into Sierra then we could builds lots on interesting analytical capability using Sierra’s APIs.  To enable our exploration, and others, I’ve decided to start a little series (probably 3 parts) of blog post about observing and querying Twitter data in Sierra.

In this first part I’m going to cover the process of grabbing a Twitter feed and building the necessary resource XML for Sierra observations.  For now, you can think of a “resource” in Sierra speak as our method of “inserting” data into the system.  For my example I’m going to be using Groovy .  This code should be easily replicated in your language of choice.  I choose Groovy because I’m familiar with it and it provides a few tools that make this kind of stuff really easy.

The first thing that we need to do is grab some tweets. I’m going to use the “friends_timeline” API call for Twitter. There are a number of other Twitter API calls that would return data formatted the same way. Please note that if you try to run this yourself you’ll need to put in a valid Twitter username/password in order for the authentication to work. I’m using the HTTPBuilder available in Groovy to build the GET request needed for the call to Twitter.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def http = new HTTPBuilder('http://twitter.com')
http.auth.basic 'put your username here', 'put your password here'
 
http.request(groovyx.net.http.Method.GET, groovyx.net.http.ContentType.JSON) {
    uri.path = '/statuses/friends_timeline.json'
 
    response.success = {r, json ->
        println r.statusLine
    }
 
    response.failure = {r ->
        println "Unexpected error: ${r.statusLine.statusCode} : ${r.statusLine.reasonPhrase}"
    }
}

The important thing to note is the “response.success” (line 7). This what will gets called once we receive a successful response from Twitter. Since we specified our desired content type as JSON (line 4) HTTPBuilder will go ahead and parse the JSON string coming back from Twitter and hand us a JSON array (pretty sweet).

The next thing we’ll need to do is iterate over that JSON array and build the XML we’ll need to observe a “resource” in Sierra. Inside our “response.success” closure we’ll have the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def xml = new StreamingMarkupBuilder()
xml.encoding = 'UTF-8'
 
def doc = xml.bind {
    // add the needed items at the beginning of the xml doc
    mkp.xmlDeclaration()
    mkp.yieldUnescaped ''
    rs {
        json.eachWithIndex {tweet, i ->
            delegate.r(n: i) {
                delegate.as {
                    // add the author as a "person" attribute
                    a(c: 'person') {
                        v(tweet.user.screen_name)
                    }
 
                    // add the text of the tweet as a "text" attribute
                    def text = tweet.text
                    a(c: 'text') {
                        v(text)
                    }
 
                    // add the create date as a "date" attribute
                    a(c: 'date') {
                        v(tweet.created_at)
                    }
 
                    // if this is a reply then grab that user as well and create another "person" attribute
                    if (tweet.in_reply_to_screen_name && tweet.in_reply_to_screen_name != 'null') {
                        a(c: 'person') {
                            v(tweet.in_reply_to_screen_name)
                        }
                    }
 
                    // if we have geo information then create a "geocoordinate" attribute
                    if (tweet.geo && tweet.geo != 'null') {
                        a(c: 'geocoordinate') {
                            v(tweet.geo)
                        }
                    }
 
                    // grab and other usernames mentioned using @ and create "person" attributes
                    text.findAll(/[@]+[A-Za-z0-9-_]+/).each { username ->
                        a(c: 'person') {
                            // todo - modify the regex above so that this replace is not necessary
                            v(username.replaceAll('@', ''))
                        }
                    }
 
                    // grab all of the hashtags out the text and create "hashtag" attributes for them
                    text.findAll(/[#]+[A-Za-z0-9-_]+/).each { hashtag ->
                        a(c: 'hashtag') {
                            // todo - modify the regex above so that this replace is not necessary
                            v(hashtag.replaceAll('#', ''))
                        }
                    }
 
                    // grab and urls out of the text and create "url" attributes for them
                    text.findAll("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]").each { url ->
                        a(c: 'url') {
                            v(url)
                        }
                    }
                }
            }
        }
    }
}
 
System.out << doc

There’s a lot going on here so let me cover a few things. As you can see on line 1 we’re using the StreamingMarkupBuilder from Groovy to build our XML document. We’re also using “json.eachWithIndex” (line 9) to iterate over each tweet returned from Twitter. As we look at each tweet we’re building an XML “r” (resource) stanza (line 10). Each “r” (resource) stanza contains a set of attributes “as” (line 11). We build the attributes “a” using values from the tweet. In this case we are building attributes for the author of the tweet (line 13), the date it was created (line 24), etc… Finally, you’ll see the use of a few regular expressions to grab things like username (line 43), hashtags (line 51), and links (line 59) out of the tweet and build attributes for those as well. If all goes well we should get xml that looks like this (from my twitter stream):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
            <a>
                newmediatim
            </a>
            <a>
                so close.
            </a>
            <a>
                Tue Nov 10 00:25:51 +0000 2009
            </a>
            <a>
                bigcartel
            </a>
            <a>
                We're back! Thanks again for being troopers. We're doing a full investigation on what needs to be done to prevent further issues like this.
            </a>
            <a>
                Tue Nov 10 00:25:20 +0000 2009
            </a>
            <a>
                grantLuckey
            </a>
            <a>
                I'm putting up 2 Christmas trees in my home this year ??
            </a>
            <a>
                Tue Nov 10 00:24:45 +0000 2009
            </a>
            <a>
                jaredpeterson
            </a>
            <a>
                @ccopeland I believe... so sad!
            </a>
            <a>
                Tue Nov 10 00:23:46 +0000 2009
            </a>
            <a>
                ccopeland
            </a>

In the future we’ll probably want to grab other things out of the tweets, but this should give us a good starting point.

In my next post I’ll cover how to send this XML resource to a Sierra instance so that I can leverage the Sierra APIs to query it.

  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Reddit
  • Twitter
  • DZone

Tags: , , , ,

This entry was posted on Monday, November 9th, 2009 at 7:32 pm and is filed under SaffronSierra. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

5 Responses to “Observing & Querying Tweets (Part 1)”

  1. Megan says:

    Hmm, I’m not sure any of what I read made sense in my brain, but I’ll just have my husband translate. He knows some tech languages. =) Neat to read anyways.

  2. Jared Peterson says:

    Megan, don’t worry… it barely makes sense in my brain and I wrote it ;)

  3. Social comments and analytics for this post…

    This post was mentioned on Twitter by jaredpeterson: here’s a pretty geeky blog post I wrote yesterday for @saffrontech http://bit.ly/1NWThp...

  4. [...] & Querying Tweets (Part 2) Dec 28, 2009 by Jared Peterson In my last post I walked through grabbing tweets from Twitter using their API. We then took those tweets and built [...]

  5. [...] & Querying Tweets (Part 3) Jan 13, 2010 by Jared Peterson In my previous posts I’ve discussed how to fetch data from Twitter and massage it into the form that is [...]

Leave a Reply