Writing the Script
As we saw in walkthrough #1, there are 4 basic commands we will use in our script: LOAD for loading data; EXTRACT for extracting features from the raw data; PROJECT for projecting specific relationships and context from the raw data; and OUTPUT for saving the result to a file. Let’s take things step-by-step.
Step 5. Create a new file twitter-hashtags.dgt Use notepad.exe, emacs, vi or your favorite text editor.
Step 6. LOAD the data.
The first command in the script is going to be to load the data file. The tweets we downloaded are in a JSON-based record format, where each line in the file is a JSON-formatted key-value field of a record; and records are separated by blank lines. The LOAD Twitter() command can parse this file. Add the following line as the first command in the script file:
LOAD Twitter(path:"statuses.log.2014-06-13-12",ignoreErrors:"true");
The Twitter data source already knows about ***the key fields in the Twitter JSON data file*** (ADD LINK), so we don’t have to specify any more information. The twitter-tools adds some non-JSON lines into its output, so we’ll also set the ignoreErrors flag to true. This will tell DGT to ignore misformatted lines in the input.
Step 7. EXTRACT higher-level features from the raw data
Add the following line as the second command in the script file:
EXTRACT AffectDetector(), Gender(), hashtag;
This EXTRACT statement generates 3 higher-level features:
-
- The AffectDetector() call infers the affect, or mood, of a text. By default, the AffectDetector() looks for a field named “text” in the raw data, though we could set the “field” argument to make it look at other fields instead.
- The Gender() call infers the gender of the author, based on the author’s first name. By default, the Gender() extractor looks for a field named “username” in the raw data. Again, we could override this using the “field” argument.
- By naming the hashtag field—without parentheses—we tell the script to pass the hashtag field through without modification.
Step 8. PROJECT the data to focus on the relationships of importance
Now, we tell the script what we relationships we care about. Here, we want to extract the pair-wise co-occurrence relationships among hashtags. That is, which hashtags are used together?
PLANAR PROJECT TO hashtag;
By projecting to “hashtag”, we are telling DGT to build a co-occurrence graph among review scores. By default DGT assumes the co-occurrence relationships are defined by the co-occurrence of values within the same record.
In this exercise, we’re choosing to use a PLANAR PROJECT command because we’re going to visually display the resulting hashtag graph at the end of this walkthrough, and planar graphs are simply easier to render. However, it’s worth noting that the planar representation is incomplete. For example, if 3 hashtags always co-occur together that information will be lost because the planar graph cannot represent this information. A hyper-graph can represent such complex co-occurrences, however. For this reason, the PROJECT command defaults to a hyper-graph, and we recommend using this representation if you are going to be computing on the result.
Step 9. OUTPUT the results to disk
Finally, we add the following command to the script to save the results:
OUTPUT TO "twitter_hashtags.graph";
If you haven’t already, now would be a good time to save your script file… The whole script should look like this:
LOAD Twitter(path:"statuses.log.2014-06-13-12",ignoreErrors:"true");
EXTRACT AffectDetector(), Gender(), hashtag;
PLANAR PROJECT TO hashtag;
OUTPUT TO "twitter_hashtags.graph";
Run the Script
Step 9. From the command line, run DGT against the script twitter-hashtags.dgt:
The output file “twitter_hashtags.graph” should now be in the e:dgt-sample directory. Each row of the output file represents a relationship between a pair of hashatags, since we projected to the planar relationship between co-occurring hashtags in our script. Columns are tab-separated and the first column of each row is the name of the edge in the graph (the edge name is simply the concatenation of the two node names, in this case the two hashtags); The second column is the count of tweets seen with the pair of hashtags; and the third column is a JSON formatted bag of data distributions for gender and affect observations.
To import this data into visualization and analysis tools, we have included two command-line utilities dgt2tsv.exe and dgt2gexf.exe that can extract specific values into a tab-separated values (TSV) file or a Graph Exchange XML Format (GEXF) file.
We’ll use the dgt2gexf command and visualize the result with the Gephi graph visualization tool:
If your twitter sample is large, you might consider adding the option “filtercount=N” (without the quotes) to the command-line. This will only include edges that were seen at least N times in your sample. Use an appropriate number, from 10 to 1000 or higher, depending on the size of your sample.
Here’s the resulting hashtag graph. Each of the clusters represents a group of hashtags that are frequently co-mentioned in our tiny sample of Twitter data…
For clarity and fun, we’ll filter out low-frequency edges and zoom into one of the clusters of hashtags about world-cup related topics. We see from the thickness of the edges that #NED and #ESP are the most frequently co-occurring hashtags, and each also co-occurs relatively frequently with #WorldCup. We also see a number of people piggy-backing on the popular #worldcup hashtag with topically unrelated hashtags (#followers, #followback, #retweet, #followme) to solicit followers and retweets.
Further Explorations
There are many interesting things to explore in hashtag relationships, such as the evolution of hashtag relationships over time — for example, use PROJECT TO hashtag,absoluteday; — hashtag relationships conditioned on gender — PROJECT TO hashtag,Gender(); — and inspections of token distributions, moods and other features associated with hashtags and their relationships.
What are you going to explore next? Let us know what you do! My twitter handle is @emrek, or you can reach the whole team by emailing us at discussiongraph@microsoft.com. Thanks!