Discussion Graph Tool

成立时间:April 25, 2014

  • Analyzing Mood of Product Reviews

    This walkthrough focuses on answering the question: How does mood (joviality, anger, guilt, …) correlate with product review score? Does this vary by gender? As a bonus, see how to extract a graph of products based on their common reviewers. Read the step-by-step.

     

    In this walkthrough, we will be working with Amazon review data for fine food products. First, we are going to ask the question, “what are the moods associated with positive and negative reviews?” Then, we will go a little deeper into the data and see how the mood distributions differ based on the gender of the reviewer, and also suggest other explorations.

    Through this example, we will introduce the basic concepts and commands of a DGT script. We’ll show how to load data, extract fields and derived features from social media; and project and aggregate the results.

    Getting the Discussion Graph Tool

    Step 1. Download the Discussion Graph Tool (DGT)

    If you haven’t already, download and install the discussion graph tool. The rest of this walkthrough will assume that you have installed the tool and added it to your executable path.

    To double-check the installation, open a new command-line window and type the command “dgt –help”. You should see the following output:

    >dgt –helpDiscussion Graph Tool Version 0.5More info: http://approjects.co.za/?big=en-us/research/project/discussion-graph-tool/?locale=zh-cnContact: discussiongraph@microsoft.comUsage: dgt.exe filename.dgt [options]Options:–target=local|… Specify target execution environment.–config=filename.xml Specify non-default configuration file

    Step 2. Create a new directory for this walkthrough.  Here, we’ll use the directory E:dgt-sample

    >mkdir e:dgt-sample

     Getting the Data

    Before we start to write our first script, let’s get some data to analyze. We’ll be using Amazon review data collected by McAuley and Leskovec. This dataset includes over 500M reviews of 74k food-related products. Each review record includes a product id, user id, user name, review score, helpfulness rating, timestamp and both review and summary text.  The user names are often real names, and review scores are integers on a scale from 1 to 5

    Step 3. Download finefoods.txt.gz from the Stanford Network Analysis Project’s data archive. Save the file to E:dgt-sample

    > e:> cd e:dgt-samplee:dgt-sample> dirVolume in drive E is DISKVolume Serial Number is AAAA-AAAADirectory of E:dgt-sample06/10/2014  11:17 AM              .06/10/2014  11:17 AM              ..06/10/2014  11:16 AM       122,104,202 finefoods.txt.gz1 File(s)    122,104,202 bytes2 Dir(s)  45,007,622,144 bytes free

    Writing the Script

    There are 4 basic commands we will use in our script: LOAD for loading data; EXTRACT for extracting features from the raw data; PROJECT for projecting specific relationships and context from the raw data; and OUTPUT for saving the result to a file. Let’s take things step-by-step.

    Step 4. Create a new file mood-reviews.dgt Use notepad.exe, emacs, vi or your favorite text editor.

    e:dgt-sample> notepad mood-reviews.dgt

    Step 5. LOAD the data.

    The first command in the script is going to be to load the data file. The reviews we downloaded are in a multi-line record format, where each line in the file represents a key-value field of a record; and records are separated by blank lines. The LOAD MultiLine() command will parse this data file.  Add the following line as the first command in the script file:

    LOAD Multiline(path:"finefoods.txt.gz",ignoreErrors:"true");

    Since the multi-line format naturally embeds the schema within the data file, we don’t have to specify it in the LOAD command.  There are some spurious newlines in the finefoods.txt.gz data, so we do we need to set the ignoreErrors flag to true.  This will tell DGT to ignore data that is misformatted.

    Step 6. EXTRACT higher-level features from the raw data

    Add the following line as the second command in the script file:

    EXTRACT AffectDetector(field:"review_text"),
            Gender(field:"review_profileName"),
            review_score;

    This EXTRACT statement generates 3 higher-level features:

      • The AffectDetector() call infers the affect, or mood, of a text. The field argument tells it which of the raw fields to analyze. We’ll choose the long review field but could just as easily have selected the summary field. If you don’t pass a field argument, then the AffectDetector() extractor will by default look for a field named “text” in the raw data.
      • The Gender() call infers the gender of the author, based on the author’s first name. The field argument tells it which field includes the author’s name. If you don’t pass a field argument, then the Gender() extractor will by default look for a field named “username” in the raw data.
      • By naming the reviewscore field—without parentheses—we tell the script to pass the reviewscore field through without modification.

    A note on naming outputs and inputs:By default, EXTRACT, PROJECT and OUTPUT commands operate on the results of the previous statement. You can also explicitly name the results of commands. To do so, use the “var x = “ notation to assign results to a variable, then add “FROM x” to later commands. For example:

    var finefoodsdata = LOAD MultiLine(path:"finefoods.txt.gz",ignoreErrors:"true");EXTRACT AffectDetector(field:"review_text"), Gender(field:"review_profileName"), reviewscore FROM finefoodsdata;

    Step 7. PROJECT the data to focus on the relationships of importance

    Now, we tell the script what we relationships we care about. Often, we’ll be using DGT to extract a graph of co-occurrence relations from a set of data. In this first example, we’re going to ask for a simpler result set, essentially using DGT as a simple aggregator or “group by” style function.  Add the following line to the script:

    PROJECT TO review_score;

    By projecting to “review_score”, we are telling DGT to build a co-occurrence graph among review scores. By default DGT assumes the co-occurrence relationships are defined by the co-occurrence of values within the same record. Since in this dataset every record has at most one review score, that means that there are no co-occurrence relationships. The resulting graph is then simply the degenerate graph of 5 nodes (1 for each score from 1 to 5).  For each of these nodes, DGT aggregates the affect and gender information that we extracted.

    Step 8. OUTPUT the results to disk

    Finally, we add the following command to the script to save the results:

    OUTPUT TO "finefoods_reviewscore_context.graph";

    If you haven’t already, now would be a good time to save your script file…  The whole script should look like this:

    LOAD Multiline(path:"finefoods.txt.gz",ignoreErrors:"true");
    EXTRACT AffectDetector(field:"review_text"),
            Gender(field:"review_profileName"),
            review_score;
    PROJECT TO review_score;
    OUTPUT TO "finefoods_reviewscore_context.graph";

    Run the Script

    Step 9. From the command line, run DGT against the script mood-reviews.dgt:

    e:dgt-sample> dgt.exe mood-reviews.dgt

    The output file “finefoods_reviewscore_context.graph” should now be in the e:dgt-sample directory.  Each row of the output file represents a reviewscore, since that is what we projected to in our script. Columns are tab-separated and the first column of each row is the name of the edge (or nodes) in the graph; The second column is the count of records seen with the given review score; and the third column is a JSON formatted bag of data distributions for gender and affect observations.

    To import this data into R, Excel or other tools, we have included a command-line utility dgt2tsv.exe that can pull out specific values.  Use the following command to build a TSV file that summarizes the gender and mood for each review score:

    e:dgt-sample> dgt2tsv.exe finefoods_reviewscore_context.graph count,gender.m,gender.f,gender.u,mood.joviality,mood.fatigue,mood.hostility,mood.sadness,mood.serenity,mood.fear,mood.guilt finefoods_reviewscore_gendermood.tsv

    Here’s a quick graph of the results about how mood varies across review scores.

    We see that joviality increases and sadness decreases with higher review scores.  We see that there is more hostility in lower review scores and more serenity in higher review scores.  While most moods are monotonically increasing or decreasing with review score, we see that guilt peaks in 2- and 3-star reviews.

    Further Explorations

    The design goal of DGT is to make it easy to explore the relationships embedded in social media data and capture the context of the discussions from which the relationships were inferred.

    Are the distributions of mood across review scores different for men and women? Conditioning the mood distributions on gender as well as review score gives us this information.  We can do this simply by adding the gender field to our PROJECT command, as follows (changes from the original script are bolded):

    LOAD Multiline(path:"finefoods.txt.gz",ignoreErrors:"true");
    EXTRACT AffectDetector(field:"review_text"),
            Gender(field:"review_profileName"),
            review_score;
    PROJECT TO review_score, gender;
    OUTPUT TO "finefoods_reviewscore_gender_context.graph";

    Here’s a quick look at the results.  Here, I’ve graphed the joviality (solid line) and sadness (dashed line) for men (orange) and women (green).  We see that the general trends hold, though there are some differences that one might continue digging deeper into…

    How are products related to each other by reviewer?  For example, how many people that wrote a review of “Brand A Popcorn” also wrote about “Brand X chocolate candies”?  We can answer this question by defining a co-occurrence relationship based on user id.  That is, we’ll say that two product ids are related if the same user reviewed both products.  Here’s how we do that in the script:

    LOAD Multiline(path:"finefoods.txt.gz",ignoreErrors:"true");EXTRACT product_productId, review_userId;RELATE BY review_userId;PLANAR PROJECT TO product_productId AGGREGATE();OUTPUT TO "finefoods_products_relateby_user.graph";
    
    

    (We’ll learn more about the RELATE BY and PLANAR PROJECT commands in the next walkthroughs.)  This will generate a discussion graph that connects pairs of products that were reviewed by the same person.  We can convert this into a file readable by the Gephi graph visualization tool using the dgt2gexf command:

    e:dgt-sample> dgt2gexf.exe finefoods_products_relateby_user.graph count finefoods_products_relateby_user.gexf filterbycount=1000

    The dgt2gexf command mirrors the dgt2tsv command.  In this case, we decided to use a filterbycount option to only output edges that have at least 1000 users who have co-reviewed the pair of products.  This filter helps keep the visualization relatively manageable.

    Here’s the resulting product graph, laid out using Gephi’s Fructerman Reingold algorithm: Each of the clusters represents a group of products that are frequently co-reviewed food products on Amazon…

  • Analyzing Twitter Hashtags

    This walkthrough focuses on twitter data and extracting a graph of related hashtags based on co-occurrences. Read the step-by-step.

  • In this walkthrough, we will be working with public stream data from Twitter. First, we are going to ask the question, “what are the moods associated with positive and negative reviews?” Then, we will go a little deeper into the data and see how the mood distributions differ based on the gender of the reviewer, and also suggest other explorations.

    Through this example, we will introduce the basic concepts and commands of a DGT script. We’ll show how to load data, extract fields and derived features from social media; and project and aggregate the results.

    Getting the Discussion Graph Tool

    Step 1. Download the Discussion Graph Tool (DGT)

    If you haven’t already, download and install the discussion graph tool (Detailed installation instructions.) The rest of this walkthrough will assume that you have installed the tool and added it to your executable path.

    To double-check the installation, open a new command-line window and type the command “dgt –help”. You should see the following output:

    >dgt –helpDiscussion Graph Tool Version 1.0More info: http://approjects.co.za/?big=en-us/research/project/discussion-graph-tool/?locale=zh-cnContact: discussiongraph@microsoft.comUsage: dgt.exe filename.dgt [options]Options:–target=local|… Specify target execution environment.–config=filename.xml Specify non-default configuration file

    Step 2. Create a new directory for this walkthrough. Here, we’ll use the directory E:dgt-sample

    >mkdir e:dgt-sample

    Getting Twitter Data

    First, let’s get some data to analyze. We’ll be using Twitter data for this walkthrough.  Twitter doesn’t allow redistribution of its data, but does have an API for retrieving a sample stream of tweets.  There are a number of steps you’ll have to complete, including registering for API keys and access tokens from Twitter.  We’ve put up full instructions.

    Step 3. Install twitter-tools package.  See our instructions.

    Step 4. Download a sample of tweets.  Run the GatherStatusStream.bat for “a while”—press Ctl-C to stop the download.  This will generate a file (or files) called statuses.log.YYYY-MM-DD-HH where YY-MM-DD-HH represent the current date and hour.  The files may be compressed (indicated with a .gz file suffix)

    Each of the line in this file represents a tweet (*), in JSON format, that includes all available metadata about the tweet, tweet author, etc.  (* the file also includes some other information, such as tweet deletions.  There’s no need to worry about those for this walkthrough.)

    > twitter-tools-mastertwitter-tools-coretargetappassemblerbinGatherStatusStream.bat1000 messages received.2000 messages received.3000 messages received.4000 messages received.5000 messages received.6000 messages received.7000 messages received.8000 messages received.9000 messages received.10000 messages received.Terminate batch job (Y/N)? Y> dir statuses*Volume in drive C is DISKVolume Serial Number is AAAA-AAAADirectory of E:dgt-sampletwitter-tools-core06/13/2014  12:53 PM        49,665,736 statuses.log.2014-06-13-121 File(s)     49,665,736 bytes0 Dir(s)  43,039,879,168 bytes free

    Writing the Script

    As we saw in walkthrough #1, there are 4 basic commands we will use in our script: LOAD for loading data; EXTRACT for extracting features from the raw data; PROJECT for projecting specific relationships and context from the raw data; and OUTPUT for saving the result to a file. Let’s take things step-by-step.

    Step 5. Create a new file twitter-hashtags.dgt Use notepad.exe, emacs, vi or your favorite text editor.

    e:dgt-sample> notepad twitter-hashtags.dgt

    Step 6. LOAD the data.

    The first command in the script is going to be to load the data file. The tweets we downloaded are in a JSON-based record format, where each line in the file is a JSON-formatted key-value field of a record; and records are separated by blank lines. The LOAD Twitter() command can parse this file. Add the following line as the first command in the script file:

    LOAD Twitter(path:"statuses.log.2014-06-13-12",ignoreErrors:"true");

    The Twitter data source already knows about ***the key fields in the Twitter JSON data file*** (ADD LINK), so we don’t have to specify any more information. The twitter-tools adds some non-JSON lines into its output, so we’ll also set the ignoreErrors flag to true. This will tell DGT to ignore misformatted lines in the input.

    Step 7. EXTRACT higher-level features from the raw data

    Add the following line as the second command in the script file:

    EXTRACT AffectDetector(), Gender(), hashtag;

    This EXTRACT statement generates 3 higher-level features:

      • The AffectDetector() call infers the affect, or mood, of a text.  By default, the AffectDetector() looks for a field named “text” in the raw data, though we could set the “field” argument to make it look at other fields instead.
      • The Gender() call infers the gender of the author, based on the author’s first name. By default, the Gender() extractor looks for a field named “username” in the raw data.  Again, we could override this using the “field” argument.
      • By naming the hashtag field—without parentheses—we tell the script to pass the hashtag field through without modification.
    Note: The output of twitter-tools already includes hashtags, user mentions, urls and stock symbols as explicit fields already parsed out of the raw text. We’ll see in the further explorations how we can use exact phrase matching and regular expression matching to pull values out of the text ourselves.

    Step 8. PROJECT the data to focus on the relationships of importance

    Now, we tell the script what we relationships we care about. Here, we want to extract the pair-wise co-occurrence relationships among hashtags.  That is, which hashtags are used together?

    PLANAR PROJECT TO hashtag;

    By projecting to “hashtag”, we are telling DGT to build a co-occurrence graph among review scores. By default DGT assumes the co-occurrence relationships are defined by the co-occurrence of values within the same record.

    In this exercise, we’re choosing to use a PLANAR PROJECT command because we’re going to visually display the resulting hashtag graph at the end of this walkthrough, and planar graphs are simply easier to render.  However, it’s worth noting that the planar representation is incomplete.  For example, if 3 hashtags always co-occur together that information will be lost because the planar graph cannot represent this information.  A hyper-graph can represent such complex co-occurrences, however.  For this reason, the PROJECT command defaults to a hyper-graph, and we recommend using this representation if you are going to be computing on the result.

    Step 9. OUTPUT the results to disk

    Finally, we add the following command to the script to save the results:

    OUTPUT TO "twitter_hashtags.graph";

    If you haven’t already, now would be a good time to save your script file… The whole script should look like this:

    LOAD Twitter(path:"statuses.log.2014-06-13-12",ignoreErrors:"true");
    EXTRACT AffectDetector(), Gender(), hashtag;
    PLANAR PROJECT TO hashtag;
    OUTPUT TO "twitter_hashtags.graph";

    Run the Script

    Step 9. From the command line, run DGT against the script twitter-hashtags.dgt:

    e:dgt-sample> dgt.exe twitter-hashtags.dgt

    The output file “twitter_hashtags.graph” should now be in the e:dgt-sample directory. Each row of the output file represents a relationship between a pair of hashatags, since we projected to the planar relationship between co-occurring hashtags in our script. Columns are tab-separated and the first column of each row is the name of the edge in the graph (the edge name is simply the concatenation of the two node names, in this case the two hashtags); The second column is the count of tweets seen with the pair of hashtags; and the third column is a JSON formatted bag of data distributions for gender and affect observations.

    To import this data into visualization and analysis tools, we have included two command-line utilities dgt2tsv.exe and dgt2gexf.exe that can extract specific values into a tab-separated values (TSV) file or a Graph Exchange XML Format (GEXF) file.

    We’ll use the dgt2gexf command and visualize the result with the Gephi graph visualization tool:

    e:dgt-sample> dgt2gexf.exe twitter_hashtags.graph count twitter_hashtags.gexf

    If your twitter sample is large, you might consider adding the option “filtercount=N” (without the quotes) to the command-line.  This will only include edges that were seen at least N times in your sample.  Use an appropriate number, from 10 to 1000 or higher, depending on the size of your sample.

    Here’s the resulting hashtag graph.  Each of the clusters represents a group of hashtags that are frequently co-mentioned in our tiny sample of Twitter data…

    For clarity and fun, we’ll filter out low-frequency edges and zoom into one of the clusters of hashtags about world-cup related topics.  We see from the thickness of the edges that #NED and #ESP are the most frequently co-occurring hashtags, and each also co-occurs relatively frequently with #WorldCup.  We also see a number of people piggy-backing on the popular #worldcup hashtag with topically unrelated hashtags (#followers, #followback, #retweet, #followme)  to solicit followers and retweets.

    Further Explorations

    There are many interesting things to explore in hashtag relationships, such as the evolution of hashtag relationships over time — for example, use PROJECT TO hashtag,absoluteday; — hashtag relationships conditioned on gender — PROJECT TO hashtag,Gender(); — and inspections of token distributions, moods and other features associated with hashtags and their relationships.

    What are you going to explore next? Let us know what you do! My twitter handle is @emrek, or you can reach the whole team by emailing us at discussiongraph@microsoft.com. Thanks!