Discussion Graph Tool Reference Guides
-
In the discussion graph tool framework, a co-occurrence analysis consists of the following key steps:
Step
Task DGT command 1
Reading from a social media data source. LOAD 2
Extracting low-level features from individual messages. EXTRACT 3
(optional)
Declaring the feature that defines a co-occurrence. What defines the fact that two or more features have co-occurred? By default, two features are considered to co-occur if they both occur in the same social media message.
RELATE BY Steps 2 and 3 implicitly define an initial discussion graph. All co-occurring feature values that were seen to co-occur in the raw social media data will be connected by hyper-edges to form a large, multi-dimensional hyper-graph. 4
(optional)
By default, each social media message is weighted equally. We can change this so that the data is weighted by user, location, or other feature. For example, we might want data from every user to count equally, regardless of how many social media messages each user sent. This would prevent our analyses from being dominated by users who post too frequently. WEIGHT BY 5
We project the initial discussion graph to focus on those relationships we care about for our analysis. For this step, the task must specify the domains we care about. PROJECT 6
Output results OUTPUT 7
(optional)
Often, we’ll want to further analyze our results with higher-level machine learning, network analyses, and visualization techniques. This is outside the scope of DGT. For more details on the core concepts behind discussion graphs, we recommend reading our ICWSM 2014 paper.
A note on projecting weighted data
Often, feature values are weighted. For example, the affect classifier produces a weighted feature value indicating how likely a message is to be expressing joviality, sadness, etc. (In other cases, the use of the WEIGHT BY command implicitly creates a weighted value).
When it encounters a weighted feature value in its target domains, the PROJECT TO command treats the weights as probabilities of a feature value having occurred. For example, let’s continue our analysis of activity and location mentions such as in the following message:
"I'm having fun hiking tiger mountain" tweeted by Alice on a Saturday at 10am
Let’s say our mood analysis indicates that the message has joviality with a weight of “0.8”, serenity has a weight of “0.4” in this message, in addition to the other discrete features:
Domain Feature Weighted value Mood Joviality 0.8 Mood Serenity 0.4 Activity hiking 1.0 Location tiger mountain 1.0 Author Alice 1.0 The two weighted features are interpreted as independent probabilities. That is, there is an 80% likelihood of this message being jovial and a 20% likelihood of not being jovial. Independently, there is a 40% likelihood of the message being serene, and 60% chance of not being serene.
If we project this single message to the relationship between location and mood (PROJECT TO Mood, Location;) this message will expand to the following 4 projected edges::
Edge Weight Metadata Joviality and Tiger Mountain 0.48 hiking, Alice Serenity and Tiger Mountain 0.08 hiking, Alice Joviality and Serenity and Tiger Mountain 0.32 hiking, Alice (No mood) and Tiger Mountain 0.12 hiking, Alice Of course, when analyzing a larger corpus of social messages, each message will be expanded individually and the results aggregated.
-
The discussion graph tool’s scripting language currently supports the following commands.
Note that square brackets [ ] indicate optional elements of the command. Italicizedterms indicate user-specified arguments, variable names, etc. of the command.
LOAD
Syntax: LOAD Datasource([arguments]);
Example: LOAD MultiLine(path:”productreviews.txt”);
The LOAD command loads social media data from some datasource. The required arguments are datasource-specific. Generally, datasources require a path to the input file as well as schema information to interpret the file. See the Common things you’ll want to do section below for examples of loading TSV, Multiline record, JSON and Twitter files.
EXTRACT
Syntax: EXTRACT [PRIMARY] field|FeatureExtractor([arguments]),… [FROM varname];
Example: EXTRACT PRIMARY hashtag, Gender(), AffectDetector();
The EXTRACT command runs a series of feature extractors against the raw social media messages loaded from a data source via the LOAD command.
Extracting a field will pass through a field from the raw data unmodified.
Extracting a feature using a FeatureExtractor() will run the specified feature extractor against the social media message. Feature extractors may generate 0, 1 or more feature values for each message they process, and the domain of the feature need not match the name of the feature extractor. For example, the AffectDetector() generates features in several domains (Subjective, Mood and PosNegAffect), and other feature extractors, such as Phrases() can generate features in custom domains.
The PRIMARY flag acts as a kind of filter on the raw social media data. EXTRACT must find at least one PRIMARY field or feature in a message, otherwise the message will be ignored. If no fields or features are marked as PRIMARY, then EXTRACT will not filter messages.
FROM varname tells the EXTRACT command where to get its input data. If not specified, EXTRACT will read from the output of the previous command.
WEIGHT BY
Syntax: WEIGHT BY featureDomain[, …] [FROM varname];
Example: WEIGHT BY userid;
The WEIGHT BY command reweights the data from social media messages. By default, every social media message counts as a single observation. If we see a co-occurrence relationship occurring in 2 social media messages, then the co-occurrence relationship will have a weight of 2. We can change this using the WEIGHT BY command so that every unique user (or location or other feature value) counts as a single observation. So, for example, if a co-occurrence relationship is expressed by 2 unique users, then it will have a weight of 2. Conversely, if a single user expresses 2 distinct co-occurrence relationships, each relationship will have a weight of only 0.5.
Note that we can WEIGHT BY one feature but RELATE BY another feature.
RELATE BY
Syntax: RELATE BY featureDomain [FROM varname];
Example: RELATE BY userid;
The RELATE BY command declares the domain that defines a co-occurrence relationship. All features that co-occur with the same feature value in this domain are considered to have co-occurred.
FROM varname tells the RELATE BY command where to get its input data. If not specified, RELATE BY will read from the output of the previous command.
Note that we can WEIGHT BY one feature but RELATE BY another feature.
PROJECT
Syntax: PROJECT TO [featureDomain, …] [FROM varname];
Variants: PLANAR PROJECT TO [featureDomain, …] [FROM varname];
Variant: PLANAR BIPARTITE PROJECT TO [featureDomain, …] [FROM varname];
Example: PROJECT TO hashtag;
The PROJECT TO command will project an initial hyper-graph to focus on only relationships among the specified feature domains. That is, only edges which connect 1 or more nodes in the specified domains will be kept, and any nodes in other feature domains will be removed from the structure of the graph. By default, the PROJECT TO command generates a hyper-graph. This means that nodes that do not co-occur with other nodes will still be described by a degenerate 1-edge. Also, if many nodes simultaneously co-occur together, their relationship will be described by a k-edge (where k == the number of co-occurring nodes)
Often, especially for ease of visualization, it is useful to restrict the discussion graph to be a planar graph (where every edge in the graph connects exactly 2 nodes). The PLANAR PROJECT TO command achieves this. All hyper-edges will be decomposed and re-aggregated into their corresponding 2-edges.
Furthermore, it can be useful to restrict the graph to be bipartite, where only edges that cross domains are kept. For example, we may only care about the relationship between users and the hashtags they use, and not care about the relationship among hashtags themselves. The PLANAR BIPARTITE PROJECT TO command achieves this. Semantically, this is the equivalent of doing a planar projection and then dropping all edges that connect nodes are in the same domain.
MERGE
Syntax: MERGE varname1,varname2[,…];
Example: MERGE MentionAndUserGraph,HashTagAndUserGraph;
The MERGE command overlays two discussion graphs atop each other. Nodes with the same feature domain and values will be merged.
OUTPUT
Syntax: OUTPUT TO “filename.graph” [FROM varname];
Example: OUTPUT TO “mentions.graph”;
The OUTPUT TO command saves a discussion graph to the specified file.
File’s are saved in DGT’s native format. This format consists of 3 tab-separated columns. The first column is the edge identifier: the comma-separated list of nodes connected by this edge. The second column is the count of the number of times this co-occurrence relationship was observed to occur. The third column is a JSON-formatted representation of the context of the relationship or, in other words, the distribution of feature values conditioned on the co-occurrence relationship.
Naming variables
We can assign the result of commands to variables, and use these variables in later commands:
Syntax:
var x = COMMAND1;
COMMAND2 FROM x;
Example:
var reviewData = LOAD Multiline(path:”finefoods.tar.gz”);
var reviewFeatures = EXTRACT AffectDetector(),reviewscore FROM reviewData;
-
Here’s a current list of feature extractors included in the discussion graph tool release.
Feature Extractor Arguments Output Domain AffectDetector() Infers mood from text
field: input field to analyze (default=’text’) Mood: weights for 7 moods (joviality, sadness, guilt, fatigue, hostility, serenity, fear) PosNeg: aggregation of positive/negative affects
Gender() Infers gender from user names
field: input field to analyze (default=’username’) discrete: whether to output discrete or weighted gender values (default=’true’)
gender: m=male, f=female, u=unknown GeoPoint() explicit lat-lon coordinates
field: input field to analyze (default=’geopoint’) rounding: number of decimal places to include
geopoint: lat-lon value GeoShapeMapping() Maps lat-lon points to feature values via a user-specified GeoJSON formatted shapefile
field: input field to analyze (default=’geopoint’). this field should contain both lat and lon coordinates, separated by a space or comma. latfield: input field containing latitude value.
lonfield: input field containing longitude value.
shapefile: GeoJSON formatted shapefile
propertynames: comma separated list of property:domain pairs. The property names a property within the shapefile, and the domain specifies a custom domain name for that property. If a lat-lon point falls within a shape specified in the shapefile, the feature extractor will output all the specified properties in the propertynames list.
unknownvalue: value to assign to a lat-lon outside of given shapes
Note: Please specify either the field argument or both the latfield and lonfield arguments.
[custom domain name] Country() An instance of GeoShapeMapping that maps lat-lon to country/region two-letter codes and country/region names
field: input field to analyze (default=’geopoint’). this field should contain both lat and lon coordinates, separated by a space or comma. latfield: input field containing latitude value.
lonfield: input field containing longitude value.
unknownvalue: value to assign to a lat-lon outside of countries/regions
Note: Please specify either the field argument or both the latfield and lonfield arguments.
fips_country: country:
USAState() An instance of GeoShapeMapping that maps lat-lon to USA subregions and states
field: input field to analyze (default=’geopoint’). this field should contain both lat and lon coordinates, separated by a space or comma. latfield: input field containing latitude value.
lonfield: input field containing longitude value.
unknownvalue: value to assign to a lat-lon outside of US states
Note: Please specify either the field argument or both the latfield and lonfield arguments.
USA_subregion: USA_state:
USA_fips:
CountyFIPS() An instance of GeoShapeMapping that maps lat-lon to US county names and FIPS codes
field: input field to analyze (default=’geopoint’). this field should contain both lat and lon coordinates, separated by a space or comma. latfield: input field containing latitude value.
lonfield: input field containing longitude value.
unknownvalue: value to assign to a lat-lon outside of US counties
Note: Please specify either the field argument or both the latfield and lonfield arguments.
countygeoid: countyname:
Time() Extracts various temporal features
field: input field to analyze (default=’creationdate’) options: list of time features to extract: absoluteminute, absolutehour, absoluteday, absoluteweek, monthofyear, dayofweek, hourofday. (default is to output all fields)
format: ‘unix’ or ‘ticks’ (default=’unix’)
absoluteminute: absolutehour:
absoluteday:
absoluteweek:
monthofyear:
dayofweek:
hourofday:
ProfileLocation() Maps geographic regions from user profile locations with a user-specified mapping file
field: input field to analyze (default=’userlocation’) domain: set custom output domain
mappingfile: model for mapping from user location names to geographic locations. DGT comes with a mapping file for major international metropolitan areas, and United States country regions and divisions.
unknownvalue: value to assign to unrecognized profile locations
[custom domain name] ProfileLocationToCountry() Maps user profile locations to 2-letter country/region FIPS codes
field: input field to analyze (default=’userlocation’) unknownvalue: value to assign to unrecognized profile locations
country: ProfileLocationToCountryName() Maps user profile locations to country/region names
field: input field to analyze (default=’userlocation’) unknownvalue: value to assign to unrecognized profile locations
countryname: ProfileLocationToUSASubregion() Maps user profile locations to subregions of USA (e.g., Pacific, Mid-Atlantic)
field: input field to analyze (default=’userlocation’) unknownvalue: value to assign to unrecognized profile locations
usa_subregion: ProfileLocationToUSAState() Maps user profile locations to US states
field: input field to analyze (default=’userlocation’) unknownvalue: value to assign to unrecognized profile locations
usa_state: ProfileLocationToUSACounty() Maps user profile locations to US county FIPS codes
field: input field to analyze (default=’userlocation’) unknownvalue: value to assign to unrecognized profile locations
usa_county: ProfileLocationToUSACountyName() Maps user profile locations to US county names
field: input field to analyze (default=’userlocation’) unknownvalue: value to assign to unrecognized profile locations
usa_countyname: ProfileLocationToMetroArea() Maps user profile locations to major metropolitan areas
field: input field to analyze (default=’userlocation’) unknownvalue: value to assign to unrecognized profile locations
metroarea: ExactPhrases() Matches specific phrases in a given list or mapping file
field: input field to analyze (default=’text’) domain: set custom output domain
accept: a comma-separated list of phrases to match
acceptfile: a text file listing phrases. Use a tab-separated two-column file to specify canonical forms for matched phrases
[custom domain name] Regex() Matches regular expressions
field: input field to analyze domain: set custom output domain
regex: the regular expression to match against text
[custom domain name] Tokens() Extracts unigram tokens
field: input field to analyze domain: set custom output domain
stopwordsfile: file of tokens to ignore (default=none)
porter: use porter stemming (default=”false”)
[custom domain name]