Social network analysis 1 + 2
Petr Ocelík
ESS418 Research Methods in Social Science
9th October 2015
Outline
• Empirical instances of networks
• History of SNA
• Graph theory
• Data organization
• Mini-case study
• R: working with network data
Introduction
• Networks are everywhere.
• Social disciplines are – by definition – dealing
with social interactions.
• SNA allows us to collect and analyze relational
data.
Introduction
• The main assumption: world is organized relationally.
“...transactions, interactions, social ties,
and conversations constitute
central stuff of social life.”
(Tilly 2008: 7)
• node
• edge
Terminology (Guclu 2012)
points lines
vertices edges, arcs math
nodes links computer science
sites bonds physics
actors ties, relations sociology
History of SNA
• The beginnings of SNA fall into 1930s.
• Mostly connected with work of Jacob Moreno.
• SNA had been developing on ad hoc basis in
separate research centers.
• The structural approach has been widely
recognized in 1970s (Mark Granovetter 1973).
• The revolution of social physics in 1990s:
– Watts and Strogatz (1998): Small-world networks
– Barabasi and Albert (1999): Scale-free networks
Jacob Moreno
Mark Granovetter
Small-world network
Scale-free network
Scale-free network
Graph theory
• Graph theory investigates graphs, i.e.
mathematical structures that model pairwise
relations between objects.
• A graph (G) is an ordered pair consisting from
a set of vertices (V) and a set of (undirected)
edges (E) or (directed) arcs (A).
• G = (V, E v A)
Graph theory
• Network consists from a set of nodes and a
set of edges.
node
edge
• network = graph
Graph theory
• order = # nodes
• size = # edges
• degree = # connections of individual nodes
Graph theory
• order = 5
• size = 7
Graph theory
• Complete graph is maximally connected
graph.
• Empty graph is graph with no edges.
Graph theory: relations
undirected directed
binary
weighted
Graph theory
• Network topology is defined by two main
concepts: connectivity and centrality.
• Connectivity describes interconnectedness of
nodes in network (focus on flows).
• Centrality describes location of nodes in network
(focus on positions).
Graph theory
• Step: move along one edge.
• Walk: sequence of steps in network.
• Path: walk in which no node as well as no edge
occurs more than once.
• Geodesic: shortest path that connects two nodes.
• Distance of two nodes = geodesic.
• Diameter: longest distance between any two
nodes in graph.
Graph theory
Graph theory
• A directly connected node is called adjacent.
• An edge linked to a node is called incident.
• All directly connected nodes create
neighbourhood.
Graph theory
• A directly connected node is called adjacent.
• An edge linked to a node is called incident.
• All directly connected nodes create
neighbourhood.
Graph theory
• Subgraph is any subset of nodes and edges of
graph.
• Component is connected subgraph.
Graph theory
• Reachability is given by the existence of the
path between the nodes.
• Isolate is a node without any connection, i.e.
node with zero degree.
Graph theory
• Structural hole is a lack of connection
between two nodes or subgraphs.
• Cutpoint is a node whose removal creates
structural hole.
• Bridge is an edge whose removal creates
structural hole.
Graph theory
• Structural hole is a lack of connection
between two nodes or subgraphs.
Graph theory
• Cutpoint is a node whose removal creates a
structural hole.
Graph theory
• Bridge is an edge whose removal creates a
structural hole.
Graph theory
• Inclusivity is given by # of connected nodes
over total # of nodes in network.
• Density is given by # of observed connections
over total # of all possible connections in
network.
Graph theory
• Inclusivity is given by # of connected nodes
over total # of nodes in network.
Graph theory
• Inclusivity is given by # of connected nodes
over total # of nodes in network.
• Inclusivity = 5 / 6 = 0.83
Graph theory: notation
• G = graph/network
• N = # of nodes in network, n = individual node
• e = edge, g = geodesic
• i, j, … = indices (labels for selected elements)
• gij = geodesic between nodes i and, ni = node i
• k = # of selected elements (typically nodes)
• Upper case: global indicators
• Lower case: local indicators
• cd(ni) = node i degree centrality
• Cd(G) = graph G degree centralization
Graph theory
• Density is given by # of observed connections
(∑ e) over total # of all possible connections in
network.
• # of all possible connections in undirected
network = (N * (N – 1)) / 2
• # of all possible connections in directed
network = (N * (N – 1))
• Density (undirected): ∑ e / ((N * (N – 1)) / 2)
Graph theory
• Density (undirected): ∑ e / ((N * (N – 1)) / 2)
• Quiz: assume that you create a network. Edge
exists if you sit next or askew to each other.
– What is inclusivity and density of this network?
Graph theory
• Bipartite (or two-mode) network consists from
two disjoint sets of vertices (U and V).
• The connections are allowed only between these
two sets of vertices, not within them.
Graph theory
• Example: two types of nodes: (1) individuals and
(2) concepts.
• The edges are between individuals and concepts.
Graph theory
• We can do one-mode projections; i.e. we can
reconstruct one-mode networks of individuals
and concepts.
• Individuals are connected if they share at least
one concept.
• Concepts are connected if they are shared by at
least one individual.
Graph theory
Graph theory
• Egocentric network is a personal network of a
given individual (ego).
• Number of steps that connect a given node to ego
classify the node into zones.
• First-order zone includes all directly connected
nodes (alteri), second-order zone includes all
nodes connected by two steps etc.
Graph theory
Graph theory
• Multiplex network consists from one set of
vertices and more than one sets of edges.
• E.g.: imagine group of people and how they are
interconnected through various social media
(Facebook, Twitter, Linkedin etc.).
Data organization
• Attributional data: individual characteristics.
– E.g.: age, income, education, GDP, TPES, etc.
• Relational data: characteristics of ties.
– E.g.: kinship ties, trade flows, conflicts, etc.
Data organization: network borders
• Network border delineation usually problematic.
• Often no natural borders.
• Different strategies for border delineation:
– nominal (e.g. all EU member states)
– positional (e.g. all democratic states)
– realistic (e.g. all states that present themselves as
democracies)
– relational (e.g. all states that are referred by others as
democracies)
– event-based (e.g. all states that participated in Iraqi
war)
Data organization: network sampling
• Typically it is impossible to have whole
population.
• Random sampling is not appropriate – why?
Data organization: network sampling
• Typically it is impossible to have population
data.
• Random sampling is not appropriate – why?
• Burt’s formula of information loss = (100 - k)%.
• Sampling methods:
– Snowballing
– Attribute based selection
Data organization: data collection
• Questionnaires / interviews
• Name generator (questionnaires)
• Observation / experiment
• Archival data
Exercise
• Define your research question.
• Define network borders and population.
• Define sampling method and data collection
technique.
Data organization
• (Social) data, attributional as well as
relational, are organized in data matrices.
• Case-by-variable matrix is a standard way of
data organization in quantitative research.
• Not appropriate for relational data.
Case-by-variable matrix
Adjacency (case-by-case) matrix
Data organization
• Adjacency matrix represents which nodes are
adjacent to other nodes.
• Incidence matrix represents the relations
between two classes of nodes.
– Rows represent one class of nodes.
– Columns represent second class of nodes.
Undirected one-mode network
Directed one-mode network
Undirected weighted one-mode
network
Incidence (case-by-event) matrix
Adjacency (case-by-case) matrix
Adjacency (event-by-event) matrix
Matrix operations: one-mode projection
• We get one-mode projection (adjacency) matrix by
multiplying incidence matrix by its transposition.
– Transposed matrix: rows turned to columns and vice
versa.
• For cases (rows) we put transpose on the second
place.
– matrix %*% t(matrix)
• For events (columns) we put transpose on the first
place.
– t(matrix) %*% matrix
Matrix transposition
• Incidence matrix
• Transposition
1 0 1 1
0 0 1 1
1 1 1 0
1 0 1
0 0 1
1 1 1
1 1 0
Matrix by matrix multiplication (cases)
%*%
Dot product: first we take first row and first column
(1, 0, 1, 1) and (1, 0, 1, 1), second we multiply corresponding
elements and sum up the products.
(1, 0, 1, 1) * (1, 0, 1, 1) = 1*1 + 0*0 + 1*1 + 1*1 = 3
1 0 1 1
0 0 1 1
1 1 1 0
1 0 1
0 0 1
1 1 1
1 1 0
Matrix by matrix multiplication (cases)
%*%
=
1 0 1 1
0 0 1 1
1 1 1 0
1 0 1
0 0 1
1 1 1
1 1 0
3 2 2
2 2 1
2 1 3
Matrix by matrix multiplication (events)
%*%
Dot product: first we take first row and first column
(1, 0, 1) and (1, 0, 1), second we multiply corresponding elements
and sum up the products.
(1, 0, 1) * (1, 0, 1) = 1*1 + 0*0 + 1*1 = 2
1 0 1 1
0 0 1 1
1 1 1 0
1 0 1
0 0 1
1 1 1
1 1 0
Matrix by matrix multiplication (events)
%*%
=
1 0 1 1
0 0 1 1
1 1 1 0
1 0 1
0 0 1
1 1 1
1 1 0
2 1 2 1
1 1 1 0
2 1 3 2
1 0 2 2
Exercise
• Do one-mode projections of incidence matrix:
Jan Petr Hedvika
Introduction 1 0 1
Methodology 1 1 0
Mini-case study
• Deep geological repository designed to contain nuclear
waste for hundreds of thousands of years.
• There are 7 pre-selected (candidate) localities in the
Czech republic.
• Since the beginning there are repeated occurrences of
local opposition.
• Research objective: to map how the issue is framed by
the local opposition / acceptance opinion leaders.
Discourse network
• A bipartite network that consists of actors
and concepts.
Haunss, Dietz & Nullmeier 2013: 13
Frame
• Frame defined as shared interpretative scheme
through which actors understand and promote a
particular version of reality (see e.g. Benford & Snow
2000).
• Actors – strategically – use frames to emphasize or
suppress particular aspects of the contested issue.
• Intuition: group of nodes (cluster) in the concept
network which are in a similar position to the rest of
the network.
(Discourse / frame) coalition
• Frame coalition is understood as a “...group of actors
that share social construct [frame].” (Hajer 1995: 43)
• Intuition: densely connected segment (community) of
the actor network.
Data and coding
• Data:
– 47 semi-standardized interviews (mayors, activists and
state officials).
• Coding:
– Corpus coded by 2 independent coders with
Krippendorff’s alpha = 0.81 (inter-rater), r = 0.79 (intrarater
reliability).
– The coded corpus contains 634 observations (38 codes).
RQDA package
• Corpus has been coded in RQDA package (Ronggui
Huang 2014).
• R package for qualitative data analysis with GUI.
• Provides basic functions of CAQDAS.
Incidence matrix
ij cell indicates
how many
times actor i
uses concept j
0 3
.
.
.
.
.
.
.
.
.
.
.
.
0 2 1
actor 1
(interview)
actor i
concept 1 concept j
.
.
.
.
. . . . . .
Network communities
• Network community is a segment of the network
created by a set of nodes (members) that are more
densely connected internally than with the nonmembers
of the community.
community 1 14 MAYs, 1 NGO, 4
STOs
community 2 15 MAYs, 9 NGOs
community 3 3 MAYs
“community” 4 STO_046
p ≤ 0.001
correlation 0.37
p-value 0.013
density 0.31
deg. centralization 0.43
bet. centralization 0.12
state mistrust: “Man cannot
believe… that this project, or
anything else about the
project, is gonna be proper …
the rules are not set, they are
changing as it goes … they are
changing according to current
political situation, how sirs
need them, so they can move
towards their goals.”
(NGO_038: 82-87)
Reconstructed frames
Responsibility • We consume electricity and thus produce radioactive waste.
• We (as well as state) have a moral and legal obligation to deal with this burden.
• The repository is the only viable solution.
• By delays and opposition we transfer this responsibility to further generations.
• Opposition is irresponsible and based on emotional/irrational argumentation.
• State has (legitimately) the last word; localities will be financially compensated.
Risk • The siting process as well as potential construction of the repository is
accompanied by number of risks (environmental, economic, social, health etc.).
• We have responsibility to further generations to preserve the localities.
• It is necessary to stop or to slow down the project till another (technological)
solution is available.
Dysfunctional
state
• The state is not able to deal with the issue competently and legitimately.
• The localities are not effectively involved in the selection process.
• The Working group is just a facade; the final decision depends solely on the
state.
• There is a lack of trust among stakeholders and the whole process lacks
legitimacy.
R: advantages
• Freeware
• Open source
• Worldwide active community
• Flexible and developed
R community / sources
• There is huge number of free resources
• R package / library manuals
• R site: http://cran.r-project.org
• Community forums:
– http://stackoverflow.com
– http://www.statmethods.net
– http://www.r-bloggers.com
• Youtube videos:
https://www.youtube.com/watch?v=qHfSTRNg6jE
• Googling (often fastest)
R libraries / packages
• Library / package:
– Can be though of as an extension that adds new
functionality.
– Libraries must be installed (just before the first
use) and loaded.
– Sometimes there can be conflicts among libraries
(e.g. different functions with same names) – we
can unload them.
– Often there are dependencies among libraries
(some libraries use functions from other libraries).
R: disadvantages
• Not as easily accessible as “clicking-programs”
• Data preparation could be demanding
• Could be slower for large datasets
R language
• object-oriented programming
– object: instance of certain data class that can be
manipulated according set of procedures (methods)
• functional-oriented programming
– function: relation that associates input(s) with
output(s)
• We can define certain objects and apply
functions on them and vice versa.
Data types
• Numeric: continuous numeric data (-1, 0.5, 10.49)
• Integer: discrete numeric data (-1, 0, 1, …)
• Character: string values = “anythingwithinquotes"
• Logical: output of logical operation
5 > 10 = FALSE
5 < 7 | 7 > 10 = TRUE
Data types: factor
• Factor: variable that take limited number of discrete
values - levels (categorical variable).
• Factor function converts vector of values into vector
of factor values (always have form of character).
• Factors can be unordered (nominal variable) or
ordered (ordinal variable).
http://www.r-tutor.com/
R: object and function
• Object:
vector <- c(1,2,3,4,5)
• Function:
fun <- function(x) { x^2 }
• Output:
fun(vector) = 1, 4, 9, 16, 25
• Nesting:
fun_2 <- function(x) { fun(x) + 1 }
R functions
• word() indicates function
• mean(vector)
• function(argument_1, argument_2, …)
• sample(0:100, 10, rep=FALSE)
• basic functions (part of the basic R package)
• package functions (part of the particular package)
• user functions (user-defined functions)
R objects
• Vector
– Sequence (1-dimensional) of elements of same data type
• Matrix
– 2-dimensional rectangular collection of elements of same
data type
– Array: n-dimensional matrix.
• List
– Vector that can contain elements of different data types
• Data frame
– List of vectors of equal length
– Table data
Vector
http://www.r-tutor.com/
Matrix
http://www.r-tutor.com/
List
http://www.r-tutor.com/
Data frame
http://www.r-tutor.com/