Given a dataset composed by users (i.e.,
players), whose features (i.e., actions performed by the players) evolve
over time, we aim at extracting meaningful patterns of behavior by
applying tensor decomposition techniques. To this aim, we need to
represent our data as a tensor (i.e., a multi-dimensional array). This
can be done by assigning to each dimension of the tensor the different
dimensions of the data, namely users, features, and time. However, not
all the available features might be predictive of a user's performance
or behavior. Thus, our framework, shown in Figure 1, includes a feature
selection step (described in the following) that allows us to identify
the most informative features that will be then used to create the
tensor.
Feature Selection
The first step in our framework is
to evaluate which features we need to build our multi-dimensional
representation (tensor) of the data. We use Decision Trees to design a
simple prediction task and detect those features that are mostly
predictive of users' performance. Decision Trees are well-known machine
learning techniques used for predicting a target outcome (in our case,
users' performance) by learning decision rules through the features
describing the data. The model also provides as an output the importance
of each feature, i.e., the Gini importance or impurity.
(https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm) Each
node of a decision tree indeed represents a condition on one feature to
split the dataset in two parts. The Gini importance is the measure used
by the algorithm (here we use the Python scikit-learn implementation
(http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#r251)
to chose the optimal split condition, and it is defined as
\(IG(p)=1−∑+{j=1}^{J} p^{2}_j\), (1)
where
\(J\) is the set of classes, and \(p_j\) is the fraction of items labeled with
the j-th class. The features in the datasets can then be ranked
according to the Gini importance yielded by their addition to the model.
We select those features whose sum of Gini values explain more than 90% of the overall sum (i.e., sum of Gini values of all features).
Finally,
as the selected features might be characterized by different ranges of
values, we normalize them as follows. Given the vector \(x_f\) related to feature f, we normalize each entry i of the vector as
\(\hat x_{f,i}=\dfrac{x_{f,i}−x_{min}}{x_{max}−x_{min}}\), (2)
where \(x_{min}\) and \(x_{max}\) respectively are the minimum and maximum values of vector \(x_f\).
Non-Negative Tensor Factorization
The
dataset now composed by users, selected features and their temporal
sequence, which in our data is the series of matches played by each
player, can be represented as a tensor. This tensor will be then
decomposed in our framework to identify groups of users with similar
behaviors in time and characterized by common sets of features.
Following the notation reported in Table 1, we thus define a three-dimensional array, denoted as \( \mathscr X∈ \mathbb{R}^{I×J×K} \),
where I is the number of users, J is the number of features, and K is
the number of time steps (i.e., played matches). In this formulation,
the entry \(x_{ijk}\) of the tensor corresponds to the entry related to the i-th user in the j-th feature at the k-th match in her/his gaming history.
Table 1. Notation used throughout the text.
Notation |
Definition |
X |
constant |
x |
scalar |
x |
vector |
X |
matrix |
\(x_{ij}\) |
matrix entry |
\(\mathscr X\) |
tensor |
\(x_{ijk}\) |
entry of a three-dimensional tensor |
○ |
outer product |
Once
the dataset is represented in the tensor form, we can decompose the
tensor to perform a dimensionality reduction on the data. Here, we focus
on the Non-negative Tensor Factorization which is given by the
PARAFAC/CANDECOMP (CP) decomposition with non-negative constraints.
The NTF approximates the tensor \(X\) into the sum of rank-one tensors, called components:
\( \mathscr X≈∑_{r=1}^R λ_r a_r ∘ br ∘ cr \), (3)
where
\(R\) is the rank of the tensor, which is the number of lower-dimensional
components found by the decomposition to approximate the data; \(λ_r\) are
the values of the tensor core \(\mathscr{L}=diag(λ)\); and the outer product \(ar∘br∘cr \)
identifies the component \(r\), with \(r=1,⋯,R\).
The vectors \(a_r\), \(b_r\) and
\(c_r\) respectively provide the level of membership of the users to the
component r, the level of membership of the features to the component r,
and the temporal activation of the component r. These vectors can be
encoded in three matrices \(A∈\mathbb{R}^{I×R},B∈\mathbb{R}^{J×R}\), and \(C∈\mathbb{R}^{K×R}\), called factor
matrices, whose r-th columns coincide to the vectors \(a_r\), \(b_r\) and \(c_r\).
Therefore, the approximation in Equation (3) can be rewritten in the
Kruskal as
\(\mathscr X≈〚\mathscr{L};A,B,C〛\) (4)
Note that the Kruskal operator 〚.〛is just a compact notation to write Equation (3) in terms of the core and factor matrices.
To obtain the factor matrices, and thus the approximated tensor, we need to solve an optimization problem of the form:
\( min∥\mathscr X−〚\mathscr{L};A,B,C〛∥^2_F \\ s.t.λ,A,B,C≥0 \)(5)
where \(∥⋅∥_F\) is
the Frobenius norm, and non-negativity constraints are imposed on the
factor matrices. To solve the optimization problem, we rely on the
Alternating Non-negative Least Squares (ANLS), combined with the
Block Principal Pivoting (BPP) method. In particular,
we use the Python package available online
(https://github.com/panisson/ntf-school/blob/master/ncp.py) to compute
the tensor decomposition.
Rank Selection
To find the best
approximation, we run 5 simulations for each rank value and we assess
the suitable number of components, i.e., rank, by using two approaches:
the Core Consistency Diagnostic (CORCONDIA) and the Automatic
Relevance Determination (ARD) for Tucker models (i.e., a generalization
of the CP decomposition).
CORCONDIA
We compute the core
consistency value for each simulation and look at the change in the
slope of the resulting core consistency curve to select a suitable
number of components R. In particular, the core consistency values
provide an evaluation of the closeness of the computed decomposition to
the ideal one. This is done by comparing the core tensor L of the
decomposition to a super-diagonal tensor G, i.e., a tensor having
entries equal to 1 on the diagonal and 0 otherwise. Therefore, the core
consistency between L and G is defined as:
\( cc=100(1−\dfrac{∑^{R}_{l=1} ∑^{R}_{m=1} ∑^{R}_{n=1} (λ_{lmn}−g_{lmn})^2} {R}) \), (6)
where \(λ_{lmn}\) are the entries of the core tensor \(\mathscr{L} \in \mathbb{R}^{R \times R \times R}\) and \(g_{lmn}\) are the entries of the ideal core \(\mathscr G∈ \mathbb R^{R×R×R}\).
The
core consistency has an upper bound: its values cannot exceed 100. As a
result, high values of the core consistency mean high similarity
between the cores, while low values indicate that the model selected
could be problematic. Finally, the core consistency can assume negative
values, thus indicating that the model selected is inappropriate. In the
Results section we will discuss the optimal number of
components as discovered by the the Core Consistency Diagnostic.
ARD Tucker
The
algorithm is based on a hierarchical Bayesian framework, whose
parameters directly define features importance. Thus, approximating the
tensor ARD automatically optimizes the hyperparameters and prunes
irrelevant components. In particular, the minimization problem based on
the least square objective function coincides to the minimization of the
negative log-likelihood assuming i.i.d. tensor's values with Gaussian
noise.
As
a result the ARD Tucker algorithm detects the relevant components,
while pruning the others and minimizing the negative log-likelihood
corresponding to the ANLS problem. The best model corresponds to the one
having the largest log-likelihood. In the Results section we will show the detected model and to select the rank we will
compare this result with the one provided by the CORCONDIA.
Once the
number of components is identified, we can analyze the information
provided by the resulting factor matrices. By the study of the matrices \(A\) and \(B\),
we can define the level of membership of each user to a specific
component, as well as the level of membership of each feature to the
components. It is worth noting that NTF allows users (or features) to
belong to several components leading to overlapping groups, but also
allows users (or features) to have a low level of membership such that
it could be not enough to label users (or features) as members of a
specific component.
The are two possible strategies to define whether
a user (analogously for features) belongs to a component: either we
accumulate the membership of a component until the 95% of its norm is reached; or, we compute an intra-component k-means with \(k=2\) clusters, i.e., for each component we divide the users into two groups:
those who belong to the specific component, and those who do not. These
methods allow to identify user clusters that might overlap. However, the
use of such methods could lead to having users that do not belong to
any component.
We use the first of the two methods to analyze the level of membership of the features, provided by the matrix \(B\).
This decision is justified by our expectation that, in League of
Legends (and in MOBA games in general), players can exhibit different
strategies during each match, reflecting different personal goals: for
example, a user may try to kill as many enemies as possible to earn a
great amount of gold; this however would potentially incur in risking
her/his hero's death more frequently; an alternative could be avoiding
getting the hero killed too often by helping (assisting) other
teammates, which however results in earning less gold at the end of the
match.
However, since our aim is to identify groups of users with
different behaviors, we decided to study the user membership as follows.
We consider the components of \(A\) as observations recorded for each
user, and we fit the k-means algorithm by imposing \(k=R\), i.e., we want to
obtain one cluster of users for each component. In this way, the users
are divided in k disjoint clusters. Here, the k-means is applied on the
lower-dimensional representation provided by the NTF, and thus on the
components' values in \(A\). The algorithm aims at finding centroids that
minimize the distance between elements in each cluster. Formally, given a
set of observations the algorithm partitions, the observations into k
sets \( S=\{S_1,⋯,S_k\} \) so that the within-cluster sum of squares will be
minimized:
\(min_S ∑_{i=1}^k ∑_{a∈S_i} ∥a−μ_i∥^2\), (7)
where \(a\) is one observation (here a row of \(A\)) and \(μ_i\) is the mean of set \(S_i\). As the values of \(A\) provide
how strongly each component represents a user, we can thus select the
group of users that mostly belong to each component. Note that the use
of k-means on the raw data would not provide the same level of
information given by NTF, which discovers correlations in the data in
multiple dimensions at a time, and allows to follow the temporal
evolution of the uncovered groups.
Finally, we can recover the temporal activation of each component in the columns of the factor matrix \(C\). This information can be used to investigate the evolution of each extracted behavior.