Using Data Visualization to Persuade

Materials and Methods

Given a dataset composed by users (i.e., players), whose features (i.e., actions performed by the players) evolve over time, we aim at extracting meaningful patterns of behavior by applying tensor decomposition techniques. To this aim, we need to represent our data as a tensor (i.e., a multi-dimensional array). This can be done by assigning to each dimension of the tensor the different dimensions of the data, namely users, features, and time. However, not all the available features might be predictive of a user's performance or behavior. Thus, our framework, shown in Figure 1, includes a feature selection step (described in the following) that allows us to identify the most informative features that will be then used to create the tensor.


Feature Selection

The first step in our framework is to evaluate which features we need to build our multi-dimensional representation (tensor) of the data. We use Decision Trees to design a simple prediction task and detect those features that are mostly predictive of users' performance. Decision Trees are well-known machine learning techniques used for predicting a target outcome (in our case, users' performance) by learning decision rules through the features describing the data. The model also provides as an output the importance of each feature, i.e., the Gini importance or impurity. (https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm) Each node of a decision tree indeed represents a condition on one feature to split the dataset in two parts. The Gini importance is the measure used by the algorithm (here we use the Python scikit-learn implementation (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#r251) to chose the optimal split condition, and it is defined as

\(IG(p)=1−∑+{j=1}^{J} p^{2}_j\), (1)

where \(J\) is the set of classes, and \(p_j\) is the fraction of items labeled with the j-th class. The features in the datasets can then be ranked according to the Gini importance yielded by their addition to the model. We select those features whose sum of Gini values explain more than 90% of the overall sum (i.e., sum of Gini values of all features).

Finally, as the selected features might be characterized by different ranges of values, we normalize them as follows. Given the vector \(x_f\) related to feature f, we normalize each entry i of the vector as

\(\hat x_{f,i}=\dfrac{x_{f,i}−x_{min}}{x_{max}−x_{min}}\), (2)

where \(x_{min}\) and \(x_{max}\) respectively are the minimum and maximum values of vector \(x_f\).


Non-Negative Tensor Factorization

The dataset now composed by users, selected features and their temporal sequence, which in our data is the series of matches played by each player, can be represented as a tensor. This tensor will be then decomposed in our framework to identify groups of users with similar behaviors in time and characterized by common sets of features.

Following the notation reported in Table 1, we thus define a three-dimensional array, denoted as \( \mathscr X∈  \mathbb{R}^{I×J×K} \), where I is the number of users, J is the number of features, and K is the number of time steps (i.e., played matches). In this formulation, the entry \(x_{ijk}\) of the tensor corresponds to the entry related to the i-th user in the j-th feature at the k-th match in her/his gaming history.

Table 1. Notation used throughout the text.

Notation Definition
X constant
x scalar
x vector
X matrix
\(x_{ij}\) matrix entry
\(\mathscr X\) tensor
\(x_{ijk}\) entry of a three-dimensional tensor
outer product

Once the dataset is represented in the tensor form, we can decompose the tensor to perform a dimensionality reduction on the data. Here, we focus on the Non-negative Tensor Factorization which is given by the PARAFAC/CANDECOMP (CP) decomposition with non-negative constraints. The NTF approximates the tensor \(X\) into the sum of rank-one tensors, called components:

\( \mathscr X≈∑_{r=1}^R λ_r a_r ∘ br ∘ cr \), (3)

where \(R\) is the rank of the tensor, which is the number of lower-dimensional components found by the decomposition to approximate the data; \(λ_r\) are the values of the tensor core \(\mathscr{L}=diag(λ)\); and the outer product \(ar∘br∘cr \) identifies the component \(r\), with \(r=1,⋯,R\).

The vectors \(a_r\), \(b_r\) and \(c_r\) respectively provide the level of membership of the users to the component r, the level of membership of the features to the component r, and the temporal activation of the component r. These vectors can be encoded in three matrices \(A∈\mathbb{R}^{I×R},B∈\mathbb{R}^{J×R}\), and \(C∈\mathbb{R}^{K×R}\), called factor matrices, whose r-th columns coincide to the vectors \(a_r\), \(b_r\) and \(c_r\). Therefore, the approximation in Equation (3) can be rewritten in the Kruskal as

\(\mathscr X≈〚\mathscr{L};A,B,C〛\) (4)

Note that the Kruskal operator 〚.〛is just a compact notation to write Equation (3) in terms of the core and factor matrices.

To obtain the factor matrices, and thus the approximated tensor, we need to solve an optimization problem of the form:

\( min∥\mathscr X−〚\mathscr{L};A,B,C〛∥^2_F \\ s.t.λ,A,B,C≥0 \)(5)

where \(∥⋅∥_F\) is the Frobenius norm, and non-negativity constraints are imposed on the factor matrices. To solve the optimization problem, we rely on the Alternating Non-negative Least Squares (ANLS), combined with the Block Principal Pivoting (BPP) method. In particular, we use the Python package available online (https://github.com/panisson/ntf-school/blob/master/ncp.py) to compute the tensor decomposition.


Rank Selection

To find the best approximation, we run 5 simulations for each rank value and we assess the suitable number of components, i.e., rank, by using two approaches: the Core Consistency Diagnostic (CORCONDIA) and the Automatic Relevance Determination (ARD) for Tucker models (i.e., a generalization of the CP decomposition).

CORCONDIA
We compute the core consistency value for each simulation and look at the change in the slope of the resulting core consistency curve to select a suitable number of components R. In particular, the core consistency values provide an evaluation of the closeness of the computed decomposition to the ideal one. This is done by comparing the core tensor L of the decomposition to a super-diagonal tensor G, i.e., a tensor having entries equal to 1 on the diagonal and 0 otherwise. Therefore, the core consistency between L and G is defined as:

\( cc=100(1−\dfrac{∑^{R}_{l=1} ∑^{R}_{m=1} ∑^{R}_{n=1} (λ_{lmn}−g_{lmn})^2} {R}) \), (6)

where \(λ_{lmn}\) are the entries of the core tensor \(\mathscr{L} \in \mathbb{R}^{R \times R \times R}\) and \(g_{lmn}\) are the entries of the ideal core \(\mathscr G∈ \mathbb R^{R×R×R}\).

The core consistency has an upper bound: its values cannot exceed 100. As a result, high values of the core consistency mean high similarity between the cores, while low values indicate that the model selected could be problematic. Finally, the core consistency can assume negative values, thus indicating that the model selected is inappropriate. In the Results section we will discuss the optimal number of components as discovered by the the Core Consistency Diagnostic.

ARD Tucker
The algorithm is based on a hierarchical Bayesian framework, whose parameters directly define features importance. Thus, approximating the tensor ARD automatically optimizes the hyperparameters and prunes irrelevant components. In particular, the minimization problem based on the least square objective function coincides to the minimization of the negative log-likelihood assuming i.i.d. tensor's values with Gaussian noise.

As a result the ARD Tucker algorithm detects the relevant components, while pruning the others and minimizing the negative log-likelihood corresponding to the ANLS problem. The best model corresponds to the one having the largest log-likelihood. In the Results section we will show the detected model and to select the rank we will compare this result with the one provided by the CORCONDIA.

Once the number of components is identified, we can analyze the information provided by the resulting factor matrices. By the study of the matrices \(A\) and \(B\), we can define the level of membership of each user to a specific component, as well as the level of membership of each feature to the components. It is worth noting that NTF allows users (or features) to belong to several components leading to overlapping groups, but also allows users (or features) to have a low level of membership such that it could be not enough to label users (or features) as members of a specific component.

The are two possible strategies to define whether a user (analogously for features) belongs to a component: either we accumulate the membership of a component until the 95% of its norm is reached; or, we compute an intra-component k-means with \(k=2\) clusters, i.e., for each component we divide the users into two groups: those who belong to the specific component, and those who do not. These methods allow to identify user clusters that might overlap. However, the use of such methods could lead to having users that do not belong to any component.

We use the first of the two methods to analyze the level of membership of the features, provided by the matrix \(B\). This decision is justified by our expectation that, in League of Legends (and in MOBA games in general), players can exhibit different strategies during each match, reflecting different personal goals: for example, a user may try to kill as many enemies as possible to earn a great amount of gold; this however would potentially incur in risking her/his hero's death more frequently; an alternative could be avoiding getting the hero killed too often by helping (assisting) other teammates, which however results in earning less gold at the end of the match.

However, since our aim is to identify groups of users with different behaviors, we decided to study the user membership as follows. We consider the components of \(A\) as observations recorded for each user, and we fit the k-means algorithm by imposing \(k=R\), i.e., we want to obtain one cluster of users for each component. In this way, the users are divided in k disjoint clusters. Here, the k-means is applied on the lower-dimensional representation provided by the NTF, and thus on the components' values in \(A\). The algorithm aims at finding centroids that minimize the distance between elements in each cluster. Formally, given a set of observations the algorithm partitions, the observations into k sets \( S=\{S_1,⋯,S_k\} \) so that the within-cluster sum of squares will be minimized:

\(min_S ∑_{i=1}^k ∑_{a∈S_i} ∥a−μ_i∥^2\), (7)

where \(a\) is one observation (here a row of \(A\)) and \(μ_i\) is the mean of set \(S_i\). As the values of \(A\) provide how strongly each component represents a user, we can thus select the group of users that mostly belong to each component. Note that the use of k-means on the raw data would not provide the same level of information given by NTF, which discovers correlations in the data in multiple dimensions at a time, and allows to follow the temporal evolution of the uncovered groups.

Finally, we can recover the temporal activation of each component in the columns of the factor matrix \(C\). This information can be used to investigate the evolution of each extracted behavior.