Methods and results

Step 2

Establishing prediction indicators

For constructing predictive models and association rules, the target variable and the input variables were used in the predictive analysis techniques should be determined and defined.


Defining target variable

To determine the target variable of the decision tree (DT) algorithm and logistic regression (LR) analysis, cluster analysis was performed on the sample of 77 respondents using two factors extracted by the exploratory factor analysis, namely the business information management effectiveness of BI and the decision support effectiveness of BI. The K-means method of non-hierarchical clustering analysis was adopted to classify the BISE of financial services firms in Taiwan. This study further applied discriminant analysis to validate the analytical results of cluster analysis, and found that they agree with the K-means method, and the cluster analysis had accuracy of 100.00 %.

Table 3 lists the results of cluster analysis for group one, comprising 26 firms, and group two, comprising 51 firms, performed using two factors of BISE, and conducts the t-test to demonstrate the significant difference in the two factors between the two groups. The analytical results showed that the t-values were 7.111 and 11.787, while the p-value was 0.000 (<0.05) and hence significant. The means of the two factors in group one were lower than in group two, indicating that group one had lower BISE and so was named the "low BISE group", while group two had higher BISE and so was termed the "high BISE group," and as the target variable of DT algorithm and LR analysis.

Table 3 Results of cluster analysis

Groups Firms Factor 1 Factor 2 Discriminant analysis to validate (%)
t-value p-value t-value p-value
Group 1: low BISE group 26 7.111 0.000 11.787 0.000 100
Group 2: high BISE group 51 100


  1. Factor 1: the business information management effectiveness of BI
  2. Factor 2: the decision support effectiveness of BI


Determining input variables

Overfitting refers to the phenomenon whereby the numerous input variables of the DT algorithm and LR analysis make it easy to select unrelated variable categories. This study conducted the Chi square test and independent sample t-test to select meaningful input variables of statistics as the input variables of the DT algorithm and LR analysis to avoid deviation of the analysis results.

A Chi square test measures whether significant difference exists between the effects of six independent variables indicating enterprise characteristics on the target variable. Analytical results showed that the p-value (0.016) of the variable of number of years of enterprise having implemented BI solutions was below 0.05, indicating a significant correlation with the target variable. Meanwhile, the p-values of the remaining five variables exceeded 0.05 (range, 0.143–0.127), and the χ2 values were in the range 2.224–5.425, indicating no significant correlation with the target variable. Thus, only the variable of number of years of enterprise having implemented BI solutions was selected as the input variable of the DT algorithm.

Additionally, this study performed independent sample t-test on the target variable through 16 measurement items of BISE. The analytical results demonstrated that the p-values of 16 variables were all below 0.05, and the t-values were in the range 4.254–8.010, achieving significance, and thus these variables were adequate as input variables of the DT algorithm and LR analysis.