A Review of the Enabling Methodologies for Knowledge Discovery from Smart Grids Data

The large-scale deployment of pervasive sensors and decentralized computing in modern smart grids is expected to exponentially increase the volume of data exchanged by power system applications. In this context, the research for scalable, and flexible methodologies aimed at supporting rapid decisions in a data rich, but information limited environment represents a relevant issue to address. To this aim, this paper outlines the potential role of Knowledge Discovery from massive Datasets in smart grid computing, presenting the most recent activities developed in this field by the Task Force on “Enabling Paradigms for High-Performance Computing in Wide Area Monitoring Protective and Control Systems” of the IEEE PSOPE Technologies and Innovation Subcommittee.


Introduction
On-line smart grid operation asks for quickly identifying reliable decisions in a complex data rich, but information-limited domain [1]. In this context, the data streaming generated by the network of pervasive sensors distributed along the entire power system do not always provide smart grids operators with the necessary information to react to external disturbances in the time-frames required to minimize their Although the effectiveness of these knowledge discovery-based paradigms have been successfully assessed in the task of solving specific smart grid problems, their global integration in realistic decision support systems requires the development of ontology middleware, which provides functionalities aimed at facilitating operational data acquisition and handling in interoperable formats, enabling information services through a coordinated process chain [9]. These functions can be obtained by processing heterogeneous smart grids data-sets by ontology-based techniques, and smart reasoning system, which enable access to the information content rather than keyword-based searches. This paradigm allows for accomplishing knowledge discovery, providing decision support to smart grid operators by focusing on making computing systems more closely interact at human conceptual levels, modeling the semantics of the data, instead of just relying on the syntactic and structural representations. These features allow the ontology middleware to become a flexible and extendable platform for knowledge management solutions in smart grids.
According to the research directions identified by these papers, the Task Force on "Enabling Paradigms for High-Performance Computing in Wide Area Monitoring Protective and Control Systems" of the IEEE PSOPE Technologies and Innovation Subcommittee analyzed the open problems, the challenging issues, and the most promising enabling technologies for knowledge discovery from smart grids data. The main results of this analysis are analyzed in this paper, and the experimental results obtained on an complex case study are presented and discussed in order to emphasize the potential role of computational and cognitive techniques for situation awareness in smart grid applications.

Knowledge Discovery from Massive Data
The recent technological advancements in data storing and processing allow the growth of electronic archives, coupled to a large and pervasive diffusion of online sensors, which transmit high frequency information streams about the operation states of complex and distributed systems [10]. Online processing of these massive data allows for improving the knowledge about the behavior of complex systems characterized by large uncertainty sources, which make the deterministic modeling of the analyzed system difficult [11].
Unfortunately, the massive increase of data volume has deteriorated the effectiveness of the traditional approaches employed to extract useful information. Indeed, a large amount of data is not guaranteed to be a reliable source of information, but in the majority of cases, the data need to be processed to reveal their true intrinsic knowledge value [12]. Furthermore, the process of acquisition and storing data are related to a certain cost in terms of equipment and storage technologies. For this reason, the extraction of the most profitable information from them is playing a strategic role in modern complex systems analysis [13]. In this context, the main objective of the data analyst is to develop strategies aimed at giving value to this process, promoting reliable software and hardware architecture able to effectively perform this task.
For this reason, when the cardinality of heterogeneous data becomes too large for a complete human management or traditional approaches, it is time for Artificial Intelligence (AI) to support analysts in extrapolating reliable and useful information [14]. In this domain, Knowledge Discovery in large Database (KDD) represents a strategic solution as it allows the identification of valid, novel, potentially useful, and ultimately understandable patterns in data [15]. Valid, novel and potentially useful data or anything, such as models or relation, represent an added value with respect to a certain aspect. Finding prediction models or deeper insights about an economy or product system that allow a better management of them are explicative examples.
This research is necessary because large datasets cannot be understood immediately, containing more information than they appear to have. Trends, regularities, and patterns can be revealed only after a complex procedure of data processing. In particular, the KDD process is an activity made of different interaction and retrieval steps, which requires the human action in certain phases. The process is commonly confused with Data Mining, which is one of the KDD steps. The interactivity of the KDD process is related to the crucial role played by humans in supervision and validation of the discovered information. Its contribution is related both to its expertise with mining tools and its knowledge domain, which means the ability to exploit the understanding of data to filter knowledge from irrelevant and incorrect data [15].
In particular, according to [15], the main steps composing the KDD process are: • Definition of the KDD process goal from the customer point of view. Understanding of the domain and of the a priori knowledge; • Selection of the target data from the available ones performing the KDD process; • Data cleaning and preprocessing: it includes the basic operation of noise removal, the collecting and merging procedures of samples, and the accounting of date and time information; • Data reduction and projection: the features of the samples are processed by adopting cardinality reduction or feature selection techniques aimed at either reducing the set of data to the most relevant feature or finding invariant transformation of data; • Goal matching of the KDD process to the choice of a particular data mining methods (e.g., clustering, regression, classification, etc.); • Data mining algorithm selection to find patterns in the data in consideration of the goal and data available; • Performing the data mining algorithm to search for patterns in data; • Mined pattern interpretation, it involves the possibility to visualize the results of mining and coming back to the previous steps to adjust patterns or select a different algorithm to improve the results; • Knowledge consolidation, it consists of processing data in the most suitable form for either successive KDD processees or visual report generation for the customer.
The data mining step is the core of a KDD process involving a repeated iteration of data mining algorithms, where the kind of applied algorithm depends on the goals to pursue, where the latter can be classified as verification and discovery. When the objective is simply validating the user hypothesis, the goal is called 'verification', whereas, when it is necessary that the developed system will find new patterns, the goal is called 'discovery'. Furthermore, prediction refers to the following data mining tasks: • Prediction: the goal is the patterns development for the prediction of the behavior of certain features given a forecasting horizon; • Description: the goal is the patterns development aimed at presenting data in a more understandable form.
Nevertheless, the described classification between the possible goals of data mining the boundary between them is not sharp. Indeed, the description models could be employed also for predicting further classification and vice versa. The data mining methods range between a wide spectrum of techniques, where the employment of one or more methods depends on the considered objective. The canonical classification considers the following family of methods [15]: • Classification: learning a function that maps a data to a certain class; • Regression: learning a function that finds a relation between an observed set of input-output data discovering possible functional relations; • Clustering: grouping data in a given set based on their similarity, by identifying samples (or patterns) with similar features; • Summarizing: finding compact representation of multi-variate data; • Dependency Modeling: learning model describing the dependencies; between variables in probabilistic and graphical terms; • Change and Deviation Detection: learning model to find differences or strong deviation measured in a flow process.
The outlined KDD process goals can be reached by the construction of specific algorithms, which are characterized by a large variety of typologies, all decomposable in three key concepts [15]: In this case, model representation is the employed language to describe discoverable patterns and it includes the data analyst knowledge about the assumption done related to the application of a certain model. This is fundamental because too simplistic hypothesis about the process to study will lead to poor results independently from the amount of data and training time.
The model evaluation criteria are the quantitative representation of how well a discovered pattern meets the goal of the KDD process, where the case of predictive models is limited to evaluating the accuracy of the estimated quantities with respect to the observed ones for each case. In the case of descriptive models, the evaluation concerns assessment on the novelty, utility, and understandability of the fitted model. Finally, once the models are selected and evaluation criteria fixed, the search method is aimed at finding the parameters/family of models, maximizing the fixed objectives, and reducing the task to an optimization problem. In particular, the employed data mining methods are classifiable in the following families [ Decision Trees are one of the most common methods employed in data mining for classification [16]. The goal of the method is to train a model for assigning a class to a sample by considering the values of its features. The model is based on the partitioning of the domain in sub-domains by applying tree branching. The process is extended to the class of regression problems when the values domain lies in that of real numbers where the methods are called 'regression trees' [17]. Nonlinear regression is instead based on developing predictive models, which combine basic functions, such as polynomial, sigmoid, and spline [18]. The polynomial regression is one of the simplest approaches, and it aims at fitting a model by using curves of order n > 2 (quadratic, cubic, etc.), while the spline approach aims at producing a piecewise model in which each model is trained with only the value lying in a specified interval.
Artificial Neural Network (ANN) is the most representative class in the data-driven learning domain [19]. ANNs are based on parametric regression and classification models whose structure imitates the behavior and the topology of biological nervous systems, in particular their connections, and where parameters are estimated in a supervised fashion by means of input-output examples of the task to be accomplished. In early traditional ANNs, the number of layers is limited and they are also called shallow neural networks. ANNs can also be used in combination with fuzzy logic to implement fuzzy neural networks that are able to deal with the uncertainty of data more naturally [20]. Some recent studies considered the application of such networks to the prediction of load forecasting [21], where the robustness of fuzzy logic to handle noisy and unreliable measures is exploited with the characteristic of ANNs to learn by means of numerical examples rather than by linguistic rules (as in the case of general fuzzy inference systems).
As an extension of shallow ANNs, deep ANNs have been proposed in the context of deep learning, which finds a large application in solving complex classification tasks typically involving a huge amount of data as in the case of image-based datasets and information processing [22]. Here, the word 'deep' stands for emphasizing the learning process based on successive layer representation of data. In most cases, the data transformation consists of hundreds of successive representation layers [23]. The enormous data size increment has pushed the emerging of deep learning algorithms and architectures in many power system applications. The most common architectures employed in the deep learning field is the Convolutional Neural Network (CNN) [24], which has shown great capability to deal with large spatial data. Many developed libraries, such as TensorFlow, Torch/PyTorch, and Theano, have been developed for several programming languages, allowing a reliable application of deep learning for their specific needs on CPU/GPU architectures. CNNs are largely employed in computer vision and for dealing with data having spatial relationships. The name derives from the convolution mathematical operation, which is employed in specific convolutional layers. The data processing in a CNN aimed at extracting progressively features from sub-samples of the original data, which have to be arranged in an input tensor. According to their capability, they have been strongly employed in spatial load forecasting applications such as in [25].
A special kind of ANN is the Recurrent Neural Network (RNN), which is capable of keeping the memory of the past in an internal state while it incrementally processes data; for this reason, RNN has a big potential for managing time series. It was developed based on the [26] proposal in the framework of 'Reservoir Computing', acquiring even more consideration in speech and text recognition due its capability to consider all the dynamic process under study. In a basic RNN architecture, the output is generated by a combination between the input data and a recurrent correlation. An RNN can be equivalently considered as many feed-forward ANNs operating sequentially to supply outputs over the time sequence to predict. Starting from randomized versions of shallow ANN architectures, as in the case of the Echo State Network (ESN) [27], over the years, several advancements have been developed in order to overcome the RNN unit limits in the deep learning field. The Long Short-Term Memory (LSTM) network is the most popular approach to this end [28]. It is based on computational units whose basic structure is composed by a cell, which keeps the memory in the unit, and three regulators or gates, which manage the information flow inside the sequential units. They are called input, output, and forget gate, but they are not present in all architectures. LSTSM is particularly suited to deal with the vanishing of gradient, a typical problem of deep learning [29]. Furthermore, another type of RNN unit, called a Gated Recurrent Unit (GRU) unit, has been developed in order to avoid overfitting issues, by increasing the forecasting accuracy as shown in [30].
Among data-driven approaches to solve the regression/classification task, there are also nonparametric models based, for instance, on Case Base Reasoning [31] and Nearest Neighbors regression or classification [32]. One of the main critical issues in this kind of application is adopting a well-defined metric for weighting the similarity between the stored examples with respect to the query sample properties. Because of the increase in the amount of the databases' cardinality, these kinds of methods often also consider the support of techniques for cardinality reduction to avoid the so-called curse of dimensionality [33].
Probabilistic Graph models are employed for characterizing the dependency between variables, where the variable dependencies are taken into account via graph structure. This approach has been initially employed by considering categorical discrete variables, for it then to be successively extended to continuous variables with Gaussian density. One of the most employed models is that, based on Bayesian networks, where the graphical relation between variables is expressed in the form of conditional probabilities, which can be assigned by the expert system, or by applying inference procedures, by learning the parameters from the observed data [34]. Finally, the Relation Learning Models combines machine learning with the logic of first order, defining the Inductive Logic Programming. It is a form of investigation aimed at finding patterns and discovering insights in data. It is based on the employment of clausal first order logic as a representation language for both data and hypothesis [35].

Research and Application Challenges in Smart Grids
The KDD process allows for harnessing the effectiveness of big data in power systems for a large number of research fields, where the possible applications range over the entire chain of power electric infrastructure. In particular, the big data employment in power systems can be seen from a holistic point of view, where the improvements produced by the discovered knowledge for each component of the system allow for improving the reliability and flexibility of the overall system [36]. The main data stream in power system operation is generated by Supervisory Control and Data Acquisition (SCADA), Phasor Measurement Units (PMU), and Advanced Metering Interface (AMI) [10]. The SCADA system is widely spread in power stations and power grids (transmission and distribution) and its measurement frequency is on the order of few seconds. The system collects a wide range of variables depending on the monitored system type.
The PMU is a measurement device operating at higher sampling frequency (30-60 measurements per second), which allows for acquiring the voltage and current phasors synchronized with a common time reference (e.g., provided by a Global Positioning System). These devices are mainly deployed in transmission networks, where they represent the backbone of the WAMSs (Wide Area Monitoring Systems) [37], and, more recently, in active distribution networks, where they are typically referred as "micro-PMU" [38]. Moreover, AMI is a system interacting with multiple metering sources (electric, heat, gas), which allows for collecting multiple heterogeneous variables in distribution networks. It is one of the most promising enabling technologies for demand response-based frameworks by allowing interaction with home devices, and IoT-based sensors [39].
The availability of different data sources, which characterize different subsystems in power grids, causes a deep heterogeneity in the corresponding data streams. In particular, the latter can be classified as:

•
Raw waveform data (voltage and currents, exchanged active, reactive power at bus, conductor temperature, etc.); • Preprocessed waveforms (voltage and currents, weather parameters over the grid); • Status variables of system components; • Consumer consumption/distributed generation data; • Power Plants operation and energy bidding data; • Electricity Market data.
To extract actionable information from this large set of heterogeneous data, many papers outline the potential role of big-data based knowledge discovery in solving several power system operation problems. Predictive maintenance, process and control optimization, analysis, and prediction of the electricity-market prices have been solved by recurring to the KDD process. Furthermore, the spreading of Variable Renewable Energy (VRE) power plants has extended the application of KDD in time and spatial prediction of the wind/solar power profile for several forecasting horizons [40][41][42]. In addition, the harness of visualization and data description in KDD process allows for introducing advanced and exhaustive analysis of the forecasting performance, by adopting a rigorous comparison of metric and statistical tests for accuracy and performance analysis.
The enhancement of accuracy in VREs forecasting is a clear example of the previously described holistic approach, where a reduction in uncertainty in the power generation amount leads to benefits for all systems, by reducing the cost related to the reserve procurement. Furthermore, KDD is useful in the estimation and forecasting of the water amount in hydroelectric power stations. In transmission networks, the role of KDD is related to the detection [43], classification, and analysis of faults, detection of the most sensitive substation to disturbances [44], impact of severe weather events on the network for resilience study, and analysis of conductor temperature for Dynamic Thermal Rating (DTR) application.
Generally, the distribution networks still do not have the same density of installed sensors with respect to the transmission networks. Anyway, the increasing in Distributed Generation (DG) and complex load active in demand response require an effort in the improvement in the communication infrastructure of the distribution grid [45,46]. In this sense, the role of KDD is enabling in extracting precious information on the limited number of data stream available. An example is related to power system state and topology estimation, where the graph configuration of the distribution network is identified by analyzing voltage measure at buses in the presence of radial networks with active connections and switchable root nodes [47].
The KDD process supplies an important support in characterizing the load profiles in distribution grids, especially for those hosting a large capacity of DG [48]. In particular, the Net Load characterization (Demand minus DG) and its forecasting represents one of the greatest challenges in the management of grid flexibility. A large support for the energy consumer profile is supplied by approaching the problem with clustering techniques and auto-correlation analysis [49]. Finally, the KDD process is employed for electricity market analysis by both power generation companies and customers in order to reveal useful insights to be used in developing advance bidding strategies in electricity markets [50].

Cardinality Reduction and Data Compression
The large scale diffusion of sensor networks in Smart Grids represents a severe issue to address in data storing and transmission, which affect many online applications, such as load flow studies, state estimation, and contingency analysis. Despite the improvements in data transmission capabilities, these massive amount of data streaming may cause bottlenecks in communication networks, which are not infrequent in Smart Grids where the development of dedicated wide area communication networks is not feasible due to the presence of large dispersed energy resources on both customers and distributed generation side. In this context, the adoption of techniques for reducing the volume of data is crucial to satisfy the time constraints in supplying the required data processing. Clearly, the typology of data compression depends on the specific needs, such as the data type (numerical or categorical variables), if the process is lossy or lossless, etc. [51].
In particular, the reduction process for the data compression can perform on: (i) features; (ii) samples. The compression is performed by aging of the features of the processed dataset. Most employed linear techniques are Factor Analysis (FC) and Principal Component Analysis (PCA), whereas nonlinear approaches include Locally Linear Embedding (LLE), Isomap, and derivatives [51]. The aim of these methods is transforming the original variables in new ones through a combination of them according to the principles of the adopted method, where the result is the reduction of the data cardinality by deleting the most irrelevant or redundant features. Further techniques, such as minimum Redundancy Maximum Relevancy (mRMR) [52], aim to extract a subset of the variables from the original dataset. The extracted variables have the highest mutual dependency with respect to a target in a dataset by using statically information metrics.
The data sampling is basically the simplest form of sample reduction because it acts on a naïve extraction from the original dataset of a subset of samples by considering non-complex rules [53]. On the contrary, data squashing produces artificial samples having the same statistical moments characteristics of the original data [54]. Data clustering aims at grouping samples with common features. The number of developed clustering is very wide with effective results in the task of classification. Binning methods consist of transforming a continuous variable in a category where the approach ranges from the naïve method to the statically based.
In this domain, the Principal Component Analysis is one of the most employed methods for linear data reduction [55]. It performs this through an unsupervised process that projects the data from the original space into a lower dimensional one where the axes, called Principal Components (PCs), of this new space are computed by combining the original variables. The first PC is oriented along the direction with the maximum variance of data [18]. This mathematically corresponds to find the vector a = [a 1 , . . . , a n ] ∈ n which a generic data pattern x is projected onto, so as to maximize the variance of the projection z: z = a 1 x 1 + · · · + a n x n = a T x . (1) It is proved that a value maximizing the variance of z is obtained when a is the eigenvector of var(x) corresponding to its largest eigenvalue; thus, in the case of basic PCA, the algorithmic procedure is the following for a given matrix X with dimensions [N, f ], where N and f are the number of samples and features, respectively: 1.
Normalize the data matrix X so that each column ofX will assume a null mean and unitary variance; 2.
Compute the Singular Value Decomposition onX: where U is the orthogonal matrix of order N, D is a rectangular diagonal matrix with dimensions [N, f ], where the diagonal elements of D assume values d 1 ≥ d 2 ≥ . . . ≥ d f , and V is an orthogonal matrix of order f ; 3.
The new variables in lower dimensional space are computed by choosing the first k ≤ f columns of matrix Z where: There are many ways to choose the optimal number of PC, where one of them is to take into account the percentage amount of variance in the chosen components where a value greater than 95% is considered satisfactory.
PCA-based methods have been applied for reducing the computational burden in a large number of smart grid applications. In particular, in [5], the PCA has been applied in order to solve power flow and optimal power flow problems in large-scale power systems. In this study, a new formalization of the system equations in the PCA domain allowed for reducing the problems cardinality by identifying the hidden relations between the state variables obtained from the analysis of the historical problem solutions. Furthermore, the application of PCA has proved to be effective in wide-area smart grid monitoring, where it allows for developing effective online power system security analysis, by reducing the complexity of the contingency screening process [56]. Other interesting application domains include the definition of strategic bidding strategies for wind power generators, where PCA has been applied in the task of finding hidden correlations between spatially distributed wind farms [57], and the development of spatial and temporal wind power forecasting tools based on Knowledge Discovery from large datasets [41].
If cardinality reduction does not supply adequate results, an alternative is represented by the feature selection ones. Differently from these former, the latter does not transform the original variables, but they subset the original dataset to the most relevant features according to a certain metric [29]. In literature, the research started to explore selecting the best features in order to choose those that maximize the mutual information between them and a target variable, and this is called maximum relevancy strategy. Unfortunately, several works of literature have proved that the best selected features by maximum relevancy do not guarantee the best prediction accuracy [58]. The reason for this is related to the neglecting of feature redundancy. Considering this, a trade-off between lesser redundancy and greater maximum relevancy was considered through the development of the minimum Relevancy Maximum Redundancy technique [59] to overcome the maximum relevancy limit. Mathematically, applying the mRMR technique corresponds to maximizing the following function: where X is a set of generic features, B is a set of the features already considered, d is the number of desired best features, v is a generic target variable, I(.) is the mutual dependency function, and x j and x i a generic feature of B and X, respectively. In (4), the left member in the parenthesis is the relevancy between the jth feature and the target variable, whilst the right member is the redundancy between the jth feature and the others of B.

Proposed Methodology
The Knowledge Discovery Process aims at extracting useful hidden information from the available data. Usefulness stands for the quality of having something to supply an advantage to the user. In particular, revealed information is useful when it is used either for gaining a direct knowledge from its visualization or for being processed in a further information process in order to extract new knowledge. For this reason, the proposed methodology aims at proving the capability to develop an accurate full data-driven model based on KDP for multi-temporal forecasting. Hence, revealed information is useful when it is processed for visualization or to be used for further data processing, as in case of the prediction models.
When the number of time step ahead to predict increases, the challenge is to characterize the behavior of the signal to predict in order to catch correlations for different periods. Hence, harnessing the hidden information content of the available data is crucial for developing a good forecasting model, since raw data are seldom suitable for an immediate effective use. For this reason, the proposed methodology, whose workflow scheme is reported in Figure 1, includes: (i) a tool for transforming date and time information in numerical predictor variables; (ii) a procedure of feature engineering; (iii) a tool for adapting the time series prediction problem in a supervised learning one; (iv) a procedure of feature selection; (v) two predictive model based on based on random forest and lazy learning; (vi) time rolling windows validation; (vii) statistical analysis of the results.

1.
Time referenced datasets about load consumption acquired by smart meters and customer substations are precious information sources to extract in order to catch the user behavior profile. Generally, electric load trajectory shapes assume similarity patterns according to the season, the day type (workweek, weekend, and annual holidays), and load type (households, tertiary sector, industrial, etc.). Electric prices, weather conditions, and spot social events complete the phenomena list affecting the electric load. It is clear that much of this information such as date-time are available in string or character format, needing adequate transformations to allow the application of regression models. Given a date-time sample, a simple preprocessing step allows for extracting several useful codified variables, including their type and timestamp, which are relevant to season, month, day of the week, day of the month, and so on.

2.
A raw time series matrix Y 0 , which is characterized by n 0 samples and c 0 variables (or features), is often characterized by noise or chaotic behavior, which do not allow a clear understanding of the signal trajectory over the time. Excessive volatility needs to be managed in order to have more stable signals, which are able to catch the time series trend. For this reason, the application of feature engineering moves toward this direction, by allowing the extraction of a large number of hidden features and smooth signals from the original time series, producing the matrix Y, which has dimensions [n 0 , c], with c > c 0 . In this sense, Table 1 summarizes the main smoothing variables used in the literature and the corresponding variable. For the sake of clarity, matrix dimensions are summarized in Table 2.

3.
The supervised learning approach for time series forecasting requires a transformation of data, which are usually arranged in a matrix form. Preparing data for this approach requires producing a couple of input-output set for each sample t (the jth rows of matrix Y), which considers a portion of the predictor trajectories (how many samples in the past are considered as process memory) and the forecasting horizon of the target variables (how many samples ahead we want to predict) ( Figure 2). The embedding procedure is a map between the samples of a time-series, which produces two matrices P, whose dimensions are n 1 and p, and R, whose dimensions are n 1 and r, called predictors and target matrices, respectively, given an input matrix Y, once assigned an auto-regressive lag L, a delay d, and a forecasting horizon H. The parameter r is computed by the product between c r and H, where c r is the number of variables in Y to predict. The delay is crucial since it shifts the most recent available sample in the past at time t. A rough indication about number of L can be chosen on the basis of the signal auto-correlation analysis. P and R were consequently split into P t , P v , R t , and R v , which are the training and test set of the predictors and target matrices. For the sake of clarity, the variable list is summarized in Table 2.

4.
The previous steps cause a huge increase in the number of variables; indeed, L new predictors (the lagged variables in the past) are produced for each starting variable (columns of Y). Unfortunately, the consequence of this cardinality growth may cause collateral effects on the prediction accuracy, since a large dimension of data causes the previously described "curse of dimensionality", which causes critical issues in the right operation of learning models. For this reason, techniques for cardinality reduction, as PCA, and feature selection, as MRMR, were considered. As described, the main difference between them is that the former produces a new set of uncorrelated variables in the PCA domain, whilst the latter extracts the most correlated and lesser redundant variables with respect to a target variable without transforming the original dataset. The reduced predictor training and test matrices are defined as P t,r and P v,r .

5.
Two different machine-learning models such as Lazy Learning [60] and Random Forest Regression [61] are assessed in this methodology. Random Forest (RF) origins arise from the bootstrap aggregation (bagging), which is a technique aimed at reducing the variance of the prediction function by averaging several prediction functions trained with random extracted samples from the dataset. RF extended this concept to the features in order to build decorrelated trees, where a random selection of variable is considered for each split. On the contrary, a Lazy Learning model as the K-Nearest Neighbors is based on local regression, where the predictor training set is used to extract the nearest neighbor samples given a query one. These latter and the corresponding targets are consequently used for building a local learner that supplies the prediction. Since the nearest neighbors are chosen by discriminating them considering a distance metric, the reduction of cardinality is crucial to reduce the number of dimensions (features) to consider in the distance computation. According to the multi-step nature of the problem, a direct strategy was applied, which, even if it requests a more computational effort with respect to an iterative approach, is less subject to the error explosion. Hence, the multi-step load forecasting problem was decomposed in H MISO problems, one for each time step ahead. 6.
An exhaustive proposed methodology validation requires testing on a large number of cases in order to appreciate the spreading of accuracy performance at the changing of training and test sets. For this reason, a time-rolling window validation was employed to slice Y in the ith training and test sets, according to a sequence of splitting points. 7.
The model performance data were analyzed in order to assess the effectiveness of the proposed methodology, where a Naive model was considered as a benchmark. The tests were performed by progressively increasing the forecasting horizons. The MSE was computed for both sample (jth row of R (i) v ) and wth target variable over the considered forecasting horizon span according to the (5): v is the predicted value matrix for the ith test case, w ∈ [1, c r ] is an indexing variable used for slicing over the columns bothR (i) v and R (i) v in order to extract the forecasting horizon span for the wth target, where n v is the row number ofR 8. Aggregate data are performed by considering statistical tests as Friedman tests [62]. The aim is to assess if the model performs differently or not. In particular, the Friedman test is a non-parametric randomized block of analysis of variance, where the null hypothesis H 0 considers all methods having the same error distribution. The test does not assume any hypothesis about data distribution. If the test rejects the null hypothesis, the Tukey-based Post Hoc test is performed in order to analyze the difference between the performance of each couple of models. In particular, the Tukey's test supplies an upper triangular matrix where the elements are sorted by an accuracy rank. This information is processed for producing useful visualizations for the choice of the best model.

Feature Equation Notes
smoothing average t and z are the generic time sample and raw variable, respectively m is the size of the rolling window rolling upper bound max rolling lower bound min signal matrix after feature engineering process c = c 0 · (1 + q), q is the number of features made per variable of Y 0 Y (i) n c slice of Y used in the ith case test where c r is the number of target variables

Case Study
The proposed methodology was tested in the task of analyzing a large dataset generated by a pervasive smart meter network deployed on a large commercial user located in the south of Italy, whose main features are summarized in Figure 3. In particular, the heat maps show the consumption level of active/reactive power, and the power factor over the whole day considering a full month. The active power heat map (above inset) reveals the highest consumption level is mainly related to time window 8-18. The central inset shows the reactive power level, which the observed pattern deflates from the active power ones. This is confirmed by the below inset, which depicts the distribution of power factor over the day.  (1-30), in hours and minutes. The target variable is the active power for an assigned forecasting horizon. Furthermore, according to the data, the time resolution at the 1 h forecasting horizon corresponds to 12 steps ahead. The simulations were performed on an Intel ® I7-9700 CPU, by running a single core instance of R. Two case studies, named 'A' and 'B', were conducted on the prediction of three-phase active power. In particular, case A analyzed three forecasting horizons H = {12, 72, 144}, which are equivalent to 2, 6, and 12 h ahead; for the sake of conciseness, the case study set-up was depicted in Table 3 with time resolution of 5 min. Several forecasting horizons are chosen for assessing the methodology and accuracy performance at the increasing of the different forecasting horizon. The considered values are related to the most frequent time constraints for the submission of offers in electricity markets considering the possibility for the utility to participate in energy/ancillary services markets. Clearly, this kind of forecasting may be used to manage the utility, to schedule several activities considering external needs, such as to avoid system stress conditions caused by huge load levels. For each forecasting horizon, the raw data were processed according to the described pipeline.
The raw dataset Y 0 was normalized and smooth variables were computed for each dataset variable for different lag time adding them to the available variable set. According to a generated splitting point set, a subset of Y (i) is extracted in order to be transformed in the predictor and target matrices P (i) and T (i) through the embedding procedure. Each one of these matrices was consequently split into P v , which are the training and test sets of the predictor and target matrices.
Since it is not reasonable that all predictors have the same information, PCA and MRMR were considered to reduce the dataset to the most meaningful variables. As shown by preliminary results, we selected the MRMR since PCA has shown a reduced capability to reconstruct the predictor matrix test set in the presence of high noisy data, reducing the prediction accuracy. Unfortunately, the adoption of a direct prediction strategy has required the production of a number of models equal to the time steps ahead to predict. Consequently, MRMR had to be performed the same number of times in order to find the most correlated predictors to the hth time step ahead. For this reason, a sub-optimal solution was to apply the MRMR only one time between the predictors training matrix P (i) t and the nearest hth time step ahead to the half width of the forecasting horizon, where the optimal number of selected features was chosen by a preliminary analysis.
Once the f most meaningful predictors were selected, the training set matrices were processed by supervised learning models to train them. In particular, Random Forest, Lazy Learning, and Naive were compared. The Naive models supply each prediction over the forecasting horizons by averaging the available samples according to (7): where h is the hth time step ahead and g is the number of samples considered for computing the expected value.
The case study B changes the resolution of the described data from 5 min to 30 for reducing both the high volatility of the time series, which is shown by Figures 2-5 and the computational costs. In particular, the tested forecasting horizons are 2, 3, 6 h, which correspond to H = 4, 6, 12, and where the experimental set-up is summarized in Table 4. In this case, RF and Lazy Learning are performed by reducing the predictor training set by using both MRMR and PCA. An important difference in the described framework between case A and B is related to the choice of the best features by MRMR. Indeed, considering the reduced computational cost deriving by the reduction of time resolution, the application of MRMR for each step of the forecasting horizon span becomes feasible. Furthermore, this case study includes a further Naive model (Naive 2), where the predicted value for a certain time of the day is computed by averaging the occurred values for the same time in the days behind. This model was added because it works differently from the traditional time series forecasting model in order to try to catch some difference in the performance behavior.

Parameter Value
Parameter Value

Case A
According to the analyzed case studies, Random Forest and Lazy Learning have shown a better prediction accuracy than the Naive model, especially for large forecasting horizons as proved by boxplot visualization in Figure 4. Particularly, the Naive model performs similarly to more complex ones as shown by the left plot of Figure 4. Indeed, when the signal is much noisier, it may compromise the entire data processing system, decreasing the prediction accuracy of the more complex models, which try to catch relationship inside data. Differently, the latter does not affect Naive since it neglects any form of data analysis.
Obviously, the Naive model predicts the forecasting horizon by performing a simple moving average of the available past samples of the signal to be predicted, revealing the dramatic detriment of its performance at the increasing of forecasting horizon as shown in Figures 5 and 6. In particular, these latter show the actual and predicted load trajectories for two samples of the forecasting horizon span, where the volatility of the signal is well highlighted by the current signal trajectories (red lines).
The computational burden rises linearly at the increasing of the forecasting horizon, where the maximum waiting time is 3 min for predicting a 300 time sample test target matrix with a 12 forecasting horizon span per time sample. Each time sample ahead was predicted by applying a direct strategy for both Random Forest and Lazy Learning models. According to the set-up of rolling window, the number of test cases are a, b, and c for 2, 6, and 12 h ahead case studies, respectively.

Case B
The case study B is focused on comparing the performance of PCA and MRMR in forecasting applications. In particular, their effectiveness appears related to the type of coupled machine learning algorithms as shown in Figure 7. Indeed, the PCA performs well in combination with Random Forest, whereas the combination of PCA with Lazy Learning shows the worst performance for all forecasting horizons. It is interesting noting that the MRMR-based model better performs than any PCA-based model. These results are confirmed by observing the trajectories for the considered forecasting horizon (Figures 8-10).   The latter figures show the Lazy Learning-PCA lower accuracy than the others forecasting models because it is unable to follow the true value (ytest). On the contrary, the employment of Lazy Learning in combination with MRMR produces an accurate forecasting, where the actual and predictor trajectories are often very close. The low-accuracy of PCA-based models may be related to the low capacity of PCA to transform in the original domain new data, where Recursive PCA may improve the accuracy [63].
According to the workflow in Figure 11, the model performance aggregation is processed by considering the Friedman's test. Since for each forecasting horizon the null hypothesis is rejected, the Tukey-based Post Hoc test for checking dissimilarities between each couple of models is performed. In particular, as observed in Figure 11, the Post Hoc test outputs are fused in a heat map according to the KDD principles. The latter has the models arranged according to the Post Hoc rank on both axes, where each cell of the map is the result of Post Hoc test between two models. The first element of the rank is arranged on the lower left corner. The green colored cell means that the model performs equally, whereas the orange cell means that the model is statistically different. In particular, for H = 2 h, the MRMR-based Lazy Learning model is the most accurate one, but its performance cannot be considered significantly different from the second model in the rank, which is the MRMR-based Random Forest. The Post Hoc test for H = 3 h and H = 6 h does not show relevant differences with respect to H = 2 h. In conclusion, it is clear that a similar visualization is effective because it allows a rapid understanding of the performance differences between two models, supporting the decision maker in the analysis of the most suitable model.  Figure 11. Visualization of a Post Hoc Test.

Critical Discussion
In particular, according to both methodology workflow description and the obtained results, the main advantage of this framework is its generalization capability. Indeed, the authors similarly addressed forecasting problems that applied to different environments such as in wind power forecasting [64].
Furthermore, as happens in every machine learning framework, one of the drawbacks is the prediction accuracy, which depends on the training/validation set features. If out of knowledge patterns appear in the validation set, it is highly probable that the forecasting accuracy will decrease. In this case, the decision maker is supported by the KDD in the preliminary data-analysis steps, which allows for recognizing possible seasonal cycles in the target profile. The latter allows for making a correct tune-up of the model, by considering an adequate size of the training set or a certain number of smooth/lagged variables.
In particular, one of the potential limits is processing data evolving without a certain pattern over the time. Indeed, in the case of utility load, where the consumption profile over the days assumes similar schemes, the methodology works well also for high forecasting horizons since it not hard find correlation between the predictor and the target over the time.
With the presence of high volatility data, a reasonable approach may be combined different models, based on different learners or trained with different data features. In particular, adaptive ensemble forecasting, where the forecasting is supplied by averaging the prediction of single learners according to weights reflecting their local accuracy, may increase the prediction accuracy without recurring to complex and time-consuming deep learning models.

Conclusions
In even more connected and liberalized power systems, the information volume exchange is dramatically growing, causing the generation of massive data sets, which may deteriorate the effectiveness of the traditional exploration and data mining tools in supplying useful knowledge to the power system stakeholders. For this reason, we explored the current scenario about the employment of artificial intelligence in smart grids, with particular interest to the decision support systems and data extraction.
For this reason, we propose this review, which aims at characterizing the employment of artificial intelligence in power systems, analyzing the main critical issues, and of the most relevant KDD-based methodology in power systems, exploring their advantages and drawbacks. At the same time, we conduct a critical analysis of a forecasting framework inspired by the KDD fundamental steps, analyzing it in a data-driven load forecasting case study.
In particular, from the analysis of the literature, the Knowledge Discovery has emerged as a fundamental tool in smart grid computing by allowing system operators to model the semantics of the data, instead of just relying on the syntactic and structural representations, and to access the data resources solving the heterogeneity problems. This could allow smart grids computing entities to closely interact at human conceptual levels, providing functionalities for ontology management, query, and inference services. In this context, the future research activities will be oriented toward the conceptualization of an ontology middleware system, which processes real or near real-time data streaming generated by heterogeneous data-sources, ontology-based services, and intelligent reasoning. In particular, they allow for enabling a Knowledge Discovery process based on the information context instead of just keyword based searches.
Furthermore, the second part of this manuscript, by analyzing a specific KDD-based methodology for data-driven load forecasting, aims at analyzing its potential in a real case study, describing how the KDD may improve the development of a decision support system. The conducted experimental analysis allows for assessing the quality of the proposed KDD-based methodology for load forecasting, where the obtained results clearly indicate the future research trends in this field.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: