The field of artificial neural networks has seen rapid growth over the past 10-15 years. Typical applications are image processing, sound and other areas with high dimensional data. In machine learning, however, there are quite a few tasks where the volume of data at the input of the system is small: for example, unusual occurrences modeling, processing of manually collected analytical data, analyzing signals from low-frequency sensors, etc. Under such conditions an important stage is the careful work with the characteristics (“features”) which the system is trained on and in particular the generation of new features from the basic existing ones, which will allow to improve the performance quality of the designed system. Manual methods are usually used for such generation, but a good alternative is the use of neural networks capable of not only learning basic mathematical operations but also identifying extremely complex patterns in the input data.
This paper describes the experience of using multilayer neural networks for generation of additional features in the context of small dimensional data when the number of basic features varies from one to two dozen. Two datasets are used for training models: real (on data from recorders) and synthetic (on generated data), which is used to train a neural network for the purpose of the subsequent generation of additional features.
Task and Data Description
The basic task for which the machine learning system was designed was intended to predict the failure of industrial electrical installations due to the accumulating number of non-critical micro breakdowns between electrical coils. Breakdowns were sporadic in nature and were caused by the presence of high-voltage pickups and impulse noises arising at the moments when other equipment was switched on and off.
To study this factor, a recorder was connected to the test section of the circuit; it registered the potentials at two control points of the coils at each time point with a frequency of 50 kHz. Information from the recorder was subsequently processed in a semi-automatic way, and as a result, a training sample was obtained which described the statistics of registered micro-breakdowns during one shift of equipment operation (8 hours).
The primary analysis of the problem showed that, due to the asymmetric nature of the system, breakdowns of different polarities affect the reliability of the system in a different way, and, moreover, they considerably compensate each other. Therefore, the following criterion was chosen as the goal variable: if the balance (i.e., the difference between the number of micro breakdowns registered during a shift of 1-2 and 2-1 type) exceeds the preset threshold T, then the classifier should produce 1, in a reverse situation 0. For the first stage of research, a zero value was chosen for the threshold T, since this provided a good balancing of the classes.
Below you will find features from the collected dataset listed in the table below.
In addition to the two basic objective features (processed data from the recorder), additional features were also calculated – expert estimates based on empirical rules and accumulated experience. Only those features that have passed a preliminary test for variation have been included in the table.
|u1||The average value of the absolute potential at the test point of coil 1, averaging over a shift (8 hours).|
|u2||The same for coil 2.|
|exp_t||Expert assessment of micro breakdowns total number (of “1-2” and “2-1” class) for a given shift, based on empirical rules.|
|exp_b||The same, for total balance (number of events “1-2” minus “2-1”).|
|exp_pb1||Expert probability estimate that the total micro breakdowns balance will exceed T threshold.|
|exp_pb2||The same, for balance < T.|
|exp_pb0||The same, for balance =T (for some type of installations this situation is quite likely).|
In total, there were 1376 observations in the collected dataset, and it was divided into 2 parts according to the chronological principle which guarantees the absence of information “leaks” from the training sample to the test sample.
|Training and validation part||1040|
To assess the performance quality of the model, the chosen metric was AUC ROC – Area Under the ROC (Receiver Operating Characteristics) Curve. This metric allows making an estimation of the classification quality without choosing the trigger threshold (unlike other standard metrics: Accuracy, Precision, Recall, F1).
Visualization of Objective Features
Fig. 1 shows the mapping of points from the training sample in the coordinates of the two main features – u1 and u2. The color of the point corresponds to the class (red – 0, aqua – 1).
From the distribution of points it is clear that this classification task is rather difficult.
Figure 1. Mapping of points (Source: Auriga)
Model 1: Approach through Original Features
As a base model, it was decided to use logistic regression with normalization of features. The reason for this choice was that for this task, good calibration of the model is very important, and alternative popular methods on decision trees (Random Forest, XGBoost, LGBM, etc.) in isolation do not have good calibration as compared to logistic regression.
The training results of the model are shown in Fig. 2 (ROC curve for test sample) and Fig. 3 (classification contours in the feature space u1, u2 and points from the training sample).
The obtained value 0.5532 of the AUC ROC metric exceeds the value of 0.5, which corresponds to random guessing, that is, despite the complex dataset, the model was able to extract useful patterns from the data.
Figure 2. ROC curve for test sample (Source: Auriga)
Figure 3. Classification contours (Source: Auriga)
Model 2: Adding Additional Manually Generated Features
Based on empirical rules, an assumption was made that an additional feature, the potential difference u1 – u2, could help to improve the quality of the model. After adding this feature we obtained the results shown in Fig. 4. As you can see, the quality metric improved to some extent up to 0.5535.
In addition to this feature, other polynomial features of the 1st and 2nd orders were also tested, but they did not lead to an improvement in the quality of the model.
Figure 4. ROC curve (Source: Auriga)
Model 3: Generation of Additional Features using a Neural Network
An alternative to the manual generation of features described above is the use of artificial neural networks, the advantage of which is that they can learn a rather complicated function that is difficult or impossible to describe analytically. In our case, this is exactly the situation: In non-deterministic conditions it is advisable to use an additional feature describing the balance between two discrete probabilistic processes, the parameters of which are set by u1 and u2 features.
To implement this approach, a synthetic dataset was generated from 10,000 examples, where random values were fed to the inputs x1 and x2 – parameters of independent Poisson processes, and the goal variable was calculated as a binary condition: if the balance between the number of events in the process1 and the number of events in the process2 is positive, then the goal variable is 1, otherwise 0.
A simple fully connected neural network was trained on this synthetic sample, the architecture of which is shown in Fig. 5.
Figure 5. Architecture (Source: Auriga)
Further, with the help of a trained neural network, an additional feature column was generated – separately for the training and test samples.
After training the basic model (Logistic regression with normalization), the value of the AUC ROC metric for the test sample was 0.5539, i.e. turned out to be better than in Model 2.
Conclusions and Perspectives
Fig. 6 contains a summary diagram of learning quality for the three above described approaches.
Figure 6. Summary diagram (Source: Auriga)
Obtained results demonstrated that the problem under investigation is rather complicated. Nevertheless, the achieved quality indicators make it possible to use this system under real-life conditions to predict the timing of preventive maintenance, averting equipment failure.
Close inspection of the approaches shows that the best option turned out to be the one when a small neural network is trained using a separate synthetic dataset, which is then used to generate additional features in the main training and test samples.
In the future, it is advisable to try using algorithms on decision trees instead of logistic regression, taking additional measures to calibrate the issued estimates. In addition, it is of interest to study different architectures of neural networks in terms of their ability to represent various complex functions.
In general, such combined systems (a standard machine learning algorithm + a shallow neural network to enrich features) should be used for those tasks where simple approaches do not work well due to hidden interconnections between the input data, and deep neural networks are inapplicable due to the small dimensionality of the data. Examples include equipment failure prediction systems, anomaly detection, credit scoring, and other similar tasks.
Andrey Teterin is a Senior Software Engineer at Auriga, Inc. He is an experienced data scientist with strong software development background. He has more than 20 years of experience with a focus on complex algorithms, machine learning and deep learning in different areas, including computational linguistics (text analysis, including syntax, semantics, sentiment analysis), investment analysis, medical and biotechnological systems, clinical trials, digital processing of audio signals and images, and recommender systems.