Anomaly Detection Using Log Stream Clustering

نوع: Type: thesis

مقطع: Segment: masters

عنوان: Title: Anomaly Detection Using Log Stream Clustering

ارائه دهنده: Provider: Mahsa Moradi

اساتید راهنما: Supervisors: Dr. Moharram Mansourizadeh

اساتید مشاور: Advisory Professors:

اساتید ممتحن یا داور: Examining professors or referees: Dr. Mehdi Sakhainia and Dr. Reza Mohammadi

زمان و تاریخ ارائه: Time and date of presentation: 17/10/2023 6:30 PM

مکان ارائه: Place of presentation: Class 27

چکیده: Abstract: Today, the science of data mining has provided a platform so that by using new technologies such as artificial intelligence and machine learning, one can analyze and extract concepts hidden in data and use them for different and important tasks. Data mining is the science of extracting patterns, information and analysis from raw data sets that have been produced in an organization or in any other set. These data are produced at a high speed, which sometimes form a stream of data. Data streaming is the continuous transfer of data at a constant and high speed. In some cases, information systems generate a stream of logs. A log is a rich source of information for detecting and predicting errors or abnormal behaviors in systems, which includes all events, occurrences and errors in the execution of a software or operating system. These errors can be discovered through the analysis of logs using big data algorithms. One of the methods for analyzing streaming data and detecting anomalies is the clustering algorithm. The purpose of data clustering is to separate a set of objects into separate groups, and one of these data flow clustering algorithms is the AutoCloud algorithm. AutoCloud is an online and one-step recursive algorithm for data stream clustering based on Euclidean distance. This algorithm is based on the concept of typicality and eccentricity data analysis, which is mainly used for anomaly detection tasks. Also, AutoCloud is able to manage inherent problems in data flow, such as concept drift and concept evolution. But due to the fact that the accuracy of performing the operation in AutoCloud is not very suitable for most data sets, it seems that using other types of distances is more appropriate and by adding methods to AutoCloud, the accuracy can be improved. to find Therefore, in this research, by implementing some ideas, it has been tried to check if changing AutoCloud can lead to its improvement or not. The first idea is to use Mahalanobis distance for this algorithm. The results show that if AutoCloud is based on Euclidean distance, it works better than Mahalanobis distance. That's why using Mahalanobis distance in AutoCloud is not very effective. In AutoCloud, the formation of clusters can be effective in absorbing data; Therefore, the second idea is to add the Kmeans algorithm to the beginning of AutoCloud. In this idea, the first 1000 data samples are processed offline and the rest of the data are processed online. The purpose of the second idea is to create clusters by Kmeans and then absorb the data with Eccentricity so that by using Kmeans and the concept of Eccentricity and considering the appropriate selection of clusters, better clustering can be achieved. In general, the results obtained with this method are worse than the basic method. The third idea works like the second idea, with the difference that the threshold limit for the condition of absorbing data samples into clusters has been changed. The threshold limit has been calculated using the concept of Eccentricity in Kmeans clusters. Finally, in order to detect anomalies in the log, the "σ gap" principle has been implemented in the AutoCloud algorithm. The results show that the accuracy of detecting anomalies in the log using the proposed method is very low. The results show that the proposed methods perform worse than the AutoCloud algorithm

فایل: ّFile: Download فایل