Enhancement of Distributed Big Data Clustering

نوع: Type: thesis

مقطع: Segment: Masters

عنوان: Title: Enhancement of Distributed Big Data Clustering

ارائه دهنده: Provider: Morteza Yousef Sanati

اساتید راهنما: Supervisors: Morteza Yousef Sanati

اساتید مشاور: Advisory Professors: Muharram Mansoorizadeh

اساتید ممتحن یا داور: Examining professors or referees: Hassan Khotanlou - Mehdi sakhaei-nia

زمان و تاریخ ارائه: Time and date of presentation: 1400/07/20 - 4pm

مکان ارائه: Place of presentation: Faculty of Engineering

چکیده: Abstract: Today, data is generated at a very high speed and volume, which in many cases is a flow of data. A data stream is an infinite sequence of data generated at high speed and volume that defines it as a sequence of data objects at intervals. One of the most common processes for data flow is clustering, which generally aims to divide data into homogeneous groups. One of the existing algorithms for clustering is the Clustream algorithm, of which there is an implemented version of it in the Apache Spark distributed environment. The Clustream algorithm maintains a constant number of microclusters in the online phase. This seems an impractical hypothesis in an evolving data stream, given the complexity of the input data in real-world streams. In addition, this algorithm retains historical data during the flow and does not have a mechanism for the gradual removal of expired clusters. This causes the radius of the clusters to increase with the continuous flow of data over time, and more data is added to each cluster, which reduces the accuracy of the clusters. In the offline phase, the final clusters are determined based on a fixed parameter. Consideration of this parameter in practice can cause a cluster to break into several other clusters or to aggregate several clusters together and may reduce the quality of clusters detected by the algorithm. In order to solve the mentioned problems, in this dissertation, changes have been made in the process of implementing the Clustream algorithm. In the online phase, two ideas have been proposed for more dynamism in the practice of clustering and deleting historical data. The first idea is to add a function called the clearing or pruning function to delete expired clusters, and the second idea is to use a slider to preserve recent data and delete old data. In the offline phase, an algorithm is also proposed that dynamically determines the number of final clusters. In the first idea, the quality of the clusters fluctuates. In some time units the quality has improved but in others the quality of clustering has decreased. In the second idea, in all cases, there is a significant improvement in the quality and accuracy of clustering. In some time units, more than ۵۰% clustering accuracy has been improved. In terms of speed, in both ideas the speed of operation is maintained to an acceptable level. Due to the fact that the proposed algorithm in the second idea has a slower execution speed in some cases, but in the best case, the execution speed has been improved up to ۵۰%.

فایل: ّFile: Download فایل