1.Introduction
A Database DB is a group of related data, used by a DBMS. Database system is a means of storing and managing internally related data instead of the outdated, straightforward file system, which only allowed access to the data stored there through designated methods of storage and retrieval, database systems allow for the management and archiving of internally related data [1]..NoSQL database is short for not only SQL, but it is also known as a non-relational database (that is, it does not have any fixed relationships or schemas) or a distributed database (because the data in a NoSQL database management system DBMS can be divided into servers). It’s a database management system which does not require the data to display a pre-defined schema. It allows a variety of data types to be stored. This implies that the data can be arranged in a variety of ways based on the database's intended use [2].
The number of connected devices, such as sensors, cameras, and radars that communicate through local networks, Networked IoT, and social networks, leads to the generation of a significant amount of streaming data. Data from these sources are often generated continuously, and are called data streams. In a streaming data environment, time is a very important constraint [3]; the data stream always changes over time. it is temporary and there is no upper limit to the data. Data is produced at high-speed streaming data environment. Considering all these parameters the data stream processing has many limitations, which must be processed in a limited amount of time as well as limited space usage[4]. An incoming series of data (also known as a data stream) has been designed to be collected and processed by a streaming database. instantaneous data in real time following its creation. Instead of referring to a distinct class of database management systems, this term encompasses a variety of databases that process live streaming data. Consisting of time-series databases, NewSQL databases, NoSQL databases, and in-memory data grids [5].
Data streaming mining is a method to extract useful knowledge and obtain relevant information from data. Data classification is one of the data mining techniques that has shown its power in predicting large amounts of data. There are different classification methods, traditional ones such as decision trees, rule-based methods, neural networks, and the simple Naive Bayes method, which have shown tremendous performance in various applications. These techniques are often used in streaming data mining where stream classifiers operate on data in a single pass because they are time-constrained. The accuracy of data classifiers depends on class label prediction, and in stream data identification of proper label is very difficult [6]. This research paper is organized as follows, Section 2 presents the NoSQL database and its features, characteristics and types. Section 3 presents data stream data, its characteristics, processing methods, data stream processing tools, and the stream database. Section 4 presents data stream mining, its characteristics, limitations and challenges facing stream data classification, and the algorithms used in data stream classification.
- Related Works
Researchers [7] proposed a data streaming method to predict solar radiation in real time with variable climate conditions and cloud coverage. Our method works within an asynchronous two-pipeline framework and using deep learning models, in order to evaluate four deep learning architectures, multilayer perceptron (MLP), long. -short term memory network (LSTM), convolutional network (CNN), and Transformer network, where (MLP) and (CNN) achieved the best accuracy with a high ability to adapt to evolving data. Researchers [8] proposed a new nonlinear feature selection method targeting multi-class classification problems in the framework of support vector machines. The proposed method is achieved using a multi-class support vector machine with a fast version of recursive feature removal. The proposed method selects features that perform well for all classes. The classifier in question simultaneously builds multiple decision functions that separate each class from the others. Researchers [9] proposed a data streaming method to predict solar radiation in real time with variable climate conditions and cloud coverage. Our method works within an asynchronous two-pipeline framework and using deep learning models, in order to evaluate four multilayer perceptron (MLP), long-short, deep learning architectures. term memory network (LSTM), convolutional network (CNN), and Transformer network, where (MLP) and (CNN) achieved the best accuracy with a high ability to adapt to evolving data. Researchers[10] presented a study on the impact of assessment scores and online activity data in a learning management system(LMS) on students' academic performance. Based on one of the commonly used data mining techniques for forecasting,. Five classification algorithms were applied, decision tree, random forest, sequential minimal optimization, multilayer perceptron, and logistic regression, and the results showed that assessment scores are the most important attributes that affect students' academic performance. Prediction models that included assessment scores alone or in combination with activity data performed better than models based on activity data alone. The random forest algorithm performed well for predicting student academic performance, followed by decision tree.
NoSQL Database.3
The term NoSQL was first used in 1998 in databases. Its importance has become increasing during the twenty-first century, particularly with considering quickly of the Internet. Due to the increasing acceptance of Web services, traditional databases are being used more frequently on a web scale, which has resulted in the creation of massive amounts of data. New data management systems were therefore required in order to handle enormous amounts of streaming data from various sources [11]. NoSQL databases are distributed database systems, which do not require structured data, are designed to scale horizontally and may be open source. NoSQL databases do not fully provide the standard ACID.(.Atomicity, Consistency, Isolation, .and Durability) properties that are provide relational databases. NoSQL databases are characterized by the ability to manage, more flexibly store and index massive data sets whereas accepting numerous concurrent user requests. NoSQL databases are now being widely used in various application fields, including manufacturing, healthcare, and social media. NoSQL was mainly used to process the drawbacks of relational database systems RDMBS in web applications; the drawbacks ware related to performance and scalability [12].
3.1 NoSQL. Database Characteristics [13]:
- Non-Relational: Relational database models are not used by NoSQL databases, and they are also not compatible with SQL join operations.
- Distributed: NoSQL databases typically store data across multiple servers and geographical locations.
- Open-Source: The majority of NoSQL databases are free to download and open source.
- Horizontally Scalable: Reduce or increase the total number of servers to match the NoSQL database's processing capacity for data.
- Schema-Free: NoSQL databases do not require the definition of database schema prior to data inserting, in contrast to RDBs. As a result, NoSQL databases allow for flexible data addition.
- Simple API: The NoSQL database provides APIs for network delivery, data collection, etc. for programmers to use, so that programmers do not need to design additional programs to make writing programs easier.
- Its commitment to the characteristics of BASE: It is an abbreviation for Basically Available Soft-State and Eventual Consistency, and the meanings are described as follows:
- Basically Available: The database system can execute and always provide services. Some parts of the DB system may have partial failures and the rest of the database system can continuously operate. Some NoSQL DBs typically keep several copies of specific data on different servers, which allows the database system to respond to all queries even if few of the servers fail.
- Soft-State: The database system does not require a state of strong consistency. Strong consistency means that no matter which replication of a certain data is updated, all later reading operations of the data must be able to obtain the latest information.
- Eventual Consistency: The database system needs to meet the consistency requirement after a certain time. Sometimes the database may be in an inconsistent state. For example, some NoSQL DBs keep multiple copies of certain data on multiple servers. However, these copies may be inconsistent in a short time, which may happen when a copy of the data is updated while the other copies continue .to have data from the old version. Eventually, the replication mechanism in the NoSQL database system will update all replicas to be consistent.
3.2 NoSQL Database Properties
Brewer's CAP Theorem demonstrates that distributed systems are only able to have two of the following characteristics at once.[14]:
- Consistency: shows that each read retrieves the most lately write.
- Availability: suggests that writing and reading are always successful. Stated alternatively, every non-failing node will provide a response within an appropriate amount of time.
- Partition Tolerance: This indicates that the system will keep working regardless of a system malfunction or data loss occurs.
In most cases, NoSQL databases prioritize high availability and partitioning. Which leads to loss of data consistency. Which in turn leads to the creation of systems known BASE mentioned previously.
3.3 NoSQL Database Types
NoSQL databases are those that offer an alternative mechanism for storing data than SQL tables. Four different kinds of NoSQL databases exist:
- Key-Value Database
A key-value database stores data in pairs, each of which consists of a key and a value. The arrangement of information (values) that the database can hold is the focus of the key, and the value corresponds to the key. This structure is quick to query and simple.. The advantages of Key-Value are size and scalability [12,15]. An example of this type is the Riak database.
- Document Oriented Database
. A document store or document-oriented database. In terms of structure, this model is comparable to Key value-based models; however, the value documents are semantic, and is stored in JSO or .XML format. This type is schema less, which helps in heterogeneous semi-structured data storage, and high flexibility. A reference key in a document-oriented database is similar to a foreign key in a relational database. The keys are unique and are associated with each document collection [12,15]. An example of this type is the MongoDb.
- Column Database
A column database stores data by columns instead of rows, and is stored separately for each column feature which uses the table as a data model, making it ideal for storing data. and processing big data. Columnar databases are widely used in data warehouse environments. Column database is known for its high performance and can support complex data types, unstructured text and graphics such as jpeg, gif, bmp, etc. [15]. Cassandra is an example of such a database.
- Graph Database
. Graphs can be used to represent data and can be more relational than a relational database. Graph databases are employed when the goal is to extract intricate relationships from massive data sets. These databases are extensively used by social media networks. Graphs are frequently utilised in link structures on websites. [12,15]. An example of this type is the OrientDb database.
- Data Stream
Technology has been more prevalent in many societal sectors recently, including banking, transportation, and healthcare. More sensors and systems that continuously produce enormous amounts of data as data streams are progressively added as part of this digital revolution. These systems and sensors include, for instance, weather forecasting, drones, and cell phones. Massive data streams are produced by these devices, and it is anticipated that these data streams will increase soon [16]. A data sequence with a timestamp is called a data stream. Data is variable; it comes in at a high rate and speed, is generated by various applications, and may change over time. The massive volume of data in the data stream is processed in real time, at a high rate of speed, and in an infinite size with no set length or at least unknown length [17].
4.1 Data Stream Characteristics
Streaming data has the following properties [18]
- Timestamp: Each item in streaming data has a timestamp, which is time sensitive with decreasing importance after a period of time.
- Continuous: The data stream has no beginning or end That is, data is collected continuously.
- Unique: Repeated transmission of data streams represents a real challenge due to time sensitivity. Accurate processing of data in real time is crucial. This means that it processes each element in streaming data at least once.
- Heterogeneous: Data arrives from different sources, so the data is different.
- Very large data: The size of the data may be huge, so it is difficult to store all the stream data in memory.
4.2 Data Stream Processing
It is the processing unit responsible for standardizing formats to analyze the collected data and then extract the relevant information. (i.e. converting the received data from its raw form into usable and more readable information). It is also responsible for removing potentially incorrect data. This processing can be performed in two ways: Stream Processing and Batch Processing [19]. Table (1) shows the difference between the two methods.
Table (1): The difference between Stream Processing and Batch Processing [19]
|
Stream Processing
|
Batch Processing
|
|
.data is processed continuously during its production
|
Data processing is in the form of a large quantity of data in one batch.
|
|
Low response time (milliseconds)
|
High response time (minutes to hours)
|
|
.The data's size is unknown and undefined
|
The data's size is known and specified
|
|
.Processing is done online in real time
|
To process it is without an internet connection and there is no real time
|
|
.Implementation of work is continuous
|
The work is scheduled to be carried out
|
|
Data is variable and dynamic
|
Fixed batch data
|
|
Fast data generation
|
Slow data generation
|
4.3 . Streaming Database
is a kind of database created especially to handle massive volumes of real-time data. data in real time. Analytics and insights are possible in real-time thanks to streaming databases, which process data as soon as it is created. For latency-critical applications like real-time analytics, fraud detection, and network monitoring, streaming databases are perfect [20].
4.4. Streaming Database Processing Tools
- Apache Spark
It is an open-source framework for real-time data entry, storing, processing, and analysis. It primarily uses small batch processing mode, in which events are handled according to predetermined time frames. It allows for the processing of multiple data types, expanding the capabilities of the MapReduce algorithm. Spark provides specialised libraries known as Spark Streaming to facilitate stream processing with short latency. For algorithms involving iterative processing, Spark is an excellent tool. To implement algorithms, a variety of programming languages can be used, including Java, Python, R, and Scala [21]. Data streams are managed by Spark Streaming as a sequence of small-batch jobs. hence achieving 100 ms end-to-end latencies. A Structured Streaming library based on the Spark SQL engine was introduced by Spark. The foundation of structured streaming is a brand-new, end-to-end latency-achieving low-latency processing mode called Continuous Processing.as little as one millisecond [22]. Figure (1) shows the Apache Spark architecture.
Figure (1): shows Apache Spark
- Apache kafka
It is an open0source software platform for processing data streams in real time. Designed specifically for Distributed Messaging System, it collects, delivers and processes large amounts of data with low latency and access. The Kafka architecture basically consists of three parts (see Figure (2)): The Producer, which collects data from different sources and then inserts it into the Topic, where the Topic describes the Data Stream Event and contains the Queue Structure. The third element of the Kafka architecture is the consumer, which retrieves topic elements [23].
Figure (2): Apache Kafka architecture
Apache Strom.3
It is an open-source framework made for processing massive volumes of data with ease and speed while enabling in-depth analysis. Programmers can create real-time distributed processing systems with the aid of Apache Strom. Its Application Programming Interface (APL) is straightforward. Apache Strom uses a series of tuples to represent streaming data. figure (3) shows the architecture of Apache Strom. Spout is the source of the data stream. A bolt processes any number of inputs streams and produces any number of new output streams. Topology is a Network that contains bolts and spouts as well as data stream between them [24].
Figure (3): shows Apache Strom
Apache Flink.3
It is framework for distributed processing and open source. It is a platform designed to process sets of streaming data from different inputs. Scala and Java are the languages used in its development, which is done by the ASF. No latency, fault tolerance, and high throughput are offered by the framework. With its high-level libraries for Java, Python, Scala, and SQL, Apache Flink facilitates both batch and stream processing [25].
- Streaming Database Mining
Numerous tasks, including frequency counts, time series analysis, clustering, and classification, are included in streaming data mining. Over the past ten years, the data stream classification task has been the most important and focused area of research among all these tasks. The traditional static data classification task differs slightly from data stream classification [26]. Table (2) shows a comparison between traditional data mining and data stream mining.
Table (2): Comparison between traditional and streaming data mining [27]
|
Streaming data mining
|
Traditional data mining
|
Advantage
|
|
Online
|
Offline
|
Processing
|
|
Impossible
|
Possible
|
Storage
|
|
No Specific
|
Specific
|
Size
|
|
.Dynamic
|
.Static
|
.Data generation
|
|
.Just one pass
|
.More time to access data
|
.Time
|
|
.Heterogeneous
|
.Homogeneous
|
.Data type
|
|
.Approximation
|
.Accurate
|
.Result
|
5.1 Characteristics of Mining Streaming Database
Data stream mining has several unique characteristics [27]:
- Size.:. Such data cannot be stored for further mining due to the massive size of the data stream. Put another way, data mining is necessary to extract useful information from this limitless amount of data. The data stream contains a variety of data types, including text, pictures, and video files. As a result, the data stream mining is only partially saved.
- Speed.: The dynamic arrival of the data stream changes over time, and the high speed at which this data is generated places strict requirements on the effectiveness of data mining.
- Multi-Dimensional.: Enormous streams of data with a variety of data types are generated from sources that are widely dispersed. This requires the need for complex algorithms.. The enormous amount of streaming data that enables processing and calculation in a single pass presents significant challenges.
5.2 Constraints and Challenges Facing Classification Streaming Database
Provides research [28] mainly presents some common constraints for streaming data
classification:
- Single Pass.: While data streams contain dynamic data instances that can only be read once, the static data sets used in traditional data mining can be read numerous times.
- Real-Time Response.: Real-time response is necessary for many real-world data stream applications, including credit card transactions and stock market forecasts. Lower time costs, or quicker response times, are essential for data processing and decision-making.
- Bounded Memory.: Many instances will be discarded due to limited memory storage and processing power, since the size of the data stream is open-ended with the arrival of data instances. As a result, only the data summary can be computed and stored to produce results that are acceptable approximations.
- Drift Detection.: The concept drift refers to the changing of data distributions in the Class Space, the Feature Space, or both spaces over time. Drift detection is useful for updating the built model for adapting to dynamic changes hidden in data streams.
- Multi-Label Problems.: In a data stream, each Instance is associated with one or more Labels. In this scene, such as a type of news stream, it may belong to more than one topic. Multi-label data stream classification aims to classify each data instance into a predefined label or a label set.
5.3 Classification Streaming Database
The classification of data streams has become more significant in various real-time applications. A data stream classification method can be developed using a variety of methods [29]. Figure (4) shows the most important data stream mining techniques. Machine learning, statistics, databases, data warehouses, data retrieval, and decision support systems. Machine Learning is a rapidly growing system that research how computers learn or improve their performance based on data. Statistic This technique studies the analysis collection, interpretation and presentation of data. Data Mining has an inherent relationship with statistics. Database systems are primarily concerned with building, managing, and utilising databases for end users and organisations. Using data storage and data mining, modern database systems have integrated systematic data analysis capabilities on database data. Data from various sources and time periods are combined in a data warehouse. To create data cubes, it aggregates data into a multidimensional space. The science of finding information or documents within text or multimedia documents that may be available online is known as information retrieval. A computerised system called a decision support system aids in the making of decisions, assessments, and plans of action for businesses or organisations. To find answers to issues and make decisions, a decision support system gathers and evaluates a lot of data. [30].
Figure (4): shows data mining techniques for data stream
5.3.1 Machine Learning Techniques for Streaming Database
The basic idea behind machine learning (ML), a branch of artificial intelligence, is that machines are capable of learning from data, identifying patterns, and making decisions without the need for human input. Data extraction is the goal of machine learning. Machine learning algorithms adapt to the style of data stream, which is dynamic, large, and continuous, as opposed to mining static data; instead, they develop an automated computational model and continuously improve its performance based on experience. supervised machine learning, unsupervised machine learning, semi-supervised machine learning, and reinforcement machine learning make up the four categories of machine learning. Classification and regression are two examples of supervised learning tasks. Supervised machine learning is one of the key areas of machine learning. One novel method is supervised machine learning [29]:
1.Machine learning is able to analyze massive quantity of data and identifying specific trends and patterns that may have gone undetected by humans.
2.It has the ability to learn on its own, make predictions, and make improvements to algorithms.
3.Machine learning algorithms gather experience, their accuracy and efficiency improve
4.Machine learning algorithms are capable of dealing with multidimensional .and multivarious data and data in a dynamic or uncertain state.
5.3.2Classification Algorithms for Streaming Database
Classification algorithms in a streaming data environment are often designed based on some basic traditional machine learning algorithms [31], which may include the following algorithms:
- Decision Tree
One simple technique for classification is a decision tree. Due to its low computational overhead, decision trees can be easily translated into highly accurate classification rules that are simple to comprehend and appropriate for high-dimensional data. Decision trees are therefore frequently used for streaming data and classification models [31]. Among the most popular algorithms in the academic world are decision trees. Tree models exhibit a high degree of interpretability and are free of distribution. These elements have played a significant role in its rising stature within the data mining community. For the data mining community, one of the most difficult problems is learning decision trees from data streams. One effective illustration of this is the Very Fast Decision Tree system, or VFDT. Repositioning leaves with decision nodes repeatedly teaches the very fast decision tree algorithm (VFDT). Sufficient information about attribute values is stored in each leaf. The statistics required by the evaluation function, which assesses the benefits of split tests based on attribute values, are considered sufficient. It moves from the root to the leaf of the tree, assessing.[32]. This is the VFDT algorithm(1)
Naive Bayes 2.
The Naive Bayes algorithm is a straightforward probability classifier that determines a set of probabilities by figuring out how frequently certain values occur and how they might combine in a given data set. Being a lossless classifier, the Naive Bayes algorithm generates a classifier that is functionally equivalent to the corresponding classifier that was trained on batch data. The foundation of the Naive Bayes algorithm is the Bayes theorem, a mathematical expression that expresses conditional probability. Taking into account the value of the class variable, this algorithm makes the assumption that all variables are independent. This conditional independence assumption is considered naive because it is rarely true in practical applications. In a range of controlled classification tasks, this algorithm has a tendency to pick things up quickly [33]. This is the Naive Bayes Algorithm (2).
|
Input : S:Stream of examples
X: Set of nominal Attributes
Y:Y={y1,……,yk} Set of class labels
H(.): Split evaluation function
Nmin :Minimum number of example
ʆ: is one minus the desired probability
of choosing the correct attribute at any node.
T: Contant to solve ties
Output : HT: is a Decision Tree
Begin
Let HT← Empty leaf (Root) ;
Foreach example (x,yk) ϵ S do
Traverse the tree HT from the root till a leaf I;
If yk ==? Then
Classify the example with the majority class in the leaf l;
else
Update sufficient statistics
If Number of examples in l >Nmin then
Compute HI(xi) for all the attributes ;
Let Xa be the attribute with highest HI ;
Let Xb be the attribute with second highest HI ;
Compute є = (Hoeffding bound)
If (H(Xa) – H(Xb)> є ) then
Replace I with a splitting test based on attribute Xa ;
Add a new empty leaf for each branch of the split ;
else
If є < T then
Replace I with a splitting test based on attribute Xa ;
Add a new empty leaf for each branch of the split ;
|
Algorithm(1): shows Very Fast Decision Tree Algorithm
|
1.function INITIALIZE (attributes,labels)
2. (A,Y)← (attributes, labels)
3. foreach class c in C do
4. N(c) =0
5. n=0
6. return
7.function TRAIN (X, y)
8. Increase n by 1
9. Increase number N(y) by 1
10. Calculate probability P(y) =N(y)/n
11. foreach attribute ai € A that holds the value of xi in x do
12. Increase number N(xi|y) by 1
13. foreach class c in C do
14. Calculate probability P(xi|c) = N(xi|c) / N(c)
15. return
16. function TEST(X)
17. foreach class c in C do
18. Calculate likelihood P(x|c) =π|x|k=1 P(xk|c). P(c)
19. Calculate relative – likelihood Rel(c|x) =P(x| c) .P(c)
20. return the class with the maximum relative-likelihood
|
Algorithm (2) shows Naive Bayes Algorithm
- 3. Neural Network
Artificial intelligence that uses neural networks aims to mimic the functions of the human brain. When a neural network has a sizable database, it performs especially well in event prediction and pattern recognition [34]. Recently, promising results for text classification have been demonstrated by neural network-based applications. One of the key components of NLP, or natural language processing, is text classification. which has been used in a variety of fields, including information retrieval, document classification, sentiment analysis, and text classification. Neural network-based models are becoming more widespread. An artificial neural network with multiple layers of inputs and outputs is called deep learning. The RNN, or Recurrent Neural Network, is one of the deep learning techniques utilised [35]. The primary applications of a recurrent neural network are in sequential labelling and prediction. From sequential and time-series data, a recurrent neural network can learn features and long-term dependencies. Any dynamic system can be modelled by a properly trained RNN. One type of supervised machine learning model is the RNN. The input layer, the hidden recurrent layer, and the output layer are the three layers of a basic RNN network. Recurrent neural networks operate on the basis of reusing a layer's output by feeding it back into the input layer to aid in layer prediction.as depicted inFigure (5). [34].
.Figure (5) shows .Recurrent Neural Network.
N input units make up the input layer. A series of vectors over time T, such as {..., xt−1, xt, xt+1,...}, where Xt = (x1, x2,..., Xn), serve as the input for this layer. In a fully connected RNN, the hidden units in the hidden layer are connected to the input units. The hidden layer contains M hidden units, ht = (h1, h2,..., hM), where the connections are defined by the weight matrix WlH. These hidden units are connected to one another over time through recurrent connections, which may result in the initialization of the hidden units using small, non-Zero elements. can enhance the network's overall stability and performance. System memory is defined by the hidden layer.
(1)
where
(2)
fH()is the hidden layer activation function, bh is the bias vector of hidden units. The hidden units are connected to the output layer through weighted WHO connections. The output layer contains P units, where yt = (y1, y2, ..., yP) that is calculated as follows:
(3)
Where fo().is the activation function and bo is the bias vector in the output layer, and since the input target pairs are sequential over time, the above steps are repeated over time [36].