evoStream


evoStream is a new clustering algorithm for data streams based on an evolutionary algorithm. The algorithm allows to identify groups of similar objects in a data stream over time. Typically stream clustering algorithms rely on aggregating the data stream first. Upon demand, these aggregations are clustered to identify groups of interest. By using an evolutionary algorithm, our approach can perform this task incrementally. This allows to make much more efficient use of available resources, e.g. by applying the algorithm in the idle time of the stream. This contribution is crucial in the context of stream clustering since it allows the real-time clustering and analysis of data streams.

Publication

The corresponding article is published in the Journal of Big Data Research:


Implementation (R-Package)

An implementation of the proposed algorithm is available in the development version of the popular library stream. The algorithm is implemented in C++ with interfaces to R for easier prototyping.

The latest release of the library on CRAN does not yet include evoStream but will do so with the next release. For users not wanting to switch to the development version, the algorithm is also available as an extension package here: https://wiwi-gitlab.uni-muenster.de/m_carn01/evoStream.

The easiest way is to install the development version of the stream package.

devtools::install_git("https://github.com/mhahsler/stream/")

For users not wanting to switch to the development version, an extension package for stream can be installed:

devtools::install_git("https://wiwi-gitlab.uni-muenster.de/m_carn01/evoStream")

The algorithm can then be used alongside other algorithms using the DSC_evoStream class:

library(stream)
#library(evoStream)

stream <- DSD_Memory(DSD_Gaussians(k = 3, d = 2), 1000)

## init evoStream
evoStream <- DSC_evoStream(r=0.05, k=3, incrementalGenerations=1, reclusterGenerations=1000)

## insert observations
update(evoStream, stream, n = 1000)

## micro clusters
get_centers(evoStream, type="micro")

## micro weights
get_weights(evoStream, type="micro")

## macro clusters
get_centers(evoStream, type="macro")

## macro weights
get_weights(evoStream, type="macro")

## plot result
reset_stream(stream)
plot(evoStream, stream, type = "both")

## if we have time, evaluate additional generations. This can be called at any time, also between observations.
## by default, 1 generation is evaluated after each observation and 1000 generations during reclustering (parameters)
evoStream$RObj$recluster(2000)

## plot improved result
reset_stream(stream)
plot(evoStream, stream, type="both")

## get assignment of micro to macro clusters
microToMacro(evoStream)

Implementation (C++)

The implementation from the R-Package has now been ported to plain C++ without the glue code for R. This should allow easier interfacing for other languages. The implementation makes use of C++11 features and is available here: https://wiwi-gitlab.uni-muenster.de/m_carn01/evoStream_C

Compile the files using your compiler of choice, for example:

g++ -g -std=c++11 -O2 -Wall -mtune=generic MC.cpp evoStream.cpp main.cpp

The interfaces are the same as in the R-Package. An example of the main interfaces as well as how to read a comma-separated file and cluster the data points is shown below:

#include <sstream>
#include <fstream>
#include <vector>

#include "evoStream.hpp"

int main(){


  // Main Interface:
  EvoStream evo = EvoStream(0.05, 0.001, 100, 4, .8, .001, 100, 2*4, 1000); // init
  std::vector<double> observation { 10.0, 20.0, 30.0 }; // read observation
  evo.cluster(observation); // cluster new observation
  evo.get_microclusters();
  evo.get_microweights();
  evo.get_macroclusters();
  evo.get_macroweights();
  evo.recluster(100); // evaluate 100 more macro solutions
  evo.microToMacro();


  // Full Example: Read CSV file (here: comma-separated, numeric values)
  evo = EvoStream(0.05, 0.001, 100, 4, .8, .001, 100, 2*4, 1000);

  std::ifstream in("data.csv");
  std::string line;
  while (std::getline(in, line)) {
      std::stringstream sep(line);
      std::string field;
      std::vector<double> fields;
      while (std::getline(sep, field, ',')) {
          fields.push_back(stod(field));
      }
      evo.cluster(fields); // insert observation
      evo.recluster(1); // evaluate 1 generation after every observation. This can be adapted to the available time
  }


  // Get micro-clusters
  std::vector< std::vector<double> > micro = evo.get_microclusters();
  std::vector<double> microWeights = evo.get_microweights();

  std::cout << "Micro Clusters" << std::endl;
  for(unsigned int row=0; row < micro.size(); row++){
    for(unsigned int col=0; col < micro[0].size(); col++){
      std::cout << micro[row][col] << " ";
    }
    std::cout << "(weight: " << microWeights[row] << ")" < > macro = evo.get_macroclusters(); // reclustering
  std::vector<double> macroWeights = evo.get_macroweights();

  std::cout << "\n\nMacro Clusters" << std::endl;
  for(unsigned int row=0; row < macro.size(); row++){
    for(unsigned int col=0; col < macro[0].size(); col++){
      std::cout << macro[row][col] << " ";
    }
    std::cout << "(weight: " << macroWeights[row] << ")" < microToMacro = evo.microToMacro();
  for(unsigned int i=0; i < microToMacro.size(); i++){
    std::cout << "Micro " << i << " -> " << "Macro " << microToMacro[i] << std::endl;
  }


  return 0;

}

Implementation (Python)

There also exists a Python port of evoStream. It is based on the C++ implementation with wrappers for Python. It is available as a Python module here: https://wiwi-gitlab.uni-muenster.de/m_carn01/evoStream_python

In order to install the module, run the following command in the modules main directory:

python setup.py install --force

For convenience, the command can be issued using the install.bat or install.sh files.

Once installed, the interfaces are the same as in the C++ and R implementations:

import evoStream

evo = evoStream.EvoStream(0.05, 0.001, 100, 4, .8, .001, 100, 2*4, 1000) ## init
evo.cluster([10.0, 20.0, 30.0]) ## read observation
evo.get_microweights()
evo.get_microclusters()
evo.get_macroclusters()
evo.get_macroweights()
evo.recluster(100) ## evaluate 100 more macro solutions
evo.microToMacro()



## Full Example: Read CSV file (here: comma-separated, numeric values)
import csv

evo = evoStream.EvoStream(0.05, 0.001, 100, 4, .8, .001, 100, 2*4, 1000);

with open('data.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quoting=csv.QUOTE_NONNUMERIC)
    for row in reader:
        evo.cluster(row)
        evo.recluster(1) # evaluate 1 generation after every observation. This can be adapted to the available time


print("Micro Clusters:")
x = evo.get_microclusters()
print(x)

print("\nMicro Weights:")
x = evo.get_microweights()
print(x)

print("\nMacro Clusters (here: performs an additional 1000 reclustering steps, see parameter)")
x = evo.get_macroclusters()
print(x)

print("\nMacro Weights")
x = evo.get_macroweights()
print(x)

print("\nAssignment of Micro Clusters to Macro Clusters")
x = evo.microToMacro()
print(x)