The corresponding article is published in the Journal of Big Data Research:
@article{BDR18, author = {Matthias Carnein and Heike Trautmann}, title = {evoStream -- Evolutionary Stream Clustering Utilizing Idle Times}, journal = {Big Data Research}, year = {2018}, volume = {14}, pages = {101 - 111}, doi = {10.1016/j.bdr.2018.05.005} }
An implementation of the proposed algorithm is available in the development version of the popular library stream. The algorithm is implemented in C++ with interfaces to R for easier prototyping.
The latest release of the library on CRAN does not yet include evoStream but will do so with the next release.
For users not wanting to switch to the development version, the algorithm is also available as an extension package here: https://wiwi-gitlab.uni-muenster.de/m_carn01/evoStream.
The easiest way is to install the development version of the stream package.
devtools::install_git("https://github.com/mhahsler/stream/")
For users not wanting to switch to the development version, an extension package for stream can be installed:
devtools::install_git("https://wiwi-gitlab.uni-muenster.de/m_carn01/evoStream")
The algorithm can then be used alongside other algorithms using the DSC_evoStream
class:
library(stream)
#library(evoStream)
stream <- DSD_Memory(DSD_Gaussians(k = 3, d = 2), 1000)
## init evoStream
evoStream <- DSC_evoStream(r=0.05, k=3, incrementalGenerations=1, reclusterGenerations=1000)
## insert observations
update(evoStream, stream, n = 1000)
## micro clusters
get_centers(evoStream, type="micro")
## micro weights
get_weights(evoStream, type="micro")
## macro clusters
get_centers(evoStream, type="macro")
## macro weights
get_weights(evoStream, type="macro")
## plot result
reset_stream(stream)
plot(evoStream, stream, type = "both")
## if we have time, evaluate additional generations. This can be called at any time, also between observations.
## by default, 1 generation is evaluated after each observation and 1000 generations during reclustering (parameters)
evoStream$RObj$recluster(2000)
## plot improved result
reset_stream(stream)
plot(evoStream, stream, type="both")
## get assignment of micro to macro clusters
microToMacro(evoStream)
The implementation from the R-Package has now been ported to plain C++ without the glue code for R. This should allow easier interfacing for other languages. The implementation makes use of C++11 features and is available here: https://wiwi-gitlab.uni-muenster.de/m_carn01/evoStream_C
Compile the files using your compiler of choice, for example:
g++ -g -std=c++11 -O2 -Wall -mtune=generic MC.cpp evoStream.cpp main.cpp
The interfaces are the same as in the R-Package. An example of the main interfaces as well as how to read a comma-separated file and cluster the data points is shown below:
#include <sstream>
#include <fstream>
#include <vector>
#include "evoStream.hpp"
int main(){
// Main Interface:
EvoStream evo = EvoStream(0.05, 0.001, 100, 4, .8, .001, 100, 2*4, 1000); // init
std::vector<double> observation { 10.0, 20.0, 30.0 }; // read observation
evo.cluster(observation); // cluster new observation
evo.get_microclusters();
evo.get_microweights();
evo.get_macroclusters();
evo.get_macroweights();
evo.recluster(100); // evaluate 100 more macro solutions
evo.microToMacro();
// Full Example: Read CSV file (here: comma-separated, numeric values)
evo = EvoStream(0.05, 0.001, 100, 4, .8, .001, 100, 2*4, 1000);
std::ifstream in("data.csv");
std::string line;
while (std::getline(in, line)) {
std::stringstream sep(line);
std::string field;
std::vector<double> fields;
while (std::getline(sep, field, ',')) {
fields.push_back(stod(field));
}
evo.cluster(fields); // insert observation
evo.recluster(1); // evaluate 1 generation after every observation. This can be adapted to the available time
}
// Get micro-clusters
std::vector< std::vector<double> > micro = evo.get_microclusters();
std::vector<double> microWeights = evo.get_microweights();
std::cout << "Micro Clusters" << std::endl;
for(unsigned int row=0; row < micro.size(); row++){
for(unsigned int col=0; col < micro[0].size(); col++){
std::cout << micro[row][col] << " ";
}
std::cout << "(weight: " << microWeights[row] << ")" < > macro = evo.get_macroclusters(); // reclustering
std::vector<double> macroWeights = evo.get_macroweights();
std::cout << "\n\nMacro Clusters" << std::endl;
for(unsigned int row=0; row < macro.size(); row++){
for(unsigned int col=0; col < macro[0].size(); col++){
std::cout << macro[row][col] << " ";
}
std::cout << "(weight: " << macroWeights[row] << ")" < microToMacro = evo.microToMacro();
for(unsigned int i=0; i < microToMacro.size(); i++){
std::cout << "Micro " << i << " -> " << "Macro " << microToMacro[i] << std::endl;
}
return 0;
}
There also exists a Python port of evoStream. It is based on the C++ implementation with wrappers for Python. It is available as a Python module here: https://wiwi-gitlab.uni-muenster.de/m_carn01/evoStream_python
In order to install the module, run the following command in the modules main directory:
python setup.py install --force
For convenience, the command can be issued using the install.bat
or install.sh
files.
Once installed, the interfaces are the same as in the C++ and R implementations:
import evoStream
evo = evoStream.EvoStream(0.05, 0.001, 100, 4, .8, .001, 100, 2*4, 1000) ## init
evo.cluster([10.0, 20.0, 30.0]) ## read observation
evo.get_microweights()
evo.get_microclusters()
evo.get_macroclusters()
evo.get_macroweights()
evo.recluster(100) ## evaluate 100 more macro solutions
evo.microToMacro()
## Full Example: Read CSV file (here: comma-separated, numeric values)
import csv
evo = evoStream.EvoStream(0.05, 0.001, 100, 4, .8, .001, 100, 2*4, 1000);
with open('data.csv', 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quoting=csv.QUOTE_NONNUMERIC)
for row in reader:
evo.cluster(row)
evo.recluster(1) # evaluate 1 generation after every observation. This can be adapted to the available time
print("Micro Clusters:")
x = evo.get_microclusters()
print(x)
print("\nMicro Weights:")
x = evo.get_microweights()
print(x)
print("\nMacro Clusters (here: performs an additional 1000 reclustering steps, see parameter)")
x = evo.get_macroclusters()
print(x)
print("\nMacro Weights")
x = evo.get_macroweights()
print(x)
print("\nAssignment of Micro Clusters to Macro Clusters")
x = evo.microToMacro()
print(x)