confStream


A common problem in clustering is the proper choice of algorithm and optimal parameter settings. This is even more challenging in stream clustering where clusters are identified and updated over time. Traditionally, automated algorithm selection and configuration is available which can automatically find the best parameter settings. Unfortunately, none of the existing approaches are directly applicable to the streaming scenario. confStream presents the first approach for automated algorithm selection and configuration of stream clustering algorithms. It uses an ensemble of different configurations which allows to automatically find and adapt optimal parameters over time.

Publications

Carnein M, Trautmann H, Bifet A and Pfahringer B (2020), "confStream: Automated Algorithm Selection and Configuration of Stream Clustering Algorithms", In 14th Learning and Intelligent Optimization Conference (LION 2020). Cham , pp. 80-95. Springer International Publishing.
[BibTeX] [DOI]
@inproceedings{LION20,
	author = {Matthias Carnein AND Heike Trautmann AND Albert Bifet AND Bernhard Pfahringer},
	editor = {Kotsireas, Ilias S. AND Pardalos, Panos M.},
	title = {confStream: Automated Algorithm Selection and Configuration of Stream Clustering Algorithms},
	booktitle = {14th Learning and Intelligent Optimization Conference (LION 2020)},
	publisher = {Springer International Publishing},
	year = {2020},
	pages = {80--95}
}
Carnein M, Trautmann H, Bifet A and Pfahringer B (2020), "Towards Automated Configuration of Stream Clustering Algorithms", In Workshop on Automated Data Science at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD '19). Würzburg, Germany , pp. 137-143. Springer International Publishing.
[BibTeX] [DOI]
@inproceedings{ECMLPKDD19,
	author = {Matthias Carnein and Heike Trautmann and Albert Bifet and Bernhard Pfahringer},
	editor = {Cellier, Peggy and Driessens, Kurt},
	title = {Towards Automated Configuration of Stream Clustering Algorithms},
	booktitle = {Workshop on Automated Data Science at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD '19)},
	publisher = {Springer International Publishing},
	year = {2020},
	pages = {137--143},
	doi = {10.1007/978-3-030-43823-4_12}
}

Implementation

The algorithm is implemented in Java as a stream clustering algorithm for the Massive Online Analysis (MOA) framework. The code has been merged into MOA project and will be available with its next release. For completeness, the implementation is also still available in its initial feature branch https://github.com/MatthiasCarnein/moa/tree/LION

The algorithm makes use of a configuration file which specifies the algorithms to use as well as their initial configuration. All stream clustering algorithms, all their parameters and all evaluation measure implemented in MOA are available. The algorithm is currently developed in a fork and will be merged eventually.

An example on how to cluster a Random RBF stream is shown below:

public static void main(String[] args) {

	// initialise algorithm and set configuration file
	ConfStream algorithm = new ConfStream();
	algorithm.fileOption.setValue("settings.json");
	algorithm.prepareForUse();

	// generate stream
	RandomRBFGeneratorEvents stream = new RandomRBFGeneratorEvents();
	stream.prepareForUse();

	// train algorithm
	for (int i = 0; i < 1000000; i++) {
		Instance inst = stream.nextInstance().getData();
		algorithm.trainOnInstanceImpl(inst);
	}

	// get result
	algorithm.getClusteringResult();

}

A configuration file which performs algorithm selection and configuration for the DenStream ("denstream.WithDBSCAN"), ClusTree ("clustree.ClusTree"), CluStream ("clustream.WithKmeans") and BICO ("kmeanspm.BICO") algorithm and all their relevant parameters is shown below. For evaluation, the Silhouette Width ("SilhouetteCoefficient") is used:

{
	"windowSize": 1000,
	"ensembleSize": 20,
	"newConfigurations": 10,
	"keepCurrentModel": "true",
	"reinitialiseWithClusters": "true",
	"preventAlgorithmDeath": "true",
	"evaluateMacro": "false",
	"keepGlobalIncumbent": "true",
	"keepAlgorithmIncumbents": "true",
	"keepInitialConfigurations": "true",
	"useTestEnsemble": "true",
	"lambda": 0.05,
	"resetProbability": 0.01,
	"numberOfCores": 1,
	"performanceMeasure": "SilhouetteCoefficient",
	"performanceMeasureMaximisation": "true",
	"algorithms": [
		{
			"algorithm": "denstream.WithDBSCAN",
			"parameters": [
				{"parameter": "e", "type":"numeric", "value":0.02, "range":[0,1]},
				{"parameter": "b", "type":"numeric", "value":0.2, "range":[0,1]},
				{"parameter": "m", "type":"integer", "value":1, "range":[0,10000]},
				{"parameter": "o", "type":"integer", "value":2, "range":[2,20]},
				{"parameter": "l", "type":"numeric", "value":0.25, "range":[0,1]}
			]
		}
		,
		{
			"algorithm": "clustree.ClusTree",
			"parameters": [
				{"parameter": "H", "type":"integer", "value":8, "range":[1,20]},
				{"parameter": "B", "type":"boolean", "value":"false"}
			]
		}
		,
		{
			"algorithm": "clustream.WithKmeans",
			"parameters": [
				{"parameter": "k", "type":"integer", "value":5, "range":[2, 20]},
				{"parameter": "m", "type":"integer", "value":100, "range":[1,10000]},
				{"parameter": "t", "type":"integer", "value":2, "range":[1,10]}
			]
		}
		,
		{
			"algorithm": "kmeanspm.BICO",
			"parameters": [
				{"parameter": "k", "type":"integer", "value":5, "range":[2, 20]},
				{"parameter": "n", "type":"integer", "value":1000, "range":[1,2000]},
				{"parameter": "p", "type":"integer", "value":10, "range":[1,20]},
				{"parameter": "d", "type":"integer", "value":10, "optimise":"false"}
			]
		}
	]
}

For every parameter, optimisation can be turned off with "optimise":"false" in order to prevent changes to the defined parameter value. In the example above, the dimensionality parameter ("d") of BICO ("kmeanspm.BICO") is set to a fixed value without parameter optimisation. If a parameter is undefined, its default value will be used.

A wide range of parameter types is supported, including numerical, integer, ordinal, categorical and boolean. Their basic usage is shown below:

"parameters": [
	{"parameter": "n", "type":"numeric", "value":0.02, "range":[0,1]},
	{"parameter": "i", "type":"integer", "value":10, "range":[0,100]},
	{"parameter": "o", "type":"ordinal", "value":"good", "range":["bad","medium","good"]},
	{"parameter": "c", "type":"categorical", "value":"euclidean", "range":["euclidean","manhattan","mahalanobis"]},
	{"parameter": "b", "type":"boolean", "value":"true"}
]

Legacy ECMLPKDD ’19 version

We proposed an initial idea for our algorithm at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD '19). For reference, the implementation at the time of publication is still available here: https://github.com/MatthiasCarnein/moa/tree/ECMLPKDD.

Since then, this version has been superseded and heavily extended as shown above. In particular, the configuration file used to have fewer settings. A configuration file for this version which optimises the ε ("e") parameter of the DenStream algorithm ("denstream.WithDBSCAN") is shown below:

{
	"windowSize": 1000,
	"ensembleSize": 25,
	"newConfigurations": 10,
	"keepCurrentModel": "true",
	"preventAlgorithmDeath": "true",
	"lambda": 0.05,
	"algorithms": [
		{
			"algorithm": "denstream.WithDBSCAN",
			"parameters": [
				{"parameter": "e", "type":"numeric", "value":0.02, "range":[0,1]}
			]
		}
	]
}