textClust is a stream clusterng algorithm for textual data. It can be used in order to identify and track topics over time in a stream of texts. The algorithm uses a widely popuar two-phase clustering approach where the stream is first summarised in real time. The result are many small preliminary clusters in the stream called 'micro-clusters'. Our micro-clusters maintain enough information to update them over time and also efficiently calculate the cosine similarity between them based on the Tf-idf vector of their texts. Upon request, the miro-clusters can be reclustered to generate the final result using any distance-based clustering algorithm such as hierarchical clustering. To keep the micro-clusters up-to-date our algorithm applies a fading strategy where micro-clusters that are not updated regularly loose relevance and are eventually removed.



An implementation of the proposed algorithm is available here: https://github.com/MatthiasCarnein/textClust.

The easiest way to install the package is by using devtools:


Usage and interfaces are largely based on the R-package stream with modifications for the analysis of text data:


## define data stream
data = data.frame(text=sample(c("Main Topic", "Similar Topic", "Something Different"), size=1000, replace=T),stringsAsFactors=F)
stream = DSD_Memory(data)

# Alternatively read data from file:
# stream = DSD_ReadCSV("file.txt", sep = "\t", comment.char="", quote="")

## define text clustering algorithm
algorithm = DSC_textClust(r=.4, lambda=0.1, tgap=100, nmin=1, nmax=2, k=3, stopword=c(), minWeight=3, textCol=1)

## run the algorithm
update(algorithm, stream, n=1000)

## get micro clusters
get_centers(algorithm, "micro")

## get macro clusters
get_centers(algorithm, "macro")

## Assign new texts to existing clusters
data = data.frame(text=sample(c("Main Topic", "Something Different"), size=100, replace=T),stringsAsFactors=F)
get_assignment(algorithm, data)
The algorithm can also be evaluated using prequential (interleaved test-then-train) evaluation:
evaluation = textClust::evaluate_cluster(algorithm, stream, measure=c("numMicroClusters", "purity"), n=1000, assign="micro", type="micro", assignMethod="nn", horizon=100)