Usage

The package offer two type of modes for running it, Basic and Advanced.

Basic

This mode will run with mostly predefined settings, saving checkpoints is not recommended in this mode. The model is run by using the fit method, with a minimal requirements of the Data and α concentration parameter. when prior is not supplied, it will automatically use a weak NIW prior.

DPMMSubClusters.fit — Method.

fit(all_data::AbstractArray{Float32,2},local_hyper_params::distribution_hyper_params,α_param::Float32;
   iters::Int64 = 100, init_clusters::Int64 = 1,seed = nothing, verbose = true, save_model = false, burnout = 20, gt = nothing, max_clusters = Inf, outlier_weight = 0, outlier_params = nothing)

Run the model (basic mode).

Args and Kwargs

all_data::AbstractArray{Float32,2} a DxN array containing the data
local_hyper_params::distribution_hyper_params the prior hyperparams
α_param::Float32 the concetration parameter
iters::Int64 number of iterations to run the model
init_clusters::Int64 number of initial clusters
seed define a random seed to be used in all workers, if used must be preceeded with @everywhere using random.
verbose will perform prints on every iteration.
save_model will save a checkpoint every 25 iterations.
burnout how long to wait after creating a cluster, and allowing it to split/merge
gt Ground truth, when supplied, will perform NMI and VI analysis on every iteration.
max_clusters limit the number of cluster
outlier_weight constant weight of an extra non-spliting component
outlier_params hyperparams for an extra non-spliting component

Return Values

labels Labels assignments
clusters Cluster parameters
weights The cluster weights, does not sum to 1, but to 1 minus the weight of all uninstanistaed clusters.
iter_count Timing for each iteration
nmi_score_history NMI score per iteration (if gt suppled)
likelihood_history Log likelihood per iteration.
cluster_count_history Cluster counts per iteration.
sub_labels Sub labels assignments

Example:

julia> x,y,clusters = generate_gaussian_data(10000,2,6,100.0)
...

julia> hyper_params = DPMMSubClusters.niw_hyperparams(1.0,
                  zeros(2),
                  5,
                  [1 0;0 1])
DPMMSubClusters.niw_hyperparams(1.0f0, Float32[0.0, 0.0], 5.0f0, Float32[1.0 0.0; 0.0 1.0])

julia> ret_values= fit(x,hyper_params,10.0, iters = 100, verbose=false)

...

julia> unique(ret_values[1])
6-element Array{Int64,1}:
 3
 6
 1
 2
 5
 4

source

DPMMSubClusters.fit — Method.

fit(all_data::AbstractArray{Float32,2},α_param::Float32;
    iters::Int64 = 100, init_clusters::Int64 = 1,seed = nothing, verbose = true, save_model = false,burnout = 20, gt = nothing, max_clusters = Inf, outlier_weight = 0, outlier_params = nothing)

Run the model (basic mode) with default NIW prior.

Args and Kwargs

all_data::AbstractArray{Float32,2} a DxN array containing the data
α_param::Float32 the concetration parameter
iters::Int64 number of iterations to run the model
init_clusters::Int64 number of initial clusters
seed define a random seed to be used in all workers, if used must be preceeded with @everywhere using random.
verbose will perform prints on every iteration.
save_model will save a checkpoint every 25 iterations.
burnout how long to wait after creating a cluster, and allowing it to split/merge
gt Ground truth, when supplied, will perform NMI and VI analysis on every iteration.
outlier_weight constant weight of an extra non-spliting component
outlier_params hyperparams for an extra non-spliting component

Return Values

labels Labels assignments
clusters Cluster parameters
weights The cluster weights, does not sum to 1, but to 1 minus the weight of all uninstanistaed clusters.
iter_count Timing for each iteration
nmi_score_history NMI score per iteration (if gt suppled)
likelihood_history Log likelihood per iteration.
cluster_count_history Cluster counts per iteration.
sub_labels Sub labels assignments

Example:

julia> x,y,clusters = generate_gaussian_data(10000,2,6,100.0)
...

julia> ret_values= fit(x,10.0, iters = 100, verbose=false)

...

julia> unique(ret_values[1])
6-element Array{Int64,1}:
 3
 6
 1
 2
 5
 4

source

Advanced

This mode allows greater flexibility, and required a Parameters file (see below). It is run by the function dp_parallel.

DPMMSubClusters.dp_parallel — Method.

dp_parallel(model_params::String; verbose = true, save_model = true,burnout = 5, gt = nothing)

Run the model in advanced mode.

Args and Kwargs

model_params::String A path to a parameters file (see below)
verbose will perform prints on every iteration.
save_model will save a checkpoint every X iterations, where X is specified in the parameter file.
burnout how long to wait after creating a cluster, and allowing it to split/merge
gt Ground truth, when supplied, will perform NMI and VI analysis on every iteration.

Return values

dpmodel, itercount , nmiscorehistory, liklihoodhistory, clustercount_history

dp_model The DPMM model inferred
iter_count Timing for each iteration
nmi_score_history NMI score per iteration (if gt suppled)
likelihood_history Log likelihood per iteration.
cluster_count_history Cluster counts per iteration.

source

In addition, you may restart a previously saved checkpoint:

DPMMSubClusters.run_model_from_checkpoint — Method.

run_model_from_checkpoint(filename)

Run the model from a checkpoint created by it, filename is the path to the checkpoint. Only to be run when using the advanced mode, note that the data must be in the same path as previously.

Example:

julia> dp = run_model_from_checkpoint("checkpoint__50.jld2")
Loading Model:
  1.073261 seconds (2.27 M allocations: 113.221 MiB, 2.60% gc time)
Including params
Loading data:
  0.000881 seconds (10.02 k allocations: 378.313 KiB)
Creating model:
Node Leaders:
Dict{Any,Any}(2=>Any[2, 3])
Running model:
...

source

Note that that data is read from a npy file, and unlike the previous fit function, should be of Samples X Dimensions.

Parameter File

For running the advanced mode you need to specify a parameters file, it is a Julia file, of the following struct:

#Data Loading specifics
data_path = "/path/to/data/"
data_prefix = "data_prefix"  #If the data file name is bob.npy, this should be 'bob'


#Model Parameters
iterations = 100
hard_clustering = false  #Soft or hard assignments
initial_clusters = 1
argmax_sample_stop = 0 #Change to hard assignment from soft at iterations - argmax_sample_stop
split_stop  = 0 #Stop split/merge moves at  iterations - split_stop

random_seed = nothing #When nothing, a random seed will be used.

max_split_iter = 20
burnout_period = 20

#Model hyperparams
α = 10.0 #Concetration Parameter
hyper_params = DPMMSubClusters.niw_hyperparams(1.0,
    zeros(Float32,2),
    5,
    Matrix{Float32}(I, 2, 2)*1.0)



#Saving specifics:
enable_saving = true
model_save_interval = 1000
save_path = "/path/to/save/dir/"
overwrite_prec = false
save_file_prefix = "checkpoint_"