Usage
The package offer two type of modes for running it, Basic and Advanced.
Basic
This mode will run with mostly predefined settings, saving checkpoints is not recommended in this mode. The model is run by using the fit
method, with a minimal requirements of the Data
and α
concentration parameter. when prior
is not supplied, it will automatically use a weak NIW
prior.
DPMMSubClusters.fit
— Method.fit(all_data::AbstractArray{Float32,2},local_hyper_params::distribution_hyper_params,α_param::Float32;
iters::Int64 = 100, init_clusters::Int64 = 1,seed = nothing, verbose = true, save_model = false, burnout = 20, gt = nothing, max_clusters = Inf, outlier_weight = 0, outlier_params = nothing)
Run the model (basic mode).
Args and Kwargs
all_data::AbstractArray{Float32,2}
aDxN
array containing the datalocal_hyper_params::distribution_hyper_params
the prior hyperparamsα_param::Float32
the concetration parameteriters::Int64
number of iterations to run the modelinit_clusters::Int64
number of initial clustersseed
define a random seed to be used in all workers, if used must be preceeded with@everywhere using random
.verbose
will perform prints on every iteration.save_model
will save a checkpoint every 25 iterations.burnout
how long to wait after creating a cluster, and allowing it to split/mergegt
Ground truth, when supplied, will perform NMI and VI analysis on every iteration.max_clusters
limit the number of clusteroutlier_weight
constant weight of an extra non-spliting componentoutlier_params
hyperparams for an extra non-spliting component
Return Values
labels
Labels assignmentsclusters
Cluster parametersweights
The cluster weights, does not sum to1
, but to1
minus the weight of all uninstanistaed clusters.iter_count
Timing for each iterationnmi_score_history
NMI score per iteration (if gt suppled)likelihood_history
Log likelihood per iteration.cluster_count_history
Cluster counts per iteration.sub_labels
Sub labels assignments
Example:
julia> x,y,clusters = generate_gaussian_data(10000,2,6,100.0)
...
julia> hyper_params = DPMMSubClusters.niw_hyperparams(1.0,
zeros(2),
5,
[1 0;0 1])
DPMMSubClusters.niw_hyperparams(1.0f0, Float32[0.0, 0.0], 5.0f0, Float32[1.0 0.0; 0.0 1.0])
julia> ret_values= fit(x,hyper_params,10.0, iters = 100, verbose=false)
...
julia> unique(ret_values[1])
6-element Array{Int64,1}:
3
6
1
2
5
4
DPMMSubClusters.fit
— Method.fit(all_data::AbstractArray{Float32,2},α_param::Float32;
iters::Int64 = 100, init_clusters::Int64 = 1,seed = nothing, verbose = true, save_model = false,burnout = 20, gt = nothing, max_clusters = Inf, outlier_weight = 0, outlier_params = nothing)
Run the model (basic mode) with default NIW
prior.
Args and Kwargs
all_data::AbstractArray{Float32,2}
aDxN
array containing the dataα_param::Float32
the concetration parameteriters::Int64
number of iterations to run the modelinit_clusters::Int64
number of initial clustersseed
define a random seed to be used in all workers, if used must be preceeded with@everywhere using random
.verbose
will perform prints on every iteration.save_model
will save a checkpoint every 25 iterations.burnout
how long to wait after creating a cluster, and allowing it to split/mergegt
Ground truth, when supplied, will perform NMI and VI analysis on every iteration.outlier_weight
constant weight of an extra non-spliting componentoutlier_params
hyperparams for an extra non-spliting component
Return Values
labels
Labels assignmentsclusters
Cluster parametersweights
The cluster weights, does not sum to1
, but to1
minus the weight of all uninstanistaed clusters.iter_count
Timing for each iterationnmi_score_history
NMI score per iteration (if gt suppled)likelihood_history
Log likelihood per iteration.cluster_count_history
Cluster counts per iteration.sub_labels
Sub labels assignments
Example:
julia> x,y,clusters = generate_gaussian_data(10000,2,6,100.0)
...
julia> ret_values= fit(x,10.0, iters = 100, verbose=false)
...
julia> unique(ret_values[1])
6-element Array{Int64,1}:
3
6
1
2
5
4
Advanced
This mode allows greater flexibility, and required a Parameters
file (see below). It is run by the function dp_parallel
.
DPMMSubClusters.dp_parallel
— Method.dp_parallel(model_params::String; verbose = true, save_model = true,burnout = 5, gt = nothing)
Run the model in advanced mode.
Args and Kwargs
model_params::String
A path to a parameters file (see below)verbose
will perform prints on every iteration.save_model
will save a checkpoint everyX
iterations, whereX
is specified in the parameter file.burnout
how long to wait after creating a cluster, and allowing it to split/mergegt
Ground truth, when supplied, will perform NMI and VI analysis on every iteration.
Return values
dpmodel, itercount , nmiscorehistory, liklihoodhistory, clustercount_history
dp_model
The DPMM model inferrediter_count
Timing for each iterationnmi_score_history
NMI score per iteration (if gt suppled)likelihood_history
Log likelihood per iteration.cluster_count_history
Cluster counts per iteration.
In addition, you may restart a previously saved checkpoint:
run_model_from_checkpoint(filename)
Run the model from a checkpoint created by it, filename
is the path to the checkpoint. Only to be run when using the advanced mode, note that the data must be in the same path as previously.
Example:
julia> dp = run_model_from_checkpoint("checkpoint__50.jld2")
Loading Model:
1.073261 seconds (2.27 M allocations: 113.221 MiB, 2.60% gc time)
Including params
Loading data:
0.000881 seconds (10.02 k allocations: 378.313 KiB)
Creating model:
Node Leaders:
Dict{Any,Any}(2=>Any[2, 3])
Running model:
...
Note that that data is read from a npy
file, and unlike the previous fit
function, should be of Samples X Dimensions
.
Parameter File
For running the advanced mode you need to specify a parameters file, it is a Julia
file, of the following struct:
#Data Loading specifics
data_path = "/path/to/data/"
data_prefix = "data_prefix" #If the data file name is bob.npy, this should be 'bob'
#Model Parameters
iterations = 100
hard_clustering = false #Soft or hard assignments
initial_clusters = 1
argmax_sample_stop = 0 #Change to hard assignment from soft at iterations - argmax_sample_stop
split_stop = 0 #Stop split/merge moves at iterations - split_stop
random_seed = nothing #When nothing, a random seed will be used.
max_split_iter = 20
burnout_period = 20
#Model hyperparams
α = 10.0 #Concetration Parameter
hyper_params = DPMMSubClusters.niw_hyperparams(1.0,
zeros(Float32,2),
5,
Matrix{Float32}(I, 2, 2)*1.0)
#Saving specifics:
enable_saving = true
model_save_interval = 1000
save_path = "/path/to/save/dir/"
overwrite_prec = false
save_file_prefix = "checkpoint_"