Usage
The package offer two type of modes for running it, Basic and Advanced.
Basic
This mode will run with mostly predefined settings, saving checkpoints is not recommended in this mode. The model is run by using the fit method, with a minimal requirements of the Data and α concentration parameter. when prior is not supplied, it will automatically use a weak NIW prior.
DPMMSubClusters.fit — Method.fit(all_data::AbstractArray{Float32,2},local_hyper_params::distribution_hyper_params,α_param::Float32;
iters::Int64 = 100, init_clusters::Int64 = 1,seed = nothing, verbose = true, save_model = false, burnout = 20, gt = nothing, max_clusters = Inf, outlier_weight = 0, outlier_params = nothing)Run the model (basic mode).
Args and Kwargs
all_data::AbstractArray{Float32,2}aDxNarray containing the datalocal_hyper_params::distribution_hyper_paramsthe prior hyperparamsα_param::Float32the concetration parameteriters::Int64number of iterations to run the modelinit_clusters::Int64number of initial clustersseeddefine a random seed to be used in all workers, if used must be preceeded with@everywhere using random.verbosewill perform prints on every iteration.save_modelwill save a checkpoint every 25 iterations.burnouthow long to wait after creating a cluster, and allowing it to split/mergegtGround truth, when supplied, will perform NMI and VI analysis on every iteration.max_clusterslimit the number of clusteroutlier_weightconstant weight of an extra non-spliting componentoutlier_paramshyperparams for an extra non-spliting component
Return Values
labelsLabels assignmentsclustersCluster parametersweightsThe cluster weights, does not sum to1, but to1minus the weight of all uninstanistaed clusters.iter_countTiming for each iterationnmi_score_historyNMI score per iteration (if gt suppled)likelihood_historyLog likelihood per iteration.cluster_count_historyCluster counts per iteration.sub_labelsSub labels assignments
Example:
julia> x,y,clusters = generate_gaussian_data(10000,2,6,100.0)
...
julia> hyper_params = DPMMSubClusters.niw_hyperparams(1.0,
zeros(2),
5,
[1 0;0 1])
DPMMSubClusters.niw_hyperparams(1.0f0, Float32[0.0, 0.0], 5.0f0, Float32[1.0 0.0; 0.0 1.0])
julia> ret_values= fit(x,hyper_params,10.0, iters = 100, verbose=false)
...
julia> unique(ret_values[1])
6-element Array{Int64,1}:
3
6
1
2
5
4DPMMSubClusters.fit — Method.fit(all_data::AbstractArray{Float32,2},α_param::Float32;
iters::Int64 = 100, init_clusters::Int64 = 1,seed = nothing, verbose = true, save_model = false,burnout = 20, gt = nothing, max_clusters = Inf, outlier_weight = 0, outlier_params = nothing)Run the model (basic mode) with default NIW prior.
Args and Kwargs
all_data::AbstractArray{Float32,2}aDxNarray containing the dataα_param::Float32the concetration parameteriters::Int64number of iterations to run the modelinit_clusters::Int64number of initial clustersseeddefine a random seed to be used in all workers, if used must be preceeded with@everywhere using random.verbosewill perform prints on every iteration.save_modelwill save a checkpoint every 25 iterations.burnouthow long to wait after creating a cluster, and allowing it to split/mergegtGround truth, when supplied, will perform NMI and VI analysis on every iteration.outlier_weightconstant weight of an extra non-spliting componentoutlier_paramshyperparams for an extra non-spliting component
Return Values
labelsLabels assignmentsclustersCluster parametersweightsThe cluster weights, does not sum to1, but to1minus the weight of all uninstanistaed clusters.iter_countTiming for each iterationnmi_score_historyNMI score per iteration (if gt suppled)likelihood_historyLog likelihood per iteration.cluster_count_historyCluster counts per iteration.sub_labelsSub labels assignments
Example:
julia> x,y,clusters = generate_gaussian_data(10000,2,6,100.0)
...
julia> ret_values= fit(x,10.0, iters = 100, verbose=false)
...
julia> unique(ret_values[1])
6-element Array{Int64,1}:
3
6
1
2
5
4Advanced
This mode allows greater flexibility, and required a Parameters file (see below). It is run by the function dp_parallel.
DPMMSubClusters.dp_parallel — Method.dp_parallel(model_params::String; verbose = true, save_model = true,burnout = 5, gt = nothing)Run the model in advanced mode.
Args and Kwargs
model_params::StringA path to a parameters file (see below)verbosewill perform prints on every iteration.save_modelwill save a checkpoint everyXiterations, whereXis specified in the parameter file.burnouthow long to wait after creating a cluster, and allowing it to split/mergegtGround truth, when supplied, will perform NMI and VI analysis on every iteration.
Return values
dpmodel, itercount , nmiscorehistory, liklihoodhistory, clustercount_history
dp_modelThe DPMM model inferrediter_countTiming for each iterationnmi_score_historyNMI score per iteration (if gt suppled)likelihood_historyLog likelihood per iteration.cluster_count_historyCluster counts per iteration.
In addition, you may restart a previously saved checkpoint:
run_model_from_checkpoint(filename)Run the model from a checkpoint created by it, filename is the path to the checkpoint. Only to be run when using the advanced mode, note that the data must be in the same path as previously.
Example:
julia> dp = run_model_from_checkpoint("checkpoint__50.jld2")
Loading Model:
1.073261 seconds (2.27 M allocations: 113.221 MiB, 2.60% gc time)
Including params
Loading data:
0.000881 seconds (10.02 k allocations: 378.313 KiB)
Creating model:
Node Leaders:
Dict{Any,Any}(2=>Any[2, 3])
Running model:
...Note that that data is read from a npy file, and unlike the previous fit function, should be of Samples X Dimensions.
Parameter File
For running the advanced mode you need to specify a parameters file, it is a Julia file, of the following struct:
#Data Loading specifics
data_path = "/path/to/data/"
data_prefix = "data_prefix" #If the data file name is bob.npy, this should be 'bob'
#Model Parameters
iterations = 100
hard_clustering = false #Soft or hard assignments
initial_clusters = 1
argmax_sample_stop = 0 #Change to hard assignment from soft at iterations - argmax_sample_stop
split_stop = 0 #Stop split/merge moves at iterations - split_stop
random_seed = nothing #When nothing, a random seed will be used.
max_split_iter = 20
burnout_period = 20
#Model hyperparams
α = 10.0 #Concetration Parameter
hyper_params = DPMMSubClusters.niw_hyperparams(1.0,
zeros(Float32,2),
5,
Matrix{Float32}(I, 2, 2)*1.0)
#Saving specifics:
enable_saving = true
model_save_interval = 1000
save_path = "/path/to/save/dir/"
overwrite_prec = false
save_file_prefix = "checkpoint_"