Package: prclust 1.3

prclust: Penalized Regression-Based Clustering Method

Clustering is unsupervised and exploratory in nature. Yet, it can be performed through penalized regression with grouping pursuit. In this package, we provide two algorithms for fitting the penalized regression-based clustering (PRclust) with non-convex grouping penalties, such as group truncated lasso, MCP and SCAD. One algorithm is based on quadratic penalty and difference convex method. Another algorithm is based on difference convex and ADMM, called DC-ADD, which is more efficient. Generalized cross validation and stability based method were provided to select the tuning parameters. Rand index, adjusted Rand index and Jaccard index were provided to estimate the agreement between estimated cluster memberships and the truth.

Authors:Chong Wu, Wei Pan

prclust_1.3.tar.gz
prclust_1.3.zip(r-4.5)prclust_1.3.zip(r-4.4)prclust_1.3.zip(r-4.3)
prclust_1.3.tgz(r-4.5-x86_64)prclust_1.3.tgz(r-4.5-arm64)prclust_1.3.tgz(r-4.4-x86_64)prclust_1.3.tgz(r-4.4-arm64)prclust_1.3.tgz(r-4.3-x86_64)prclust_1.3.tgz(r-4.3-arm64)
prclust_1.3.tar.gz(r-4.5-noble)prclust_1.3.tar.gz(r-4.4-noble)
prclust_1.3.tgz(r-4.4-emscripten)prclust_1.3.tgz(r-4.3-emscripten)
prclust.pdf |prclust.html
prclust/json (API)

# Install 'prclust' in R:
install.packages('prclust', repos = c('https://chongwu-biostat.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/chongwu-biostat/prclust/issues0 issues

Uses libs:
  • c++– GNU Standard C++ Library v3

On CRAN:prclust-1.3(2016-12-13)

Conda:

cpp

2.70 score 6 scripts 250 downloads 4 exports 1 dependencies

Last updated 8 years agofrom:3da1a5ebea. Checks:1 OK, 10 NOTE. Indexed: yes.

TargetResultLatest binary
Doc / VignettesOKFeb 20 2025
R-4.5-win-x86_64NOTEFeb 20 2025
R-4.5-mac-x86_64NOTEFeb 20 2025
R-4.5-mac-aarch64NOTEFeb 20 2025
R-4.5-linux-x86_64NOTEFeb 20 2025
R-4.4-win-x86_64NOTEFeb 20 2025
R-4.4-mac-x86_64NOTEFeb 20 2025
R-4.4-mac-aarch64NOTEFeb 20 2025
R-4.3-win-x86_64NOTEFeb 20 2025
R-4.3-mac-x86_64NOTEFeb 20 2025
R-4.3-mac-aarch64NOTEFeb 20 2025

Exports:clusterStatGCVPRcluststability

Dependencies:Rcpp

Citation

To cite package ‘prclust’ in publications use:

Wu C, Pan W (2016). prclust: Penalized Regression-Based Clustering Method. R package version 1.3, https://github.com/chongwu-biostat/prclust.

ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see ‘help("citation")’.

Corresponding BibTeX entry:

  @Manual{,
    title = {prclust: Penalized Regression-Based Clustering Method},
    author = {Chong Wu and Wei Pan},
    year = {2016},
    note = {R package version 1.3},
    url = {https://github.com/chongwu-biostat/prclust},
  }

Readme and manuals

prclust     LicenseCRANdownloads

prclust is a new R package that makes it incredibly easy to use penalized regression based clustering method with R.

Benefits

  • It can treat some complex clustering situations, for example, in the presence of non-convex clusters, in which traditional methods such as K-means break down.
  • Apply or modify many established results and techniques, such as model selection criteria, in regression to clustering.

Features

  • Two algorithms (DC-ADMM and quadratic penalty based algorithm) to implement penalized regression based clustering.
  • Two criteria (generalized cross validation and stability based criterion) to select the tuning parameters.
  • A function to calculate Rand, aRand and Jaccard index, measuring the agreement between estimated cluster and the truth with a higher value indicating a higher agreement.

Installation

To install the stable version from CRAN, simply run the following from an R console:

install.packages("prclust")

To install the latest development builds directly from GitHub, run this instead:

if (!require("devtools"))
  install.packages("devtools")
# devtools::install_github("ChongWu-Biostat/prclust")
install.packages("prclust", repos = c('https://chongwu-biostat.r-universe.dev', 'https://cloud.r-project.org'))

Using prclust

Clustering analysis is regarded as unsupervised learning in absence of a class label, as opposed to supervised learning. Over the last few years, a new framework of clustering analysis was introduced by treating it as a penalized regression problem based on over-parameterization. Specifically, we parameterize p-dimensional observations with its own centroid. Two observations are said to belong to the same cluster if their corresponding centroids are equal. Then clustering analysis is formulated to identify a small subset of distinct values of these centroids via solving a penalized regression problem. For more details, see the following two papers.

  • Pan Wei, Xiaotong Shen, and Binghui Liu. "Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty." Journal of Machine Learning Research 14.1 (2013):1865-1889.
  • Wu C, Kwon S, Shen X, Pan W. "A New Algorithm and Theory for Penalized Regression-based Clustering." Journal of Machine Learning Research. 2016;17(188):1-25.  
library("prclust")
## generate the data
data = matrix(NA,2,100)
data[1,1:50] = rnorm(50,0,0.33)
data[2,1:50] = rnorm(50,0,0.33)
data[1,51:100] = rnorm(50,1,0.33)
data[2,51:100] = rnorm(50,1,0.33)
## set the tunning parameter
lambda1 =1
lambda2 = 3
tau = 0.5
a =PRclust(data,lambda1,lambda2,tau)
a #clustering results

Selecting the tunning parameters

A bonus with the regression approach to clustering is the potential application of many existing model selection methods for regression or supervised learning to clustering. We propose using generalized cross-validation (GCV). GCV can be regarded as an approximation to leave-one-out cross-validation (CV). Hence, GCV provides an approximately unbiased estimate of the prediction error.

We try with various tuning parameter values, obtaining their corresponding GDFs and thus GCV statistics, then choose the set of the tuning parameters with the minimum GCV statistic.

#case 1
gcv1 = GCV(data,lambda1=1,lambda2=1,tau=0.5,sigma=0.25)
gcv1

#case 2
gcv2 = GCV(data,lambda1=1,lambda2=0.7,tau=0.3,sigma=0.25)
gcv2

GCV, while yielding good performance, requires extensive computation and specification of a hyper-parameter perturbation size. Here, we provide an alternative by modifying a stability-based criterion for determining the tuning parameters.

The main idea of the method is based on cross-validation. That is,

  • Randomly partition the entire data set into a training set and a test set with an almost equal size;
  • Cluster the training and test sets separately via PRclust with the same tuning parameters;
  • Measure how well the training set clusters predict the test clusters.
  • Try with various tuning parameter values, obtaining their corresponding statbility based statistics average prediction strengths, then choose the set of the tuning parameters with the maximum average prediction stength.
#case 1
stab1 = stability(data,rho=1,lambda=1,tau=0.5,n.times = 10)
stab1

#case 2
stab2 = stability(data,rho=1,lambda=0.7,tau=0.3,n.times = 10)
stab2

Evaluating the clustering results

We provide a function clusterStat, which calculates Rand, adjusted Rand, Jaccard index, measuring the agreement between estimated cluster and the truth with a higher value indicating a higher agreement.

truth = c(rep(1,50),rep(2,50))
clustStat(a$group,truth)

Other examples

Some examples have been provided for further illustrating the performance of prclust.


###################################################
#### case 2
###################################################
x1 = -1 + 2*c(0:99)/99
data = matrix(NA,2,200)
data[1,1:100] = x1
data[2,1:100] = (rbinom(100,1,0.5)*2-1)*sqrt(1-x1^2)+runif(100,min = -0.1,max = 0.1)

x2 = -2 + 4*c(0:99)/99
data[1,101:200] = x2
data[2,101:200] = (rbinom(100,1,0.5)*2-1)*sqrt(4-x2^2)+runif(100,min = -0.1,max = 0.1)
## set the tunning parameter
lambda1 =1
lambda2 = 30
tau = 0.6
a =PRclust(data,lambda1,lambda2,tau)
a
## quadratic penalty
lambda1 =1
lambda2 = 1
tau = 0.5
a =PRclust(data,lambda1,lambda2,tau, algorithm ="Quadratic")
a

##################################################
### case 3
##################################################
data = matrix(runif(10*200),10,200)
## set the tunning parameter
lambda1 =1
lambda2 =5
tau = 1
a =PRclust(data,lambda1,lambda2,tau)
a

## quadratic penalty
lambda1 =1
lambda2 =1
tau = 1
a =PRclust(data,lambda1,lambda2,tau, algorithm ="Quadratic")
a
##################################################
### case 4
##################################################
### generate the data set, it's kind of complicate
judge = 1
while(judge != 0)
{
    tempCenter = matrix(rnorm(12,0,5),3,4)
    c1 = tempCenter[,1]
    c2 = tempCenter[,2]
    c3 = tempCenter[,3]
    c4 = tempCenter[,4]

    tempObs1 = 25 + rbinom(1,1,0.5)*25
    tempObs2 = 25 + rbinom(1,1,0.5)*25
    tempObs3 = 25 + rbinom(1,1,0.5)*25
    tempObs4 = 25 + rbinom(1,1,0.5)*25

    Obs = tempObs1 + tempObs2 + tempObs3 + tempObs4
    data = matrix(NA,3,Obs)
    data[1,1:tempObs1] = rnorm(tempObs1,c1[1],1)
    data[2,1:tempObs1] = rnorm(tempObs1,c1[2],1)
    data[3,1:tempObs1] = rnorm(tempObs1,c1[3],1)

    data[1,(tempObs1+1):(tempObs1 + tempObs2)] = rnorm(tempObs2,c2[1],1)
    data[2,(tempObs1+1):(tempObs1 + tempObs2)] = rnorm(tempObs2,c2[2],1)
    data[3,(tempObs1+1):(tempObs1 + tempObs2)] = rnorm(tempObs2,c2[3],1)

    data[1,(tempObs1+tempObs2+1):(tempObs1 + tempObs2 +tempObs3)] = rnorm(tempObs3,c3[1],1)
    data[2,(tempObs1+tempObs2+1):(tempObs1 + tempObs2 +tempObs3)] = rnorm(tempObs3,c3[2],1)
    data[3,(tempObs1+tempObs2+1):(tempObs1 + tempObs2 +tempObs3)] = rnorm(tempObs3,c3[3],1)

    data[1,(tempObs1 + tempObs2 +tempObs3+1):Obs] = rnorm(tempObs4,c4[1],1)
    data[2,(tempObs1 + tempObs2 +tempObs3+1):Obs] = rnorm(tempObs4,c4[2],1)
    data[3,(tempObs1 + tempObs2 +tempObs3+1):Obs] = rnorm(tempObs4,c4[3],1)
    
    a =as.matrix(dist(t(data)))
    if((min(a[1:tempObs1,(tempObs1+1):Obs]) <1 | min(a[(tempObs1+1):(tempObs1+tempObs2),(tempObs1+tempObs2+1):Obs])<1 |min(a[(tempObs1+tempObs2+1):(tempObs1+tempObs2+tempObs3),(tempObs1+tempObs2+tempObs3+1):Obs])<1 )== 0)
    judge =0
}

## set the tunning parameter
lambda1 =1
lambda2 =10
tau =3
a =PRclust(data,lambda1,lambda2,tau)
a

## quadratic penalty
lambda1 =1
lambda2 =1.8
tau =2.3
a =PRclust(data,lambda1,lambda2,tau, algorithm ="Quadratic")
a

#############################################
#### case 5
#############################################
## generate the data set
x1 = -0.5 + c(0:99)/99
data = matrix(NA,3,200)
data[1,1:100] = x1 + rnorm(100,0,0.1)
data[2,1:100] = x1 + rnorm(100,0,0.1)
data[3,1:100] = x1 + rnorm(100,0,0.1)

data[1,101:200] = x1 + 2 + rnorm(100,0,0.1)
data[2,101:200] = x1 + 2 + rnorm(100,0,0.1)
data[3,101:200] = x1 + 2 + rnorm(100,0,0.1)

## set the tunning parameter
lambda1 =1
lambda2 = 1
tau = 0.5
a =PRclust(data,lambda1,lambda2,tau)
a

## quadratic penalty
lambda1 =1
lambda2 = 0.45
tau = 0.35
a =PRclust(data,lambda1,lambda2,tau)
a

#############################################
### case 6
############################################
data = matrix(NA,2,150)
data[1,1:50] =1.5*sin(2*pi*(30+c(0:49)*5)/360)
data[2,1:50] =1.5*cos(2*pi*(30+c(0:49)*5)/360) + runif(50,-0.025,0.025)

data[1,51:100] = rnorm(50,0,0.1)
data[2,51:100] = rnorm(50,0,0.1)

data[1,101:150] = rnorm(50,0.8,0.1)
data[2,101:150] = rnorm(50,0,0.1)
#plot(data[1,],data[2,])
## set the tunning parameter
lambda1 =1
lambda2 = 1
tau = 0.35
a =PRclust(data,lambda1,lambda2,tau)
a

# quadratic penalty
## set the tunning parameter
lambda1 =1
lambda2 = 0.5
tau = 0.4
a =PRclust(data,lambda1,lambda2,tau, algorithm ="Quadratic")
a