scRiskCell

Description¶

scRiskCell is a toolkit for identifying functionally perturbed cell populations from single-cell transcriptomic data, enabling the study of dynamic cellular processes at single-cell resolution. In scRiskCell, cells from different conditions (e.g., healthy vs disease) are treated equally within each cell type to avoid confounding effects from donor labels. The toolkit reorders cells (e.g., β-cells) based on their association with a given phenotype (such as disease status), constructing a trajectory that reflects phenotypic progression or functional transition. While scRiskCell is particularly well-suited for identifying disease-associated risk cells and exploring disease progression, it is also broadly applicable to other scenarios. For example, it can be used to identify drug-resistant cells, stress-responsive cells, or other phenotypically distinct subpopulations based on any binary or continuous trait of interest, not limited to donor identity. For a video tutorial on how to use scRiskCell, please see: https://youtu.be/Zx_ymupnoeE

Graphical abstract

Key methodological steps include:

Split the data into two subsets，including the endpoint and intermediate group.
Applies PCA for dimensionality reduction.
Uses logistic regression to model phenotype (e.g., disease status) as a binary variable.
Computes an index for each cell based on regression coefficients.
Determine an index threshold.
Define risk cells as those with index values above the determined threshold.
Calculate the proportion of risk cells per donor.

Installation¶

The Python version must be >= 3.9.

wget http://lin-group.cn/server/scRiskCell/download/scRiskCell.zip    
conda create --name scRiskCell python=3.10  
conda activate scRiskCell  
cd scRiskCell  
pip install . --ignore-installed
pip show scRiskCell

Data preparation¶

To use scRiskCell, the input should be a pandas.DataFrame where rows represent cells and columns represent genes. This matrix can be extracted from a processed Seurat object using:

    # In R
    expr_matrix <- t(as.data.frame(Seurat_object@assays$RNA@scale.data))

After extracting the scaled expression matrix, you must append four required metadata columns to the DataFrame:

Donor: Specifies the donor identity each cell originates from.
Category: Indicates the phenotype class for each cell.
For example, in a diabetic progression study, this column may contain values such as:

"ND" (non-diabetic)
"preT2D" (pre-type 2 diabetes)
"T2D" (type 2 diabetes)

Label: A numeric encoding of the Category column.
For instance, you can encode:

"ND" as 0
"preT2D" as 1
"T2D" as 2

Cell_id: A unique identifier for each cell.

⚠️ Important Notes:

The names of the four metadata columns must be exactly: Donor, Category, Label, and Cell_id.
The Category column must contain at least two distinct phenotype groups
(e.g., "ND" and "T2D").
The Label column should correspond numerically to the Category values for modeling purposes.

Basic usage¶

Functions included in scRiskCell:
Split · PerformPCA · PerformRegression · GetDiaseaseIndex · GetIndexThreshold · GetCellRisk · GetDonorRiskRatio · scRiskCell · PlotIndexViolin_by_Category · PlotIndexBoxplot_by_Category · PlotIndexViolin_by_Donor · PlotIndexBoxplot_by_Donor · PlotRatioBoxplot_by_Category · PlotRatioStackBar_by_Donor · PlotROC

`Split`¶

The goal is to separate the end states (such as healthy and fully diseased) from the intermediate state (such as a pre-disease condition), enabling more accurate modeling of functional transitions.

Usage¶

Split(df, category_group1, category_group2)

Argument¶

df: a pandas.DataFrame containing scaled gene expression data and metadata. Must follow the structure described in the Data Preparation for scRiskCell section.
category_group1: a list of phenotype names representing the two endpoint categories. For example: ["ND", "T2D"].
category_group2: a list of phenotype names representing the intermediate categories. For example: ["preT2D"].

Result¶

Return two DataFrames:

endpoint_df: a subset of df containing cells from category_group1.
intermediate_df: a subset of df containing cells from category_group2.

`PerformPCA`¶

Reduces the high-dimensional gene expression matrix to a lower-dimensional space defined by user-specified principal components (e.g., 20 PCs).

Usage¶

PerformPCA(endpoint_df, intermediate_df, n_components)

Argument¶

endpoint_df and intermediate_df: the output from Split.
n_components: an integer specifying the number of principal components to retain (default 20).

Result¶

pca_endpoint_df and pca_intermediate_df: two reduced-dimension expression matrix with four required columns—Donor, Category, Label, andCell_id.

`PerformRegression`¶

Perform logistic regression on PCA-reduced gene expression data to calculate disease index for each cell.

Usage¶

PerformRegression(pca_endpoint_df, n_components)

Argument¶

pca_endpoint_df: output from PerformPCA.
n_components: consistent with the n_components parameter used in PerformPCA.

Result¶

A fitted logistic regression model object.

`GetDiaseaseIndex`¶

Using the logistic regression model and PCA-reduced expression data, this function computes the disease index for each cell, reflecting the likelihood of functionally perturbed.

Usage¶

GetDiaseaseIndex(pca_endpoint_df, pca_intermediate_df, model, n_components)

Argument¶

pca_endpoint_df and pca_intermediate_df: the output from PerformPCA.
model: the output from PerformRegression.
n_components: consistent with the n_components parameter used in PerformPCA.

Result¶

A pandas.DataFrame index_df containing key columns—Donor, Category, Label, Cell_id, and Disease_index.

`GetIndexThreshold`¶

Determine a disease index threshold using one of two methods:

Quantile Selection Method (default):
Based on the ascending order of Disease_index, a cutoff threshold is selected according to a specified quantile. Cells with Disease_index values above this threshold are designated as risk cells.
Sliding Window Method:
Apply a moving window across the ascending order of Disease_index to identify the threshold at which the number of cells from the most severe phenotype (highest Label value) within the window meets or exceeds a specified count. Cells with Disease_index values above this threshold are designated as risk cells.

Usage¶

GetIndexThreshold(index_df, get_threshold_params)

Argument¶

index_df : the output from GetDiaseaseIndex.
get_threshold_params: a dictionary that specifies the method for threshold determination:
- For Quantile Selection Method:
```
{"ratio": 0.85}
```
- For Sliding Window Method:
```
{"window_size": 100, "threshold": 60, "step": 1}
```
  window_size: size of the sliding window.
  threshold: minimum number of high-label cells required in the window.
  step: step size for moving the window forward.
Note: Keys "ratio", "window_size", "threshold", and "step" are fixed; only values can be changed. If get_threshold_params is not provided, the function defaults to the quantile method with ratio = 0.85.

Result¶

A single numeric value representing the disease index threshold.

`GetCellRisk`¶

Label cells as risk or non-risk based on a predefined disease index threshold. The threshold can be derived from quantile-based or sliding window-based methods.

Usage¶

GetCellRisk(index_df, threshold)

Argument¶

index_df: the output from GetDiaseaseIndex.
index_threshold: the output from GetIndexThreshold.

Result¶

A pandas.DataFrame identical to the input index_df, but with an additional column:
Risk: a binary indicator where 1 denotes a risk cell (Disease_index > threshold), and 0 denotes a non-risk cell.

`GetDonorRiskRatio`¶

Calculate the proportion of risk and non-risk cells for each donor.

Usage¶

GetDonorRiskRatio(risk_df)

Argument¶

risk_df: the output from GetCellRisk.

Result¶

A pandas.DataFrame where each row corresponds to a donor, and columns include:
Donor: donor/source ID;
Category: phenotype of the donor (e.g., control or disease);
Label: phenotype label of the donor (e.g., 0 or 1);
Risk_ratio: proportion of risk cells in the donor;
non_Risk_ratio: proportion of non-risk cells in the donor.

`scRiskCell`¶

Integrates all key steps—dimensionality reduction, model fitting, disease index calculation, risk cell labeling, and risk cell ratio calculating—into one unified workflow to identify risk cells and compute donor-level risk cell proportions.

Usage¶

scRiskCell(df, category_group1, category_group2, n_components, get_threshold_params)

Argument¶

df: a pandas.DataFrame containing scaled gene expression data and metadata. Must follow the structure described in the Data Preparation section.
category_group1: a list of phenotype names representing the two endpoint categories. For example: ["ND", "T2D"].
category_group2: a list of phenotype names representing the intermediate categories. For example: ["preT2D"].
n_components: an integer specifying the number of principal components to retain (default 20).
get_threshold_params: a dictionary that specifies the method for threshold determination:
- For Quantile Selection Method:
```
{"ratio": 0.85}
```
- For Sliding Window Method:
```
{"window_size": 100, "threshold": 60, "step": 1}
```
  window_size: size of the sliding window.
  threshold: minimum number of high-label cells required in the window.
  step: step size for moving the window forward.
Note: Keys "ratio", "window_size", "threshold", and "step" are fixed; only values can be changed.
If get_threshold_params is not provided, the function defaults to the quantile method with ratio = 0.85.

Result¶

pca_endpoint_df and pca_intermediate_df: two reduced-dimension expression matrices, each containing four required metadata columns: Donor, Category, Label, and Cell_id.
model: a fitted logistic regression model object.
index_df: a pandas.DataFrame containing columns: Donor, Category, Label, Cell_id, and the computed Disease_index for each cell.
threshold: a single numeric value representing the determined disease index threshold used to define risk cells.
risk_df: a pandas.DataFrame indicating whether each cell is classified as a risk cell.
donor_risk_ratio: a pandas.DataFrame summarizing the proportion of risk cells per donor.

📊 Visualization functions¶

These functions help visualize index values, risk cell proportions, and donor-level summaries.

PlotIndexViolin_by_Category¶

Draws violin plots to visualize the distribution of index across different categories (e.g., control vs. disease).

Usage¶
```
   PlotIndexViolin_by_Category(index_df, colors=None, save_path=None)
```
PlotIndexBoxplot_by_Category¶

Draws boxplots to compare distributions of index across different categories (e.g., control vs. disease).

Usage¶
```
   PlotIndexBoxplot_by_Category(index_df, colors=None, save_path=None)
```
PlotIndexViolin_by_Donor¶

Draws violin plots to show index distribution per donor.

Usage¶
```
   PlotIndexViolin_by_Donor(index_df, colors=None, save_path=None)
```
PlotIndexBoxplot_by_Donor¶

Draws boxplots to show index distributions per donor.

Usage¶
```
   PlotIndexBoxplot_by_Donor(index_df, colors=None, save_path=None)
```
PlotRatioBoxplot_by_Category¶

Draws boxplots of the proportion of risk cells per donor grouped by category.

Usage¶
```
   PlotRatioBoxplot_by_Category(donor_risk_ratio, colors=None, save_path=None)
```
PlotRatioStackBar_by_Donor¶

Creates a stacked bar chart showing the composition of risk vs non-risk cells for each donor.

Usage¶
```
   PlotRatioStackBar_by_Donor(donor_risk_ratio, colors=None, save_path=None)
```
PlotROC¶

Plots the ROC curve for phenotype classification based on the risk ratio. If the dataset includes more than two phenotype categories, the ROC curve is computed only for the two endpoint categories (i.e., excluding any intermediate disease states).

Usage¶
```
   PlotROC(donor_risk_ratio, color=None, save_path=None)
```

Arguments Description¶

index_df : output from the GetDiaseaseIndex function.
donor_risk_ratio : output from the GetDonorRiskRatio function.
colors :
- For all plotting functions except PlotROC, provide a list of colors matching the number of categories.
- For PlotROC, provide only one color.
save_path :
- If provided, the figure will be saved to the specified path.
- If not provided, the figure will not be saved.

Example¶

🧪 Example data¶

We provide an example dataset consisting of pancreatic β-cells from human donors across two disease states:

"ND" (non-diabetic)
"T2D" (type 2 diabetes)

Mode 1: Step-by-Step Execution¶

import pandas as pd
import scRiskCell

df = pd.read_parquet("./example_data.parquet")
df

	A1BG	A1BG-AS1	A1CF	A2M	A2M-AS1	AAAS	AACS	AADAT	AAED1	AAGAB	...	VTRNA1-1	XRCC6P5	YBX2	YPEL4	ZBTB20-AS1	ZNF492	Donor	Category	Label	Cell_id
0	2.155548	-0.102392	-0.190453	-0.013606	-0.041375	-0.189619	-0.269934	-0.097576	1.173756	-0.208078	...	0.0	-0.025799	-0.013606	-0.019101	-0.013606	0.0	H1	ND	0	H1_TGTGAGCTGAGA
1	1.656031	2.846145	-0.190453	-0.013606	-0.041375	-0.189619	1.598809	-0.097576	1.188750	-0.208078	...	0.0	-0.025799	-0.013606	-0.019101	-0.013606	0.0	H1	ND	0	H1_TCTCACCCTTCN
2	-0.268511	-0.102392	-0.190453	-0.013606	-0.041375	-0.189619	-0.269934	-0.097576	-0.225120	-0.208078	...	0.0	-0.025799	-0.013606	-0.019101	-0.013606	0.0	H1	ND	0	H1_CACGTTACCGCT
3	-0.268511	-0.102392	-0.190453	-0.013606	-0.041375	-0.189619	0.994045	-0.097576	-0.225120	-0.208078	...	0.0	-0.025799	-0.013606	-0.019101	-0.013606	0.0	H1	ND	0	H1_TATCCGTTTAGC
4	1.057544	-0.102392	3.400730	-0.013606	-0.041375	-0.189619	1.017674	-0.097576	2.236631	-0.208078	...	0.0	-0.025799	-0.013606	-0.019101	-0.013606	0.0	H1	ND	0	H1_TCTCCTTGGACG
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5397	-0.268511	-0.102392	-0.190453	-0.013606	-0.041375	-0.189619	-0.269934	-0.097576	-0.225120	-0.208078	...	0.0	-0.025799	-0.013606	-0.019101	-0.013606	0.0	T2D3	T2D	1	T2D3_CGATGTATTGCC
5398	-0.268511	-0.102392	-0.190453	-0.013606	-0.041375	-0.189619	-0.269934	-0.097576	-0.225120	-0.208078	...	0.0	-0.025799	-0.013606	-0.019101	-0.013606	0.0	T2D3	T2D	1	T2D3_CTATGTATTGGC
5399	-0.268511	-0.102392	-0.190453	-0.013606	-0.041375	-0.189619	-0.269934	-0.097576	-0.225120	-0.208078	...	0.0	-0.025799	-0.013606	-0.019101	-0.013606	0.0	T2D3	T2D	1	T2D3_CTAGGTATTGCC
5400	-0.268511	-0.102392	-0.190453	-0.013606	-0.041375	-0.189619	-0.269934	-0.097576	-0.225120	-0.208078	...	0.0	-0.025799	-0.013606	-0.019101	-0.013606	0.0	T2D3	T2D	1	T2D3_CCACGCGAACGG
5401	-0.268511	-0.102392	-0.190453	-0.013606	-0.041375	-0.189619	-0.269934	-0.097576	-0.225120	-0.208078	...	0.0	-0.025799	-0.013606	-0.019101	-0.013606	0.0	T2D3	T2D	1	T2D3_CTATGGATTGCC

5402 rows × 16639 columns

category_group1 = ["ND", "T2D"]
category_group2 = []
endpoint_df, intermediate_df = scRiskCell.Split(df=df, category_group1=category_group1, category_group2=category_group2)

pca_endpoint_df, pca_intermediate_df = scRiskCell.PerformPCA(endpoint_df=endpoint_df, intermediate_df=intermediate_df)

model = scRiskCell.PerformRegression(pca_endpoint_df=pca_endpoint_df)

index_df = scRiskCell.GetDiaseaseIndex(pca_endpoint_df=pca_endpoint_df, pca_intermediate_df=pca_intermediate_df, model=model)
index_df

	Donor	Category	Label	Cell_id	Disease_index
0	H1	ND	0	H1_TGTGAGCTGAGA	-19.711830
3151	H6	ND	0	H6_GACAAACCCCTC	-6.027563
3152	H6	ND	0	H6_ACCGTCTGGATG	-15.407243
3153	H6	ND	0	H6_AAACTTCGTTAT	-14.495945
3154	H6	ND	0	H6_CCTTGGACGGAT	-12.182768
...	...	...	...	...	...
3058	T2D2	T2D	1	T2D2_AGATTTGGGGCN	1.349378
3057	T2D2	T2D	1	T2D2_GGGCAGTTTGTG	6.269058
3056	T2D2	T2D	1	T2D2_TAAGGTCATACG	4.058519
3126	T2D2	T2D	1	T2D2_TTCCTTTGCCTT	4.930024
5401	T2D3	T2D	1	T2D3_CTATGGATTGCC	2.526536

5402 rows × 5 columns

get_threshold_params = {'window_size': 500, 'threshold': 280, 'step': 1}
threshold = scRiskCell.GetIndexThreshold(index_df=index_df, get_threshold_params=get_threshold_params)

risk_df = scRiskCell.GetCellRisk(index_df=index_df, threshold=threshold)
risk_df

	Donor	Category	Label	Cell_id	Disease_index	Risk
0	H1	ND	0	H1_TGTGAGCTGAGA	-19.711830	0
3151	H6	ND	0	H6_GACAAACCCCTC	-6.027563	0
3152	H6	ND	0	H6_ACCGTCTGGATG	-15.407243	0
3153	H6	ND	0	H6_AAACTTCGTTAT	-14.495945	0
3154	H6	ND	0	H6_CCTTGGACGGAT	-12.182768	0
...	...	...	...	...	...	...
3058	T2D2	T2D	1	T2D2_AGATTTGGGGCN	1.349378	0
3057	T2D2	T2D	1	T2D2_GGGCAGTTTGTG	6.269058	1
3056	T2D2	T2D	1	T2D2_TAAGGTCATACG	4.058519	1
3126	T2D2	T2D	1	T2D2_TTCCTTTGCCTT	4.930024	1
5401	T2D3	T2D	1	T2D3_CTATGGATTGCC	2.526536	0

5402 rows × 6 columns

donor_risk_ratio = scRiskCell.GetDonorRiskRatio(risk_df=risk_df)
donor_risk_ratio

	Donor	Category	Label	Risk_ratio	non_Risk_ratio
2	H3	ND	0	0.001369	0.998631
0	H1	ND	0	0.000000	1.000000
1	H2	ND	0	0.000000	1.000000
3	H6	ND	0	0.000000	1.000000
6	T2D3	T2D	1	0.823591	0.176409
5	T2D2	T2D	1	0.773842	0.226158
4	T2D1	T2D	1	0.751152	0.248848

Mode 2: One-Step Execution¶

category_group1 = ["ND", "T2D"]
category_group2 = []
get_threshold_params = {'window_size': 500, 'threshold': 280, 'step': 1}
pca_endpoint_df, pca_intermediate_df, model, index_df, threshold, risk_df, donor_risk_ratio = scRiskCell.scRiskCell(df=df, category_group1=category_group1, category_group2=category_group2, get_threshold_params=get_threshold_params)
donor_risk_ratio

	Donor	Category	Label	Risk_ratio	non_Risk_ratio
2	H3	ND	0	0.001369	0.998631
0	H1	ND	0	0.000000	1.000000
1	H2	ND	0	0.000000	1.000000
3	H6	ND	0	0.000000	1.000000
6	T2D3	T2D	1	0.823591	0.176409
5	T2D2	T2D	1	0.773842	0.226158
4	T2D1	T2D	1	0.751152	0.248848

Visualization¶

scRiskCell.PlotIndexViolin_by_Category(index_df=index_df, colors=None, save_path=None)

No description has been provided for this image

scRiskCell.PlotIndexBoxplot_by_Category(index_df=index_df, colors=None, save_path=None)

scRiskCell.PlotIndexViolin_by_Donor(index_df=index_df, colors=None, save_path=None)

scRiskCell.PlotIndexBoxplot_by_Donor(index_df=index_df, colors=None, save_path=None)

scRiskCell.PlotRatioBoxplot_by_Category(donor_risk_ratio=donor_risk_ratio, colors=None, save_path=None)

scRiskCell.PlotRatioStackBar_by_Donor(donor_risk_ratio=donor_risk_ratio, colors=None, save_path=None)

scRiskCell.PlotROC(donor_risk_ratio=donor_risk_ratio, color=None, save_path=None)

Description¶

Installation¶

Data preparation¶

Basic usage¶

Split¶

Usage¶

Argument¶

Result¶

PerformPCA¶

Usage¶

Argument¶

Result¶

PerformRegression¶

Usage¶

Argument¶

Result¶

GetDiaseaseIndex¶

Usage¶

Argument¶

Result¶

GetIndexThreshold¶

Usage¶

Argument¶

Result¶

GetCellRisk¶

Usage¶

Argument¶

Result¶

GetDonorRiskRatio¶

Usage¶

Argument¶

Result¶

scRiskCell¶

Usage¶

Argument¶

Result¶

📊 Visualization functions¶

PlotIndexViolin_by_Category¶

Usage¶

PlotIndexBoxplot_by_Category¶

Usage¶

PlotIndexViolin_by_Donor¶

Usage¶

PlotIndexBoxplot_by_Donor¶

Usage¶

PlotRatioBoxplot_by_Category¶

Usage¶

PlotRatioStackBar_by_Donor¶

Usage¶

PlotROC¶

Usage¶

Arguments Description¶

Example¶

🧪 Example data¶

Mode 1: Step-by-Step Execution¶

Mode 2: One-Step Execution¶

Visualization¶

`Split`¶

`PerformPCA`¶

`PerformRegression`¶

`GetDiaseaseIndex`¶

`GetIndexThreshold`¶

`GetCellRisk`¶

`GetDonorRiskRatio`¶

`scRiskCell`¶

`PlotIndexViolin_by_Category`¶

`PlotIndexBoxplot_by_Category`¶

`PlotIndexViolin_by_Donor`¶

`PlotIndexBoxplot_by_Donor`¶

`PlotRatioBoxplot_by_Category`¶

`PlotRatioStackBar_by_Donor`¶

`PlotROC`¶