Description¶
scRiskCell is a toolkit for identifying functionally perturbed cell populations from single-cell transcriptomic data, enabling the study of dynamic cellular processes at single-cell resolution. In scRiskCell, cells from different conditions (e.g., healthy vs disease) are treated equally within each cell type to avoid confounding effects from donor labels. The toolkit reorders cells (e.g., β-cells) based on their association with a given phenotype (such as disease status), constructing a trajectory that reflects phenotypic progression or functional transition. While scRiskCell is particularly well-suited for identifying disease-associated risk cells and exploring disease progression, it is also broadly applicable to other scenarios. For example, it can be used to identify drug-resistant cells, stress-responsive cells, or other phenotypically distinct subpopulations based on any binary or continuous trait of interest, not limited to donor identity. For a video tutorial on how to use scRiskCell, please see: https://youtu.be/Zx_ymupnoeE
Key methodological steps include:
- Split the data into two subsets,including the endpoint and intermediate group.
- Applies PCA for dimensionality reduction.
- Uses logistic regression to model phenotype (e.g., disease status) as a binary variable.
- Computes an index for each cell based on regression coefficients.
- Determine an index threshold.
- Define risk cells as those with index values above the determined threshold.
- Calculate the proportion of risk cells per donor.
Installation¶
The Python version must be >= 3.9.
wget http://lin-group.cn/server/scRiskCell/download/scRiskCell.zip
conda create --name scRiskCell python=3.10
conda activate scRiskCell
cd scRiskCell
pip install . --ignore-installed
pip show scRiskCell
Data preparation¶
To use scRiskCell, the input should be a
pandas.DataFrame where
rows represent cells and
columns represent genes. This matrix can be
extracted from a processed Seurat object using:
# In R
expr_matrix <- t(as.data.frame(Seurat_object@assays$RNA@scale.data))
After extracting the scaled expression matrix, you must append four required metadata columns to the DataFrame:
-
Donor: Specifies the donor identity each cell originates from. -
Category: Indicates the phenotype class for each cell.
For example, in a diabetic progression study, this column may contain values such as:
-
"ND"(non-diabetic) -
"preT2D"(pre-type 2 diabetes) -
"T2D"(type 2 diabetes)
-
Label: A numeric encoding of theCategorycolumn.
For instance, you can encode:
"ND"as0"preT2D"as1"T2D"as2
-
Cell_id: A unique identifier for each cell.
⚠️ Important Notes:
-
The names of the four metadata columns must be exactly:
Donor,Category,Label, andCell_id. -
The
Categorycolumn must contain at least two distinct phenotype groups
(e.g.,"ND"and"T2D"). -
The
Labelcolumn should correspond numerically to theCategoryvalues for modeling purposes.
Basic usage¶
Functions included in scRiskCell:
Split · PerformPCA ·
PerformRegression ·
GetDiaseaseIndex ·
GetIndexThreshold ·
GetCellRisk ·
GetDonorRiskRatio ·
scRiskCell ·
PlotIndexViolin_by_Category ·
PlotIndexBoxplot_by_Category ·
PlotIndexViolin_by_Donor ·
PlotIndexBoxplot_by_Donor ·
PlotRatioBoxplot_by_Category ·
PlotRatioStackBar_by_Donor ·
PlotROC
Split¶
The goal is to separate the end states (such as healthy and fully diseased) from the intermediate state (such as a pre-disease condition), enabling more accurate modeling of functional transitions.
Usage¶
Split(df, category_group1, category_group2)
Argument¶
-
df: a pandas.DataFrame containing scaled gene expression data and metadata. Must follow the structure described in the Data Preparation for scRiskCell section. -
category_group1: a list of phenotype names representing the two endpoint categories. For example:["ND", "T2D"]. -
category_group2: a list of phenotype names representing the intermediate categories. For example:["preT2D"].
Result¶
Return two DataFrames:
-
endpoint_df: a subset ofdfcontaining cells fromcategory_group1. -
intermediate_df: a subset ofdfcontaining cells fromcategory_group2.
PerformPCA¶
Reduces the high-dimensional gene expression matrix to a lower-dimensional space defined by user-specified principal components (e.g., 20 PCs).
Usage¶
PerformPCA(endpoint_df, intermediate_df, n_components)
Argument¶
-
endpoint_dfandintermediate_df: the output fromSplit.
-
n_components: an integer specifying the number of principal components to retain (default 20).
Result¶
-
pca_endpoint_dfandpca_intermediate_df: two reduced-dimension expression matrix with four required columns—Donor,Category,Label, andCell_id.
PerformRegression¶
Perform logistic regression on PCA-reduced gene expression data to calculate disease index for each cell.
Usage¶
PerformRegression(pca_endpoint_df, n_components)
Argument¶
-
pca_endpoint_df: output fromPerformPCA.
-
n_components: consistent with then_componentsparameter used inPerformPCA.
Result¶
- A fitted logistic regression model object.
GetDiaseaseIndex¶
Using the logistic regression model and PCA-reduced expression data, this function computes the disease index for each cell, reflecting the likelihood of functionally perturbed.
Usage¶
GetDiaseaseIndex(pca_endpoint_df, pca_intermediate_df, model, n_components)
Argument¶
-
pca_endpoint_dfandpca_intermediate_df: the output fromPerformPCA.
-
model: the output fromPerformRegression.
-
n_components: consistent with then_componentsparameter used inPerformPCA.
Result¶
-
A pandas.DataFrame
index_dfcontaining key columns—Donor,Category,Label,Cell_id, andDisease_index.
GetIndexThreshold¶
Determine a disease index threshold using one of two methods:
-
Quantile Selection Method (default):
Based on the ascending order ofDisease_index, a cutoff threshold is selected according to a specified quantile. Cells withDisease_indexvalues above this threshold are designated as risk cells. -
Sliding Window Method:
Apply a moving window across the ascending order ofDisease_indexto identify the threshold at which the number of cells from the most severe phenotype (highestLabelvalue) within the window meets or exceeds a specified count. Cells withDisease_indexvalues above this threshold are designated as risk cells.
Usage¶
GetIndexThreshold(index_df, get_threshold_params)
Argument¶
-
index_df: the output fromGetDiaseaseIndex.
-
get_threshold_params: a dictionary that specifies the method for threshold determination:-
For Quantile Selection Method:
{"ratio": 0.85}
-
For Sliding Window Method:
{"window_size": 100, "threshold": 60, "step": 1}
window_size: size of the sliding window.
threshold: minimum number of high-label cells required in the window.
step: step size for moving the window forward.
Note: Keys "ratio", "window_size", "threshold", and "step" are fixed; only values can be changed. If
get_threshold_paramsis not provided, the function defaults to the quantile method withratio = 0.85. -
Result¶
- A single numeric value representing the disease index threshold.
GetCellRisk¶
Label cells as risk or non-risk based on a predefined disease index threshold. The threshold can be derived from quantile-based or sliding window-based methods.
Usage¶
GetCellRisk(index_df, threshold)
Argument¶
-
index_df: the output fromGetDiaseaseIndex.
-
index_threshold: the output fromGetIndexThreshold.
Result¶
-
A pandas.DataFrame identical to the input
index_df, but with an additional column:
Risk: a binary indicator where1denotes a risk cell (Disease_index> threshold), and0denotes a non-risk cell.
GetDonorRiskRatio¶
Calculate the proportion of risk and non-risk cells for each donor.
Usage¶
GetDonorRiskRatio(risk_df)
Argument¶
-
risk_df: the output fromGetCellRisk.
Result¶
-
A pandas.DataFrame where each row corresponds to a donor, and
columns include:
Donor: donor/source ID;
Category: phenotype of the donor (e.g., control or disease);
Label: phenotype label of the donor (e.g., 0 or 1);
Risk_ratio: proportion of risk cells in the donor;
non_Risk_ratio: proportion of non-risk cells in the donor.
scRiskCell¶
Integrates all key steps—dimensionality reduction, model fitting, disease index calculation, risk cell labeling, and risk cell ratio calculating—into one unified workflow to identify risk cells and compute donor-level risk cell proportions.
Usage¶
scRiskCell(df, category_group1, category_group2, n_components, get_threshold_params)
Argument¶
-
df: a pandas.DataFrame containing scaled gene expression data and metadata. Must follow the structure described in the Data Preparation section. -
category_group1: a list of phenotype names representing the two endpoint categories. For example:["ND", "T2D"]. -
category_group2: a list of phenotype names representing the intermediate categories. For example:["preT2D"]. -
n_components: an integer specifying the number of principal components to retain (default 20). -
get_threshold_params: a dictionary that specifies the method for threshold determination:-
For Quantile Selection Method:
{"ratio": 0.85}
-
For Sliding Window Method:
{"window_size": 100, "threshold": 60, "step": 1}
window_size: size of the sliding window.
threshold: minimum number of high-label cells required in the window.
step: step size for moving the window forward.
Note: Keys "ratio", "window_size", "threshold", and "step" are fixed; only values can be changed.
Ifget_threshold_paramsis not provided, the function defaults to the quantile method withratio = 0.85. -
Result¶
-
pca_endpoint_dfandpca_intermediate_df: two reduced-dimension expression matrices, each containing four required metadata columns:Donor,Category,Label, andCell_id. -
model: a fitted logistic regression model object. -
index_df: a pandas.DataFrame containing columns:Donor,Category,Label,Cell_id, and the computedDisease_indexfor each cell. -
threshold: a single numeric value representing the determined disease index threshold used to define risk cells. -
risk_df: a pandas.DataFrame indicating whether each cell is classified as a risk cell. -
donor_risk_ratio: a pandas.DataFrame summarizing the proportion of risk cells per donor.
📊 Visualization functions¶
These functions help visualize index values, risk cell proportions, and donor-level summaries.
-
PlotIndexViolin_by_Category¶Draws violin plots to visualize the distribution of index across different categories (e.g., control vs. disease).
Usage¶
PlotIndexViolin_by_Category(index_df, colors=None, save_path=None) -
PlotIndexBoxplot_by_Category¶Draws boxplots to compare distributions of index across different categories (e.g., control vs. disease).
Usage¶
PlotIndexBoxplot_by_Category(index_df, colors=None, save_path=None) -
PlotIndexViolin_by_Donor¶Draws violin plots to show index distribution per donor.
Usage¶
PlotIndexViolin_by_Donor(index_df, colors=None, save_path=None) -
PlotIndexBoxplot_by_Donor¶Draws boxplots to show index distributions per donor.
Usage¶
PlotIndexBoxplot_by_Donor(index_df, colors=None, save_path=None) -
PlotRatioBoxplot_by_Category¶Draws boxplots of the proportion of risk cells per donor grouped by category.
Usage¶
PlotRatioBoxplot_by_Category(donor_risk_ratio, colors=None, save_path=None) -
PlotRatioStackBar_by_Donor¶Creates a stacked bar chart showing the composition of risk vs non-risk cells for each donor.
Usage¶
PlotRatioStackBar_by_Donor(donor_risk_ratio, colors=None, save_path=None) -
PlotROC¶Plots the ROC curve for phenotype classification based on the risk ratio. If the dataset includes more than two phenotype categories, the ROC curve is computed only for the two endpoint categories (i.e., excluding any intermediate disease states).
Usage¶
PlotROC(donor_risk_ratio, color=None, save_path=None)
Arguments Description¶
-
index_df: output from theGetDiaseaseIndexfunction. -
donor_risk_ratio: output from theGetDonorRiskRatiofunction. -
colors:-
For all plotting functions except
PlotROC, provide a list of colors matching the number of categories. -
For
PlotROC, provide only one color.
-
For all plotting functions except
-
save_path:- If provided, the figure will be saved to the specified path.
- If not provided, the figure will not be saved.
Example¶
🧪 Example data¶
We provide an example dataset consisting of pancreatic β-cells from human donors across two disease states:
"ND"(non-diabetic)"T2D"(type 2 diabetes)
Mode 1: Step-by-Step Execution¶
import pandas as pd
import scRiskCell
df = pd.read_parquet("./example_data.parquet")
df
| A1BG | A1BG-AS1 | A1CF | A2M | A2M-AS1 | AAAS | AACS | AADAT | AAED1 | AAGAB | ... | VTRNA1-1 | XRCC6P5 | YBX2 | YPEL4 | ZBTB20-AS1 | ZNF492 | Donor | Category | Label | Cell_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.155548 | -0.102392 | -0.190453 | -0.013606 | -0.041375 | -0.189619 | -0.269934 | -0.097576 | 1.173756 | -0.208078 | ... | 0.0 | -0.025799 | -0.013606 | -0.019101 | -0.013606 | 0.0 | H1 | ND | 0 | H1_TGTGAGCTGAGA |
| 1 | 1.656031 | 2.846145 | -0.190453 | -0.013606 | -0.041375 | -0.189619 | 1.598809 | -0.097576 | 1.188750 | -0.208078 | ... | 0.0 | -0.025799 | -0.013606 | -0.019101 | -0.013606 | 0.0 | H1 | ND | 0 | H1_TCTCACCCTTCN |
| 2 | -0.268511 | -0.102392 | -0.190453 | -0.013606 | -0.041375 | -0.189619 | -0.269934 | -0.097576 | -0.225120 | -0.208078 | ... | 0.0 | -0.025799 | -0.013606 | -0.019101 | -0.013606 | 0.0 | H1 | ND | 0 | H1_CACGTTACCGCT |
| 3 | -0.268511 | -0.102392 | -0.190453 | -0.013606 | -0.041375 | -0.189619 | 0.994045 | -0.097576 | -0.225120 | -0.208078 | ... | 0.0 | -0.025799 | -0.013606 | -0.019101 | -0.013606 | 0.0 | H1 | ND | 0 | H1_TATCCGTTTAGC |
| 4 | 1.057544 | -0.102392 | 3.400730 | -0.013606 | -0.041375 | -0.189619 | 1.017674 | -0.097576 | 2.236631 | -0.208078 | ... | 0.0 | -0.025799 | -0.013606 | -0.019101 | -0.013606 | 0.0 | H1 | ND | 0 | H1_TCTCCTTGGACG |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5397 | -0.268511 | -0.102392 | -0.190453 | -0.013606 | -0.041375 | -0.189619 | -0.269934 | -0.097576 | -0.225120 | -0.208078 | ... | 0.0 | -0.025799 | -0.013606 | -0.019101 | -0.013606 | 0.0 | T2D3 | T2D | 1 | T2D3_CGATGTATTGCC |
| 5398 | -0.268511 | -0.102392 | -0.190453 | -0.013606 | -0.041375 | -0.189619 | -0.269934 | -0.097576 | -0.225120 | -0.208078 | ... | 0.0 | -0.025799 | -0.013606 | -0.019101 | -0.013606 | 0.0 | T2D3 | T2D | 1 | T2D3_CTATGTATTGGC |
| 5399 | -0.268511 | -0.102392 | -0.190453 | -0.013606 | -0.041375 | -0.189619 | -0.269934 | -0.097576 | -0.225120 | -0.208078 | ... | 0.0 | -0.025799 | -0.013606 | -0.019101 | -0.013606 | 0.0 | T2D3 | T2D | 1 | T2D3_CTAGGTATTGCC |
| 5400 | -0.268511 | -0.102392 | -0.190453 | -0.013606 | -0.041375 | -0.189619 | -0.269934 | -0.097576 | -0.225120 | -0.208078 | ... | 0.0 | -0.025799 | -0.013606 | -0.019101 | -0.013606 | 0.0 | T2D3 | T2D | 1 | T2D3_CCACGCGAACGG |
| 5401 | -0.268511 | -0.102392 | -0.190453 | -0.013606 | -0.041375 | -0.189619 | -0.269934 | -0.097576 | -0.225120 | -0.208078 | ... | 0.0 | -0.025799 | -0.013606 | -0.019101 | -0.013606 | 0.0 | T2D3 | T2D | 1 | T2D3_CTATGGATTGCC |
5402 rows × 16639 columns
category_group1 = ["ND", "T2D"]
category_group2 = []
endpoint_df, intermediate_df = scRiskCell.Split(df=df, category_group1=category_group1, category_group2=category_group2)
pca_endpoint_df, pca_intermediate_df = scRiskCell.PerformPCA(endpoint_df=endpoint_df, intermediate_df=intermediate_df)
model = scRiskCell.PerformRegression(pca_endpoint_df=pca_endpoint_df)
index_df = scRiskCell.GetDiaseaseIndex(pca_endpoint_df=pca_endpoint_df, pca_intermediate_df=pca_intermediate_df, model=model)
index_df
| Donor | Category | Label | Cell_id | Disease_index | |
|---|---|---|---|---|---|
| 0 | H1 | ND | 0 | H1_TGTGAGCTGAGA | -19.711830 |
| 3151 | H6 | ND | 0 | H6_GACAAACCCCTC | -6.027563 |
| 3152 | H6 | ND | 0 | H6_ACCGTCTGGATG | -15.407243 |
| 3153 | H6 | ND | 0 | H6_AAACTTCGTTAT | -14.495945 |
| 3154 | H6 | ND | 0 | H6_CCTTGGACGGAT | -12.182768 |
| ... | ... | ... | ... | ... | ... |
| 3058 | T2D2 | T2D | 1 | T2D2_AGATTTGGGGCN | 1.349378 |
| 3057 | T2D2 | T2D | 1 | T2D2_GGGCAGTTTGTG | 6.269058 |
| 3056 | T2D2 | T2D | 1 | T2D2_TAAGGTCATACG | 4.058519 |
| 3126 | T2D2 | T2D | 1 | T2D2_TTCCTTTGCCTT | 4.930024 |
| 5401 | T2D3 | T2D | 1 | T2D3_CTATGGATTGCC | 2.526536 |
5402 rows × 5 columns
get_threshold_params = {'window_size': 500, 'threshold': 280, 'step': 1}
threshold = scRiskCell.GetIndexThreshold(index_df=index_df, get_threshold_params=get_threshold_params)
risk_df = scRiskCell.GetCellRisk(index_df=index_df, threshold=threshold)
risk_df
| Donor | Category | Label | Cell_id | Disease_index | Risk | |
|---|---|---|---|---|---|---|
| 0 | H1 | ND | 0 | H1_TGTGAGCTGAGA | -19.711830 | 0 |
| 3151 | H6 | ND | 0 | H6_GACAAACCCCTC | -6.027563 | 0 |
| 3152 | H6 | ND | 0 | H6_ACCGTCTGGATG | -15.407243 | 0 |
| 3153 | H6 | ND | 0 | H6_AAACTTCGTTAT | -14.495945 | 0 |
| 3154 | H6 | ND | 0 | H6_CCTTGGACGGAT | -12.182768 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 3058 | T2D2 | T2D | 1 | T2D2_AGATTTGGGGCN | 1.349378 | 0 |
| 3057 | T2D2 | T2D | 1 | T2D2_GGGCAGTTTGTG | 6.269058 | 1 |
| 3056 | T2D2 | T2D | 1 | T2D2_TAAGGTCATACG | 4.058519 | 1 |
| 3126 | T2D2 | T2D | 1 | T2D2_TTCCTTTGCCTT | 4.930024 | 1 |
| 5401 | T2D3 | T2D | 1 | T2D3_CTATGGATTGCC | 2.526536 | 0 |
5402 rows × 6 columns
donor_risk_ratio = scRiskCell.GetDonorRiskRatio(risk_df=risk_df)
donor_risk_ratio
| Donor | Category | Label | Risk_ratio | non_Risk_ratio | |
|---|---|---|---|---|---|
| 2 | H3 | ND | 0 | 0.001369 | 0.998631 |
| 0 | H1 | ND | 0 | 0.000000 | 1.000000 |
| 1 | H2 | ND | 0 | 0.000000 | 1.000000 |
| 3 | H6 | ND | 0 | 0.000000 | 1.000000 |
| 6 | T2D3 | T2D | 1 | 0.823591 | 0.176409 |
| 5 | T2D2 | T2D | 1 | 0.773842 | 0.226158 |
| 4 | T2D1 | T2D | 1 | 0.751152 | 0.248848 |
Mode 2: One-Step Execution¶
category_group1 = ["ND", "T2D"]
category_group2 = []
get_threshold_params = {'window_size': 500, 'threshold': 280, 'step': 1}
pca_endpoint_df, pca_intermediate_df, model, index_df, threshold, risk_df, donor_risk_ratio = scRiskCell.scRiskCell(df=df, category_group1=category_group1, category_group2=category_group2, get_threshold_params=get_threshold_params)
donor_risk_ratio
| Donor | Category | Label | Risk_ratio | non_Risk_ratio | |
|---|---|---|---|---|---|
| 2 | H3 | ND | 0 | 0.001369 | 0.998631 |
| 0 | H1 | ND | 0 | 0.000000 | 1.000000 |
| 1 | H2 | ND | 0 | 0.000000 | 1.000000 |
| 3 | H6 | ND | 0 | 0.000000 | 1.000000 |
| 6 | T2D3 | T2D | 1 | 0.823591 | 0.176409 |
| 5 | T2D2 | T2D | 1 | 0.773842 | 0.226158 |
| 4 | T2D1 | T2D | 1 | 0.751152 | 0.248848 |
Visualization¶
scRiskCell.PlotIndexViolin_by_Category(index_df=index_df, colors=None, save_path=None)
scRiskCell.PlotIndexBoxplot_by_Category(index_df=index_df, colors=None, save_path=None)
scRiskCell.PlotIndexViolin_by_Donor(index_df=index_df, colors=None, save_path=None)
scRiskCell.PlotIndexBoxplot_by_Donor(index_df=index_df, colors=None, save_path=None)
scRiskCell.PlotRatioBoxplot_by_Category(donor_risk_ratio=donor_risk_ratio, colors=None, save_path=None)
scRiskCell.PlotRatioStackBar_by_Donor(donor_risk_ratio=donor_risk_ratio, colors=None, save_path=None)
scRiskCell.PlotROC(donor_risk_ratio=donor_risk_ratio, color=None, save_path=None)
Hello!