Welcome to Shihua Zhang's Lab
Bioinformatics and Data Science
Software:
scAND (scATAC-seq data Analysis via Network Diffusion) is a python-based package for scalable embedding of massive scATAC-seq data. scAND treats peaks-by-cells matrix as a bipartite network that indicates the accessible relationship between cells and peaks and employs a network diffusion method to alleviate the data sparsity and gather the global information. scAND improves the clustering performance on both simulated and real datasets, and can be applied to data integration.
[scAND_Guide]
|
JRIM is a package for Jointly Reconstructing cis-regulatory Interaction Maps of multiple cell populations using single-cell chromatin
accessibility data and identifying shared and common interaction patterns. It uses an aggregation process to deal with the sparsity of single-cell data,
exploits similarity between cell types via a group lasso penalty, and generates comparable networks.
JRIM could be used to characterize difference between cell types or identify dynamic changes during cell development.
[JRIM_Guide]
|
CIRCLET, a powerful tool for accurate reconstruction of circular trajectory with high resolution by
considering multi-scale features of chromosomal architectures of single cells. Further division of
the reconstructed trajectory helps to accurately characterize the dynamics of chromosomal structures
and uncover important regulatory genes along cell-cycle progression, providing a novel framework for
discovering regulatory regions even cancer markers at single-cell resolution.
[Guide]
|
Single-cell RNA sequencing (scRNA-seq) data analysis remains challenging due to the presence of dropout events (i.e., excess zero counts).
Taking account of cell heterogeneity and expression effect on dropout, we propose PBLR to accurately impute the dropouts of scRNA-seq data.
PBLR is an effective tool to recover dropout events on both simulated and real datasets,and can dramatically improve low-dimensional representation and reveal
gene-gene relationship compared to several state-of-the-art methods.
|
MSTD is a generic and efficient method to identify multi-scale topological domains (MSTD)
from symmetric Hi-C and other high resolution asymmetric promoter capture Hi-C datasets.
[Guide]
|
gkm-DNN (gapped k-mer deep neural network) is a software which uses gapped k-mer frequency vector (gkm-fv) as input to train neural networks.
gkm-DNN is designed for classification but can be easily extended to other problems such as regression and ranking. The software is open sourced.
gkm-DNN consists of calculating gkm-fv (using R) and training the neural networks (using Java + DL4J). For more information please see user guide.
[Guide]
|
MIA (Matrix Integration Analysis) is a MATLAB package, implementing
and extending four computational methods (Guide). MIA can integrate diverse types of
genomic data (e.g., copy number variation, DNA methylation, gene expression, microRNA
expression profiles and/or gene network data) to identify the underlying modular patterns.
MIA is flexible and can handle a wide range of biological problems and data types.
In addition, MIA can also be run for users without a MATLAB license.
[Guide]
|
MDPFinder (Mutated Driver Pathway Finder) is a package
for identifying driver pathways promoting cancer
proliferation and filtering out the unfunctional and
passenger ones. It includes two methods to solve the
so-called Maximum Weight Submatrix problem which is
designed to de novo identify mutated driver pathways
from mutation data in cancer. The first one is an exact
method which can be helpful for assessing other approximate
or/and heuristic algorithms. The second one is a stochastic and
flexible method which can be employed to incorporate other types
of information to improve the first method. [Pubmed]
|
CoMDP (Co-occurring Mutated Driver Pathway) is a package
for de novo identifying co-occurring driver pathways in cancer with mutation data.
The modified version mod_CoMDP can be used to model the situation where a certain
pathway has been previously proven to play important roles in some cancers and one
wants to know whether there are other pathways with cooperative
effects with it. [Pubmed]
|
dCMA (differential Chromatin Modification Analysis)
is a package for identifying cell-type specific genomic regions with distinctive
chromatin modifications. It can find cell-type specific elements which are unique
to a cell type investigated. This differential comparative epigenomic strategy
is a promising tool in deciphering the human genome and characterizing cell
specificity. [Pubmed]
|
jNMF is a package which implemented the
joint matrix factorization technique to
integrating multi-dimensional genomics
data for the discovery of combinatorial
patterns. It projects multiple types of
genomic data onto a common coordinate system,
in which heterogeneous variables weighted highly
in the same projected direction form a multi-dimensional
module (md-module). Genomic variables in such modules
are characterized by significant correlations and likely
functional associations. [Pubmed]
|
sMBPLS (sparse Multi-Block Partial Least Squares) is a package
to identify multi-dimensional regulatory modules from multiple
datasets in a regression manner. A multi-dimensional regulatory
module contains sets of regulatory factors from different layers
that are likely to jointly contribute to a local
"gene expression factory". [Pubmed]
|
SNPLS (Sparse Network-regularized Partial Least Squares) is a package
to integrate pairwise gene expression and drug response data as well
as a gene interaction network for identifying joint gene-drug
co-modules in a regression manner. This package can be easily adapted
to other biological pairwise data. [Pubmed]
|
HTTMM (Hierarchical Taxonomy Tree based Mixture Model) is a package designed for
estimating the abundance of taxon within a microbial community by incorporating
the structure of the taxonomy tree. In this model, genome specific short reads and
homologous short reads among genomes can be distinguished and represented by leaf
and intermediate nodes in the taxonomy tree respectively. An expectation-
maximization algorithm has been adopted to solve this model. [Pubmed]
|
NSLR (Network-regularized Sparse Logistic Regression) is a package to integrate gene expression data,
clinical binary outcome, and normalized Laplacian matrix encoding the protein-protein interaction (PPI)
network for clinical risk prediction and biomarker discovery.
[Guide]
|
ESPCA (Edge-group Sparse PCA) is a package to integrate the group structure from a prior gene network into the PCA framework for dimension reduction and feature interpretation.
ESPCA enforces sparsity of principal component (PC) loadings through considering the connectivity of gene variables in the prior network. Based on such prior knowledge,
ESPCA can overcome the drawbacks of sparse PCA and capture some gene modules with better biological interpretations.
We also extended ESPCA for analyzing multiple gene expression matrices simultaneously.
[Guide]
|
JMF (Joint Matrix Factorization) is a MATLAB package to integrate multi-view data as well as prior relationship knowledge within or between multi-view data for pattern recognition and data mining.
Four update rules are adopted for solving JMF. Additionally, two adapted prediction JMF models based on JMF are provided.
[Guide]
|
CSMF (Common and Specific Matrix Factorization) is a MATLAB package to simultaneously extract common and specific patterns from the data of two or multiple biological
interrelated conditions via matrix factorization. In addition to the main functions, this package also includes data simulation, parameter selection, solution fine tuning, etc.
CSMF can be widely used to analyze various data types such as RNA-seq, Chip-seq and scRNA-seq.
[Guide]
[CSMF_tutorial]
|