Track Awesome Computational Biology Updates Daily
Awesome list of computational biology.
🏠 Home · 🔍 Search · 🔥 Feed · 📮 Subscribe · ❤️ Sponsor · 😺 inoue0426/awesome-computational-biology · ⭐ 120 · 🏷️ Miscellaneous
Mar 13, 2026
Protein Foundation Models / Protein Structure Prediction and Design
- EvoDiff (⭐664) — Discrete diffusion framework for protein sequence generation trained on evolutionary-scale data, supporting unconditional generation, disordered region design, and functional motif scaffolding. [ paper-2023 ]
Mar 11, 2026
Benchmarks & Datasets
- 1000 Genomes Project — Reference panel of human genetic variation from 2,504 individuals across 26 populations.
- BACE — Binary classification and regression dataset for β-secretase 1 (BACE-1) inhibitor binding affinity.
- BEAT AML — Functional ex vivo drug sensitivity measurements paired with genomics for acute myeloid leukemia.
- ClinTox — Clinical toxicity dataset contrasting FDA-approved drugs with those that failed clinical trials due to toxicity.
- CPTAC (Clinical Proteomic Tumor Analysis Consortium) — Multi-omic proteogenomic datasets for multiple cancer types linking proteomics with genomics.
- FLIP (Fitness Landscape Inference for Proteins) (⭐117) — Benchmark collection of protein fitness landscape datasets for evaluating protein ML models.
- LINCS L1000 — Gene expression profiles (978 landmark genes) for >20,000 chemical and genetic perturbations across cell lines.
- OGB (Open Graph Benchmark) — Large-scale graph ML benchmark suite including biological datasets such as ogbl-ppa (protein-protein associations) and ogbg-molhiv.
- PharmGKB — Curated pharmacogenomics dataset linking genetic variants to drug response phenotypes across thousands of drugs.
- PRISM — Cancer drug sensitivity profiling of >4,500 drugs across >900 cancer cell lines using pooled-cell-line barcoding.
- ProteinGym (⭐395) — Large-scale benchmark of deep mutational scanning assays for evaluating protein fitness landscape models.
- QM9 — Quantum chemistry properties for 134K stable small organic molecules computed at DFT level.
- scIB (Single-cell Integration Benchmarks) (⭐408) — Comprehensive benchmarking framework for single-cell data integration methods.
- SIDER (Side Effect Resource) — Database of 1,430 approved drugs with their recorded adverse drug reactions across 27 system-organ classes.
- Tabula Muris — Comprehensive single-cell atlas of 20 mouse organs and tissues, enabling cross-tissue and cross-species comparisons.
- Tabula Sapiens — Comprehensive human single-cell atlas of ~500K cells from 24 organs and tissues across multiple donors.
- TAPE (Tasks Assessing Protein Embeddings) (⭐733) — Benchmark suite of five biologically meaningful semi-supervised learning tasks for evaluating protein representations.
- The Cancer Genome Atlas (TCGA) — Comprehensive multi-omics (genomics, transcriptomics, proteomics, methylation) dataset for 33 cancer types across ~11,000 patients.
- Tox21 — 12,707 compounds tested in 12 nuclear receptor and stress-response pathway biochemical assays for toxicity prediction.
- UK Biobank — Large-scale biomedical database of ~500K participants with genetic, imaging, and health data for population genetics and disease studies.
Preprocessing Tools
- scVelo (⭐490) — RNA velocity estimation for single-cell transcriptomics, inferring the direction and speed of cell differentiation.
Drug Response Prediction
- RECOVER (⭐24) — Machine learning framework for predicting synergistic drug combination responses across cell lines.
Molecular Generation
- DiffDock (⭐1.5k) — Diffusion generative model for molecular docking, predicting the binding pose of small molecules to protein targets.
LLM for Biology
- BioMedLM — 2.7B parameter GPT-2-style language model trained exclusively on biomedical literature from PubMed for biomedical question answering and text generation.
Single-cell Foundation Models / Transcriptomics Foundation Models
- UCE (⭐246) — Universal Cell Embeddings: zero-shot single-cell embedding model trained on 36M cells across species, tissues, and assays without fine-tuning.
- GEARS (⭐342) — Graph-based model for predicting transcriptional responses to single and combinatorial genetic perturbations using biological priors.
Protein Foundation Models / Protein Structure Prediction and Design
- OpenFold (⭐3.3k) — Trainable, memory-efficient open-source reproduction of AlphaFold2 enabling custom protein structure prediction workflows.
- SaProt — Structure-aware protein language model using structure-aware tokens that encode both sequence and backbone geometry for improved function prediction.
Genomics Foundation Models / Protein Structure Prediction and Design
- Borzoi (⭐228) — Extended successor to Enformer for predicting RNA-seq coverage from long genomic sequence windows (524 kb) with improved resolution.
Mar 03, 2026
scRNA
- CZ CELLxGENE — Single-cell dataset repository and interactive explorer from the Chan Zuckerberg Initiative.
- Human Cell Atlas — Open global atlas of all cells in the human body.
Compound
- HMDB (Human Metabolome Database) — Comprehensive database of small molecule metabolites found in the human body.
- DrugCentral — Online drug compendium with drug mode of action and indication information.
Protein
- SAbDab — Structural Antibody Database containing all antibody structures in the PDB.
- OADB (Observed Antibody Space Database) — Database of antibody sequences from immune repertoire sequencing.
Genome
- ENCODE — Encyclopedia of DNA Elements; regulatory and functional genomic elements across the genome.
- Ensembl — Genome browser and annotation database for vertebrate and other eukaryotic genomes.
- gnomAD — Genome Aggregation Database; genetic variation from large-scale sequencing projects.
- Rfam — Database of RNA families with sequence alignments and consensus structures.
Disease
- DisGeNET — Database of gene-disease associations integrating expert-curated and GWAS data.
- OMIM (Online Mendelian Inheritance in Man) — Comprehensive database of human genes and genetic disorders.
Protein-Protein Interaction
- IntAct — Open-source molecular interaction database and analysis system from EMBL-EBI.
Benchmarks & Datasets
- BindingDB Curated Sets — Curated binding affinity datasets for protein–ligand interaction benchmarking.
- Cancer Therapeutics Response Portal (CTRP) — Drug sensitivity profiles across ~900 cancer cell lines for >400 compounds.
- GuacaMol (⭐500) — Benchmark suite for generative molecular design models.
- MOSES (⭐957) — Benchmarking platform for molecular generation models.
- Therapeutics Data Commons (TDC) — Unified benchmark suite covering ADMET, drug-target interaction, drug response, and more.
Preprocessing Tools
- Biopython — Collection of Python tools for biological computation including sequence analysis, structure parsing, and database access.
- DeepChem (⭐6.6k) — Deep learning library for drug discovery, quantum chemistry, and materials science.
- scvi-tools — Probabilistic models for single-cell omics data analysis.
- CellTypist (⭐457) — Automated cell type annotation for scRNA-seq.
- GROMACS — Molecular dynamics simulation package for biochemical molecules.
- MDAnalysis — Python library for analyzing and altering molecular dynamics simulation trajectories.
- OpenMM — High-performance toolkit for molecular simulation and GPU-accelerated MD.
Molecular Generation
- REINVENT (⭐370) — Reinforcement learning for de novo drug design.
- MolGPT (⭐169) — Transformer-based model for molecular generation.
- Molecular Transformer (⭐413) — Sequence-to-sequence model for retrosynthesis prediction.
- TargetDiff (⭐323) — 3D equivariant diffusion model for structure-based drug design.
LLM for Biology
- ClawBio (⭐106) — Bioinformatics-native AI agent skill library with local-first pharmacogenomics, ancestry PCA, semantic similarity, nutrigenomics, and metagenomics skills.
Single-cell Foundation Models / Transcriptomics Foundation Models
- Geneformer — Context-aware, attention-based deep learning model pretrained on a large corpus of single-cell transcriptomes.
- scBERT (⭐347) — BERT-based foundation model pretrained on large-scale scRNA-seq data for cell type annotation.
- CellPLM (⭐101) — Cell pre-trained language model with inter-cell transformer architecture for diverse single-cell analysis tasks.
Single-cell Foundation Models / Spatial Foundation Models
- GigaPath (⭐578) — Slide-level digital pathology foundation model pretrained on 1.3 billion pathology image tokens from whole-slide images.
- UNI (⭐681) — General-purpose self-supervised pathology foundation model trained on 100K+ whole-slide images for diverse computational pathology tasks.
- CONCH (⭐472) — Vision-language foundation model for computational pathology trained with contrastive captioning on pathology image–text pairs.
- Phikon — ViT-based pathology foundation model pretrained with iBOT self-supervision on TCGA whole-slide images.
Single-cell Foundation Models / Multi-Omics Foundation Models
- scMulan (⭐62) — Single-cell multi-omic language model pretrained on ~10M cells spanning transcriptomics, epigenomics, and proteomics for cross-omics transfer tasks.
- totalVI (⭐1.6k) — Probabilistic framework for joint analysis of paired scRNA-seq and protein (CITE-seq) data enabling multi-modal cell state representation across single-cell datasets.
- MultiVI (⭐1.6k) — Multi-modal variational autoencoder for integrating paired and unpaired single-cell RNA-seq and ATAC-seq measurements into a unified latent space.
- MIRA (⭐67) — Probabilistic multimodal topic model jointly modeling single-cell transcriptomics and chromatin accessibility for regulatory network inference.
- GLUE (⭐455) — Graph-Linked Unified Embedding framework for unpaired single-cell multi-omics data integration across RNA, ATAC, methylation, and protein modalities.
- BABEL (⭐47) — Cross-modality translation model enabling prediction between scRNA-seq and scATAC-seq profiles without requiring paired single-cell measurements.
- Multigrate (⭐31) — Asymmetric multi-omics variational autoencoder for integrating single-cell data across RNA, ATAC, and protein modalities with missing-modality support.
- MOFA+ (⭐384) — Multi-Omics Factor Analysis framework identifying shared axes of variation across bulk and single-cell datasets including RNA, ATAC, proteomics, methylation, and copy number.
- GeneCompass (⭐111) — Large-scale foundation model integrating DNA regulatory sequences and single-cell transcriptomics from 120M+ cells across multiple species for gene regulation prediction.
- UnitedNet (⭐52) — Interpretable multi-task deep neural network for single-cell multi-omics integration spanning transcriptomics, chromatin accessibility, and proteomics.
- SpatialGlue — Graph attention network for spatial multi-omics integration jointly embedding spatial transcriptomics with chromatin accessibility or proteomics.
- MIDAS (⭐62) — Mosaic integration and differential accessibility model for single-cell multi-omics data that handles arbitrary missing-modality combinations across transcriptomics, chromatin accessibility, and proteomics.
Single-cell Foundation Models / Domain Alignment
- scArches (⭐399) — Transfer learning framework for mapping new single-cell datasets onto pre-trained reference atlases across batches, conditions, and modalities.
- TOSICA — Transformer-based framework for one-stop interpretable cell-type annotation supporting cross-dataset and cross-species transfer.
Protein Foundation Models / Protein Structure Prediction and Design
- AlphaFold3 (⭐7.7k) — Predicts structures of proteins, nucleic acids, small molecules, and their complexes.
- Boltz-1 (⭐3.8k) — Open-source all-atom biomolecular structure prediction model for proteins, nucleic acids, small molecules, and their complexes achieving AlphaFold3-level accuracy.
- Chai-1 (⭐1.9k) — Unified molecular structure prediction model covering proteins, nucleic acids, small molecules, and complexes.
- ESM3 (⭐2.3k) — Multimodal protein language model that jointly reasons over sequence, structure, and function for generative protein design and engineering.
- ESMFold (⭐4k) — Fast protein structure prediction using language model embeddings.
- RFdiffusion (⭐2.8k) — Generative model for protein backbone design using diffusion.
- ProteinMPNN (⭐1.6k) — Deep learning model for protein sequence design given backbone structure.
- OmegaFold (⭐613) — High-resolution de novo protein structure prediction from sequence.
- RoseTTAFold (⭐2.2k) — Three-track neural network for protein structure prediction.
Multi-Modal Foundation Models / Protein Structure Prediction and Design
- CHIEF (⭐688) — Clinical Histopathology Imaging Evaluation Foundation model integrating histology images and clinical context for pan-cancer analysis.
- BiomedCLIP — CLIP-based vision-language foundation model for biomedical images and text trained on PubMed figure–caption pairs.
Genomics Foundation Models / Protein Structure Prediction and Design
- Nucleotide Transformer (⭐831) — Foundation model for genomic sequences across multiple species.
- DNABERT (⭐744) — Pre-trained bidirectional encoder for DNA sequence analysis.
- DNABERT-2 (⭐460) — Improved genome foundation model with efficient tokenization.
- Enformer (⭐15k) — Transformer model predicting gene expression from DNA sequence.
- Basenji (⭐466) — Sequential regulatory activity prediction from DNA sequences.
- Caduceus (⭐226) — Bidirectional equivariant long-range DNA sequence model based on Mamba.
- Evo (⭐1.5k) — Long-context genomic foundation model (up to 1M tokens).
- HyenaDNA (⭐764) — Long-range genomic foundation model handling sequences up to 1M tokens with sub-quadratic attention.
Feb 08, 2026
Pathway
- Reactome — Expert-curated, peer-reviewed pathway database with detailed reaction mechanisms.
- BioCyc — Collection of pathway/genome databases across thousands of organisms.
- SIGNOR — Database of causal signaling interactions and pathways.
- MSigDB (Molecular Signatures Database) — Curated gene sets derived from pathways and biological processes.
Protein
- PROTEIN DATA BANK (PDB) — 3D structures of proteins, nucleic acids, complexes.
- RCSB Protein Data Bank — Repository for structural data of biological molecules.
Disease
- DrugBank — Database of drugs and targets (University of Alberta).
Drug-Gene Interaction
- Comparative Toxicogenomics Database — Chemical-gene interactions, chemical-disease and gene-disease associations, chemical-phenotype associations.
- SNAP — Dataset of drug-gene interactions.
Benchmarks & Datasets
- Genomics of Drug Sensitivity in Cancer (GDSC) — Drug sensitivity for ~1000 human cancer cell lines and hundreds of compounds.
- CrossDocked2020 — Large-scale dataset for structure-based virtual screening.
- OpenBioLink (⭐158) — Benchmark datasets for biological knowledge graph completion.
Drug (Cell Line) Response
- Cancer Cell Line Encyclopedia — Database of ~1000 cancer cell lines.
- CellMiner Cross Database (CellMinerCDB) — Integrates multiple cancer cell line databases.
Chemical-Protein Interaction
- BindingDB — Compounds and target database.
- PDBBind — Binding affinity data for biomolecular complexes.
Protein-Protein Interaction
- BioGRID — Protein, genetic, and chemical interactions.
- HIPPIE — Human protein-protein interaction database.
Knowledge Graph
- DRKG (⭐671) — Large-scale biological knowledge graph for drug discovery.
- Hetionet (⭐343) — Heterogeneous network integrating genes, diseases, drugs, pathways, and more.
- PrimeKG (⭐706) — Multi-modal precision medicine knowledge graph integrating clinical, genetic, and drug data.
API
- PubMed E-utilities (esearch/efetch) — APIs for searching and retrieving biomedical literature from PubMed.
- NCBI E-utilities — Unified APIs for accessing NCBI databases (Gene, GEO, SRA, PubChem, etc).
- UniProt REST API — Programmatic access to protein sequence and functional annotation data.
- Ensembl REST API — API for genomic annotations, variants, genes, and comparative genomics.
- KEGG REST API — API for accessing KEGG pathways, compounds, genes, and reactions.
- ChEMBL Web Services — REST API for bioactive molecules, targets, and bioassays.
- Open Targets Platform API — API for target–disease associations integrating genetics, genomics, and drug data.
- ClinicalTrials.gov API — API for querying clinical trial metadata and results.
Drug Target Interaction
- DTINet (⭐185) — Network-based framework integrating heterogeneous biological data for DTI prediction.
- DeepDTA (⭐293) — Deep learning model using CNNs on protein sequences and drug SMILES.
- GraphDTA (⭐293) — Graph neural network–based DTI prediction using molecular graphs.
- MolTrans (⭐225) — Transformer-based DTI model leveraging molecular substructures.
- DrugBAN (⭐138) — Bilinear attention network for interpretable DTI prediction.
Protein Foundation Models / Pre-trained Embedding
- Evolutionary Scale Modeling (ESM) (⭐4k) — Protein embeddings.
Jan 07, 2026
Preprocessing Tools
- ChatSpatial (⭐18) — MCP server for spatial transcriptomics analysis via natural language.
Single-cell Foundation Models / Transcriptomics Foundation Models
- scFoundation (⭐392) — Large-scale foundation model for single-cell gene expression, enabling multiple downstream tasks.
- scGPT (⭐1.5k) — Transformer-based foundation model pretrained on millions of single-cell profiles.
- BulkFormer (⭐42) — Foundation model for bulk RNA-seq data; learns general transcriptomic representations.
Jan 03, 2026
Preprocessing Tools
- FlashDeconv (⭐13) — High-performance spatial transcriptomics deconvolution (~1M spots in ~3 min).
- Squidpy — Python library for spatial single-cell analysis.
Nov 17, 2024
Drug Response Prediction
- MOFGCN (⭐6) — GCN + heterogeneous network.
- DeepDSC — Autoencoder + fully connected NN.
- DGDRP (⭐0) — Multi-view embedding neural network.
- DeepAEG (⭐3) — GNN embedding + attention mechanism.
Sep 01, 2024
Compound
- ZINC ligand discovery database — Free database of commercially-available compounds for virtual screening.
Protein
- Critical Assessment of Structure Prediction (CASP) — Assessing methods for protein structure prediction.
- Uniclust — Clustered protein sequence databases.
- CATH database — Hierarchical classification of protein domain structures.
Genome
- 10x Genomics Dataset — Collection of single-cell datasets.
- The Genotype-Tissue Expression (GTEx) — Human gene expression and regulation resource.
- Dependency Map (DepMap) — CRISPR-Cas9 screens in cancer cell lines.
- Catalogue Of Somatic Mutations In Cancer (COSMIC) — Resource on somatic mutations in cancers.
- MGnify — Resource for metagenomic and metatranscriptomic data.
- JASPAR — Database of transcription factor binding profiles.
Clinical Trial
- ClinicalTrials.gov — Privately and publicly funded clinical studies.
- ICD10 — International Classification of Diseases, 10th revision.
- EU Drug Regulating Authorities Clinical Trials DB (EudraCT) — European clinical trial database.
- MIMIC-IV — Freely accessible critical care database.
Benchmarks & Datasets
- MoleculeNet — Benchmark datasets for molecular machine learning.
Aug 11, 2024
Compound
- Therapeutic Target Database — Drug-target, target-disease, and drug-disease datasets.
Aug 10, 2024
LLM for Biology
- scPRINT (⭐142) — Pretrained on 50M cells for scRNA-seq denoising & zero imputation.
Jul 17, 2024
Knowledge Graph
- Drug Mechanism Database (DrugMechDB) (⭐69) — Mechanisms of action from drug to disease.
Drug Response Prediction
- drGAT (⭐1) — Attention-based model for drug response prediction with gene explainability.
LLM for Biology
- GeneGPT (⭐423) — LLM for biomedical information, integrated with various APIs.
- GenePT (⭐310) — Foundation LLM for single-cell data.
Mar 17, 2024
LLM for Biology
- AI4Chem/ChemLLM-7B-Chat — LLM for chemical & molecular science.
- BioGPT (⭐4.5k) — LLM for biomedical text generation.
Protein Foundation Models / Pre-trained Embedding
- ChemBERTa-2 (⭐487) — Chemical embeddings & prediction.
Nov 29, 2023
Compound
- Drug Repurposing Hub — Collections of drug repurposing data (drug, MoA, target, etc).
Protein
- AlphaFold Protein Structure Database — 3D protein structure predictions.
Protein-Protein Interaction
- STRING — PPI networks for multiple organisms.
Sep 07, 2023
Compound
- Rhea — Database of chemical reactions.
Jun 13, 2023
scRNA
- Single Cell Expression Atlas — Public database for single-cell RNA.
Pathway
- PathwayCommons — Database of pathways and interactions.
Genome
- cBioPortal — Cancer genomics database; aggregating many patient datasets.
Preprocessing Tools
- Scanpy — Python library for scRNA-seq analysis.
- Seurat — R library for scRNA-seq analysis.
Apr 17, 2023
scRNA
- Gene Expression Omnibus — Public functional genomics database.
- Single Cell PORTAL — Public database for single-cell RNA.
Dec 29, 2022
Benchmarks & Datasets
- NCI60 — Drug sensitivity benchmark across 60 diverse human cancer cell lines.
May 18, 2022
Compound
- ChEBI — Database focused on small chemical compounds.
May 15, 2022
Compound
- PubChem — One of the largest chemical databases (compounds, genes, and proteins).
- ChEMBL — Bioactive molecules with drug-like properties.
- ChemSpider — Chemical structure database.
- KEGG COMPOUND — Collection of small molecules and biopolymers.
- LIPID MAPS — Database of lipids.
Pathway
- KEGG PATHWAY — Collection of pathway maps.
- WikiPathways — Database of biological pathways.
Mass Spectra
- MassBank — Open source databases and tools for mass spectrometry reference spectra.
- MoNA MassBank of North America — Meta-database of metabolite mass spectra, metadata, and associated compounds.
Protein
- THE HUMAN PROTEIN ATLAS — Comprehensive human protein database (cells, tissues, organs).
- UniProt — Functional information on proteins.
Genome
- Human Genome Resources at NCBI — Database for genomics, proteomics, transcriptomics, and systems biology.
- GenBank — NCBI's database of genetic sequences.
- UCSC Genome Browser — UCSC's genome browser.
Disease
- KEGG DRUG — Comprehensive, approved drug information.
Preprocessing Tools
- Chemistry Development Kit (⭐571) — Cheminformatics software & machine learning tools.
- RDKit (⭐3.3k) — Cheminformatics software & machine learning toolkit.
Drug Repurposing
- DeepPurpose (⭐1.1k) — Deep learning library for drug repurposing.
Drug Target Interaction
- NeoDTI (⭐77) — Library for drug-target interaction prediction.
Compound-Protein Interaction
- MCPINN (⭐3) — Drug discovery via compound-protein interaction and machine learning.
- TransformerCPI (⭐153) — CPI prediction using Transformer.
Feb 22, 2022
Drug-Gene Interaction
- DGIdb — Drug-gene interactions and the druggable genome.
Chemical-Protein Interaction
- STITCH — Chemical-protein interactions.