Final Project: disordR
disordR: a minimal IDP toolkit in R
Project Goal
I built disordR to provide a small toolkit for analyzing intrinsically disordered proteins (IDPs). The package includes functions for amino-acid properties, classic Uversky charge-hydropathy metrics/plots, and a simple consensus combiner for per-residue disorder scores. A small, bundled dataset makes testing the functions simple and reproducible.
Design Reasoning
I chose four functions to cover common IDP tasks:
-
aa_props()→ mean Kyte–Doolittle hydropathy, net charge, and fraction charged residues (FCR). -
uversky_metrics()→ scaled hydropathy (0–1) + mean net charge per residue with a simple IDP vs Ordered call -
uversky_plot()→ classic Uversky scatter with the boundary line for visual interpretation. -
consensus_disorder()→ mean consensus or predictor scores
I limited dependencies to tibble and ggplot2 to make installs reliable and simple. A future direction of this package could include lightweight helpers for AlphaFold/ColabFold pLDDT-based IDR segment calling. This would provide insight into specific intrinsically disordered regions and how they may relate to structure/function.
Dataset Selection & Justification
I used a synthetic protein-sequence dataset (n = 120, p = 12 features + label) with the package to guarantee reproducibility and to cover key protein components that matter for Uversky plots: basic-rich, acidic-rich, mixed, and disorder-ish compositions.
-
File:
inst/extdata/disordR_sequences.csv -
Shape: 120 rows × 13 columns (sequence + 12 derived features, including length, mean KD, charge metrics, residue-class fractions, and a coarse label).
Data Cleaning & Preparation
Protein sequences were programmatically generated with controlled amino-acid compositions (to create basic/acidic/mixed/disorderish groups). I computed the following per-sequence features:
-
length, mean KD hydropathy
-
total net charge and mean net charge per residue
-
fraction charged residues (FCR)
-
fraction acidic/basic/polar/hydrophobic residues
The CSV keeps one row per sequence and uses plain ASCII column names. No missing values are present.
Reproducible code (demo from README.md)
Expected results from functions
#1) Amino-acid properties: aa_props(seq) returns a 1×3 tibble (mean_hydro, net_charge, fcr).
#2) Uversky metrics: uversky_metrics(seq) returns scaled hydropathy (mean_h_scaled in [0,1]), net_charge_per_res (≥ 0), and class ∈ {IDP, Ordered}.
Uversky plot: uversky_plot(df) draws vs with the line and sensible axis limits (H: 0.2–0.7; R: 0–0.7).
#3) full scatter: The Uversky scatter, from the full CSV, separates IDP-like sequences above the boundary from ordered sequences below it.
#4) Consensus: consensus_disorder(...) averages score by position across any number of inputs.
Limitations & future work
#1) Amino-acid properties: aa_props(seq) returns a 1×3 tibble (mean_hydro, net_charge, fcr).
#2) Uversky metrics: uversky_metrics(seq) returns scaled hydropathy (mean_h_scaled in [0,1]), net_charge_per_res (≥ 0), and class ∈ {IDP, Ordered}.
Uversky plot: uversky_plot(df) draws vs with the line and sensible axis limits (H: 0.2–0.7; R: 0–0.7).
#3) full scatter: The Uversky scatter, from the full CSV, separates IDP-like sequences above the boundary from ordered sequences below it.
#4) Consensus: consensus_disorder(...) averages score by position across any number of inputs.
-
The dataset is synthetic and is meant for demonstration. Real proteomes will have broader distributions and will require more fine-tuning.
-
Planned helpers for pLDDT-based IDR calling (AlphaFold/ColabFold) will likely be added in a later release.
Links
-
GitHub repo: https://github.com/hannahcardenas4/disordR
Comments
Post a Comment