disordR: a minimal IDP toolkit in R

Project Goal

I built disordR to provide a small toolkit for analyzing intrinsically disordered proteins (IDPs). The package includes functions for amino-acid properties, classic Uversky charge-hydropathy metrics/plots, and a simple consensus combiner for per-residue disorder scores. A small, bundled dataset makes testing the functions simple and reproducible.

Design Reasoning

I chose four functions to cover common IDP tasks:

aa_props() → mean Kyte–Doolittle hydropathy, net charge, and fraction charged residues (FCR).
uversky_metrics() → scaled hydropathy (0–1) + mean net charge per residue with a simple IDP vs Ordered call
uversky_plot() → classic Uversky scatter with the boundary line for visual interpretation.
consensus_disorder() → mean consensus or predictor scores

I limited dependencies to tibble and ggplot2 to make installs reliable and simple. A future direction of this package could include lightweight helpers for AlphaFold/ColabFold pLDDT-based IDR segment calling. This would provide insight into specific intrinsically disordered regions and how they may relate to structure/function.

Dataset Selection & Justification

I used a synthetic protein-sequence dataset (n = 120, p = 12 features + label) with the package to guarantee reproducibility and to cover key protein components that matter for Uversky plots: basic-rich, acidic-rich, mixed, and disorder-ish compositions.

File: inst/extdata/disordR_sequences.csv
Shape: 120 rows × 13 columns (sequence + 12 derived features, including length, mean KD, charge metrics, residue-class fractions, and a coarse label).

Data Cleaning & Preparation

Protein sequences were programmatically generated with controlled amino-acid compositions (to create basic/acidic/mixed/disorderish groups). I computed the following per-sequence features:

length, mean KD hydropathy
total net charge and mean net charge per residue
fraction charged residues (FCR)
fraction acidic/basic/polar/hydrophobic residues

The CSV keeps one row per sequence and uses plain ASCII column names. No missing values are present.

Reproducible code (demo from README.md)


# install.packages("remotes")
remotes::install_github("hannahcardenas4/disordR")
library(disordR)
library(tibble)

# 1) basic properties
aa_props("MKKSSSDEE")

# 2) Uversky metrics + single-point plot
u  <- uversky_metrics("MDSEKEKKEKEKEGGGGGSSTTTTTSSSSSSSSSS")
df <- tibble(name="demo", class=u$class,
             mean_h_scaled=u$mean_h_scaled,
             net_charge_per_res=u$net_charge_per_res)
uversky_plot(df)

# 3) full scatter from bundled CSV
fp  <- system.file("extdata","disordR_sequences.csv", package="disordR")
dat <- read.csv(fp, stringsAsFactors = FALSE)
um  <- do.call(rbind, lapply(dat$sequence, uversky_metrics))
plot_df <- tibble(name=dat$name, class=um$class,
                  mean_h_scaled=um$mean_h_scaled,
                  net_charge_per_res=um$net_charge_per_res)
uversky_plot(plot_df, show_labels = FALSE)

# 4) consensus example
s1 <- tibble(position=1:5, score=c(.2,.4,.6,.8,1))
s2 <- tibble(position=1:5, score=c(.3,.5,.7,.9,1))
consensus_disorder(s1, s2)

Expected results from functions

#1) Amino-acid properties: `aa_props(seq)` returns a 1×3 tibble (`mean_hydro`, `net_charge`, `fcr`).
#2) Uversky metrics: `uversky_metrics(seq)` returns scaled hydropathy (`mean_h_scaled` in [0,1]), `net_charge_per_res` (≥ 0), and `class` ∈ {IDP, Ordered}.
Uversky plot: `uversky_plot(df)` draws $R$ vs $H$ with the line $R = 2.785 H - 1.151$ and sensible axis limits (H: 0.2–0.7; R: 0–0.7).
#3) full scatter: The Uversky scatter, from the full CSV, separates IDP-like sequences above the boundary from ordered sequences below it.
#4) Consensus: `consensus_disorder(...)` averages `score` by `position` across any number of inputs.
Limitations & future work

The dataset is synthetic and is meant for demonstration. Real proteomes will have broader distributions and will require more fine-tuning.
Planned helpers for pLDDT-based IDR calling (AlphaFold/ColabFold) will likely be added in a later release.

Search This Blog

R Programming Journal – Hannah Cardenas

Final Project: disordR

disordR: a minimal IDP toolkit in R

Project Goal

Design Reasoning

Dataset Selection & Justification

Data Cleaning & Preparation

Reproducible code (demo from README.md)

Expected results from functions

Links

Comments

Post a Comment

Popular posts from this blog

Assignment #5: Matrix Algebra in R

Assignment #10 - Building your own R package