Final Project: disordR

 

disordR: a minimal IDP toolkit in R

Project Goal 

I built disordR to provide a small toolkit for analyzing intrinsically disordered proteins (IDPs). The package includes functions for amino-acid properties, classic Uversky charge-hydropathy metrics/plots, and a simple consensus combiner for per-residue disorder scores. A small, bundled dataset makes testing the functions simple and reproducible.

Design Reasoning

I chose four functions to cover common IDP tasks:

  • aa_props() → mean Kyte–Doolittle hydropathy, net charge, and fraction charged residues (FCR).

  • uversky_metrics() → scaled hydropathy (0–1) + mean net charge per residue with a simple IDP vs Ordered call

  • uversky_plot() → classic Uversky scatter with the boundary line for visual interpretation.

  • consensus_disorder() → mean consensus or predictor scores

I limited dependencies to tibble and ggplot2 to make installs reliable and simple. A future direction of this package could include lightweight helpers for AlphaFold/ColabFold pLDDT-based IDR segment calling. This would provide insight into specific intrinsically disordered regions and how they may relate to structure/function.

Dataset Selection & Justification

I used a synthetic protein-sequence dataset (n = 120, p = 12 features + label) with the package to guarantee reproducibility and to cover key protein components that matter for Uversky plots: basic-rich, acidic-rich, mixed, and disorder-ish compositions. 

  • File: inst/extdata/disordR_sequences.csv

  • Shape: 120 rows × 13 columns (sequence + 12 derived features, including length, mean KD, charge metrics, residue-class fractions, and a coarse label).

Data Cleaning & Preparation

Protein sequences were programmatically generated with controlled amino-acid compositions (to create basic/acidic/mixed/disorderish groups). I computed the following per-sequence features:

  • length, mean KD hydropathy

  • total net charge and mean net charge per residue

  • fraction charged residues (FCR)

  • fraction acidic/basic/polar/hydrophobic residues

The CSV keeps one row per sequence and uses plain ASCII column names. No missing values are present.

Reproducible code (demo from README.md)

# install.packages("remotes") remotes::install_github("hannahcardenas4/disordR") library(disordR)
library(tibble) # 1) basic properties aa_props("MKKSSSDEE") # 2) Uversky metrics + single-point plot u <- uversky_metrics("MDSEKEKKEKEKEGGGGGSSTTTTTSSSSSSSSSS") df <- tibble(name="demo", class=u$class, mean_h_scaled=u$mean_h_scaled, net_charge_per_res=u$net_charge_per_res) uversky_plot(df) # 3) full scatter from bundled CSV fp <- system.file("extdata","disordR_sequences.csv", package="disordR") dat <- read.csv(fp, stringsAsFactors = FALSE) um <- do.call(rbind, lapply(dat$sequence, uversky_metrics)) plot_df <- tibble(name=dat$name, class=um$class, mean_h_scaled=um$mean_h_scaled, net_charge_per_res=um$net_charge_per_res) uversky_plot(plot_df, show_labels = FALSE) # 4) consensus example s1 <- tibble(position=1:5, score=c(.2,.4,.6,.8,1)) s2 <- tibble(position=1:5, score=c(.3,.5,.7,.9,1)) consensus_disorder(s1, s2)

Expected results from functions

  • #1) Amino-acid propertiesaa_props(seq) returns a 1×3 tibble (mean_hydronet_chargefcr).

  • #2) Uversky metricsuversky_metrics(seq) returns scaled hydropathy (mean_h_scaled in [0,1]), net_charge_per_res (≥ 0), and class ∈ {IDP, Ordered}.

  • Uversky plotuversky_plot(df) draws R vs H with the line R=2.785H1.151 and sensible axis limits (H: 0.2–0.7; R: 0–0.7).

  • #3) full scatter: The Uversky scatter, from the full CSV, separates IDP-like sequences above the boundary from ordered sequences below it.

  • #4) Consensusconsensus_disorder(...) averages score by position across any number of inputs.

Limitations & future work

  • The dataset is synthetic and is meant for demonstration. Real proteomes will have broader distributions and will require more fine-tuning. 

  • Planned helpers for pLDDT-based IDR calling (AlphaFold/ColabFold) will likely be added in a later release.

Links

Comments

Popular posts from this blog

Assignment #5: Matrix Algebra in R

Assignment #9 - Visualization in R – Base Graphics, Lattice, and ggplot2