Documentation

CheRRI’s core method scripts

CheRRI is built as a modular tool calling individual scripts accomplishing the various tasks of CheRRI’s core methods. If you would like to perform only one step of the CheRRI pipeline, you can do this by calling the individual scripts. A short description of this scripts is given in the following.

RRI detection with find_trusted_RRI.py

Here we search for trusted RRIs, so RRIs which can be found in all replicates. In a first filter step only uniquely mapped RRIs are taken. Than RRI sequence partners in all replicates are found, using a overlap threshold. Output are the ChiRA input tables, now containing only the trusted RRIs. Out of all RRI pairs of the replicates only the one with the highest overlap to all others is added to the trusted_RRI data set.

Input parameters for find_trusted_RRI.py

ID

name

description

-i

--input_path

Path to folder storing input data (containing all replicates)

-r

--list_of_replicates

List of file names for all replicates

-o

--overlap_th

Overlap threshold to find trusted RRIs

-d

--output_path

Path where output folder should be stored

-n

--experiment_name

Name of the data source of positive trusted RRIs

-s

--score_th

Threshold for EM score from ChiRA

-fh

--filter_hybrid

Filter the data for hybrids already detected by ChiRA

Output of find_trusted_RRI.py

The filtered set of trusted RRI sites in tabular format.

Compute occupied regions with find_occupied_regions.py

Given the RRI information tables from ChiRA and RNA-protein binding positions, an InterLab object is build. The occupied information can be used to mask parts of the genome and therefore enable to select negative interaction regions.

Input parameters for find_occupied_regions.py

ID

name

description

-i1

--RRI_path

Path to folder storing all RRI data (table)

-i2

--rbp_path

Path to RBP site data file (BED format)

-r

--list_of_replicates

List of file names for all replicates

-o

--out_path

Path where output folder should be stored

-t

--overlap_th

Overlap threshold

-s

--score_th

Score threshold

-e

--external_object

External RRI overlapping object (InterLap dict)

-fh

--filter_hybrind

Filter the data for hybrids already detected by ChiRA

-mo

--mode

Function call within which CheRRI mode (train/eval)

Output of find_occupied_regions.py

A python pickle file object storing occupied regions in an InterLap dictionary.

Interaction predictions with generate_pos_neg_with_context.py

Given a set of trusted RRI sites and occupied regions, a given context is appended. Then positive interactions are computed by calling IntaRNA, specifying the trusted RRI sites as seed regions. The negative interactions are computed by calling IntaRNA on regions outside the RRI sites / occupied regions.

Input parameters for generate_pos_neg_with_context.py

ID

name

description

-i1

--input_rris

Path to file storing all trusted RRIs

-i2

--input_occupied

Path to occupied regions file

-d

--output_path

Path where output folder should be stored

-n

--experiment_name

Name of the data source of positive trusted RRIs

-g

--genome_file

Path to genome FASTA file

-c

--context

How much context should be added up- and downstream

--pos_occ

Occupied regions are set (default)

--no_pos_occ

Set if no occupied regions should be used

-b

--block_ends

# nucleotides blocked at the ends of each extended RRI site

-s

--no_sub_opt

# of interactions IntraRNA will give if possible

-l

--chrom_len_file

Tabular file containing chrom name \t chrom length for each chromosome

-p

--param_file

IntaRNA parameter file

-m

--mode

Which CheRRI mode is running (train/eval)

IntaRNA parameters used within CheRRI

To generate the current features IntaRNA parameters by default are set to:

parameters

value

description

outMode

C

Output style (C=tabluar format)

seedBP

5

the number of base pairs within the seed

seedMinPu

0

the minimal unpaired probability of each seed region in query and target

accW

150

sliding window length (0=global folding)

acc

N/C

To globally turn off accessibility consideration: turn off/on

outMaxE

-5

maximal energy for any interaction reported

outOverlap

B

overlapping of interaction sites of suboptimal allowed (B:both)

outNumber

5

generate up to N interactions for each query-target pair

seedT/QRange

positive interaction

genomic positions of the trusted RRI

q/tAccConstr

negative interaction

genomic positions of the occupied regions

intLenMax

50

restrict the overall length an interaction

temperature

37

experimental temperature

intLoopMax

3

number of unpaired bases between inter molecular base pairs

IntaRNA parameters can be changed by specifying a custom IntaRNA parameter file. CheRRIs default parameter set can be found here.

Output of generate_pos_neg_with_context.py

Positive and negative datasets stored in tabular format.

Feature extraction with get_features.py

Here additional sequence features are computed and the output is filtered for the final or a given feature set. The features are stored in tabular format. Note that the graph-kernel features are computed directly in CheRRI’s main functions and do not have a separate file.

Input parameters for get_features.py

ID

name

description

-i

--input

Path to input file

-f

--feature_set_list

Set of features the script will output

-o

--output_file

Output file path including the file name

Output of get_features.py

Tabular file having all features given via --feature_set_list.

Feature selection and optimization

If you would only like to run the feature selection or optimization, please check out biofilm.