Usage
You can use CheRRI in two modes. The eval mode predicts whether an RRI site of interest is biologically relevant. Pre-computed human and mouse models exist and can be downloaded from Zenodo. If you would like to build a model based on novel RRI interactome data, you can use the train mode.
Evaluation of RRIs
Based on a tabular file containing chromosomal position data of the RRIs, CheRRI classifies if the interaction region is likely to be a biologically relevant one.
For the eval mode, a model and the filtered feature set have to be specified.
CheRRI has pre-trained models for human and mouse, which can be downloaded from Zenodo (see content.txt
inside the zip folder for details).
If there exists an RNA-RNA-interactome dataset for your preferred organism, we recommend to train your own organism-specific model using CheRRI’s train mode. After training, the model can be than used in eval mode for the classification of your predicted RRI positions.
RRI input format in evaluation mode
The RRI instances to be evaluated need to be given in CSV format (parameter --RRIs_table
). The table needs the following header line:
chrom1,start1,stop1,strand1,chrom2,start2,stop2,strand2
Following the header line, each subsequent line represents an RRI, with chromosome ID (format: 1,2,3 …), interaction start, interaction end, and strand (“+” or “-”) of the two interacting partners. For example, you might want to evaluate the following three RRI sites:
19,18307518,18307539,-,14,90454500,90454521,+
X,109054541,109054590,+,9,89178539,89178562,-
10,123136102,123136122,+,5,1245880,1245902,+
Example call for CheRRI’s evaluation mode
For the test call please download the Cherri_models_data zip folder. The human model is needed to execute the call. Be sure to provide the correct location for the model and its feature set (-m
, -mp
). For example, assuming the data (zip folder extracted to folder Cherri_models_data
) is stored inside the CheRRI folder:
cherri eval -i1 test_data/evaluate/test_evaluate_rris.csv -g human -l human -o ./ -n test_eval -c 150 -st on -m Cherri_models_data/Model_with_graph_features/human/full_human_context_150.model -mp Cherri_models_data/Model_with_graph_features/human/training_data_human_context_150.npz
If you would like to preform a test evaluation you can use the test_file stored in CheRRIs github repository.
Input parameters in evaluation mode
Input parameters for CheRRI’s eval mode (cherri eval
):
Required:
ID |
name |
description |
---|---|---|
|
|
Table containing all RRIs that should be evaluated in the correct format |
|
|
Path to genome FASTA file, or use the built-in download function if you want the human or mouse genome |
|
|
Path to output directory where the output folder will be stored. It will contain separate output folders for each step of the data and feature preparation as well as the evaluated instances |
|
|
Tabular file containing data in two-column format for each chromosome: ‘chrom name’ \t ‘chrom length’. You can directly specify ‘human’ or ‘mouse’ |
|
|
Set path to the model which should be used for evaluation |
|
|
Set path to the feature file of the given model |
Optional:
ID |
name |
description |
---|---|---|
|
|
Path to occupied regions python object. This file should be used if there are regions which that should be blocked from interactions. One can create this file with the find_occupied_regions.py |
|
|
How much context should be added at up- and downstream of the sequence. Default: 150 |
|
|
Name of the data source of the RRIs, e.g. experiment and organism. Default: eval_rri |
|
|
IntaRNA parameter file. Default: file in path_to_cherri_folder/Cherri/rrieval/IntaRNA_param |
|
|
Set ‘off’ if you want to disable structure. Default ‘on’ |
|
|
Name for the output directory. Default: ‘date_Cherri_evaluating_RRIs’ |
|
|
If you want to start from hand-curated feature files. Use this for evaluating test set performance (set ‘on’). Default: ‘off’ |
|
|
Number of jobs used for graph feature computation. Default: 1 |
Output in evaluation mode
At the end of the run the location of the results table is given.
The final results table will have a query and target ID of your input sequences (target_ID
,query_ID
), the score of your instance (instance_score
), the predicted class of each RRI (0 or 1) (predicted_label
), if you are running the validation mode with -ef on
the positive or negative label (true_lable
), and subsequent all features of the instance are provided.
The IDs are a summary of chromosome;strand;start;stop
oft the first (target) and the second (query) sequence.
Throughout the program, several output files are generated and stored in the following structure:
├── date_Cherri_evaluation_mode
| ├── evaluate_RRIs.csv
| ├── positive_instance
| ├── {name}_context_{context}pos.csv
| ├── {name}_context_{context}_block_ends_0_RRI_dataset.csv
| ├── feature_files
| ├── feature_filtered_{name}_context_{context}_pos.csv
| ├── training_data_{name}_context_{context}.npz
| ├── evaluation
| ├── evaluation_results_{name}.csv
Validate your model using the eval mode
You can also use CheRRIs eval mode to create a validation result table and than use the compute_f1 to get the F1 score.
In the following a example call to validate a theoretical model build from DataA with data form a different source e.g. DataB is given.
You can set a MODELPATH leading to your models e.g. of DataA and DataB. Here we assume that DataA and DataB will in the same MODELPATH location (e.g. MODELPATH=~/myproject/cherri_models
).
You can set the MODELPATH on a linux based system like this:
export MODELPATH=$PATH:~/myproject/cherri_models
cherri eval -i1 $MODELPATH/<DataB>/feature_files/feature_filtered_<DataB>_context_<150>_pos_occ -g human -l human -o $MODELPATH -n <eval_modelA_using_DataB> -c 150 -st on -m $MODELPATH/DataA/model/full_<DataA>_context_<150>.model -mp $MODELPATH/DataA/feature_files/training_data_<DataA>_context_<150>.npz -j 10 -on evaluation -ef on
In the following is a example call to cross validate a theoretical model build from DataA. The biofilm-cv will split the data into 5 parts, use 4 to retrain a model and the 5th, left out part to evaluate the trained model.
python -m biofilm.biofilm-cv --infile $MODELPATH/DataA/feature_files/training_data_<DataA>_context_<150>.npz --foldselect 0 --model $MODELPATH/DataA/model/optimized/full_<DataA>_context_<150>.model --out $MODELPATH
python -m biofilm.biofilm-cv --infile $MODELPATH/DataA/feature_files/training_data_<DataA>_context_<150>.npz --foldselect 1 --model $MODELPATH/DataA/model/optimized/full_<DataA>_context_<150>.model --out $MODELPATH
python -m biofilm.biofilm-cv --infile $MODELPATH/DataA/feature_files/training_data_<DataA>_context_<150>.npz --foldselect 2 --model $MODELPATH/DataA/model/optimized/full_<DataA>_context_<150>.model --out $MODELPATH
python -m biofilm.biofilm-cv --infile $MODELPATH/DataA/feature_files/training_data_<DataA>_context_<150>.npz --foldselect 3 --model $MODELPATH/DataA/model/optimized/full_<DataA>_context_<150>.model --out $MODELPATH
python -m biofilm.biofilm-cv --infile $MODELPATH/DataA/feature_files/training_data_<DataA>_context_<150>.npz --foldselect 4 --model $MODELPATH/DataA/model/optimized/full_<DataA>_context_<150>.model --out $MODELPATH
Than use the result file to compute the F1 score using compute_f1.
Build a new CheRRI model in training mode
Within CheRRI’s train mode you can train your own model. The input data are the RRIs found by Direct Duplex Detection (DDD) methods or other interactome HTS protocols. In theory it can be any RRI which is enough to build a solid model training dataset. If the interactions are from different organisms CheRRI needs to be called in a mixed_model ‘on’ training mode (will be explained later).
Retrieve RNA-RNA interactome files using ChiRA
To extract RRI interactions from DDD methods, a tool named ChiRA is used to generate the ‘ChiRA interaction summary’ table. CheRRI expects as input the ‘ChiRA interaction summary’ file.
If you want to prepare a ‘ChiRA interaction summary’ table file, please follow this tutorial. You should prepare one ChiRA interaction summary file per replicate.
Starting from the RRI site information, CheRRI will build a model based on features generated from the DDD method interactions site data.
ChiRA RRI output files are needed as input for CheRRI train mode. --RRI_path
(-i1
) demands the path to the the ChiRA interaction summary files, and --list_of_replicates
(-r
) demands the ChiRA interaction summary file names of the replicates used by CheRRI inside the -i1
folder.
Build RRI interactome file as input for CheRRI
You can add your own interaction data in a tabular CSV format. Please follow the specific guides given below to build the table. You would need to provide the position information on an interaction using the following header line:
['chrom_1st','start_1st','end_1st','strand_1st','chrom_2end','start_2end','end_2end','strand_2end']
If you want to add the result of an interaction you should set all interaction positions by using the following header names: predicted interaction positions:
'chrom_seq_1st_site', 'start_seq_1st_site','stop_seq_1st_site','strand_seq_1st_site','chrom_seq_2end_site', 'start_seq_2end_site', 'stop_seq_2end_site','strand_seq_2end_site'
The information of the interaction need to be complete or not provided at all. If they are not provided the RRI position information from above are used as score for finding overlaps between replicates.
If you have a score for the interactions you can also provide it in the following columns:
'score_seq_1st_site', 'score_seq_2end_site'
You can check the example file to get a impression how it could look.
Example call for CheRRI’s training mode
This is an example call to evoke CheRRI’s model training mode. If you download the Cherri github folder you can use the files for a test call:
cherri train -i1 test_data/training/Paris/ -r miRNA_human_1.tabular miRNA_human_2.tabular miRNA_human_3.tabular -g human -l human -o ./ -n human_test -c 50 -st on -t 600 -me 8000 -j 7
Input parameters in training mode
Input parameters for CheRRI’s train mode (cherri train
):
Required:
ID |
name |
description |
---|---|---|
|
|
Path to folder storing the ChiRA interaction summary files |
|
|
Path to output directory where the output folder will be stored. It will contain separate output folders for each step of the data, feature and model preparation |
|
|
List the ChiRA interaction summary file for each replicate |
|
|
Tabular file containing data in two-column format for each chromosome: ‘chrom name’ \t ‘chrom length’. You can directly specify ‘human’ or ‘mouse’ |
|
|
Path to genome FASTA file, or use the built-in download function if you want the human or mouse genome |
Optional:
ID |
name |
description |
---|---|---|
|
|
How much context should be added at up- and downstream of the sequence. Default: 150 |
|
|
Name of the data source of RRIs. Will be used for the file names. Default: ‘model_rri’ |
|
|
IntaRNA parameter file. Default: file in path_to_cherri_folder/Cherri/rrieval/IntaRNA_param |
|
|
Set ‘off’ if you want to disable graph-kernel features. Default: ‘on’ (when set to ‘on’ the feature optimization will be performed directly and the data will be stored in feature_files and no model/feature folder will be created) |
|
|
Path to the genomic RBP crosslink or binding site locations (in BED format) |
|
|
Time used for the optimization in seconds, default: 43200 (12h) |
|
|
Memory in MB each thread can use (total ram/threads). Default: 4300 |
|
|
Number of jobs used for graph feature computation and model selection. Default: 1 |
|
|
Use mixed model to combine different datasets into a combined model. Default: ‘off’ |
|
|
Filter the data for hybrids already detected by ChiRA (set ‘on’ to filter). Default: ‘off’ |
|
|
Name for the output directory. Default ‘date_Cherri_evaluating_RRIs’ |
|
|
Set a temporary directory for autosklearn. Either proved a path or ‘out’ to set it to the output directory. Default: ‘off’ |
|
|
# of interactions IntraRNA will give is possible. Default: 5 |
|
|
score threshold for the additional occupied regions [BED] Default: 10 |
|
|
context to extend left and right for the BED file instances. Default: 5 |
|
|
5-fold cross validated of the pipeline will be performed using the training data. Set ‘off’ to skip. Default: ‘on’ |
|
|
number of folds for the cross validation. Default: 5 |
|
|
Methods used for model selection. Default: any |
Output in training mode
At the end of the run the location of the trained model is given.
Throughout the program, several output files are generated inside the output folder (default: date_Cherri_model_build
), with the following structure:
├── date_CheRRI_model_build
| ├── date_occ_out
| ├── occupied_regions.obj
| ├── rri_occupied_regions_overlap_0.3
| ├── rri_occupied_regions_overlapTH_0.3_scoreTH_1.csv
| ├── pos_neg_data
| ├── {name}_context_{context}_pos_occ__block_ends_40_RRI_dataset.csv
| ├── {name}_context_{context}_pos_occ_neg.csv
| ├── {name}_context_{context}_pos_occ_pos.csv
| ├── feature_files
| ├── feature_filtered_{name}_context_{context}_pos.csv
| ├── feature_filtered_{name}_context_{context}_neg.csv
| ├── training_data_{name}_context_{context}.npz
| ├── model
| ├── {name}_context_{context}_fold{0-4}.model
| ├── {name}_context_{context}_fold{0-4}.csv
| ├── full_{name}_context_{context}.model
Run train in mixed model mode
CheRRI is able to build one model based on different datasets. Is the mixed parameter is set to ‘on’ CheRRI will connect training data for different datasets. However before running the mixed mode one would create the training data for the individual datasets. Next we have a theoretical example of DataA, DataB and DataC, which should be trained together Therefore best create an dedicated output folder e.g. CheRRI_build_model
Then build you first model:
cherri train -i1 /path/to/CheRRI_build_model/ -r dataA_1.tabular dataA_2.tabular -g human -l human -o ./ -n Data_A -c 150 -st on -t 600 -me 8000 -j 7
If you don’t need the model of the individual datasets you can either set the -t very low or even interrupt the call once the model is build. Next rename the output folder created by CheRRI to the name you gave to the Data/Model (-n Data_A)
mv /path/to/CheRRI_build_model/<date>_CheRRI_build_model /path/to/CheRRI_build_model/Data_A
Then run the next dataset:
cherri train -i1 /path/to/CheRRO_build_model/ -r dataB_1.tabular dataB_2.tabular -g human -l human -o ./ -n Data_B -c 150 -st on -t 600 -me 8000 -j 7
Next rename the output folder created by CheRRI to the name you gave to the Data/Model (-n Data_A)
mv /path/to/CheRRI_build_model/<date>_CheRRI_build_model /path/to/CheRRI_build_model/Data_B
Then run the last dataset:
cherri train -i1 /path/to/CheRRI_build_model/ -r dataC_1.tabular dataC_2.tabular -g mouse -l mouse -o ./ -n Data_C -c 150 -st on -t 600 -me 8000 -j 7
And rename the output folder created by CheRRI to the name you gave to the Data/Model (-n Data_A)
mv /path/to/CheRRI_build_model/<date>_CheRRI_build_model /path/to/CheRRI_build_model/Data_C
Finally you can run CheRRI train in the mixed model mode like this:
cherri train -i1 /path/to/CheRRI_build_model/ -r Data_A Data_B Data_C -g /not/needed/ -l /not/needed/ -o /path/to/CheRRI_build_model/ -n Full
This time your replicates are the names of the training datasets you want to connect (Data_A Data_B Data_C).