Basic concepts

Pergola allows to explore and process longitudinal data using genomic tools.

Two main benefits derive from using pergola.

Visualization and data processing mature genomic tools can be used.

Input

Input data

Pergola can process any sequence of temporal events contained in a character-separated file as in the example below:

id    t_ini   t_end type      value   description
1     137     156     type_x  0.06    "type x event"
1     168     192     type_y  0.02    "type y event"
1     250     281     type_x  0.07    "type x event"
2     311     333     type_x  0.08    "type x event"
2     457     482     type_y  0.02    "type y event"
2     569     601     type_z  0.03    "type z event"
...

The minimal input file must contain at least two columns. One of these columns should correspond to time points and the second one to any value assigned to each of the time points.

Tip

Pergola also takes as input an excel file (xlsx) with the same structure as the file shown above in the first sheet of the file, as shown below:

id t_ini t_end type value description
1 137 156 type_x 0.06 type x event
1 168 192 type_y 0.02 type y event
1 250 281 type_x 0.07 type x event
2 311 333 type_x 0.08 type x event
2 457 482 type_y 0.02 type y event
2 569 601 type_z 0.03 type z event

Pergola ontology

In order to specify to pergola the content of each of the fields of the input data, user has to map the input fields to a set of defined terms or pergola ontology.

The pergola ontology consists on a set of controlled terms or vocabulary to define the content of each of the input fields inside the input data.

Pergola ontology terms are shown in the table below:

Term Definition
start Refers to start time points of each interval of the original data. If “chrom_end” is not set all “chrom_start” should be equidistant and intervals will be set to the delta between time points (mandatory).
data_value Refers to associated values consider for the representation of data (mandatory).
end Refers to the end of each time interval.
track Refers to each of the experimental entities present in the file.
data_types Refers to each of the different features annotated in the file.
chrom Refers to different phases of the experiment.
dummy All additional fields in the original input data not used by pergola

To see how to create a mapping file using pergola ontology to set the equivalence between input terms in the original data and Pergola output terms, read next section.

Mapping file

The mapping file sets the correspondence between the input data and the terms used by pergola. It is thus the way pergola knows what is encoded in each of the fields of the input data. Provided we have an input file as the show in the input data section, a mapping file looks like the following example:

! Mapping of behavioural fields into pergola ontology terms
!
! Any line starting with an exclamation mark is a comment

input_file:id > pergola:track
input_file:t_ini > pergola:start
input_file:t_end > pergola:end
input_file:type > pergola:data_types
input_file:value > pergola:data_value
input_file:description > pergola:dummy

Mapping file uses the external mapping file format from the Gene Ontology Consortium to set the correspondence between the input data and the pergola ontology.

Note

The reserved term input_file is arbitrary. The external mapping file format requires to tag the left part of the assignment and this way the mapping file follows this specification. However, you can use any other term to tag your input file. This might be changed in following pergola versions.

Output

Pergola adapted several of the more commonly used formats of the genomics community to encode longitudinal data. The idea is very simple, both types of data are sequential and thus, it is relatively easy to adapt the scaffold thought to encode genomic data to encode a temporal sequence of events. In this section we present the formats we adapted and for which purpose they can be used.

Discrete data

Longitudinal data many times presents the form of a sequence of irregular discrete events or time intervals with a series of associated data such as the type of event or the magnitude. Genomics provides formats that are specially suitable to encode this type of data such as the BED (Browser Extensible Format) or the GFF (General Feature Format) formats. These two file formats designed to encode information such as genomic features or genomic annotations can be adapted to encode discrete temporal events by regarding at chromosome positions in a genome as analogous to time points in a behavioral trajectory.

Continuous data

Pergola enables the calculation of parameters such as accumulated or mean values of a quantifiable measure over user-defined time windows. The BedGraph file format used in genomics to store continuous-valued data such as probability scores or transcriptome data, provides a perfect structure for Pergola calculations or for whatever type of scores derived from the analysis of your input data.

Reference data

Some types of genomics tools, such as several genome browsers, need a reference sequence in order to align the data sequences to this reference (genome). For this reason we adapted the FASTA file format to enable the use of these type of tools with pergola processed data.

File formats adapted as pergola output are summarized in the following table:

Type of data Format
Discrete BED
GFF
Continuous BedGraph
Reference FASTA

Tip

You can go to the specifications of each of these data file formats clicking on the file names in the table.