.. _Basic_concepts: Basic concepts ============== Pergola allows to explore and process longitudinal data using genomic tools. Two main benefits derive from using pergola. Visualization and data processing mature genomic tools can be used. ------------------------- Input ------------------------- .. _input-data: ***************** Input data ***************** Pergola can process any sequence of temporal events contained in a character-separated file as in the example below: :: id t_ini t_end type value description 1 137 156 type_x 0.06 "type x event" 1 168 192 type_y 0.02 "type y event" 1 250 281 type_x 0.07 "type x event" 2 311 333 type_x 0.08 "type x event" 2 457 482 type_y 0.02 "type y event" 2 569 601 type_z 0.03 "type z event" ... The minimal input file must contain at least two columns. One of these columns should correspond to time points and the second one to any value assigned to each of the time points. .. tip:: Pergola also takes as input an **excel** file (**xlsx**) with the same structure as the file shown above in the first sheet of the file, as shown below: .. csv-table:: :header: "id", "t_ini", "t_end", "type", "value", "description" :widths: 2, 2, 2, 3, 2, 10 1, 137, 156, "type_x", 0.06, "type x event" 1, 168, 192, "type_y", 0.02, "type y event" 1, 250, 281, "type_x", 0.07, "type x event" 2, 311, 333, "type_x", 0.08, "type x event" 2, 457, 482, "type_y", 0.02, "type y event" 2, 569, 601, "type_z", 0.03, "type z event" .. Mappings .. _pergola-ontology: ***************** Pergola ontology ***************** In order to specify to pergola the content of each of the fields of the input data, user has to map the input fields to a set of defined terms or pergola ontology. The **pergola ontology** consists on a set of controlled terms or vocabulary to define the content of each of the input fields inside the input data. Pergola ontology terms are shown in the table below: .. =============== ============ ================= .. Term Mandatory Definition .. =============== ============ ================= .. chrom_start yes Refers to start time points of each interval of the original data. If “chrom_end” is not set all “chrom_start” should be equidistant and intervals will be set to the delta between time points. .. data_values yes Refers to associated values consider for the representation of data. .. chrom_end no Refers to the end of each time interval. .. track no Refers to each of the experimental entities present in the file. .. data_types no Refers to each of the different features annotated in the file. .. chrom no Refers to different phases of the experiment. .. dummy no All additional fields in the original input data not used by pergola .. =============== ============ ================= =============== ================= Term Definition =============== ================= start Refers to start time points of each interval of the original data. If “chrom_end” is not set all “chrom_start” should be equidistant and intervals will be set to the delta between time points (**mandatory**). data_value Refers to associated values consider for the representation of data (**mandatory**). end Refers to the end of each time interval. track Refers to each of the experimental entities present in the file. data_types Refers to each of the different features annotated in the file. chrom Refers to different phases of the experiment. dummy All additional fields in the original input data not used by pergola =============== ================= To see how to create a mapping file using pergola ontology to set the equivalence between input terms in the original data and Pergola output terms, read next section. .. _mapping-file: ************* Mapping file ************* The mapping file sets the correspondence between the input data and the terms used by pergola. It is thus the way pergola knows what is encoded in each of the fields of the input data. Provided we have an input file as the show in the :ref:`input data` section, a mapping file looks like the following example: .. :ref:`pergola-ontology`. :: ! Mapping of behavioural fields into pergola ontology terms ! ! Any line starting with an exclamation mark is a comment input_file:id > pergola:track input_file:t_ini > pergola:start input_file:t_end > pergola:end input_file:type > pergola:data_types input_file:value > pergola:data_value input_file:description > pergola:dummy Mapping file uses `the external mapping file format `_ from the `Gene Ontology Consortium `_ to set the correspondence between the input data and the pergola ontology. .. note:: The reserved term input_file is arbitrary. The external mapping file format requires to tag the left part of the assignment and this way the mapping file follows this specification. However, you can use any other term to tag your input file. This might be changed in following pergola versions. .. comment .. `GFF `_ ------------------------- Output ------------------------- Pergola adapted several of the more commonly used formats of the genomics community to encode longitudinal data. The idea is very simple, both types of data are sequential and thus, it is relatively easy to adapt the scaffold thought to encode genomic data to encode a temporal sequence of events. In this section we present the formats we adapted and for which purpose they can be used. .. _discrete-data: **************** Discrete data **************** Longitudinal data many times presents the form of a sequence of irregular discrete events or time intervals with a series of associated data such as the type of event or the magnitude. Genomics provides formats that are specially suitable to encode this type of data such as the `BED `_ (Browser Extensible Format) or the `GFF `_ (General Feature Format) formats. These two file formats designed to encode information such as genomic features or genomic annotations can be adapted to encode discrete temporal events by regarding at chromosome positions in a genome as analogous to time points in a behavioral trajectory. .. _continuous-data: **************** Continuous data **************** Pergola enables the calculation of parameters such as accumulated or mean values of a quantifiable measure over user-defined time windows. The `BedGraph `_ file format used in genomics to store continuous-valued data such as probability scores or transcriptome data, provides a perfect structure for Pergola calculations or for whatever type of scores derived from the analysis of your input data. .. _reference-data: **************** Reference data **************** Some types of genomics tools, such as several genome browsers, need a reference sequence in order to align the data sequences to this reference (genome). For this reason we adapted the `FASTA `_ file format to enable the use of these type of tools with pergola processed data. .. TODO mention the cytoband file used for display periods of time relevant for .. whatever reason, maybe the more intituitive example (signal) might be days and .. nights in data that could follow a circadian rhythm. .. provide a way to define irregular intervals and associated values. Users can select these formats .. to encode the duration of behavioral bouts (for example, feeding, grooming or activity) and also their magnitude, .. or to store additional environmental information (for example, contextual cues or light-dark cycles). .. So for instance if we take again the file as the show in the `Data input`_ section File formats adapted as pergola output are summarized in the following table: +--------------+----------------------------------------------------------------------------------------------------+ | Type of data | Format | +==============+====================================================================================================+ | Discrete | `BED `_ | + +----------------------------------------------------------------------------------------------------+ | | `GFF `_ | +--------------+----------------------------------------------------------------------------------------------------+ | Continuous | `BedGraph `_ | +--------------+----------------------------------------------------------------------------------------------------+ | Reference | `FASTA `_ | +--------------+----------------------------------------------------------------------------------------------------+ .. tip:: You can go to the specifications of each of these data file formats clicking on the file names in the table. .. ------------------------- .. Operations .. ------------------------- .. TODO Add point to the documentation explaining how to use visualization and exploration of genomic tools