Tutorials

Transcriptome reference generation

The first steps of cerberus are to generate a reference set of transcript start sites (TSSs), intron chains (ICs), and transcript end sites (TESs). TSSs and TESs can either be added from GTFs or BED files. ICs can be added from GTFs.

_images/cerberus_pipeline_ref_gen.png

Extracting triplet features from GTFs

To generate BED files for the TSSs and TESs from a GTF file, use cerberus gtf_to_bed.

Usage: cerberus gtf_to_bed [OPTIONS]

Options:
  --gtf TEXT       GTF file  [required]
  --mode TEXT      Choose tss or tes  [required]
  -o TEXT          Output file name  [required]

  --dist INTEGER   Distance (bp) to extend regions on either side  [default:
                   50]

  --slack INTEGER  Distance allowable for merging regions  [default: 50]
  --help           Show this message and exit.

Output file format:

Chromosome

Start

Stop

Name

Score

Strand

ThickStart

chr1

169804296

169804436

ENSG00000000460_2

.

cerberus

chr1

169807786

169807887

ENSG00000000460_4

.

cerberus

  • The ThickStart column is used to indicate the source of the data

To generate tab-separated files for the ICs from a GTF file, use cerberus gtf_to_ics.

Usage: cerberus gtf_to_ics [OPTIONS]


Options:
  --gtf TEXT  GTF file  [required]
  -o TEXT     Output file name  [required]
  --help      Show this message and exit.

Output file format:

Chromosome

Strand

Coordinates

Name

chr1

ENSG00000000460_1

chr1

54-87

ENSG00000000460_2

  • Coordinates is a --separated list of the internal splice sites used

  • Monoexonic transcripts are assigned the coordinate -

Aggregating TSSs and TESs across multiple sources

Cerberus can aggregate information to support the existence or creation of TSS or TES regions across multiple bed files. To perform this aggregation, use cerberus agg_ends.

Usage: cerberus agg_ends [OPTIONS]

Options:
  --input TEXT     Path to config file. Each line contains file path, whether
                   to add ends (True / False), whether to use as a reference,
                   source name  [required]

  --mode TEXT      Choose tss or tes  [required]
  --slack INTEGER  Distance (bp) allowable for merging regions  [default: 20]
  -o TEXT          Output file name  [required]
  --help           Show this message and exit.

Notes on behavior:

  • TSSs / TESs will be numbered from 1…n for each gene based on

    • MANE status, appris_principal status, and basic_set status of TSS / TES (if tags are present in GTF used to call regions)

    • Order of BED files in config file (earlier BED files will yield lower numbers)

  • BED files without gene IDs cannot be used to initialize new TSS / TES regions

  • After a region is initialized, the boundaries are fixed. Regions that are considered thereafter are simply added as additional forms of support for the already-initialized regions. The reasoning behind this is to avoid “super regions” that grow as new data is added.

Input BED file format:

  • Generally, BED files follow the BED file specifications

  • If you have gene ids in your BED file, they must be in the name field (4th column), and must be _ separated with additional text that makes the name unique. Additionally, the thickStart (7th) column must contain the string cerberus to indicate to Cerberus how to parse the gene ids. This is how the files output from cerberus gtf_to_bed are automatically formatted.

Chromosome

Start

Stop

Name

Score

Strand

ThickStart

chr1

169804296

169804436

ENSG00000000460_2

.

cerberus

chr1

169807786

169807887

ENSG00000000460_4

.

cerberus

  • The ThickStart column is used to indicate the source of the data

Input config file format:

  • Comma-separated

  • No header

  • First BED must have True add ends setting

  • BED files with True add ends setting must have gene ids and strand

  • Recommended to use True add ends setting for GTFs that you are planning to annotate

BED file path

Reference

Add ends

Source name

path/to/bed/v40_tss.bed

True

True

v40

path/to/bed/encode_tss.bed

False

True

encode

Output file format:

Chromosome

Start

Stop

Name

Score

Strand

Sources

Novelty

chr1

169794989

169795129

ENSG00000000460_1

.

v40,v29,encode

Known

chr1

300

400

ENSG00000000460_2

.

encode

Novel

  • The ThickStart column is used to indicate the sources, defined in the config file, for each region. Sources are comma-separated if there was more than one form of support for the region.

  • The ThickEnd column is used to indicate the novelty of the region. Regions not supported by one of the sources used as references will be Novel, others will be Known.

Aggregating ICs across multiple sources

Cerberus can aggregate information to support the existence or creation of intron chains across multiple datasets. To perform this aggregation, use cerberus agg_ics.

Usage: cerberus agg_ics [OPTIONS]

Options:

  --input TEXT  Path to config file. Each line contains file path, whether to
                use as a reference, source name  [required]

  -o TEXT       Output file name  [required]
  --help        Show this message and exit.

Notes on behavior:

  • ICs will be numbered from 1…n for each gene based on

    • MANE status, appris_principal status, and basic_set status of IC (if tags are present in GTF used to call intron chains)

    • Order of IC TSV files in config file (earlier TSV files will yield lower numbers)

Input IC file format:

  • Output from cerberus gtf_to_ics.

Chromosome

Strand

Coordinates

Name

chr1

ENSG00000000460_1

chr1

54-87

ENSG00000000460_2

  • Coordinates is a --separated list of the internal splice sites used

  • Monoexonic transcripts are assigned the coordinate -

Input config file format:

  • Comma-separated

  • No header

  • BED files with True add ends setting must have gene ids and strand

  • Recommended to use True add ends setting for GTFs that you are planning to annotate

Output file format:

Chromosome

Strand

Coordinates

Name

source

novelty

chr1

ENSG00000000460_1

v40,v29

Known

chr1

54-87

ENSG00000000460_2

encode

NIC

  • Coordinates is a --separated list of the internal splice sites used

  • Monoexonic transcripts are assigned the coordinate -

  • Sources are comma-separated if there was more than one form of support for the intron chain.

  • Novelty of the intron chain is determined with respect to the sources used as references. Those supported are Known. Others are assigned a SQANTI novelty category (ISM, NIC, NNC, and Unspliced).

Generating a cerberus reference in h5 format

The final step of reference generation is to write an h5 representation of the TSSs, ICs, and TESs present in the input data as a series of tables in h5 format.

Usage: cerberus write_reference [OPTIONS]

Options:

  --tss TEXT  TSS bed file output from `agg_ends`  [required]
  --tes TEXT  TES bed file output from `agg_ends`  [required]
  --ics TEXT  IC tsv file output from `agg_ics`  [required]
  -o TEXT     Output .h5 file name
  --help      Show this message and exit.

Input TSS / TES file format:

  • Output from cerberus agg_ends.

Chromosome

Start

Stop

Name

Score

Strand

Sources

Novelty

chr1

169794989

169795129

ENSG00000000460_1

.

v40,v29,encode

Known

chr1

300

400

ENSG00000000460_2

.

encode

Novel

  • The ThickStart column is used to indicate the sources, defined in the config file, for each region. Sources are comma-separated if there was more than one form of support for the region.

  • The ThickEnd column is used to indicate the novelty of the region. Regions not supported by one of the sources used as references will be Novel, others will be Known.

Input IC file format:

  • Output from cerberus agg_ics.

Chromosome

Strand

Coordinates

Name

source

novelty

chr1

ENSG00000000460_1

v40,v29

Known

chr1

54-87

ENSG00000000460_2

encode

NIC

  • Coordinates is a --separated list of the internal splice sites used

  • Monoexonic transcripts are assigned the coordinate -

  • Sources are comma-separated if there was more than one form of support for the intron chain.

  • Novelty of the intron chain is determined with respect to the sources used as references. Those supported are Known. Others are assigned a SQANTI novelty category (ISM, NIC, NNC, and Unspliced).

Output cerberus reference h5 format:

  • CerberusAnnotation() object saved in h5 format.

  • Can be read in in Python using

import cerberus
ca = cerberus.read(<ref name>)

TSS / TES regions accessible using:

ca.tss
ca.tes

Chromosome

Start

End

Strand

Name

novelty

source

gene_id

tss/tes

chr1

169794989

169795129

ENSG00000000460_1

v40,v29

Known

ENSG00000000460

1

chr1

169795358

169795459

ENSG00000000460_2

encode

Novel

ENSG00000000460

2

  • BED-inspired with a few added columns

Because end matching is less precise than intron chain matching, the Cerberus reference h5 also store the mapping between each input TSS / TES and what region in the reference (ie from ca.tss or ca.tes), if any, the input region matched too.

These maps are stored in:

ca.tss_map
ca.tes_map

Chromosome

Start

End

Strand

source

Name

chr1

169794989

169795129

v40

ENSG00000000460_1

chr1

169795358

169795459

encode

ENSG00000000460_2

  • Each BED entry from BED files input to cerberus agg_ends will have an entry here.

  • Original coordinates are reported in the Chromosome, Start, End, and Strand columns.

  • Source column has source name as defined in cerberus agg_ends

  • Name column includes the final ID for the Cerberus end region that is in the TSS or TES table.

  • If the region was not included in the final reference, the Name column will be NaN.

Intron chains accessible using:

ca.ic

Chromosome

Strand

Coordinates

Name

source

novelty

gene_id

ic

chr1

ENSG00000000460_1

v40,v29

Known

ENSG0000000046

1

chr1

54-87

ENSG00000000460_2

encode

NIC

ENSG0000000046

2

Transcriptome annnotation

After generating an h5 Cerverys reference of TSSs, ICs, and TESs, you can now annotate any transcriptome in GTF format using the features present in your Cerberus reference; generating transcript triplets for each transcript in your transcriptome. Multiple transcriptomes can be annotated using the same Cerberus reference. You can then use this mapping to add the Cerberus transcript triplets to preexisting GTF files and abundance matrices.

_images/cerberus_pipeline_annotate_transcriptome.png

Assigning transcript triplets to a transcriptome

Assign each transcript in a GTF file a Cerberus TSS, IC, and TES. You can annotate multiple transcriptomes using the same Cerberus reference.

Note: It is recommended that for each transcriptome you annotate, that you have extracted TSSs, ICs, and TESs from the transcriptome’s GTF and added them to the Cerberus reference. Otherwise you run the risk of encountering triplet features that are unannotated in the Cerberus reference.

cerberus annotate_transcriptome \
            --gtf {gtf} \
            --h5 {cerberus ref} \
            --source {source name} \
            -o {cerberus annot}

Output file format:

Update counts matrix with Cerberus IDs

Replace transcript IDs from your transcriptome in an abundance matrix with cerberus isoform triplets.

cerberus replace_ab_ids \
            --h5 {annot} \
            --ab {ab} \
            --source {source} \
            --collapse \
            -o {update_ab}

Notes on behavior:

  • Some transcripts from a transcriptome may be assigned the same triplet. Using the --collapse flag will deduplicate these in the abundance file and sum up the counts across each transcript with the same isoform triplet.

Input abundance file format:

  • Currently cerberus only works with TALON abundance matrices, but we want to support other formats in the future.

Update GTF with Cerberus IDs

Replace transcript IDs from your transcriptome in a GTF with cerberus isoform triplets.

cerberus replace_gtf_ids \
            --h5 {annot} \
            --gtf {gtf} \
            --source {source} \
            --update_ends \
            --collapse \
            -o {update_gtf}

Notes on behavior:

  • Some transcripts from a transcriptome may be assigned the same triplet. Using the --collapse flag will deduplicate these in the abundance file and sum up the counts across each transcript with the same isoform triplet.

  • Because cerberus uses regions to represent TSSs and TESs rather than single-bp coordinates, the ends of each of your transcripts are not representable in GTF format. We circumvent this by choosing the furthest upstream boundary of the region for the TSS and the furthest downstream boundary for the TES to report in the GTF when the --update_ends option is used. If it is not, the original ends from your transcriptome will be used.

Calculating gene triplets

_images/gene_triplets.png

After you’ve generated your final Cerberus annotation h5 file for as many transcriptomes as you wish with cerberus annotate_transcriptome, you can now start to calculate gene triplets as a metric of measuring transcriptional diversity. There are a few different ways of doing this.

Calculating gene triplets from Cerberus sources

The most simple gene triplet is calculated from all of the isoforms present in a given transcriptome source. You can calculate and add these triplets to your CerberusAnnotation using the cerberus.get_source_triplets() function.

In Python, import cerberus, read in your CerberusAnnotation (output from cerberus annotate_transcriptome), compute source triplets, and add them to your CerberusAnnotation:

import cerberus

ca = cerberus.read('cerberus_annot.h5')
df = ca.get_source_triplets()
ca.add_triplets(df)

After adding these source triplets, you can save your updated CerberusAnnotation in h5 format. The triplets will be saved in the ca.triplets DataFrame slot.

ca.write('cerberus_annot_triplets.h5')

Calculating gene triplets from a list of transcriptomes

You may want to filter the isoforms used to calculate your gene triplets. For example, maybe you want to only consider isoforms that are expressed above a certain TPM level. To compute gene triplets from each gene from any arbitrary list of isoform triplets, use cerberus.get_subset_triplets():

import cerberus
ca = cerberus.read('cerberus_annot.h5')
source_name = 'highly_expressed'
tids = ['Gene[1,1,1]', 'Gene[1,2,1]']
df = ca.get_subset_triplets(tids, source_name)
ca.add_triplets(df)

Calculating gene triplets based on transcripts expressed in a sample

You also may want to compute gene triplets based on a the set of isoforms expressed in a given sample. For this analysis, we use Swan to store our sample metadata and expression information:

import swan_vis as swan

sg = swan.SwanGraph()
sg.add_annotation(input.annot)
sg.add_transcriptome(input.gtf, include_isms=True)
sg.add_abundance(input.ab)
sg.add_abundance(input.gene_ab, how='gene')
sg.add_metadata(input.meta)

sg.save_graph(output.prefix)

We can use the expression values and metadata stored in the SwanGraph to determine which isoforms are expressed in each sample, and use cerberus.get_expressed_triplets() to calculate these values:

import swan_vis as swan
import cerberus

sg = swan.read(swan_file)
ca = cerberus.read('cerberus_annot.h5')

source_name = 'sample_isos'
metadata_column = 'sample'
min_tpm = 1
df = ca.get_expressed_triplets(sg,
                               obs_col=metadata_column,
                               min_tpm=min_tpm,
                               source=source_name)
ca.add_triplets(df)

Calculating gene triplets directly from a GTF

In cases where you already have a finalized transcriptome and don’t wish to use the Cerberus tools to generate your transcriptome, you have the option to generate summary gene triplets directly from a GTF using a command line tool.

cerberus gtf_to_triplets

Usage: cerberus gtf_to_triplets [OPTIONS]

Options:
  --gtf TEXT            GTF file to replace ids in  [required]
  --source TEXT         name of source in cerberus object to map from
                        [required]

  --gene_id_col TEXT    Attribute name in GTF w/ gene id
  --gene_name_col TEXT  Attribute name in GTF w/ gene name
  -o TEXT               Output file name  [required]
  --help                Show this message and exit.

The output file will be a CerberusAnnotation h5 object that is directly compatible with Cereberus’ downstream plotting functions.

Calculating gene triplets from custom subset files

In the cases where the above gene triplet calculation functions don’t fulfill your exact needs, cerberus supports adding triplets for individual samples using a tab-separated isoform subset file. Coming soon