DataStructures - Alignment

This document contains information about the data structures used in the TRIC algorithm.

  • Run contains all data pertaining to a LC-MS/MS run, particularly references to measured precursors

  • PrecursorGroup represents a set of precursors (e.g. precursors deriving from the same peptide sequence but identified by different charge states and isotopic labelling); see also CyPrecursorGroup for a Cython implementation

  • A Precursor represents a single precursor (e.g. a single measured analyte with a precursor m/z identified by its chemical formula, charge state and isotopic labelling)
  • A peak group represents a single RT region in the chromatogram of a single Precursor

Run Module

Run

class msproteomicstoolslib.data_structures.Run.Run(header, header_dict, runid, orig_input_filename=None, filename=None, aligned_filename=None, useCython=False)

A run contains references to identified precursor groups and precursors.

The run stores a reference to precursor groups (heavy/light pairs) identified in the run. It has a unique id and stores the headers from the csv

A run has the following attributes:
  • an identifier that is unique to this run
  • a filename where it originally came from
  • a dictionary of precursor groups which are accessible through the following functions - getPrecursorGroup - hasPrecursor - getPrecursor - addPrecursor
Parameters:
  • header (str) – Run header
  • header_dict (dict) – Run header dictionary
  • runid (str) – Run header dictionary
  • orig_input_filename (str) – Original filname of the csv file
  • filename (str) – Original filname of the mzML (e.g. the column “filename”)
  • aligned_filename (str) – Aligned filename (e.g. the column “align_origfilename”)
addPrecursor(precursor, peptide_group_label)

Add a new precursor to the run using a specific peptide label.

If the corresponding precursor group does not yet exist, a new precursor group is created. Otherwise the precursor is added to the precursor group.

Parameters:
  • precursor (CyPrecursor, Precursor or GeneralPrecursor) – Precursor to be added (e.g. PEPT[+98]IDE/2)
  • peptide_group_label (str) – Label of the corresponding peptide group (e.g. PEPTIDE)
getPrecursor(peptide_group_label, trgr_id)

Return precursor corresponding to the given peptide label group and the transition group id

getPrecursorGroup(curr_id)
get_aligned_filename()
get_best_peaks()

Return the best peakgroup for each peptide precursor

get_best_peaks_with_cutoff(cutoff)

Return the best peak per run (with cutoff)

get_id()
get_openswath_filename()
get_original_filename()
hasPrecursor(peptide_group_label, trgr_id)

PrecursorGroup Module

PrecursorGroup

class msproteomicstoolslib.data_structures.PrecursorGroup.PrecursorGroup(peptide_group_label, run)

Bases: object

A set of precursors that are isotopically modified versions or different charge states of each other.

A collection of precursors that are isotopically modified versions or different charge states of the same underlying peptide sequence. Generally these are heavy/light forms. This class groups these Precursors together.

- self.peptide_group_label_

Identifier or precursor group

- self.run_

Reference to the Run where this PrecursorGroup is from

- self.precursors_

List of actual precursors

addPrecursor(self, precursor)

Add precursor to peptide group

getAllPeakgroups(self)

Generator of all peakgroups attached to the precursors in this group

getAllPrecursors(self)

Return a list of all precursors in this precursor group

getOverallBestPeakgroup(self)

Get the best peakgroup (by fdr score) of all precursors contained in this precursor group

getPeptideGroupLabel(self)

Get peptide group label

getPrecursor(self, curr_id)

Get the precursor for the given transition group id

get_decoy()

Whether the current peptide is a decoy or not

Returns:decoy – Whether the peptide is decoy or not
Return type:bool
peptide_group_label_
precursors_
run_

Precursor Module

PrecursorBase

class msproteomicstoolslib.data_structures.Precursor.PrecursorBase(this_id, run)

Bases: object

find_closest_in_iRT(delta_assay_rt)
get_all_peakgroups()
get_best_peakgroup()
get_decoy()
get_id()
get_selected_peakgroup()
select_pg(this_id)
set_decoy(decoy)
unselect_pg(id)

GeneralPrecursor

class msproteomicstoolslib.data_structures.Precursor.GeneralPrecursor(this_id, run)

Bases: msproteomicstoolslib.data_structures.Precursor.PrecursorBase

A set of peakgroups that belong to the same precursor in a single run.

== Implementation details ==

This is a plain implementation where all peakgroup objects are stored in a simple list, this is not very efficient since many objects need to be created which in Python takes a lot of memory.

add_peakgroup(peakgroup)
append(transitiongroup)
find_closest_in_iRT(delta_assay_rt)
getProteinName()
getRun()
getRunId()
getSequence()
get_all_peakgroups()
get_best_peakgroup()

Return the best peakgroup according to fdr score

get_run_id()
get_selected_peakgroup()
id
peakgroups
precursor_group
protein_name
run
sequence
setProteinName(p)
setSequence(s)
set_precursor_group(p)

Precursor

class msproteomicstoolslib.data_structures.Precursor.Precursor(this_id, run)

Bases: msproteomicstoolslib.data_structures.Precursor.PrecursorBase

A set of peakgroups that belong to the same precursor in a single run.

Each precursor has a backreference to its precursor group (heavy/light pair) it belongs to, the run it belongs to as well as its amino acid sequence. Furthermore, a unique id for the precursor and the protein name are stored.

A precursor can return its best transition group, the selected peakgroup, or can return the transition group that is closest to a given iRT time. Its id is the transition_group_id (e.g. the id of the chromatogram)

The “selected” peakgroup is represented by the peakgroup that belongs to cluster number 1 (cluster_id == 1) which in this case is “special”.

== Implementation details ==

For memory reasons, we store all information about the peakgroup in a tuple (invariable). This tuple contains a unique feature id, a score and a retention time. Additionally, we also store, in which cluster the peakgroup belongs (if the user sets this).

A peakgroup has the following attributes:
  • an identifier that is unique among all other precursors
  • a set of peakgroups
  • a back-reference to the run it belongs to
add_peakgroup_tpl(pg_tuple, tpl_id, cluster_id=-1)

Adds a peakgroup to this precursor.

The peakgroup should be a tuple of length 4 with the following components:
  1. id
  2. quality score (FDR)
  3. retention time (normalized)

3. intensity (4. d_score optional)

cluster_ids_
find_closest_in_iRT(delta_assay_rt)
getAllPeakgroups()
getClusteredPeakgroups()
getPrecursorGroup()
getProteinName()
getRun()
getRunId()
getSequence()
get_all_peakgroups()
get_best_peakgroup()
get_id()
get_run_id()
get_selected_peakgroup()
id
peakgroups_
precursor_group
protein_name
run
select_pg(this_id)
sequence
setClusterID(this_id, cl_id)
setProteinName(p)
setSequence(s)
set_precursor_group(p)
unselect_all()
unselect_pg(this_id)

PeakGroup Module

PeakGroupBase

class msproteomicstoolslib.data_structures.PeakGroup.PeakGroupBase

Bases: object

A single peakgroup that is defined by a retention time in a chromatogram of multiple transitions. Additionally it has an fdr_score and it has an aligned RT (e.g. retention time in normalized space). A peakgroup can be selected for quantification or not (this is stored as having cluster_id == 1).

For each precursor, there can be multiple clusters of peakgroups, with the first (or best) one generally being in cluster 1, therefore we store a cluster id. Generally, an alignment algorithm will assign a cluster id to zero, one or more peakgroups of each precursor.

cluster_id_
fdr_score
get_cluster_id()
get_fdr_score()
get_feature_id()
get_intensity()
get_normalized_retentiontime()
get_value(value)
id_
intensity_
is_selected()
normalized_retentiontime
select_this_peakgroup()
set_fdr_score(fdr_score)
set_feature_id(id_)
set_intensity(intensity)
set_normalized_retentiontime(normalized_retentiontime)
set_value(key, value)

MinimalPeakGroup

class msproteomicstoolslib.data_structures.PeakGroup.MinimalPeakGroup(unique_id, fdr_score, assay_rt, selected, cluster_id, peptide, intensity=None, dscore=None)

Bases: msproteomicstoolslib.data_structures.PeakGroup.PeakGroupBase

See PeakGroupBase for a detailed description.

This implementation is designed to be immutable as the actual data is stored in the Precursor class which generates this object on-the-fly to improve memory performance.

getPeptide()
get_cluster_id()
get_dscore()
print_out()
select_this_peakgroup()

Select this peakgroup for quantification (assigns cluster id 1; works since it calls back to its Precursor obj)

setClusterID(id_)

Set cluster id (works since it calls back to its Precursor obj)

set_fdr_score(fdr_score)

Raises exception as this object is immutable

set_feature_id(id_)

Raises exception as this object is immutable

set_intensity(intensity)

Raises exception as this object is immutable

set_normalized_retentiontime(normalized_retentiontime)

Raises exception as this object is immutable

GuiPeakGroup

class msproteomicstoolslib.data_structures.PeakGroup.GuiPeakGroup(fdr_score, intensity, leftWidth, rightWidth, assay_rt, peptide)

Bases: msproteomicstoolslib.data_structures.PeakGroup.PeakGroupBase

See PeakGroupBase for a detailed description.

This implementation stores additional information including left/right width.

get_value(value)

GeneralPeakGroup

class msproteomicstoolslib.data_structures.PeakGroup.GeneralPeakGroup(row, run, peptide)

Bases: msproteomicstoolslib.data_structures.PeakGroup.PeakGroupBase

See PeakGroupBase for a detailed description.

This implementation stores the full row read from the CSV file including all meta-data. It is generally not recommended to use this implementation unless for toy examples.

getPeptide()
get_dscore()
get_value(value)
peptide
print_out()
row
run
setClusterID(clid)
set_value(key, value)

DataStructures - Basic

Aminoacides Module

Aminoacid

class msproteomicstoolslib.data_structures.aminoacides.Aminoacid(name, code, code3, composition)

Class to hold information about a single Amino Acid (AA)

code = None

One letter code

code3 = None

Three letter code

composition = None

Elemental composition

elementsLib = None

Library of elements

name = None

Full name of the AA

Aminoacides

class msproteomicstoolslib.data_structures.aminoacides.Aminoacides
addAminoacid(aminoacid)
getAminoacid(code)
initAminoacides()

Modifications Module

Modification

class msproteomicstoolslib.data_structures.modifications.Modification(aminoacid, tpp_Mod, unimodAccession, peakViewAccession, is_labeling, composition)

A modification on an Aminoacid

codes = ['TPP', 'unimod', 'ProteinPilot']

Available modification formats

getcode(code)

Modifications

class msproteomicstoolslib.data_structures.modifications.Modifications(default_mod_file=None)

A collection of modifications

appendModification(modification)
is_bool(expression)
printModifications()
readModificationsFile(modificationsfile)

It reads a tsv file with additional modifications. Modifications will be appended to the default modifications of this class. Tsv file headers & an example: modified-AA TPP-nomenclature Unimod-Accession ProteinPilot-nomenclature is_a_labeling composition-dictionary S S[167] 21 [Pho] False {‘H’ : 1,’O’ : 3, ‘P’ : 1}

translateModificationsFromSequence(sequence, code, aaLib=None)

Returns a Peptide object, given a sequence with modifications in any of the available codes. The code (TPP, Unimod,...) to be translated must be given.

Peak Module

Peak

class msproteomicstoolslib.data_structures.peak.Peak(str=None, spectraST=False)

Represents one peak of a spectrum.

init_with_self(peak)
initialize(peak, intensity, peak_annotation, statistics)
parse_str(peak)
to_write_string()

Peptide Module

Peptide

class msproteomicstoolslib.data_structures.peptide.Peptide(sequence, modifications={}, protein='', aminoacidLib=None)
addSpectrum(spectrum)

Deprecated definition

all_ions(ionseries=None, frg_z_list=[1, 2], fragmentlossgains=[0], mass_limits=None, label='')

Returns all the fragment ions of the peptide in a tuple of two objects: (annotated, ionmasses_only) annotated is a list of tuples as : (ion_type, ion_number, ion_charge, lossgain, fragment_mz) ionmasses_only is a list of fragment masses. When ionseries is not provided, all existing ion series (see: Peptide.iontypes) will be calculated. When frg_z_list is not provided, fragment ion charge states +1 and +2 will be used.

calIsoforms(switchingModification, modLibrary)

This returns the full list of peptide species of the same peptide family (isobaric, same composition, different modification site. The list is given as a list of Peptide objects. switchingModification must be given as a Modification object.

cal_UIS(otherPeptidesList, UISorder=2, ionseries=None, fragmentlossgains=[0], precision=1e-08, frg_z_list=[1, 2], mass_limits=None)

It calculates the UIS for a given peptide referred to a given list of other peptides. It returns a tuple of two objects all_UIS, and all_UIS_annotated. all_UIS contains only a mass list.

comparePeptideFragments(otherPeptidesList, ionseries=None, fragmentlossgains=[0], precision=1e-08, frg_z_list=[1, 2])

This returns a tuple of lists: (CommonFragments, differentialFragments). The differentialFragmentMasses are the masses of the __self__ peptide are not shared with any of the peptides listed in the otherPeptidesList. otherPeptidesList must be a list of Peptide objects. The fragments are reported as a tuple : (ionserie,ion_number,ion_charge,frqgmentlossgain,mass)

fragmentSequence(ion_type, frg_number)
getDeltaMassFromSequence(sequence)
getMZ(charge, label='')
getMZfragment(ion_type, ion_number, ion_charge, label='', fragmentlossgain=0.0)
getSequenceWithMods(code)
get_decoy_Q3(frg_serie, frg_nr, frg_z, blackList=[], max_tries=1000)
pseudoreverse(sequence='None')
shuffle_sequence()

Residues Module

Residues

class msproteomicstoolslib.data_structures.Residues.Residues(type='mono')

A class that contains information elements, amino acids and modifications. It stores mainly masse of these but also chemical formulas.

The most commonly used properties are:
  • Residues.average_elments : element weights
  • Residues.monoisotopic_elments : element weights
  • Residues.aa_codes : Three and One letter amino acid codes
  • Residues.aa_names : English names of the amino acids
  • Residues.aa_sum_formulas_text : Chemical formulas of all amino acids
  • Residues.aa_sum_formulas: Chemical formulas of all amino acids as hash
  • Residues.mass_xxx: monoisotopic masses of different compounds (NH3, H2O, CO, HPO4 etc)
  • Residues.average_data: average weight of amino acids
  • Residues.monoisotopic_data: monoisotopic weight of amino acids
  • Residues.monoisotopic_mod: monoisotopic modification data
  • Residues.mod_mapping: mapping of + notation to absolute weight notation (K[+8] to K[136])
  • Residues.Hydropathy: Hydropathy of amino acids (gravy scores)
  • TODO hydrophobicity of amino acids
  • TODO basicity of amino acids
  • TODO helicity of amino acids
  • Residues.pI: pI of amino acids

DDB Module

DDB

Abstraction layer to the 2DDB software framework.