Ph.D: Improvement of the number of proteins identified by LC-MSMS with an AMTtag method

Posted 14 May 2022

University of Rouen Normandy

France (Ph.D.)

https://www.univ-rouen.fr/


Context

The thesis will take place in a multidisciplinary context in the framework of a collaboration between the TIBS team of LITIS UR 4108 and the PISSARO proteomics platform of the University of Rouen Normandy, France. The identification of proteins in a complex sample by proteomic analysis is nowadays an essential step to better characterise a biological system under study, and for example to highlight the regulation of signalling pathways associated with a phenotype, in all types of organisms and from different types of of samples (tissues, cells, biological fluids, etc.). For this purpose, proteins are cut with a specific enzyme (protease) and the peptides generated are then analysed by mass spectrometry coupled to liquid chromatography (LC-MS/MS). In the first instance, on high resolution and high mass accuracy instruments, the ionised peptides are analysed in a first MS scan. The retention time (RT) as well as the mass-to-charge ratio (m/z) of each of these ions (called parents) are determined. During these experiments, some of these peptide ions (often the most intense) are analysed by tandem mass spectrometry (MS/MS): they are selected, fragmented and the fragment ions (called daughter ions) are analysed in a second MS analysis. Subsequently, bioinformatics tools are used to process these data to determine the sequence of the analysed peptides (amino acid sequence), and thus to identify the proteins present in the samples. In these studies, peptides are most commonly identified by comparing the m/z values of the parent and daughter with the theoretical m/z values obtained by in silico digestion and fragmentation of proteins from a reference database of the organism under study.

Today, in a fairly standard way, an LC-MS/MS experiment produces tens or even more than a hundred thousand spectra per hour of analysis. However, many peptides (and therefore the proteins from which they originate) remain unidentified despite, 1) extensive work on the development of identification software, and 2) improvements in mass spectrometry instruments. The main reasons for this limited level of identification are multiple:

  • only the majority ions are generally selected for MS/MS analysis;
  • some MS/MS spectra lack information for the peptide sequence to be successfully annotated successfully;
  • most identification work is undertaken without taking into account the potential presence of post-translational modifications (PTMs).

Thus, of all the data generated, a large fraction is not used. The development of an approach using the recorded information would then allow a significant increase in the number of proteins identified, and thus to better characterise biological systems, such as the biochemical pathways regulated during a disease, or to search for protein biomarkers, signatures of this pathology... An interesting strategy to increase the number of identified peptides is the method called AMTtag (Accurate Mass and retention Time tag [1, 2, 4]), based on the use of m/z and RT coordinates. This methodology is based on a sequential process: 1) the creation of a database of the coordinates (m/z and RT) coordinates of peptides identified by classical software (e.g. Mascot [3]) and 2) using this database to predict a peptide identification in an unknown sample from these coordinates. This method could be envisaged thanks to the progress made in mass spectrometry, particularly in terms of the accuracy of mass-to-charge ratio measurements. However, this method, which appeared in the mid-2000s, is only rarely used because it is associated with several limitations:

  • the variability of the RTs due to the strong dependence on the materials used during the chromatographic separation of peptides (change of columns, composition of elution solutions...);
  • the RTs being associated with a greater variation than the m/z ratios, it is necessary to carry out a preliminary RT alignment step. This step is essential in order to apply an AMTtag strategy;
  • it is necessary to analyse peptide elution profiles very precisely to identify and separate peptide co-elution events, often found in these highly complex samples;
  • it requires a large number of datasets for the database creation/completion stage.

Objectives

The PISSARO platform has taken an interest in this strategy and has begun bioinformatics work by analysis data (m/z coordinates, RT) and has already built up a database for proteins from the the bacterium Pseudomonas aeruginosa. This preliminary work has highlighted the feasibility and interest of this approach. Indeed, the database contains a total of 25871 peptides from 3386 different proteins. For a given LC-MS analysis, the use of this database allows the identification of an average of 13000 peptides (compared to 8000 peptides by Mascot) and an average of 2000 proteins (against 1500 proteins by Mascot). The objectives of this thesis are to further develop innovative methods and algorithms to improve the identification of peptides and proteins using the AMTtag method. A second part is based on the development of methods for the quantification and differential analysis of proteins/peptides identified by the AMTtag method. The first phase will therefore consist of optimising the AMTtag method described above. This will involve develop a new algorithm for identifying peptides based on their mass-to-charge ratio and their retention time. To improve the reliability of the identification, a new and more accurate alignment method will be developed. Complementary methods will be considered, such as taking into account isotopic mass, to confirm the reliability of the identification and reduce false positives. Finally, these algorithms will have to be optimised to be able to process a large amount of data in a minimum amount of time. In a first version, the work developed during this thesis will be applied to Pseudomonas aeruginosa. Tests will then be carried out on Acinetobacter baumannii and then on eukaryotic organisms, to evaluate the robustness of the AMTtag method. The switch to organisms will undoubtedly lead to problems of scaling up. This needs to be addressed by finding suitable data structures to support efficient searches. A second phase will consist of developing a quantification strategy with the aim of comparative analysis of protein abundance in different samples. This quantification will be based on the integration of the chromatographic signals of each peptide identified by the AMTtag method. During this process, a step of standardisation of the peptide signal intensities between all experiments will be necessary. Biostatistical tools will have to be associated to evaluate the variability of peptide and protein abundance within and between samples.

Bibliography

  1. A. Agron, D. M. Avtonomov, A. S. Kononikhin, I. A. Popov, and S. A. Moshkovskii and E. N. Nikolaev. Accurate mass tag retention time database for urine proteome analysis by chromatography mass spectrometry. Biochemistry 75(5) May 2010. URL: https://doi.org/10.1134/S0006297910050147.
  2. Conrads TP, Anderson GA, Veenstra TD, Pasa-Tolić L, Smith RD. Utility of accurate mass tags for proteome-wide protein identification. Anal Chem. 72(14) July 2000: 3349-54. doi:10.1021/ac0002386. PMID: 10939410.
  3. https://www.matrixscience.com/ Mascot
  4. Wu C, Monroe ME, Xu Z, Slysz GW, Payne SH, Rodland KD, Liu T, Smith RD. An Optimized Informatics Pipeline for Mass Spectrometry-Based Peptidomics. J Am Soc Mass Spectrom. 26(12) December 2015: 2002-8. doi: 10.1007/s13361-015-1169-z. Epub 2015 May 27. PMID: 26015166; PMCID: PMC4655184

Perspectives

Eventually, software with an interface will be developed to allow teams of biologists to implement the method on their organisms of interest. The idea is to have a system that is sufficiently generalist to be applied to any type of organism. Applications are possible in the health sector, particularly for the analysis of tumour samples.

Contact: Thierry Lecroq (thierry.lecroq@univ-rouen.fr)