Machine learning discovery of missing links that mediate alternative branches to plant alkaloids

The following Q&A helps understand how we used a Support Vector Machine (SVM) algorithm to discover enzymes that can be harnessed for biosynthesis
Machine learning discovery of missing links that mediate alternative branches to plant alkaloids
Like

Scheme for the machine learning discovery of missing links that mediate alternative branches to plant alkaloids of the following paper:  https://www.nature.com/articles/s41467-022-28883-8

1.  How many enzyme sequences are applied to SVM models to find out missing link enzymes?

Our paper focused on distinguishing between aromatic acetaldehyde synthase (AAS) and aromatic amino acid decarboxylase (AAAD), and also prediction of phenylpyruvate decarboxylase (PPDC). Because we had to curate positive training sequences that are highly likely to have these specialized functions, the number of positive training examples is not high for these models. But Support Vector Machine (SVM) was still effective with the low numbers of positive training sequences.

For AAAD models, we used 286 positive training sequences, 9,486 negative training sequences, and 7 Papaver somniferum tyrosine decarboxylase (TyDC) test sequences. There are 9 TyDC sequences, but one was included as an AAAD training sequence to make sure our predictions were not biased, and TyDC4 has a premature stop codon.

For AAS models, we used 360 positive training sequences, 9,412 negative training sequences and 7 P. somniferum TyDC test sequences. For the PPDC prediction model, we used 244 positive training sequences, 6,268 negative training sequences and 27 P. somniferum pyruvate decarboxylase (PDC)-related test sequences.

There is also a combined PPDC+PDC prediction model, three cytochrome P450 (CYP450) prediction models and a CYP450 reductase prediction model. All of the training sequences are included in the Supplementary Data files, and all of the test sequences are listed in the Supplementary Tables.

2.   How many sequences have and do not have the target function in the sequences applied to SVM?

The positive and negative training sequences we use to train the SVM models are described above. Regarding the matching of test sequence to the target functions, all P. somniferum TyDC test sequences showed positive decision scores for AAS and AAAD prediction, which suggests that these enzymes could be bifunctional. TyDC1-3 & 5-7 showed higher decision scores for AAS than AAAD, and TyDC8 showed a higher score for AAAD than AAS. 

In vitro assays confirmed that AAS activity was present in TyDC6, which showed the highest decision score for AAS. The TyDC6 prediction and discovery is the most important finding of the paper.

For the PPDC test sequences, 17 of 27 P. somniferum PDC-related sequences showed positive decision scores for PPDC prediction. A pyruvate decarboxylase 1 isoform X1 sequence with a 27 residue N-terminal truncation showed the highest PPDC decision score, so this sequence should be characterized in a future study.

Regarding how test sequences will end up matching to target functions, it totally depends on how we pick the test sequences. If test sequences are very similar to the positive training sequences they will give higher scores.

3.  Was structural information of reaction substrates and products included for feature extraction?

In the current study, only enzyme amino acid sequences were used to predict enzyme function. This was sufficient to predict the specialized functions of AAS and PPDC missing links in this proof-of-concept work. Professor Michihiro Araki and PhD candidate Naoki Watanabe have already updated the machine learning algorithms to include substrate and product information. But these updates are being published separately.

4.  What kind of experimental data and database data were used to differentiate between enzyme functions?

To determine if AAS and AAAD training sequences belong to the correct function group, we devised structure-based rules that could differentiate between AAS and AAAD. The rules were extracted from reported experimental data, database entries and analysis of protein structure models. The AAS and AAAD training sequence database could then be curated using structure-based rules at two active site amino acid positions. Plant PPDC positive training sequences were curated according to a report of rose PPDC [Hirata, H. et al. Sci. Rep. 6, 20234 (2016)] and by selection of related sequences that grouped in the same phylogenetic clade.

To determine the enzyme function of test sequences, we first used in vivo screenings, but in vitro characterization should be included to prove new functions. So for in vitro proof of P. somniferum TyDC6 function, AAS product 4-hydroxyphneylacetaldehyde was detected with GC-MS, AAS product 3,4-dihydroxyphenylacetaldehyde with LC-MS, and AAAD products tyramine and dopamine with LC-MS. AAS activity was also demonstrated using a peroxidase-based fluorescent assay.

5.   Why was machine learning necessary, if sequences were first classified by structural analysis?

This important question was raised by one of the reviewers. One of the main purposes of the machine learning enzyme prediction is to better classify specialized enzymes that are incorrectly annotated throughout sequence databases. In our example, many enzymes with potential AAS activity and PPDC activity are simply annotated as AAAD and PDC, respectively. Therefore the training sequences that we gathered from databases had to be curated to make sure that they are correctly assigned to their specialized groups. To do this we examined the structures of clear examples for each enzyme group, and determined structure-based rules for the curation.

After training models with the curated sequences, many test sequence scored high for the target function, even though their active site structural features resembled that of the negative training group. For example, in vitro experiments showed that P. somniferum TyDC6 possesses AAS activity, but its active site configuration resembles that of typical AAAD. Because the AAS specialized function is not suggested from the TyDC6 structure, prediction of TyDC6 as AAS is given to us purely from the SVM scores. This shows that the SVM algorithm has made a true discovery of a novel enzyme, and suggests that protein structure analysis alone is not sufficient for functional prediction in at least some cases. We speculate that some emergent properties or elusive structural features that we cannot easily notice with our eyes may be responsible for some specialized functions, and that machine learning may be able to better predict these emergent properties.

This work is dedicated to Natalie Chanier, and was supported by the New Energy and Industrial Development Organization (NEDO).