Many tools have been developed to predict the effect of missense variants in the genome. VUStruct is provided to additionally analyze factors that are often excluded by other tools, such as thermodynamic destabilization of the resulting protein structure, 3D colocalization of the variant with clusters of pathogenic mutations in the same protein, potential digenic effects arising from pairwise interactions of variants in different genes, and other factors described below. For details about how the various methods work, please visit the Calculation Details menu above.
In this tutorial, we shall begin with an example list of variants of unknown significance (VUS) as input. These variants come from the sequencing results of a fictional case for diagnosis of an unknown disease. This tutorial will walk you through preparation and submission of the input data, and the interpretation of the results generated by the system.
VUStruct is launhed from a file of candidate variants for a single case. Each case typically includes 5-50 variants preselected from WES or WGS of the proband. This selection removes variants of unreliable quality, variants too common to be associated with the proband’s phenotype, variants associated with conditions unrelated to the patient’s condition, and known benign variants. In the UDN workflow, this step is usually performed by experts in clinical genetics, using standard software tools like VEP and databases of biomedical literature like OMIM.
The length of the input lists for this webserver is limited to 100 variants, because the ddG calculations are computationally intensive. But please contact us if you want to submit a larger set.
Submit only de-identified data. You will be given a link to access the results that is on the public internet, however difficult to guess.
Include inheritance information if you have it. Typically the proband’s parents are included in the WES and WGS testing, to ascertain the inheritance of the variants, and match it with the segregation of the phenotype. Sometimes additional relatives are included, e.g. siblings who are affected or unaffected.
The VUStruct launch form accepts a variety of input formats, all of which are preprocessed to "vustruct.csv" format.
index | gene | chrom | pos | change | effect | transcript | unp | refseq | mutation | genome | inheritance | zygosity |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | COA5 | chr2 | 98608380 | G/A | missense | ENST00000328709.8 | Q86WW8 | NM_001008215.2 | P9L | GRCh38 | father | heterozygous |
2 | NAGA | chr22 | 42067209 | C/T | missense | ENST00000396398.8;ENST00000402937.1;... | P17050 | NM_000262.2 | D136N | GRCh38 | mother | heterozygous |
3 | SMYD5 | chr2 | 73218934 | C/A | missense | ENST00000389501.9 | Q6GMV2 | NM_006062.2 | A57E | GRCh38 | father | heterozygous |
4 | SMARCAL1 | chr2 | 216447069 | G/A | missense | ENST00000357276.9;ENST00000358207.9;... | Q9NZC9 | NM_001127207.1 | A588T | GRCh38 | mother | heterozygous |
5 | ONECUT1 | chr15 | 52789080 | G/C | missense | ENST00000305901.7 | Q9UBC0 | NM_004498.2 | R269G | GRCh38 | father | heterozygous |
6 | SORD | chr15 | 45069018 | CG/C | Frameshift deletion | Q00796-1 | NM_003104.6 | Frameshift deletion | GRCh38 | mother | homozygous | |
7 | SEC61A2 | chr10 | 12085047 | Deletion | Deletion | GRCh38 | de novo | heterozygous |
Go to the VUStruct launch page, which can be clicked from the menu at the top of this page. You will see an input form with a SUBMIT button at the bottom. Enter your Case Label in the top field and then select a Data Format of either 'VCF GRCh38' or 'Vanderbilt UDN Case Spreadsheet'. If you are proceeding with the Vanderbilt format .xlsx input data, you will not see the Additional Digenic entry box, as gene names are pulled from the spreadsheet itself.
To launch, click the blue/green button the bottom, 'Submit VUStruct Case to Computer Cluster'. Press the green Submit button at the bottom of the page. You will receive a link to a page where the results will appear. You can bookmark that link. The duration of the calculation depends primarily on the size of the proteins. If the longest protein in your list is under 200 residues in length, the calculation can finish in as little as 4-6 hours. For proteins >1000 residues in length, the calculation can take more than 3 days to finish.
The results page for a calculation looks like this below. The top table shows a summary of the predictions for all variants submitted.The DiGePred predictions are shown in a heat map linked from “DiGePred Analysis,” and summarized in the lower table.
YEach line in the top table shows the variant information, and a summary of the PathProx and ddG calculations. Note that the summary does not give a full picture of the predictions, because it combines calculations performed on multiple isoforms, and multiple structural models for each variant. You should look at each one in detail as shown below:
You should first expand the rows for each variant of interest by clicking on the “collapsable” icon (looks like a greater-than sign) at the left of each row. Example highlighted in red box here
Once that row is expanded, you will see the links in the first column go to a “Report” for the variant as it appears in each curated transcript (and corresponding protein isoform). Example highlighted in red below [fig 05.png]. These are important links, because they will take you to the detailed prediction for the calculations. Again, you should not rely on the summary alone. Click on a “Report” link to see the results page for that variant.
Each Report link will take you to another page that is specific for that variant in one transcript/protein isoform. The calculations can be performed multiple times, depending on how many structural models are available for the protein (which include the variant position). An example Report page is shown below. Helpful features to note on this Report page are:
Each section contains results for one of the structural models. The results include:
Here are shown the results of calculations using the model from ModBase. [You can see that in this case, the ddG_monomer calculation gave a very high score, >46 kcal/mol, in contrast with the ddG_cartesian calculation which did not give a score.We launch both protocols because sometimes one of them fails to complete, due to problems such as e.g. chainbreaks in the input structure. Below that, the PathProx score of 0.27 is shown under “ClinVar Results,” which quantifies the colocalization of the variant compared to distributions of pathogenic vs benign variants (for details and references, see CalculationDetails->PathProx ).
The variant is shown in the context of the structural model using NGL viewer. This provides great flexibility in displaying the data, even permitting you to save a figure. For example, use the menu option Variants → ClinVar to show the locations of the pathogenic variants. Then use Colors rarr; ClinVar PathProx to paint the ribbons with the corresponding scores at each residue. It becomes clear why this location scored highly, especially if you contrast that with the display of Variants rarr; GnomAD. You can save your figure with File → “Export Image”
The 3D viewer is followed by additional images showing the distributions of variants compared to the statistics of the null distribution used by PathProx
It can be useful to view the ROC curve and the PR curve for the PathProx calculations. The score itself can be more or less convincing, depending on the number and distribution of the pathogenic variants and the neutral variants. You can see in the PathProx Clinvar ROC below, that the classification confidence is suggestive, but not overwhelmingly strong. Does that match your intuition from looking at the distributions in the NGL viewer above? For a better understanding of spatial constraint, be sure to read (Sively et al. 2018)
The genes you provide in your list are evaluated for the potential of digenic disease interactions. This is done using a predictor called DiGePred, a random forest classifier for identifying candidate digenic disease gene pairs by features derived from biological networks, genomics, evolutionary history, and functional annotations. Go back to the “case page” that summarizes all the results for this VUStruct run. You’ll see a summary table of the DiGePred scores. Just above that is the link to the detailed results (highlighted in red below):
The digenic predictions are displayed as a heat map. Each cell represents the potential interactions between two of the genes in the list. The matrix is symmetric (redundant). The legend on the right hand side indicates the colors that the cell borders will be drawn, used to highlight the inheritance information (if provided with the input data). The tabs at the bottom are used to select subsets of the information used as support for the predicted interactions, and change the display of the heatmap. Moving the mouse cursor over individual cells displays a pop-up tip that summarizes the score, and the information used to support that score.
Some of the results of these calculations are effectively orthogonal to the predictors most commonly used. We have found them useful for prioritizing candidates for additional investigation that are not clearly implicated by other methods. Summarizing the data is not trivial, so we create a spreadsheet highlighting our conclusions.