VUStruct is powered by Vanderbilt's ACCRE compute cluster. Cluster Status JSON: Dashboard.


Tutorial

Many tools have been developed to predict the effect of missense variants in the genome. VUStruct is provided to additionally analyze factors that are often excluded by other tools, such as thermodynamic destabilization of the resulting protein structure, 3D colocalization of the variant with clusters of pathogenic mutations in the same protein, potential digenic effects arising from pairwise interactions of variants in different genes, and other factors described below. For details about how the various methods work, please visit the Calculation Details menu above.

In this tutorial, we shall begin with an example list of variants of unknown significance (VUS) as input. These variants come from the sequencing results of a fictional case for diagnosis of an unknown disease. This tutorial will walk you through preparation and submission of the input data, and the interpretation of the results generated by the system.

VUStruct is designed to accept a list of candidate variants for a single case. Each case typically includes 5-50 variants selected from WES or WGS of the proband. This selection removes variants of unreliable quality, variants too common to be associated with the proband’s phenotype, variants associated with conditions unrelated to the patient’s condition, and known benign variants. In the UDN workflow, this step is usually performed by experts in clinical genetics, using standard software tools like VEP and databases of biomedical literature like OMIM.

The length of the input lists for this webserver is limited to 100 variants, because the ddG calculations are computationally intensive. But please contact us if you want to submit a larger set.

Submit only de-identified data. You will be given a link to access the results that is on the public internet, however difficult to guess.

Include inheritance information if you have it. Typically the proband’s parents are included in the WES and WGS testing, to ascertain the inheritance of the variants, and match it with the segregation of the phenotype. Sometimes additional relatives are included, e.g. siblings who are affected or unaffected.

For this tutorial, we provide sample input data in two formats, either of which runs through the pipelien to the same end result.

  • SampleCase.xlsx: Input variants and Gene Names (with inheritance information) in Vanderbilt UDN case spreadsheet format.
  • Separate VCF and Gene List files:
    • SampleCase.vcf: The Genomic Coordinates of the variants in standard Variant Call Format
    • SampleCase.txt: A .txt file of Gene names and inheritance information

Go to the VUStruct launch page, which can be clicked from the menu at the top of this page. You will see an input form with a SUBMIT button at the bottom. Enter your Case Label in the top field and then select a Data Format of either 'VCF GRCh38' or 'Vanderbilt UDN Case Spreadsheet'. If you are proceeding with the Vanderbilt format .xlsx input data, you will not see the Additional Digenic entry box, as gene names are pulled from the spreadsheet itself.


To launch, click the blue/green button the bottom, 'Submit VUStruct Case to Computer Cluster'. Press the green Submit button at the bottom of the page. You will receive a link to a page where the results will appear. You can bookmark that link. The duration of the calculation depends primarily on the size of the proteins. If the longest protein in your list is under 200 residues in length, the calculation can finish in as little as 4-6 hours. For proteins >1000 residues in length, the calculation can take more than 3 days to finish.

General

The results page for a calculation looks like this below. The top table shows a summary of the predictions for all variants submitted.The DiGePred predictions are shown in a heat map linked from “DiGePred Analysis,” and summarized in the lower table.

YEach line in the top table shows the variant information, and a summary of the PathProx and ddG calculations. Note that the summary does not give a full picture of the predictions, because it combines calculations performed on multiple isoforms, and multiple structural models for each variant. You should look at each one in detail as shown below:

You should first expand the rows for each variant of interest by clicking on the “collapsable” icon (looks like a greater-than sign) at the left of each row. Example highlighted in red box here


Once that row is expanded, you will see the links in the first column go to a “Report” for the variant as it appears in each curated transcript (and corresponding protein isoform). Example highlighted in red below [fig 05.png]. These are important links, because they will take you to the detailed prediction for the calculations. Again, you should not rely on the summary alone. Click on a “Report” link to see the results page for that variant.

Report Pages

Each Report link will take you to another page that is specific for that variant in one transcript/protein isoform. The calculations can be performed multiple times, depending on how many structural models are available for the protein (which include the variant position). An example Report page is shown below. Helpful features to note on this Report page are:

  • Summary information at the top, including links back to the main “case” page, the gene information at Medline+, and the protein information at Uniprot.
  • A graphical view of the protein sequence, with PFAM domains highlighted, and showing the location of the variant of interest. This quickly shows whether the variant falls in an annotated domain, between domains, or near the protein termini. This is a quick view; for greater detail, you can look at the Interpro site at EMBL-EBI.
  • Rate4Site score for this position in the protein (for details, see the documentation page.)
  • Cosmis score for this position in the protein (for details, see the documentation page.)
  • Structure Summary table lists all structures used for the calculations. The PathProx and ddG calculations are run for each structure. There may be experimental structures from the PDB, including structures generated by X-ray crystallography, NMR solution spectroscopy, and CryoEM techniques. There also may be models gathered from the AlphaFold2 database (the names start with “AF-”). There may be models taken from ModBase at Salilab (the names start with the RefSeqID “ENSP…” ). And models from SWISS-MODEL repo at Expasy (the names start with the UniProt ID, e.g. Q13825). Some of these structures and models will contain the full-length protein, sometimes in multimeric states, others contain only domains or a fragment of the protein. The structures and models may vary widely in quality and relevance, so you will want to consider which one is the best for your purpose.
  • The left margin contains shortcuts to the section of the page that contains the results for each structure listed in the Structure Summary table. Each section contains details of the PathProx and ddG calculations. To be clear: the links in the table take you to the source of each structural model, whereas the shortcuts in the left sidebar take you to the results of the calculations lower down on this page. Click on one of the shortcuts to go to that section. (Highlighted in red here.)

Example section for a structural model

Each section contains results for one of the structural models. The results include:

  • PathProx calculations using variants from ClinVar as “Pathogenic”
  • PathProx calculations using variants from COSMIC as “Pathogenic”
  • a 3D display (using NGL viewer) of the variant in the structural context
  • and 14 thumbnails detailing the variant distributions and statistics for the PathProx calculations

Here are shown the results of calculations using the model from ModBase. [You can see that in this case, the ddG_monomer calculation gave a very high score, >46 kcal/mol, in contrast with the ddG_cartesian calculation which did not give a score.We launch both protocols because sometimes one of them fails to complete, due to problems such as e.g. chainbreaks in the input structure. Below that, the PathProx score of 0.27 is shown under “ClinVar Results,” which quantifies the colocalization of the variant compared to distributions of pathogenic vs benign variants (for details and references, see CalculationDetails->PathProx ).

Structure display

The variant is shown in the context of the structural model using NGL viewer. This provides great flexibility in displaying the data, even permitting you to save a figure. For example, use the menu option Variants → ClinVar to show the locations of the pathogenic variants. Then use Colors rarr; ClinVar PathProx to paint the ribbons with the corresponding scores at each residue. It becomes clear why this location scored highly, especially if you contrast that with the display of Variants rarr; GnomAD. You can save your figure with File → “Export Image”


Thumbnails of PathProx Statistics

The 3D viewer is followed by additional images showing the distributions of variants compared to the statistics of the null distribution used by PathProx

It can be useful to view the ROC curve and the PR curve for the PathProx calculations. The score itself can be more or less convincing, depending on the number and distribution of the pathogenic variants and the neutral variants. You can see in the PathProx Clinvar ROC below, that the classification confidence is suggestive, but not overwhelmingly strong. Does that match your intuition from looking at the distributions in the NGL viewer above? For a better understanding of spatial constraint, be sure to read (Sively et al. 2018)

DiGePred results

The genes you provide in your list are evaluated for the potential of digenic disease interactions. This is done using a predictor called DiGePred, a random forest classifier for identifying candidate digenic disease gene pairs by features derived from biological networks, genomics, evolutionary history, and functional annotations. Go back to the “case page” that summarizes all the results for this VUStruct run. You’ll see a summary table of the DiGePred scores. Just above that is the link to the detailed results (highlighted in red below):

The digenic predictions are displayed as a heat map. Each cell represents the potential interactions between two of the genes in the list. The matrix is symmetric (redundant). The legend on the right hand side indicates the colors that the cell borders will be drawn, used to highlight the inheritance information (if provided with the input data). The tabs at the bottom are used to select subsets of the information used as support for the predicted interactions, and change the display of the heatmap. Moving the mouse cursor over individual cells displays a pop-up tip that summarizes the score, and the information used to support that score.


Summary

Some of the results of these calculations are effectively orthogonal to the predictors most commonly used. We have found them useful for prioritizing candidates for additional investigation that are not clearly implicated by other methods. Summarizing the data is not trivial, so we create a spreadsheet highlighting our conclusions.