MtDNA haplotypes can be queried in two different input formats, either as differences to the rCRS or as FASTA/sequence string.
differences to the rCRS
Sequence is entered as differences to the revised Cambridge Reference Sequence (rCRS, Andrews et al 1999).
FASTA/sequence string
Sequence is entered as consecutive string of bases (e.g. copy&paste from a text file or a consensus from sequence analysis software). Please do not enter header information like in FASTA format, enter nucleotides only.
In either case the query haplotype is internally converted into a consecutive string of nucleotides (FASTA-like format) and compared to the database sequences, which are also converted to strings.
Sample Info
The sample-specific information identifies a search. This is also the reference under which the query is reported. In addition the query is given a unique result identification number (RID, see search output).
Sequence range
It is important to specify the sequence range of your query haplotype.
EMPOP holds mtDNA sequences with a minimum range of HVS-I (16024-16365). Some data are represented by HVS-I and HVS-II (73-340), more recent data may be present as entire control region (CR) information (16024-576; see overview for details). In addition, individual sequence ranges including single positions can be defined.
mtDNA profile
Profile
MtDNA sequence data can be entered
as differences to rCRS: specify the mutations separated by blanks following the standard forensic annotation, e.g. 263G 309.1C 309.2C. Deletions can be annotated as 16193DEL or 16193del or 16193-. 16193D is considered a mixture of A, G, and T (IUB code).
or in string format (FASTA-like sequence without header information): you can copy&paste sequence stretches from analysis software or text strings. Each sequence string requires the respective sequence range.
Sequence range
e.g. 16024-16365, 16024-576 (full control region) or Control Region SNPs, such as 16519. Insertions after the last position of the sequence range are considered out of range, e.g. 455.1C is considered lying outside range 450-455. Extend the sequence range to the subsequent position to include insertions, e.g. 450-456.
Options
Match type: Select how the bases of the query profile and the database profiles are matched.
Pattern match: mixture designations match its individual components (Y={C,T,Y}).
Example: 152Y matches 152T and 152C.
Literal match: mixture designations are considered exclusive to all other nucleotide designations (Y={Y}).
Example: 152Y matches only 152Y.
This is helpful for determining the observed heteroplasmic mutations in EMPOP.
Number of differences displayed
Select the maximum number of differences between the query profile and the database profiles (range 1 - 5). Database profiles with more than the selected number of differences are not shown individually in the results.
Default value is 5 (3 for unregistered users).
Disregard InDels in length variants at positions 16193, 309, 455, 573
Length variants that are known hotspots for indels can be ignored in a search. This involves the C-runs around positions 16193, 309 and 573 and the T-run around position 455 relative to the rCRS. You can select these positions individually in order to exclude mutations there from a search. To optimize the run time of a search the number of ignored mutations has been limited to 5 per queried region (e.g. HVS-I).
Note that the uninterrupted C-run around position 16189 may collapse into a single C-run if the respective transition occurs without other transitions in its vicinity.
Example 1:
Consider the following profile of the EMPOP database
16189C 16193.1C 16519C 263G 315.1C
and the query profile
16189C 16519C 263G 315.1C
The database profile harbors an uninterrupted C-run in HVS-I (16189C). The length variation at 16193.1C from the base profile is ignored in a search if the hotspot 16193 was selected from the list.
Example 2:
Consider the following profile of the EMPOP database
16189C 16192T 16270T 16398A 73G 150T 263G 315.1C
and the query profile
16189C 16191.1C 16192T 16270T 16398A 73G 150T 263G 315.1C
The database profile harbors an interrupted C-run in HVS-I that is shifted with respect to rCRS (16189C 16192T). The length variation at 16191.1C from the query profile is not ignored in a search if hotspot 16193 was selected from the list.
Source
EMPOP 2 is using a new structure for the haplotypes, which better suits the forensic demand. Haplotypes are grouped with respect to the
geographical affiliation, therein classified into continent, UN region, country, and locality and
population-specific affiliation, therein classified into a hierarchical metapopulation system consisting of 4 levels.
This information can be retrieved in the output of a search.
Forensic data: Haplotypes stored under "forensic data" are represented by their raw data and they are permanently linked to the database entries. The minimum requirement is full double-strand coverage of high quality sequences within the sequence range.
Literature data represents published data and data that were submitted without adequate raw lane information. Literature data have also been scrutinized before being loaded onto EMPOP.
Figures show total number of samples contained in the respective category in the database. In the output an additional figure indicating the number of samples used thereof for the given query is shown. This number depends on the query range from the input. Only those database profiles are included for a search for which the sequence range is equal or greater than the query range given in the input mask.
Output
Please note that tabs can be used to change between input and output (do not use the "BACK"-function of your browser).
The Summary tab provides
an overview of the query settings chosen.
a summary statistics of the number of haplotypes found in the database grouped by the number of differences to the queried profile. The third column displays the cumulative number of haplotypes found, thus the lower-most number indicates the number of haplotypes that met the search criteria. This table is accompanied by a graphical representation of the number of haplotypes found. For a list of individual haplotypes in a new tab either click on one of the numbers in the first column of the table or select the appropriate column in the bar chart.
Matching profiles can further be restricted to a specific geographical affiliation and/or a (sub)metapopulation. You can choose from either of the two drop-down lists independently. Only those options remain in the two lists that meet the actual restrictions, that is, any selection in one list also affects the other list. To remove a restriction choose "All" in the appropriate list.
two statistical values for a statistical description of the current search result:
The frequency estimates following two different approaches together with their confidence intervals are calculated as follows:
PUc denotes the uncorrected frequency k/n where k is the number of hits and n is the samplesize.
PN+1 denotes the N+1 counting method following the formula (k+1)/(n+1) for estimating the frequency.
For each of the estimated frequencies the confidence interval is computed following the approach of Wilson 1927:
With where Phi denotes the normal distribution and alpha is set to 0.05 we get c=1.96. Let be the estimated frequency, then
denotes the Wilson interval.
Please note that the precision of these calculations is limited to 14 decimal digits. Although presented on the web page, outcomes below that range cannot be guaranteed to be correct.
To change the query activate the tab labeled "Input", change the input options and press SEARCH again.
MtDNA data tables can be depicted as quasi-median networks to enhance the understanding of the data in regard to homoplasy and potential artifacts. Highly recurrent mutations are removed from the dataset (filtering) to help detect data idiosyncrasies that pinpoint sequencing and data interpretation problems. A detailed discussion of the method can be found in Bandelt and Dür (2007) and its application in Parson and Dür (2007).
The following section leads you through
the input and parameter selection of a network analysis
the output generated by NETWORK
network drawing and
interpretation of the results.
Input
Sample Info
The sample-specific information identifies a search. This is also the reference under which the query is reported. In addition the query is given a unique result identification number (RID, see search output).
Input file (=emp file)
The input file contains the annotated population data. The emp-file is a tab delimited text file that can be created using standard text software or MS Excel (then, safe file under .txt format and rename "txt" by "emp"). Its format needs to meet the following example:
# Population data of 250 individuals from Austria; Walther Parson (walther.parson@i-med.ac.at)
# 100 samples from Innsbruck, 100 samples from Salzburg, 50 samples from Vienna
#! 16024-576
haplotype1 H1c 1 16519C 263G 523DEL 524DEL 477C
haplotype2 R0 2+
#! 16024-16365 73-340
haplotype4 T2b 1 16126C 16294T 16296T 16304C 73G 263G 315.1C
haplotype5 ? 1 16223T 73G 263G 315.1C
+ Note that haplotypes can be used in the condensed form for NETWORK and other EMPOP tools (while they need to be reported as individual haplotypes when applying for EMPOP accession numbers)
Structure of the emp-file
Lines starting with "#!" indicate the sequence range of the haplotype. Note that a given sequence range is applied to all mtDNA haplotypes following this range until a new range is defined. Thus, multiple haplotypes with different sequence ranges can be handled in one file. The network is drawn of the input data tailored to the greatest common range.
The file lists the haplotypes in columns with the following contents.
Column A: Sequence name: don't use blank space or special characters (allowed characters are letters (except umlauts ä, ö, ü), numbers, "-", "_", "/")
Column B: Haplogroup (hg) status: indicate hg, if unknown, use "?"
Column C: Frequency of haplotype (0 - 9999). If it is set to 0 the sample is not considered for the analysis.
Column D: Annotation of the haplotype relative to the rCRS. Separate differences by tabs (or use individual cells in MS Excel). Use forensic notation of sequences as outlined in the ISFG recommendations for mtDNA typing (Carracedo et al (2000)), e.g. 16311C, 249DEL (also 249Del, 249del, 249-), 315.1C, 524.1A 524.2C, etc.
Text lines can be included everywhere in the file for comments or description. They need to be marked with "#".
Avoid blank lines (except when marked with "#").
A sample datafile AUT273_spec.emp and further information on preparing emp files can be downloaded from the EMPOP download page.
Ambiguous symbols
The software accepts the IUB-code. However, ambiguous symbols (e.g. sequence heteroplasmy Y ~ C/T) can cause artificial nodes and links in the network. Therefore it is necessary to specify a non-ambiguous symbol either by calling the dominant type or by using the phylogenetic background of the sample. You will be notified on the presence of ambiguous symbols on the screen and in the network analysis report. For your information you also get a list of new insertions in your data set that are not known to the current EMPOP database.
New insertions
We collect positions with observed insertions in an EMPOP datafile to which new data are compared. New insertions that have not been recorded in EMPOP yet are displayed to draw the attention on them. This however does not impact the performance of NETWORK.
Filtering
Highly recurrent mutations are removed from the data set (filtering) that would otherwise increase the complexity of the network. You can choose between different filters depending on the application. The contents of the filters can be viewed by clicking on the symbol next to the dropdown box.
Available filters:
EMPOPspeedy: This filter removes highly recurrent mutations based on the lists provided in Bandelt et al (2002 and 2006). This filter is typically used for the analysis of mtDNA population data within the hypervariable segments - HVS-I (16024 - 16569) and HVS-II (1 - 576).
EMPOPspeedyWE: This filter removes highly recurrent mutations as presented in Zimmermann et al (2010). This filter is typically used for the analysis of west Eurasian mtDNA population data within the hypervariable segments - HVS-I (16024 - 16569) and HVS-II (1 - 576).
EMPOPall: This is a superfine filter that contains all mutations observed in EMPOP. This filter provides a very quick check on the data by highlighting only yet unobserved mutations. We update the EMPOPall filter periodically.
Unfiltered: None of the mutations are removed from your dataset. This is useful for the analysis of very short sequence stretches in the mtDNA CR (see below). The complexity of the network will increase rapidly if no filter is applied to the analysis of larger sequence regions.
Range
The range determines the region for which the network is computed. Any range within 16024-16569 and 1-576 can be queried. In some data very small regions may be interesting for detailed network analysis (e.g. 450-460).
PROCEED starts the execution.
Output
The summary tab is displayed when execution is finished. A link to the archive file (zip format) is presented. Download the file and unzip it to obtain the folder [RID_FILTERNAME_REGION], which contains the following files:
Results file [FILENAME_FILTER_REGION_report.txt]: This file summarizes the settings and the results of the network analysis - for details see chapter Interpretation.
File for drawing the network [FILENAME_FILTERNAME_REGION_network.dnw]: This file can be used to draw the entire network of the mtDNA datafile by dnw.exe.
File for drawing the torso [FILENAME_FILTERNAME_REGION_torso.dnw]: This file can be used to draw the torso of the network of the mtDNA datafile by dnw.exe.
Difference table of the network [FILENAME_FILTERNAME_REGION_network.txt]: This file contains the filtered and reduced haplotypes of the entire network, displayed in dot table format.
Difference table of the torso [FILENAME_FILTERNAME_REGION_torso.txt]: This file contains the filtered and reduced haplotypes of the torso of the network, displayed in dot table format.
EMP-file
Execute the file and follow the instructions given by the software. Choose a destination folder where the software is to be installed.
Once the installation is finished you can find a folder called DNW containing the software and an uninstaller in the start menu. Files having ".dnw" as file ending are automatically linked to the software. Double-clicking a dnw file opens the network in a separate window. The dialog box presents a legend of keys to edit the network (e.g. t ... for drawing a draft of the network, l ... for adding labels, etc.). During execution the current drawing can be exported in EPS (Encapsulated PostScript) or FIG format for printing or editing.
Interpretation
The Report.txt file summarizes relevant information of the network analysis. The network is described in a table by the number of samples (n), the number of polymorphic positions (p), the number of partitions or condensed characters (p’), the number of haplotypes (h), the number of nodes in the network (q), the number of nodes in the torso (t) and the number of nodes of the peeled torso (t’). These values are indicative for the quality of a network. However, they depend on the size and composition of the population data set in question. Generally, small t’-values (ideally 1) describe a star-like structure of the network, which is in agreement with the expected evolutionary pattern.
A more suggestive representation of the data is the graph of the quasi-median network. The nodes of this graph are given by the haplotypes or the quasi-medians generated from the haplotypes. In the drawing the frequencies of the haplotypes or quasi-medians are also shown. The root node is drawn with a bold circle and contains the filtered and reduced Anderson sequence (In the rare case that no haplotype contains the filtered and reduced Anderson sequence, the first haplotype is chosen instead and a warning is included in the report). The links are single or combined mutations specified by the syntax for single mutations or / for combined mutations, where the orientation is from the root node outwards. Links with the same mutation are drawn parallel and are labeled only once. The torso is obtained from the quasi-median network by collapsing all pendant subtrees into their base nodes. Thus the analysis of homoplasy can be restricted to the torso which contains all the reticulation of the network. For each base node the coinciding haplotypes are listed in the report to make it easy to find all corresponding samples.
Further reading
Bandelt HJ et al (2002) The fingerprint of phantom mutations in mitochondrial DNA data. Am J Hum Genet 71:1150-1160
Bandelt HJ et al (2006) Estimation of mutation rates and coalescence times: some caveats. In: Human mitochondrial DNA and the evolution of Homo sapiens. Springer-Verlag eds. Hans-Jürgen Bandelt, Vincent Macaulay, Martin Richards
Bandelt and Dür (2007) Translating DNA data tables into quasi-median networks for parsimony analysis and error detection. Mol Phylogenet Evol 42:256-271
Schwarz and Dür (2011) Visualization of quasi-median networks. Discrete Applied Mathematics 159(15):1608-1616
Input
EMP Tool
Please upload your EMP file. The file is checked for correct format, missing reading frame(s), clerical errors, and InDels that are not known to the current EMPOP database. In case the format is valid the software looks for irregular InDels and sorts the haplotypes according to the criteria selected.
Input file (=emp file)
The input file contains the annotated population data. The emp-file is a tab delimited text file that can be created using standard text software or MS Excel (then, safe file under .txt format and rename "txt" by "emp"). Its format needs to meet the following example:
# Population data of 250 individuals from Austria; Walther Parson (walther.parson@i-med.ac.at)
# 100 samples from Innsbruck, 100 samples from Salzburg, 50 samples from Vienna
#! 16024-576
haplotype1 H1c 1 16519C 263G 523DEL 524DEL 477C
haplotype2 R0 2+
#! 16024-16365 73-340
haplotype4 T2b 1 16126C 16294T 16296T 16304C 73G 263G 315.1C
haplotype5 ? 1 16223T 73G 263G 315.1C
+ Note that haplotypes can be used in the condensed form for NETWORK and other EMPOP tools (while they need to be reported as individual haplotypes when applying for EMPOP accession numbers)
Structure of the emp-file
Lines starting with "#!" indicate the sequence range of the haplotype. Note that a given sequence range is applied to all mtDNA haplotypes following this range until a new range is defined. Thus, multiple haplotypes with different sequence ranges can be handled in one file. The network is drawn of the input data tailored to the greatest common range.
The file lists the haplotypes in columns with the following contents.
Column A: Sequence name: don't use blank space or special characters (allowed characters are letters (except umlauts ä, ö, ü), numbers, "-", "_", "/")
Column B: Haplogroup (hg) status: indicate hg, if unknown, use "?"
Column C: Frequency of haplotype (0 - 9999). If it is set to 0 the sample is not considered for the analysis.
Column D: Annotation of the haplotype relative to the rCRS. Separate differences by tabs (or use individual cells in MS Excel). Use forensic notation of sequences as outlined in the ISFG recommendations for mtDNA typing (Carracedo et al (2000)), e.g. 16311C, 249DEL (also 249Del, 249del, 249-), 315.1C, 524.1A 524.2C, etc.
Text lines can be included everywhere in the file for comments or description. They need to be marked with "#".
Avoid blank lines (except when marked with "#").
A sample datafile AUT273_spec.emp and further information on preparing emp files can be downloaded from the EMPOP download page.
Sort criteria
In case your EMP file is valid, haplotypes are grouped by reading frames and within these groups sorted either by