the output
the zip file
At the end of the script run a compressed zip file will be generated. It will contain a folder with the name of the zip file, that should be "gdf_timestamp". In it you may find text files containing the screen output (_out.txt), and failed genotypes (_fails.txt) and third allele information (_overdata.txt) if applicable. Additionally, the zip file will contain folders with the input for the required programs too (folder names are the programs they apply to).
A file containing inconsistent information (_skews.txt) present in the input file may be also present. As the same genotype is expected for the same sample when testing the same snp, if a different option appears (duplications in same plate, in different plates, or even on different platforms) it is recorded for a further study as a suggestion of a possible error (our experience shows that in most of the cases this situation occurs due to a software misscall).
A file containing a crossmatch table with samples and snps (_snps.txt) allows visual genotypes reviewing. Manually edited genotypes are marked with a "*" sign, and in case that more than one platform data is entered in the input file each genotype will be identified by a platform code (Sequenom=1, SNPlex=2, Illumina=3, Taqman=4). Blank fields stand for no snp-sample pair, the "-" signs stand for failed genotypes, and the skews are represented by "??".
Statistics of relevant snps features will be found in another file (_snpsStats.txt), such as individual countings and percentages, minor allele frequencies, heterozygosities or deviation from Hardy-Weinberg equilibrium. And something similar for samples will be in another file (_sampleStats.txt), that will contain success and failure rates.
Input lines containing data that was not processed due to its content or to its format
are optionally printed (_unprocessed.txt and _wrongFormat.txt), allowing the user to check
the unprocessed lines and to correct the input format.
the screen output
The report that GDF provides at the end of each run is divided in sections. The first one is an inventory of the files used and the files generated. A series of sections may then appear if applicable, such as unused genes (genes present in the configuration file with all their snps untested), unknown snps (snps present in the data file but not in the configuration file), untested snps (snps present in the configuration file but not in the data file), failed snps (snps that failed in all the genotyped samples), failed samples (samples with no successful genotype) and no pedigree samples (samples with no pedigree information present in the pedigree file if used).
There are two mayor errors checkings implemented on GDF. The first one deals with snps that were not genotyped on a sample but they were tested on the rest (the sample will appear in the "Unperformed tests samples" section), and the other one deals with samples that carry a third allele (the sample will be then recorded in the "Overlapping information samples" section). Both sections results are presented using the following format:
sample ( gene - untestedsnp1 untestedsnp2 ... )
There is a statistics section at the bottom of the report where relevant facts about the data in the input file are highlighted. It shows percentages of the different genotyping groups in which the genotypes found may be classified and the counting of the lines from the input file. Note in this last counting that the "unprocessed data lines" that may eventually appear refers to lines either containing snps not present in the configuration file or either being the sample name "LADDER" or "NTC", which are removed because they are not "real" samples, and ommited genotypes.