| Frequently Asked Questions: Data File Formats
|
|
|
| BED format |
|
|
|
| |
BED format provides a flexible way to define the data lines that
are displayed in an annotation track. BED lines have three required fields and
nine additional optional fields. The number of fields per line must be consistent
throughout any single set of data in an annotation track.
The first three required BED fields are:
- chrom - The name of the chromosome (e.g. chr3, chrY,
chr2_random) or scaffold (e.g. scaffold10671).
- chromStart - The starting position of the feature in the
chromosome or scaffold. The first base in a chromosome is numbered 0.
- chromEnd - The ending position of the feature in the
chromosome or scaffold. The chromEnd base is not included in the
display of the feature. For example, the first 100 bases of a
chromosome are defined as chromStart=0, chromEnd=100, and span
the bases numbered 0-99.
The 9 additional optional BED fields are:
- name - Defines the name of the BED line. This label is
displayed to the left of the BED line in the Genome Browser
window when the track is open to full display mode or directly to the
left of the item in pack mode.
- score - A score between 0 and 1000. If the track line
useScore attribute is set to 1 for this annotation data set, the
score value will determine the level of gray in which
this feature is displayed (higher numbers = darker gray).
- strand - Defines the strand - either '+' or '-'.
- thickStart - The starting position at which the feature
is drawn thickly (for example, the start codon in gene
displays).
- thickEnd - The ending position at which the feature is
drawn thickly (for example, the stop codon in gene displays).
- reserved - This should always be set to zero.
- blockCount - The number of blocks (exons) in the
BED line.
- blockSizes - A comma-separated list of the block
sizes. The number of items in this list should correspond to
blockCount.
- blockStarts - A comma-separated list of block starts.
All of the blockStart positions should be calculated relative to
chromStart. The number of items in
this list should correspond to blockCount.
Example:
Here's an example of an annotation track that uses a complete BED definition:
track name=pairedReads description="Clone Paired Reads" useScore=1
chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601
| |
|
|
|
PSL format |
|
|
|
| |
PSL lines represent alignments, and are typically taken from files generated by BLAT
or psLayout. See the
BLAT
documentation for more details. All of the following fields are
required on each data line within a PSL file:
- matches - Number of bases that match that aren't repeats
- misMatches - Number of bases that don't match
- repMatches - Number of bases that match but are part of repeats
- nCount - Number of 'N' bases
- qNumInsert - Number of inserts in query
- qBaseInsert - Number of bases inserted in query
- tNumInsert - Number of inserts in target
- tBaseInsert - Number of bases inserted in target
- strand - '+' or '-' for query strand. For translated alignments, second '+'or '-' is for genomic strand
- qName - Query sequence name
- qSize - Query sequence size
- qStart - Alignment start position in query
- qEnd - Alignment end position in query
- tName - Target sequence name
- tSize - Target sequence size
- tStart - Alignment start position in target
- tEnd - Alignment end position in target
- blockCount - Number of blocks in the alignment (a block contains no gaps)
- blockSizes - Comma-separated list of sizes of each block
- qStarts - Comma-separated list of starting positions of each block in query
- tStarts - Comma-separated list of starting positions of each block in target
Example:
Here is an example of an annotation track in PSL format. Note that line
breaks have been inserted into the PSL lines in this example for
documentation display purposes.
Click
here for a copy of this example that can be pasted into the browser without editing.
track name=fishBlats description="Fish BLAT" useScore=1
59 9 0 0 1 823 1 96 +- FS_CONTIG_48080_1 1955 171 1062 chr22
47748585 13073589 13073753 2 48,20, 171,1042, 34674832,34674976,
59 7 0 0 1 55 1 55 +- FS_CONTIG_26780_1 2825 2456 2577 chr22
47748585 13073626 13073747 2 21,45, 2456,2532, 34674838,34674914,
59 7 0 0 1 55 1 55 -+ FS_CONTIG_26780_1 2825 2455 2676 chr22
47748585 13073727 13073848 2 45,21, 249,349, 13073727,13073827,
Be aware that the coordinates for a negative strand in a PSL line are handled in a special way. In the
qStart and qEnd fields, the coordinates indicate the position where the query matches from
the point of view of the forward strand, even when the match is on the reverse strand.
However, in the qStarts list, the coordinates are reversed.
Example:
Here is a 30-mer containing 2 blocks that align on the minus strand and
2 blocks that align on the plus strand (this sometimes can happen in response to
assembly errors):
0 1 2 3 tens position in query |
0123456789012345678901234567890 ones position in query |
++++ +++++ plus strand alignment on query |
-------- ---------- minus strand alignment on query |
|
|
Plus strand: |
qStart=12 |
qEnd=31 |
blockSizes=4,5 |
qStarts=12,26 |
|
Minus strand: |
qStart=4 |
qEnd=26 |
blockSizes=10,8 |
qStarts=5,19 |
Essentially, the minus strand blockSizes and qStarts are
what you would get if you reverse-complemented the query.
However, the qStart and qEnd are not reversed. To convert one to the other:
qStart = qSize - revQEnd
qEnd = qSize - revQStart
| |
|
|
|
GFF format |
|
|
|
| |
GFF (General Feature Format) lines are based on the GFF standard file format. GFF
lines have nine required fields that must be tab-separated. If the fields are
separated by spaces instead of tabs, the track will not display correctly. For more
information on GFF format, refer to http://www.sanger.ac.uk/Software/formats/GFF.
Here is a brief description of the GFF fields:
- seqname - The name of the sequence. Must be a chromosome or scaffold.
- source - The program that generated this feature.
- feature - The name of this type of feature. Some examples of
standard feature types are "CDS", "start_codon", "stop_codon", and
"exon".
- start - The starting position of the feature in the sequence. The first base is numbered 1.
- end - The ending position of the feature (inclusive).
- score - A score between 0 and 1000. If the track line
useScore attribute is set to 1 for this annotation data set, the
score value will determine the level of gray in which
this feature is displayed (higher numbers = darker gray). If there is no
score value, enter ".".
- strand - Valid entries include '+', '-', or '.' (for don't know/don't care).
- frame - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the
first base. If the feature is not a coding exon, the value should be '.'.
- group - All lines with the same group are linked together into a single item.
Example:
Here's an example of a GFF-based track.
Click here for a copy of this example that can be
pasted into the browser without editing. NOTE: Paste operations on some
operating systems will replace tabs with spaces, which will result in an error
when the GFF track is uploaded. You can circumvent this problem by pasting the
URL of the above example
(http://genome.ucsc.edu/goldenPath/help/regulatory.txt) instead of the text
itselfinto the custom annotation track text box.
track name=regulatory description="TeleGene(tm) Regulatory Regions"
chr22 TeleGene enhancer 1000000 1001000 500 + . touch1
chr22 TeleGene promoter 1010000 1010100 900 + . touch1
chr22 TeleGene promoter 1020000 1020000 800 - . touch2
| |
|
|
|
GTF format |
|
|
|
|
|
MAF format |
|
|
|
| |
The multiple alignment format stores a series of
multiple alignments in a format that is easy to
parse and relatively easy to read. This format
stores multiple alignments at the DNA level
between entire genomes. The previously existing formats
are suitable for multiple alignments of single proteins or
regions of DNA without rearrangements, and would require
considerable extension to cope with genomic issues such as
forward and reverse strand directions, multiple pieces
to the alignment, and so forth.
General Structure
The .maf format is
line-oriented. Each multiple alignment ends with a
blank line. Each sequence in an alignment is on a single
line, which can get quite long, but there is no length limit.
Words in a line are delimited by any white space. Lines
starting with # are considered to be comments.
Lines starting with ## can be ignored by most
programs, but contain meta-data of one form or
another.
The file is divided into paragraphs that terminate
in a blank line. Within a paragraph, the first
word of a line indicates its type. Each multiple
alignment is in a separate paragraph that begins
with an "a" line and contains an "s" line
for each sequence in the multiple alignment.
Some MAF files may contain two optional line types: an
"i" line containing information about what is in the aligned
species DNA before and after the immediately preceding "s"
line, and an "e" line containing information about the size of the gap between the alignments that span the current
block. Parsers may ignore any other types of paragraphs and
other types of lines within an alignment paragraph.
The Header Line
The first line of a .maf file begins with ##maf.
This word is followed by white-space-separated
variable=value pairs. There should be no white
space surrounding the "=".
##maf version=1 scoring=tba.v8
The currently defined variables are:
- version - Required. Currently set to one.
- scoring - Optional. A name for the scoring
scheme used for the alignments.
The current scoring schemes are:
- bit - roughly corresponds to blast bit
values (roughly 2 points per
aligning base minus penalties for
mismatches and inserts).
- blastz - blastz scoring scheme -- roughly
100 points per aligning base.
- probability - some score normalized
between 0 and 1.
- program - Optional. Name of the program
generating the alignment.
Undefined variables are ignored by the parser.
The Alignments Parameter Line
The second line displays the parameters that were
used to run the alignment program.
# tba.v8 (((human chimp) baboon) (mouse rat))
Alignment Block Lines (lines starting with 'a' -- parameters for a new alignment block
a score=23262.0
Each alignment begins with an 'a' line that set
variables for the entire alignment block. The 'a'
is followed by name=value pairs. There are no
required name=value pairs. The currently defined
variables are:
- score -- Optional. Floating point score. If
this is present, it is good practice
to also define scoring in the first line.
- pass -- Optional. Positive integer value.
For programs that do multiple pass
alignments such as blastz, this shows
which pass this alignment came from.
Typically, pass 1 will find the
strongest alignments genome-wide, and
pass 2 will find weaker alignments
between two first-pass alignments.
Lines starting with 's' -- a sequence within an alignment block
s hg16.chr7 27707221 13 + 158545518 gcagctgaaaaca
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
s baboon 249182 13 + 4622798 gcagctgaaaaca
s mm4.chr6 53310102 13 + 151104725 ACAGCTGAAAATA
The 's' lines together with the 'a' lines define a
multiple alignment. The 's' lines have the
following fields which are defined by position
rather than name=value pairs.
- src -- The name of one of the source sequences
for the alignment. For sequences that
are resident in a browser assembly, the
form 'database.chromosome' allows
automatic creation of links to other
assemblies. Non-browser sequences are
typically reference by the species name
alone.
- start -- The start of the aligning region in the
source sequence. This is a
zero-based number. If the strand field
is '-' then this is the start
relative to the reverse-complemented
source sequence.
- size -- The size of the aligning region in the
source sequence. This number is equal
to the number of non-dash characters in
the alignment text field below.
- strand -- Either '+' or '-'. If '-', then the
alignment is to the
reverse-complemented source.
- srcSize -- The size of the entire source sequence,
not just the parts involved in the alignment.
- text -- The nucleotides (or amino acids) in
the alignment and any insertions
(dashes) as well.
Lines starting with 'i' -- information about what's
happening before and after this block in the aligning
species
s hg16.chr7 27707221 13 + 158545518 gcagctgaaaaca s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
i panTro1.chr6 N 0 C 0
s baboon 249182 13 + 4622798 gcagctgaaaaca
i baboon I 234 n 19
The 'i' lines contain information about the context of the
sequence lines immediately preceeding them. The following
fields are defined by position rather than name=value pairs:
- src -- The name of the source sequence for the
alignment. Should be the same as the 's' line immediately
above this line.
- leftStatus -- A character that specifies the
relationship between the sequence in this block and the
sequence that appears in the previous block.
- leftCount -- Usually the number of bases in the aligning
species between the start of this alignment and the end of
the previous one.
- rightStatus -- A character that specifies the
relationship between the sequence in this block and the
sequence that appears in the subsequent block.
- rightCount -- Usually the number of bases in the
aligning species between the end of this alignment and the
start of the next one.
The status characters can be one of the following values:
- C -- the sequence before or after is contiguous with
this block.
- I -- there are bases between the bases in this block
and the one before or after it.
- N -- this is the first sequence from this src chrom or
scaffold.
- n -- this is the first sequence from this src chrom or
scaffold but it is bridged by another alignment from a
different chrom or scaffold.
- M -- there is missing data before or after this block
(Ns in the sequence).
Lines starting with 'e' -- information about empty parts
of the alignment block
s hg16.chr7 27707221 13 + 158545518 gcagctgaaaaca
e mm4.chr6 53310102 13 + 151104725 I
The 'e' lines indicate that there isn't aligning DNA for a
species but that the current block is bridged by a chain
that connects blocks before and after this block. The
following fields are defined by position rather than
name=value pairs.
- src -- The name of one of the source sequences for the
alignment.
- start -- The start of the non-aligning region in the
source sequence. This is a zero-based number. If the
strand field is '-' then this is the start relative to
the reverse-complemented source sequence.
- size -- The size in base pairs of the non-aligning
region in the source sequence.
- strand -- Either '+' or '-'. If '-', then the
alignment is to the reverse-complemented source.
- srcSize -- The size of the entire source sequence,
not just the parts involved in the alignment.
alignment and any insertions (dashes) as well.
- status -- A character that specifies the
relationship between the non-aligning sequence in this
block and the sequence that appears in the previous and
subsequent blocks.
The status character can be one of the following values:
- C -- the sequence before and after is contiguous
implying that this region was either deleted in the source
or inserted in the reference sequence. The browser draws a
single line or a '-' in base mode in these blocks.
- I -- there are non-aligning bases in the source species
between chained alignment blocks before and after this
block. The browser shows a double line or '=' in base mode.
- M -- there are non-aligning bases in the source and
more than 90% of them are Ns in the source. The browser
shows a pale yellow bar.
- n -- there are non-aligning bases in the source and the
next aligning block starts in a new chromosome or scaffold
that is bridged by a chain between still other blocks. The
browser shows either a single line or a double line based
on how many bases are in the gap between the bridging
alignments.
A Simple Example
Here is a simple example of a three alignment
blocks derived from five starting sequences.
Repeats are shown as lowercase, and each block may
have a subset of the input sequences. All
sequence columns and rows must contain at least one nucleotide
(no full columns or rows of insertions).
##maf version=1 scoring=tba.v8
# tba.v8 (((human chimp) baboon) (mouse rat))
# multiz.v7
# maf_project.v5 _tba_right.maf3 mouse _tba_C
# single_cov2.v4 single_cov2 /dev/stdin
a score=23262.0
s hg16.chr7 27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s baboon 116834 38 + 4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG
s mm4.chr6 53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG
s rn3.chr4 81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG
a score=5062.0
s hg16.chr7 27699739 6 + 158545518 TAAAGA
s panTro1.chr6 28862317 6 + 161576975 TAAAGA
s baboon 241163 6 + 4622798 TAAAGA
s mm4.chr6 53303881 6 + 151104725 TAAAGA
s rn3.chr4 81444246 6 + 187371129 taagga
a score=6636.0
s hg16.chr7 27707221 13 + 158545518 gcagctgaaaaca
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
s baboon 249182 13 + 4622798 gcagctgaaaaca
s mm4.chr6 53310102 13 + 151104725 ACAGCTGAAAATA
| |
|
|
|
WIG format |
|
|
|
| |
Wiggle format (WIG) allows the display of continuous-valued data in a track
format. Click here for more
information.
| |
|
|
|
.2bit format |
|
|
|
| |
A .2bit file stores multiple DNA sequences (up to 4 Gb total) in a compact
randomly-accessible format. The file contains masking information as well
as the DNA itself.
The file begins with a 16-byte header containing the
following fields:
- signature - the number 0x1A412743 in the architecture of the machine that
created the file
- version - zero for now. Readers should abort if they see a version number
higher than 0.
- sequenceCount - the number of sequences in the file.
- reserved - always zero for now
All fields are 32 bits unless noted. If the signature value is not as given,
the reader program should byte-swap the signature and check if the swapped
version matches. If so, all multiple-byte entities in the file will have to be
byte-swapped. This enables these binary files to be used unchanged on
different architectures.
The header is followed by a file index, which contains one entry for
each sequence. Each index entry contains three fields:
- nameSize - a byte containing the length of the name field
- name - the sequence name itself, of variable length depending on nameSize
- offset - the 32-bit offset of the sequence data relative to the start of the
file
The index is followed by the sequence records, which contain nine fields:
- dnaSize - number of bases of DNA in the sequence
- nBlockCount - the number of blocks of Ns in the file (representing unknown
sequence)
- nBlockStarts - the starting position for each block of Ns
- nBlockSizes - the size of each block of Ns
- maskBlockCount - the number of masked (lower-case) blocks
- maskBlockStarts - the starting position for each masked block
- maskBlockSizes - the size of each masked block
- packedDna - the DNA packed to two bits per base, represented as so:
T - 00, C - 01, A - 10, G - 11. The first base is in the most significant
2-bit byte; the last base is in the least significant 2 bits. For example, the
sequence TCAG is represented as 00011011. The packedDna field is padded with 0
bits as necessary to take an even multiple of 32 bits in the file, which
improves I/O performance on some machines.
| |
|
|
|
.nib format |
|
|
|
| |
The .nib format pre-dates the .2bit format and is less compact. It
describes a DNA sequence by packing two bases into each byte.
Each .nib file contains only a single sequence. The file begins with a 32-bit
signature that is 0x6BE93D3A in the architecture of the machine that created
the file (or possibly a byte-swapped version of the same number on another
machine). This is followed by a 32-bit number in the same format that describes
the number of bases in the file. Next, the bases themselves are listed, packed
two bases to the byte. The first base is packed in the high-order 4 bits
(nibble); the second base is packed in the low-order four bits:
byte = (base1<<4) + base2
The numerical representations for the bases are:
- 0 - T
- 1 - C
- 2 - A
- 3 - G
- 4 - N (unknown)
The most significant bit in a nibble is set if the base is masked.
| |
|
|
| |