bx.interval_index_file module
Classes for index files that map genomic intervals to values.
- Authors:
James Taylor (james@bx.psu.edu), Bob Harris (rsharris@bx.psu.edu)
An interval index file maps genomic intervals to values.
This implementation writes version 1 file format, and reads versions 0 and 1.
Index File Format
All fields are in big-endian format (most significant byte first).
All intervals are origin-zero, inclusive start, exclusive end.
The file begins with an index file header, then is immediately followed by an index table. The index table points to index headers, and index headers point to bins. Index headers and bins are referenced via pointers (file offsets), and can be placed more or less anywhere in the file.
File header
offset 0x00: |
2C FF 80 0A |
magic number |
offset 0x04: |
00 00 00 01 |
version (00 00 00 00 is also supported) |
offset 0x08: |
00 00 00 2A |
|
offset 0x0C: |
… |
index table |
Index table
The index table is a list of N index headers, packed sequentially and sorted by name. The first begins at offset 0x0C. Each header describes one set of intervals.
offset: |
xx xx xx xx |
|
offset+4: |
… |
index src name (e.g. canFam1.chr1) |
offset+4+L: |
xx xx xx xx |
offset (in this file) to index data |
offset+8+L: |
xx xx xx xx |
(B) number of bytes in each value; for version 0, this field is absent, and B is assumed to be 4 |
Index data
The index data for (for one index table) consists of the overall range of intervals followed by an array of pointers to bins. The length of the array is 1+binForRange(maxEnd-1,maxEnd), where maxEnd is the maximum interval end.
offset: |
xx xx xx xx |
minimum interval start |
offset+4: |
xx xx xx xx |
maximum interval end |
offset+8: |
xx xx xx xx |
offset (in this file) to bin 0 |
offset+12: |
xx xx xx xx |
number of intervals in bin 0 |
offset+16: |
xx xx xx xx |
offset (in this file) to bin 1 |
offset+20: |
xx xx xx xx |
number of intervals in bin 1 |
… |
… |
… |
Bin
A bin is an array of (start,end,val), sorted by increasing start (with end and val as tiebreakers). Note that bins may be empty (the number of intervals indicated in the index data is zero). Note that B is determined from the appropriate entry in the index table.
offset: |
xx xx xx xx |
start for interval 1 |
offset+4: |
xx xx xx xx |
end for interval 1 |
offset+8: |
… |
(B bytes) value for interval 1 |
offset+8+B: |
xx xx xx xx |
start for interval 2 |
offset+12+B: |
xx xx xx xx |
end for interval 2 |
offset+16+B: |
… |
(B bytes) value for interval 2 |
… |
… |
… |
- class bx.interval_index_file.Index(min=0, max=536870912, filename=None, offset=0, value_size=None, version=None)
Bases:
object
- add(start, end, val)
Add the interval (start,end) with associated value val to the index
- bytes_required()
- find(start, end)
- get_value_size()
- iterate()
- load_bin(index)
- new(min, max)
Create an empty index for intervals in the range min, max
- open(filename, offset, version)
- property value_size
- write(f)