bx.interval_index_file module

Classes for index files that map genomic intervals to values.

Authors:

James Taylor (james@bx.psu.edu), Bob Harris (rsharris@bx.psu.edu)

An interval index file maps genomic intervals to values.

This implementation writes version 1 file format, and reads versions 0 and 1.

Index File Format

All fields are in big-endian format (most significant byte first).

All intervals are origin-zero, inclusive start, exclusive end.

The file begins with an index file header, then is immediately followed by an index table. The index table points to index headers, and index headers point to bins. Index headers and bins are referenced via pointers (file offsets), and can be placed more or less anywhere in the file.

File header

offset 0x00:

2C FF 80 0A

magic number

offset 0x04:

00 00 00 01

version (00 00 00 00 is also supported)

offset 0x08:

00 00 00 2A

  1. number of index sets

offset 0x0C:

index table

Index table

The index table is a list of N index headers, packed sequentially and sorted by name. The first begins at offset 0x0C. Each header describes one set of intervals.

offset:

xx xx xx xx

  1. length of index src name

offset+4:

index src name (e.g. canFam1.chr1)

offset+4+L:

xx xx xx xx

offset (in this file) to index data

offset+8+L:

xx xx xx xx

(B) number of bytes in each value; for version 0, this field is absent, and B is assumed to be 4

Index data

The index data for (for one index table) consists of the overall range of intervals followed by an array of pointers to bins. The length of the array is 1+binForRange(maxEnd-1,maxEnd), where maxEnd is the maximum interval end.

offset:

xx xx xx xx

minimum interval start

offset+4:

xx xx xx xx

maximum interval end

offset+8:

xx xx xx xx

offset (in this file) to bin 0

offset+12:

xx xx xx xx

number of intervals in bin 0

offset+16:

xx xx xx xx

offset (in this file) to bin 1

offset+20:

xx xx xx xx

number of intervals in bin 1

Bin

A bin is an array of (start,end,val), sorted by increasing start (with end and val as tiebreakers). Note that bins may be empty (the number of intervals indicated in the index data is zero). Note that B is determined from the appropriate entry in the index table.

offset:

xx xx xx xx

start for interval 1

offset+4:

xx xx xx xx

end for interval 1

offset+8:

(B bytes) value for interval 1

offset+8+B:

xx xx xx xx

start for interval 2

offset+12+B:

xx xx xx xx

end for interval 2

offset+16+B:

(B bytes) value for interval 2

class bx.interval_index_file.Index(min=0, max=536870912, filename=None, offset=0, value_size=None, version=None)

Bases: object

add(start, end, val)

Add the interval (start,end) with associated value val to the index

bytes_required()
find(start, end)
get_value_size()
iterate()
load_bin(index)
new(min, max)

Create an empty index for intervals in the range min, max

open(filename, offset, version)
property value_size
write(f)
class bx.interval_index_file.Indexes(filename=None)

Bases: object

A set of indexes, each identified by a unique name

add(name, start, end, val, max=536870912)
find(name, start, end)
get(name)
open(filename)
write(f)