pyensembl package¶

Submodules¶

pyensembl.biotypes module¶

pyensembl.common module¶

pyensembl.common.dump_pickle(obj, filepath)¶

pyensembl.common.is_valid_ensembl_id(ensembl_id)¶: Is the argument a valid ID for any Ensembl feature?

pyensembl.common.is_valid_human_protein_id(protein_id)¶: Is the argument a valid identifier for human Ensembl proteins?

pyensembl.common.is_valid_human_transcript_id(transcript_id)¶: Is the argument a valid identifier for human Ensembl transcripts?

pyensembl.common.load_pickle(filepath)¶

pyensembl.common.memoize(fn)¶: Simple reset-able memoization decorator for functions and methods, assumes that all arguments to the function can be hashed and compared.

pyensembl.common.require_ensembl_id(ensembl_id)¶

pyensembl.common.require_human_protein_id(protein_id)¶

pyensembl.common.require_human_transcript_id(transcript_id)¶

pyensembl.database module¶

class pyensembl.database.Database(gtf, install_string)¶

Bases: object

Wrapper around sqlite3 database so that the rest of the library doesn’t have to worry about constructing the .db file or writing SQL queries directly.

PRIMARY_KEY_COLUMNS = {'gene': 'gene_id', 'transcript': 'transcript_id'}¶

column_exists(*args, **kwargs)¶

column_values_at_locus(column_name, feature, contig, position, end=None, strand=None, distinct=False, sorted=False)¶: Get the non-null values of a column from the database at a particular range of loci

columns(*args, **kwargs)¶

connect_or_create(overwrite=False)¶: Return a connection to the database if it exists, otherwise create it. Overwrite the existing database if overwrite is True.

connection¶: Get a connection to the database or raise an exception

create(overwrite=False)¶

Create the local database (including indexing) if it’s not already set up. If overwrite is True, always re-create the database from scratch.

Returns a connection to the database.

distinct_column_values_at_locus(column, feature, contig, position, end=None, strand=None)¶

Gather all the distinct values for a property/column at some specified locus.

column : str: Which property are we getting the values of.
feature : str: Which type of entry (e.g. transcript, exon, gene) is the property associated with?
contig : str: Chromosome or unplaced contig name
position : int: Chromosomal position
end : int, optional: End position of a range, if unspecified assume we’re only looking at the single given position.
strand : str, optional: Either the positive (‘+’) or negative strand (‘-‘). If unspecified then check for values on either strand.

local_db_filename()¶

local_db_path()¶

query(*args, **kwargs)¶: Construct a SQL query and run against the sqlite3 database, filtered both by the feature type and a user-provided column/value.

query_distinct_on_contig(column_name, feature, contig)¶

query_feature_values(*args, **kwargs)¶: Run a SQL query against the sqlite3 database, filtered only on the feature type.

query_loci(filter_column, filter_value, feature)¶

Query for loci satisfying a given filter and feature type.

filter_column : str: Name of column to filter results by.
filter_value : str: Only return loci which have this value in the their filter_column.
feature : str: Feature names such as ‘transcript’, ‘gene’, and ‘exon’

Returns list of Locus objects

query_locus(filter_column, filter_value, feature)¶

Query for unique locus, raises error if missing or more than one locus in the database.

filter_column : str: Name of column to filter results by.
filter_value : str: Only return loci which have this value in the their filter_column.
feature : str: Feature names such as ‘transcript’, ‘gene’, and ‘exon’

Returns single Locus object.

query_one(select_column_names, filter_column, filter_value, feature, distinct=False, required=False)¶

run_sql_query(sql, required=False, query_params=[])¶

Given an arbitrary SQL query, run it against the database and return the results.

sql : str: SQL query
required : bool: Raise an error if no results found in the database
query_params : list: For each ‘?’ in the query there must be a corresponding value in this list.

pyensembl.download_cache module¶

class pyensembl.download_cache.DownloadCache(reference_name, annotation_name, annotation_version=None, decompress_on_download=False, copy_local_files_to_cache=False, install_string_function=None, cache_directory_path=None)¶

Bases: object

Downloads remote files to cache, optionally copies local files into cache, raises custom message if data is missing.

cache_directory_path¶

cached_path(path_or_url)¶: When downloading remote files, the default behavior is to name local files the same as their remote counterparts.

delete_cache_directory()¶

delete_cached_files(prefixes=[], suffixes=[])¶: Deletes any cached files matching the prefixes or suffixes given

download_or_copy_if_necessary(path_or_url, download_if_missing=False, overwrite=False)¶

Download a remote file or copy Get the local path to a possibly remote file.

Download if file is missing from the cache directory and download_if_missing is True. Download even if local file exists if both download_if_missing and overwrite are True.

If the file is on the local file system then return its path, unless self.copy_local_to_cache is True, and then copy it to the cache first.

path_or_url : str

download_if_missing : bool, optional: Download files if missing from local cache
overwrite : bool, optional: Overwrite existing copy if it exists

is_url_format(path_or_url)¶

local_path_or_install_error(field_name, path_or_url, download_if_missing=False, overwrite=False)¶

exception pyensembl.download_cache.MissingLocalFile(path)¶: Bases: exceptions.Exception

exception pyensembl.download_cache.MissingRemoteFile(url)¶: Bases: exceptions.Exception

pyensembl.download_cache.cache_subdirectory(reference_name=None, annotation_name=None, annotation_version=None)¶: Which cache subdirectory to use for a given annotation database over a particular reference. All arguments can be omitted to just get the base subdirectory for all pyensembl cached datasets.

pyensembl.ensembl_release module¶

Contains the EnsemblRelease class, which extends the Genome class to be specific to (a particular release of) Ensembl.

class pyensembl.ensembl_release.EnsemblRelease(release=87, species=Species(latin_name='homo_sapiens', synonyms=['human'], reference_assemblies={'GRCh38': (76, 87), 'GRCh37': (55, 75), 'NCBI36': (54, 54)}), server='ftp://ftp.ensembl.org')¶

Bases: pyensembl.genome.Genome

Bundles together the genomic annotation and sequence data associated with a particular release of the Ensembl database.

classmethod cached(release=87, species=Species(latin_name='homo_sapiens', synonyms=['human'], reference_assemblies={'GRCh38': (76, 87), 'GRCh37': (55, 75), 'NCBI36': (54, 54)}), server='ftp://ftp.ensembl.org')¶: Construct EnsemblRelease if it’s never been made before, otherwise return an old instance.

classmethod from_dict(state_dict)¶: Deserialize EnsemblRelease without creating duplicate instances.

install_string()¶

classmethod normalize_init_values(release, species, server)¶: Normalizes the arguments which uniquely specify an EnsemblRelease genome.

to_dict()¶

pyensembl.ensembl_release_versions module¶

pyensembl.ensembl_release_versions.check_release_number(release)¶: Check to make sure a release is in the valid range of Ensembl releases.

pyensembl.ensembl_url_templates module¶

Templates for URLs and paths to specific relase, species, and file type on the Ensembl ftp server.

For example, the human chromosomal DNA sequences for release 78 are in:

ftp://ftp.ensembl.org/pub/release-78/fasta/homo_sapiens/dna/

pyensembl.ensembl_url_templates.make_fasta_dna_url(ensembl_release, species, contig, server='ftp://ftp.ensembl.org')¶: Construct URL to FASTA file with full sequence of a particular chromosome. Returns server_url/subdir and filename as tuple result.

pyensembl.ensembl_url_templates.make_fasta_url(ensembl_release, species, sequence_type, server='ftp://ftp.ensembl.org')¶

Construct URL to FASTA file with cDNA transcript or protein sequences

Parameter examples:: ensembl_release = 75 species = “Homo_sapiens” sequence_type = “cdna” (other option: “pep”)

pyensembl.ensembl_url_templates.make_gtf_url(ensembl_release, species, server='ftp://ftp.ensembl.org')¶: Returns a URL and a filename, which can be joined together.

pyensembl.exon module¶

class pyensembl.exon.Exon(exon_id, contig, start, end, strand, gene_name, gene_id)¶

Bases: pyensembl.locus.Locus

to_dict()¶

pyensembl.gene module¶

class pyensembl.gene.Gene(gene_id, gene_name, contig, start, end, strand, biotype, genome)¶

Bases: pyensembl.locus_with_genome.LocusWithGenome

exons¶

id¶: Alias for gene_id necessary for backwards compatibility.

name¶: Alias for gene_name necessary for backwards compatibility.

to_dict()¶

transcripts¶: Property which dynamically construct transcript objects for all transcript IDs associated with this gene.

pyensembl.genome module¶

Contains the Genome class, with its millions of accessors and wrappers around an arbitrary genomic database.

class pyensembl.genome.Genome(reference_name, annotation_name, annotation_version=None, gtf_path_or_url=None, transcript_fasta_path_or_url=None, protein_fasta_path_or_url=None, decompress_on_download=False, copy_local_files_to_cache=False, require_ensembl_ids=True, cache_directory_path=None)¶

Bases: serializable.serializable.Serializable

Bundles together the genomic annotation and sequence data associated with a particular genomic database source (e.g. a single Ensembl release) and provides a wide variety of helper methods for accessing this data.

clear_cache()¶: Clear any in-memory cached values and short-lived on-disk materializations from MemoryCache

contigs()¶: Returns all contig names for any gene in the genome (field called “seqname” in Ensembl GTF files)

db¶

delete_index_files()¶: Delete all data aside from source GTF and FASTA files

download(overwrite=False)¶

Download data files needed by this Genome instance.

overwrite : bool, optional: Download files regardless whether local copy already exists.

exon_by_id(exon_id)¶: Construct an Exon object from its ID by looking up the exon”s properties in the given Database.

exon_ids(contig=None, strand=None)¶

exon_ids_at_locus(contig, position, end=None, strand=None)¶

exon_ids_of_gene_id(gene_id)¶

exon_ids_of_gene_name(gene_name)¶

exon_ids_of_transcript_id(transcript_id)¶

exon_ids_of_transcript_name(transcript_name)¶

exons(contig=None, strand=None)¶: Create exon object for all exons in the database, optionally restrict to a particular chromosome using the contig argument.

exons_at_locus(contig, position, end=None, strand=None)¶

gene_by_id(gene_id)¶: Construct a Gene object for the given gene ID.

gene_by_protein_id(protein_id)¶: Get the gene ID associated with the given protein ID, return its Gene object

gene_id_of_protein_id(protein_id)¶: What is the gene ID associated with a given protein ID?

gene_ids(contig=None, strand=None)¶: What are all the gene IDs (optionally restrict to a given chromosome/contig and/or strand)

gene_ids_at_locus(contig, position, end=None, strand=None)¶

gene_ids_of_gene_name(gene_name)¶: What are the gene IDs associated with a given gene name? (due to copy events, there might be multiple genes per name)

gene_name_of_exon_id(exon_id)¶

gene_name_of_gene_id(gene_id)¶

gene_name_of_transcript_id(transcript_id)¶

gene_name_of_transcript_name(transcript_name)¶

gene_names(contig=None, strand=None)¶: Return all genes in the database, optionally restrict to a chromosome and/or strand.

gene_names_at_locus(contig, position, end=None, strand=None)¶

genes(contig=None, strand=None)¶

Returns all Gene objects in the database. Can be restricted to a particular contig/chromosome and strand by the following arguments:

contig : str: Only return genes on the given contig.
strand : str: Only return genes on this strand.

genes_at_locus(contig, position, end=None, strand=None)¶

genes_by_name(gene_name)¶: Get all the unqiue genes with the given name (there might be multiple due to copies in the genome), return a list containing a Gene object for each distinct ID.

gtf¶

index(overwrite=False)¶: Assuming that all necessary data for this Genome has been downloaded, generate the GTF database and save efficient representation of FASTA sequence files.

install_string()¶: Add every missing file to the install string shown to the user in an error message.

loci_of_gene_names(gene_name)¶

Given a gene name returns list of Locus objects with fields:: chromosome, start, stop, strand

You can get multiple results since a gene might have multiple copies in the genome.

locus_of_exon_id(exon_id)¶: Given an exon ID returns Locus

locus_of_gene_id(gene_id)¶: Given a gene ID returns Locus with: chromosome, start, stop, strand

locus_of_transcript_id(transcript_id)¶

protein_ids(contig=None, strand=None)¶: What are all the protein IDs (optionally restrict to a given chromosome and/or strand)

protein_ids_at_locus(contig, position, end=None, strand=None)¶

protein_sequence(protein_id)¶: Return cDNA nucleotide sequence of transcript, or None if transcript doesn’t have cDNA sequence.

protein_sequences¶

to_dict()¶: Returns a dictionary of the essential fields of this Genome.

transcript_by_id(transcript_id)¶: Construct Transcript object with given transcript ID

transcript_by_protein_id(protein_id)¶

transcript_id_of_protein_id(protein_id)¶: What is the transcript ID associated with a given protein ID?

transcript_ids(contig=None, strand=None)¶

transcript_ids_at_locus(contig, position, end=None, strand=None)¶

transcript_ids_of_exon_id(exon_id)¶

transcript_ids_of_gene_id(gene_id)¶

transcript_ids_of_gene_name(gene_name)¶

transcript_ids_of_transcript_name(transcript_name)¶

transcript_name_of_transcript_id(transcript_id)¶

transcript_names(contig=None, strand=None)¶: What are all the transcript names in the database (optionally, restrict to a given chromosome and/or strand)

transcript_names_at_locus(contig, position, end=None, strand=None)¶

transcript_names_of_gene_name(gene_name)¶

transcript_sequence(transcript_id)¶: Return cDNA nucleotide sequence of transcript, or None if transcript doesn’t have cDNA sequence.

transcript_sequences¶

transcripts(contig=None, strand=None)¶: Construct Transcript object for every transcript entry in the database. Optionally restrict to a particular chromosome using the contig argument.

transcripts_at_locus(contig, position, end=None, strand=None)¶

transcripts_by_name(transcript_name)¶

pyensembl.gtf module¶

class pyensembl.gtf.GTF(gtf_path, cache_directory_path=None)¶

Bases: object

Parse a GTF gene annotation file from a given local path. Represent its contents as a Pandas DataFrame (optionally filtered by locus, column, contig, &c).

clear_cache()¶

data_subset_path(contig=None, feature=None, column=None, strand=None, distinct=False, extension='.csv')¶

Path to cached file for storing materialized views of the genomic data. Typically this is a CSV file, the filename reflects which filters have been applied to the entries of the database.

Parameters:

contig : str, optional: Path for subset of data restricted to given contig
feature : str, optional: Path for subset of data restrict to given feature
column : str, optional: Restrict to single column
strand : str, optional: Positive (“+”) or negative (“-”) DNA strand. Default = either.
distinct : bool, optional: Only keep unique values (default=False)

dataframe(contig=None, feature=None, strand=None, save_to_disk=False)¶: Load genome entries as a DataFrame, optionally restricted to particular contig or feature type.

dataframe_at_locus(contig, start, end=None, offset=None, strand=None)¶: Subset of entries which overlap an inclusive range of chromosomal positions

dataframe_column_at_locus(column_name, contig, start, end=None, offset=None, strand=None)¶: Subset of entries which overlap an inclusive range of loci

pyensembl.locus module¶

class pyensembl.locus.Locus(contig, start, end, strand)¶

Bases: serializable.serializable.Serializable

Base class for any entity which can be localized at a range of positions on a particular strand of a chromosome/contig.

can_overlap(contig, strand=None)¶: Is this locus on the same contig and (optionally) on the same strand?

contains(contig, start, end, strand=None)¶

contains_locus(other_locus)¶

distance_to_interval(start, end)¶: Find the distance between intervals [start1, end1] and [start2, end2]. If the intervals overlap then the distance is 0.

distance_to_locus(other)¶

length¶

offset(position)¶

Offset of given position from stranded start of this locus.

For example, if a Locus goes from 10..20 and is on the negative strand, then the offset of position 13 is 7, whereas if the Locus is on the positive strand, then the offset is 3.

offset_range(start, end)¶: Database start/end entries are always ordered such that start < end. This makes computing a relative position (e.g. of a stop codon relative to its transcript) complicated since the “end” position of a backwards locus is actually earlir on the strand. This function correctly selects a start vs. end value depending on this locuses’s strand and determines that position’s offset from the earliest position in this locus.

on_backward_strand¶

on_contig(contig)¶

on_forward_strand¶

on_negative_strand¶

on_positive_strand¶

on_strand(strand)¶

overlaps(contig, start, end, strand=None)¶

Does this locus overlap with a given range of positions?

Since locus position ranges are inclusive, we should make sure that e.g. chr1:10-10 overlaps with chr1:10-10

overlaps_locus(other_locus)¶

to_dict()¶

pyensembl.memory_cache module¶

Cache and serializing the results of expensive computations. Used in pyensembl primarily to cache the heavy-weight parsing of GTF files and various filtering operations on Ensembl entries.

A piece of data is returned from one of three sources: 1) Cache cold. Run the user-supplied compute_fn. 2) Cache warm on disk. Parse or unpickle the serialized result into memory. 3) Cache warm in memory. Return cached object.

class pyensembl.memory_cache.MemoryCache¶

Bases: object

In-memory and on-disk caching of long-running queries and computations.

cached_dataframe(csv_path, compute_fn)¶

If a CSV path is in the _memory_cache, then return that cached value.

If we’ve already saved the DataFrame as a CSV then load it.

Otherwise run the provided compute_fn, and store its result in memory and and save it as a CSV.

cached_object(path, compute_fn)¶

If cached_object has already been called for a value of path in this running Python instance, then it should have a cached value in the

_memory_cache; return that value.

If this function was never called before with a particular value of path, then call compute_fn, and pickle it to path.

If path already exists, unpickle it and store that value in _memory_cache.

clear_cached_objects()¶

delete_file(path)¶

is_empty(filename)¶

remove_from_cache(key)¶

pyensembl.search module¶

Helper functions for searching over collections of PyEnsembl objects

pyensembl.search.find_nearest_locus(start, end, loci)¶: Finds nearest locus (object with method distance_to_interval) to the interval defined by the given start and end positions. Returns the distance to that locus, along with the locus object itself.

pyensembl.sequence_data module¶

class pyensembl.sequence_data.SequenceData(fasta_path, require_ensembl_ids=False, cache_directory_path=None)¶

Bases: object

Container for reference nucleotide and amino acid sequenes.

clear_cache()¶

fasta_dictionary¶

get(sequence_id)¶: Get sequence associated with given ID or return None if missing

index(overwrite=False)¶

pyensembl.shell module¶

Manipulate pyensembl’s local cache.

%(prog)s {install, delete, delete-sequence-cache} [–release XXX –species human...]

To install particular Ensembl human release(s):: %(prog)s install –release 75 77
To install particular Ensembl mouse release(s):: %(prog)s install –release 75 77 –species mouse
To delete all downloaded and cached data for a particular Ensembl release:: %(prog)s delete-all-files –release 75 –species human
To delete only cached data related to transcript and protein sequences:: %(prog)s delete-index-files –release 75
To install any genome:: %(prog)s install –reference-name “GRCh38” –gtf URL_OR_PATH –transcript-fasta URL_OR_PATH –protein-fasta URL_OR_PATH

pyensembl.shell.run()¶

pyensembl.species module¶

class pyensembl.species.Species(latin_name, synonyms=[], reference_assemblies={})¶

Bases: serializable.serializable.Serializable

Container for combined information about a species name, its synonyn names and which reference to use for this species in each Ensembl release.

classmethod from_dict(state_dict)¶

classmethod register(latin_name, synonyms, reference_assemblies)¶: Create a Species object from the given arguments and enter into all the dicts used to look the species up by its fields.

to_dict()¶

which_reference(ensembl_release)¶

pyensembl.species.check_species_object(species_name_or_object)¶: Helper for validating user supplied species names or objects.

pyensembl.species.find_species_by_name(species_name)¶

pyensembl.species.find_species_by_reference(reference_name)¶

pyensembl.species.max_ensembl_release(reference_name)¶

pyensembl.species.normalize_reference_name(name)¶

Search the dictionary of species-specific references to find a reference name that matches aside from capitalization.

If no matching reference is found, raise an exception.

pyensembl.species.normalize_species_name(name)¶: If species name was “Homo sapiens” then replace spaces with underscores and return “homo_sapiens”. Also replace common names like “human” with “homo_sapiens”.

pyensembl.species.which_reference(species_name, ensembl_release)¶

pyensembl.transcript module¶

class pyensembl.transcript.Transcript(transcript_id, transcript_name, contig, start, end, strand, biotype, gene_id, genome)¶

Bases: pyensembl.locus_with_genome.LocusWithGenome

Transcript encompasses the locus, exons, and sequence of a transcript.

Lazily fetches sequence in case we”re constructing many Transcripts and not using the sequence, avoid the memory/performance overhead of fetching and storing sequences from a FASTA file.

coding_sequence¶: cDNA coding sequence (from start codon to stop codon, without any introns)

coding_sequence_position_ranges¶: Return absolute chromosome position ranges for CDS fragments of this transcript

complete¶: Consider a transcript complete if it has start and stop codons and a coding sequence whose length is divisible by 3

contains_start_codon¶: Does this transcript have an annotated start_codon entry?

contains_stop_codon¶: Does this transcript have an annotated stop_codon entry?

exon_intervals¶: List of (start,end) tuples for each exon of this transcript, in the order specified by the ‘exon_number’ column of the exon table.

exons¶

first_start_codon_spliced_offset¶: Offset of first nucleotide in start codon into the spliced mRNA (excluding introns)

five_prime_utr_sequence¶: cDNA sequence of 5’ UTR (untranslated region at the beginning of the transcript)

gene¶

id¶: Alias for transcript_id necessary for backward compatibility.

last_stop_codon_spliced_offset¶: Offset of last nucleotide in stop codon into the spliced mRNA (excluding introns)

name¶: Alias for transcript_name necessary for backward compatibility.

protein_id¶

protein_sequence¶

sequence¶: Spliced cDNA sequence of transcript (includes 5” UTR, coding sequence, and 3” UTR)

spliced_offset(position)¶

Convert from an absolute chromosomal position to the offset into this transcript”s spliced mRNA.

Position must be inside some exon (otherwise raise exception).

start_codon_positions¶: Chromosomal positions of nucleotides in start codon.

start_codon_spliced_offsets¶: Offsets from start of spliced mRNA transcript of nucleotides in start codon.

start_codon_unspliced_offsets¶: Offsets from start of unspliced pre-mRNA transcript of nucleotides in start codon.

stop_codon_positions¶: Chromosomal positions of nucleotides in stop codon.

stop_codon_spliced_offsets¶: Offsets from start of spliced mRNA transcript of nucleotides in stop codon.

stop_codon_unspliced_offsets¶: Offsets from start of unspliced pre-mRNA transcript of nucleotides in stop codon.

three_prime_utr_sequence¶: cDNA sequence of 3’ UTR (untranslated region at the end of the transcript)

to_dict()¶

Module contents¶

class pyensembl.MemoryCache¶

Bases: object

In-memory and on-disk caching of long-running queries and computations.

cached_dataframe(csv_path, compute_fn)¶

If a CSV path is in the _memory_cache, then return that cached value.

If we’ve already saved the DataFrame as a CSV then load it.

Otherwise run the provided compute_fn, and store its result in memory and and save it as a CSV.

cached_object(path, compute_fn)¶

If cached_object has already been called for a value of path in this running Python instance, then it should have a cached value in the

_memory_cache; return that value.

If this function was never called before with a particular value of path, then call compute_fn, and pickle it to path.

If path already exists, unpickle it and store that value in _memory_cache.

clear_cached_objects()¶

delete_file(path)¶

is_empty(filename)¶

remove_from_cache(key)¶

class pyensembl.DownloadCache(reference_name, annotation_name, annotation_version=None, decompress_on_download=False, copy_local_files_to_cache=False, install_string_function=None, cache_directory_path=None)¶

Bases: object

Downloads remote files to cache, optionally copies local files into cache, raises custom message if data is missing.

cache_directory_path¶

cached_path(path_or_url)¶: When downloading remote files, the default behavior is to name local files the same as their remote counterparts.

delete_cache_directory()¶

delete_cached_files(prefixes=[], suffixes=[])¶: Deletes any cached files matching the prefixes or suffixes given

download_or_copy_if_necessary(path_or_url, download_if_missing=False, overwrite=False)¶

Download a remote file or copy Get the local path to a possibly remote file.

Download if file is missing from the cache directory and download_if_missing is True. Download even if local file exists if both download_if_missing and overwrite are True.

If the file is on the local file system then return its path, unless self.copy_local_to_cache is True, and then copy it to the cache first.

path_or_url : str

download_if_missing : bool, optional: Download files if missing from local cache
overwrite : bool, optional: Overwrite existing copy if it exists

is_url_format(path_or_url)¶

local_path_or_install_error(field_name, path_or_url, download_if_missing=False, overwrite=False)¶

class pyensembl.EnsemblRelease(release=87, species=Species(latin_name='homo_sapiens', synonyms=['human'], reference_assemblies={'GRCh38': (76, 87), 'GRCh37': (55, 75), 'NCBI36': (54, 54)}), server='ftp://ftp.ensembl.org')¶

Bases: pyensembl.genome.Genome

Bundles together the genomic annotation and sequence data associated with a particular release of the Ensembl database.

classmethod cached(release=87, species=Species(latin_name='homo_sapiens', synonyms=['human'], reference_assemblies={'GRCh38': (76, 87), 'GRCh37': (55, 75), 'NCBI36': (54, 54)}), server='ftp://ftp.ensembl.org')¶: Construct EnsemblRelease if it’s never been made before, otherwise return an old instance.

classmethod from_dict(state_dict)¶: Deserialize EnsemblRelease without creating duplicate instances.

install_string()¶

classmethod normalize_init_values(release, species, server)¶: Normalizes the arguments which uniquely specify an EnsemblRelease genome.

to_dict()¶

pyensembl.cached_release(release, species='human')¶

Create an EnsemblRelease instance only if it’s hasn’t already been made, otherwise returns the old instance.

Keeping this function for backwards compatibility but this functionality has been moving into the cached method of EnsemblRelease.

class pyensembl.Gene(gene_id, gene_name, contig, start, end, strand, biotype, genome)¶

Bases: pyensembl.locus_with_genome.LocusWithGenome

exons¶

id¶: Alias for gene_id necessary for backwards compatibility.

name¶: Alias for gene_name necessary for backwards compatibility.

to_dict()¶

transcripts¶: Property which dynamically construct transcript objects for all transcript IDs associated with this gene.

class pyensembl.Transcript(transcript_id, transcript_name, contig, start, end, strand, biotype, gene_id, genome)¶

Bases: pyensembl.locus_with_genome.LocusWithGenome

Transcript encompasses the locus, exons, and sequence of a transcript.

Lazily fetches sequence in case we”re constructing many Transcripts and not using the sequence, avoid the memory/performance overhead of fetching and storing sequences from a FASTA file.

coding_sequence¶: cDNA coding sequence (from start codon to stop codon, without any introns)

coding_sequence_position_ranges¶: Return absolute chromosome position ranges for CDS fragments of this transcript

complete¶: Consider a transcript complete if it has start and stop codons and a coding sequence whose length is divisible by 3

contains_start_codon¶: Does this transcript have an annotated start_codon entry?

contains_stop_codon¶: Does this transcript have an annotated stop_codon entry?

exon_intervals¶: List of (start,end) tuples for each exon of this transcript, in the order specified by the ‘exon_number’ column of the exon table.

exons¶

first_start_codon_spliced_offset¶: Offset of first nucleotide in start codon into the spliced mRNA (excluding introns)

five_prime_utr_sequence¶: cDNA sequence of 5’ UTR (untranslated region at the beginning of the transcript)

gene¶

id¶: Alias for transcript_id necessary for backward compatibility.

last_stop_codon_spliced_offset¶: Offset of last nucleotide in stop codon into the spliced mRNA (excluding introns)

name¶: Alias for transcript_name necessary for backward compatibility.

protein_id¶

protein_sequence¶

sequence¶: Spliced cDNA sequence of transcript (includes 5” UTR, coding sequence, and 3” UTR)

spliced_offset(position)¶

Convert from an absolute chromosomal position to the offset into this transcript”s spliced mRNA.

Position must be inside some exon (otherwise raise exception).

start_codon_positions¶: Chromosomal positions of nucleotides in start codon.

start_codon_spliced_offsets¶: Offsets from start of spliced mRNA transcript of nucleotides in start codon.

start_codon_unspliced_offsets¶: Offsets from start of unspliced pre-mRNA transcript of nucleotides in start codon.

stop_codon_positions¶: Chromosomal positions of nucleotides in stop codon.

stop_codon_spliced_offsets¶: Offsets from start of spliced mRNA transcript of nucleotides in stop codon.

stop_codon_unspliced_offsets¶: Offsets from start of unspliced pre-mRNA transcript of nucleotides in stop codon.

three_prime_utr_sequence¶: cDNA sequence of 3’ UTR (untranslated region at the end of the transcript)

to_dict()¶

class pyensembl.Exon(exon_id, contig, start, end, strand, gene_name, gene_id)¶

Bases: pyensembl.locus.Locus

to_dict()¶

class pyensembl.SequenceData(fasta_path, require_ensembl_ids=False, cache_directory_path=None)¶

Bases: object

Container for reference nucleotide and amino acid sequenes.

clear_cache()¶

fasta_dictionary¶

get(sequence_id)¶: Get sequence associated with given ID or return None if missing

index(overwrite=False)¶

pyensembl.find_nearest_locus(start, end, loci)¶: Finds nearest locus (object with method distance_to_interval) to the interval defined by the given start and end positions. Returns the distance to that locus, along with the locus object itself.

pyensembl.find_species_by_name(species_name)¶

pyensembl.find_species_by_reference(reference_name)¶

pyensembl.which_reference(species_name, ensembl_release)¶

pyensembl.check_species_object(species_name_or_object)¶: Helper for validating user supplied species names or objects.

pyensembl.normalize_reference_name(name)¶

Search the dictionary of species-specific references to find a reference name that matches aside from capitalization.

If no matching reference is found, raise an exception.

pyensembl.normalize_species_name(name)¶: If species name was “Homo sapiens” then replace spaces with underscores and return “homo_sapiens”. Also replace common names like “human” with “homo_sapiens”.

class pyensembl.Genome(reference_name, annotation_name, annotation_version=None, gtf_path_or_url=None, transcript_fasta_path_or_url=None, protein_fasta_path_or_url=None, decompress_on_download=False, copy_local_files_to_cache=False, require_ensembl_ids=True, cache_directory_path=None)¶

Bases: serializable.serializable.Serializable

Bundles together the genomic annotation and sequence data associated with a particular genomic database source (e.g. a single Ensembl release) and provides a wide variety of helper methods for accessing this data.

clear_cache()¶: Clear any in-memory cached values and short-lived on-disk materializations from MemoryCache

contigs()¶: Returns all contig names for any gene in the genome (field called “seqname” in Ensembl GTF files)

db¶

delete_index_files()¶: Delete all data aside from source GTF and FASTA files

download(overwrite=False)¶

Download data files needed by this Genome instance.

overwrite : bool, optional: Download files regardless whether local copy already exists.

exon_by_id(exon_id)¶: Construct an Exon object from its ID by looking up the exon”s properties in the given Database.

exon_ids(contig=None, strand=None)¶

exon_ids_at_locus(contig, position, end=None, strand=None)¶

exon_ids_of_gene_id(gene_id)¶

exon_ids_of_gene_name(gene_name)¶

exon_ids_of_transcript_id(transcript_id)¶

exon_ids_of_transcript_name(transcript_name)¶

exons(contig=None, strand=None)¶: Create exon object for all exons in the database, optionally restrict to a particular chromosome using the contig argument.

exons_at_locus(contig, position, end=None, strand=None)¶

gene_by_id(gene_id)¶: Construct a Gene object for the given gene ID.

gene_by_protein_id(protein_id)¶: Get the gene ID associated with the given protein ID, return its Gene object

gene_id_of_protein_id(protein_id)¶: What is the gene ID associated with a given protein ID?

gene_ids(contig=None, strand=None)¶: What are all the gene IDs (optionally restrict to a given chromosome/contig and/or strand)

gene_ids_at_locus(contig, position, end=None, strand=None)¶

gene_ids_of_gene_name(gene_name)¶: What are the gene IDs associated with a given gene name? (due to copy events, there might be multiple genes per name)

gene_name_of_exon_id(exon_id)¶

gene_name_of_gene_id(gene_id)¶

gene_name_of_transcript_id(transcript_id)¶

gene_name_of_transcript_name(transcript_name)¶

gene_names(contig=None, strand=None)¶: Return all genes in the database, optionally restrict to a chromosome and/or strand.

gene_names_at_locus(contig, position, end=None, strand=None)¶

genes(contig=None, strand=None)¶

Returns all Gene objects in the database. Can be restricted to a particular contig/chromosome and strand by the following arguments:

contig : str: Only return genes on the given contig.
strand : str: Only return genes on this strand.

genes_at_locus(contig, position, end=None, strand=None)¶

genes_by_name(gene_name)¶: Get all the unqiue genes with the given name (there might be multiple due to copies in the genome), return a list containing a Gene object for each distinct ID.

gtf¶

index(overwrite=False)¶: Assuming that all necessary data for this Genome has been downloaded, generate the GTF database and save efficient representation of FASTA sequence files.

install_string()¶: Add every missing file to the install string shown to the user in an error message.

loci_of_gene_names(gene_name)¶

Given a gene name returns list of Locus objects with fields:: chromosome, start, stop, strand

You can get multiple results since a gene might have multiple copies in the genome.

locus_of_exon_id(exon_id)¶: Given an exon ID returns Locus

locus_of_gene_id(gene_id)¶: Given a gene ID returns Locus with: chromosome, start, stop, strand

locus_of_transcript_id(transcript_id)¶

protein_ids(contig=None, strand=None)¶: What are all the protein IDs (optionally restrict to a given chromosome and/or strand)

protein_ids_at_locus(contig, position, end=None, strand=None)¶

protein_sequence(protein_id)¶: Return cDNA nucleotide sequence of transcript, or None if transcript doesn’t have cDNA sequence.

protein_sequences¶

to_dict()¶: Returns a dictionary of the essential fields of this Genome.

transcript_by_id(transcript_id)¶: Construct Transcript object with given transcript ID

transcript_by_protein_id(protein_id)¶

transcript_id_of_protein_id(protein_id)¶: What is the transcript ID associated with a given protein ID?

transcript_ids(contig=None, strand=None)¶

transcript_ids_at_locus(contig, position, end=None, strand=None)¶

transcript_ids_of_exon_id(exon_id)¶

transcript_ids_of_gene_id(gene_id)¶

transcript_ids_of_gene_name(gene_name)¶

transcript_ids_of_transcript_name(transcript_name)¶

transcript_name_of_transcript_id(transcript_id)¶

transcript_names(contig=None, strand=None)¶: What are all the transcript names in the database (optionally, restrict to a given chromosome and/or strand)

transcript_names_at_locus(contig, position, end=None, strand=None)¶

transcript_names_of_gene_name(gene_name)¶

transcript_sequence(transcript_id)¶: Return cDNA nucleotide sequence of transcript, or None if transcript doesn’t have cDNA sequence.

transcript_sequences¶

transcripts(contig=None, strand=None)¶: Construct Transcript object for every transcript entry in the database. Optionally restrict to a particular chromosome using the contig argument.

transcripts_at_locus(contig, position, end=None, strand=None)¶

transcripts_by_name(transcript_name)¶

class pyensembl.GTF(gtf_path, cache_directory_path=None)¶

Bases: object

Parse a GTF gene annotation file from a given local path. Represent its contents as a Pandas DataFrame (optionally filtered by locus, column, contig, &c).

clear_cache()¶

data_subset_path(contig=None, feature=None, column=None, strand=None, distinct=False, extension='.csv')¶

Path to cached file for storing materialized views of the genomic data. Typically this is a CSV file, the filename reflects which filters have been applied to the entries of the database.

Parameters:

contig : str, optional: Path for subset of data restricted to given contig
feature : str, optional: Path for subset of data restrict to given feature
column : str, optional: Restrict to single column
strand : str, optional: Positive (“+”) or negative (“-”) DNA strand. Default = either.
distinct : bool, optional: Only keep unique values (default=False)

dataframe(contig=None, feature=None, strand=None, save_to_disk=False)¶: Load genome entries as a DataFrame, optionally restricted to particular contig or feature type.

dataframe_at_locus(contig, start, end=None, offset=None, strand=None)¶: Subset of entries which overlap an inclusive range of chromosomal positions

dataframe_column_at_locus(column_name, contig, start, end=None, offset=None, strand=None)¶: Subset of entries which overlap an inclusive range of loci

class pyensembl.Locus(contig, start, end, strand)¶

Bases: serializable.serializable.Serializable

Base class for any entity which can be localized at a range of positions on a particular strand of a chromosome/contig.

can_overlap(contig, strand=None)¶: Is this locus on the same contig and (optionally) on the same strand?

contains(contig, start, end, strand=None)¶

contains_locus(other_locus)¶

distance_to_interval(start, end)¶: Find the distance between intervals [start1, end1] and [start2, end2]. If the intervals overlap then the distance is 0.

distance_to_locus(other)¶

length¶

offset(position)¶

Offset of given position from stranded start of this locus.

For example, if a Locus goes from 10..20 and is on the negative strand, then the offset of position 13 is 7, whereas if the Locus is on the positive strand, then the offset is 3.

offset_range(start, end)¶: Database start/end entries are always ordered such that start < end. This makes computing a relative position (e.g. of a stop codon relative to its transcript) complicated since the “end” position of a backwards locus is actually earlir on the strand. This function correctly selects a start vs. end value depending on this locuses’s strand and determines that position’s offset from the earliest position in this locus.

on_backward_strand¶

on_contig(contig)¶

on_forward_strand¶

on_negative_strand¶

on_positive_strand¶

on_strand(strand)¶

overlaps(contig, start, end, strand=None)¶

Does this locus overlap with a given range of positions?

Since locus position ranges are inclusive, we should make sure that e.g. chr1:10-10 overlaps with chr1:10-10

overlaps_locus(other_locus)¶

to_dict()¶

class pyensembl.Exon(exon_id, contig, start, end, strand, gene_name, gene_id)

Bases: pyensembl.locus.Locus

to_dict()

pyensembl package¶

Submodules¶

pyensembl.biotypes module¶

pyensembl.common module¶

pyensembl.database module¶

pyensembl.download_cache module¶

pyensembl.ensembl_release module¶

pyensembl.ensembl_release_versions module¶

pyensembl.ensembl_url_templates module¶

pyensembl.exon module¶

pyensembl.gene module¶

pyensembl.genome module¶

pyensembl.gtf module¶

pyensembl.locus module¶

pyensembl.memory_cache module¶

pyensembl.search module¶

pyensembl.sequence_data module¶

pyensembl.shell module¶

pyensembl.species module¶

pyensembl.transcript module¶

Module contents¶

Table Of Contents

Related Topics

This Page