pyensembl package¶
Submodules¶
pyensembl.biotypes module¶
pyensembl.common module¶
-
pyensembl.common.dump_pickle(obj, filepath)¶
-
pyensembl.common.is_valid_ensembl_id(ensembl_id)¶ Is the argument a valid ID for any Ensembl feature?
-
pyensembl.common.is_valid_human_protein_id(protein_id)¶ Is the argument a valid identifier for human Ensembl proteins?
-
pyensembl.common.is_valid_human_transcript_id(transcript_id)¶ Is the argument a valid identifier for human Ensembl transcripts?
-
pyensembl.common.load_pickle(filepath)¶
-
pyensembl.common.memoize(fn)¶ Simple reset-able memoization decorator for functions and methods, assumes that all arguments to the function can be hashed and compared.
-
pyensembl.common.require_ensembl_id(ensembl_id)¶
-
pyensembl.common.require_human_protein_id(protein_id)¶
-
pyensembl.common.require_human_transcript_id(transcript_id)¶
pyensembl.database module¶
-
class
pyensembl.database.Database(gtf, install_string)¶ Bases:
objectWrapper around sqlite3 database so that the rest of the library doesn’t have to worry about constructing the .db file or writing SQL queries directly.
-
PRIMARY_KEY_COLUMNS= {'gene': 'gene_id', 'transcript': 'transcript_id'}¶
-
column_exists(*args, **kwargs)¶
-
column_values_at_locus(column_name, feature, contig, position, end=None, strand=None, distinct=False, sorted=False)¶ Get the non-null values of a column from the database at a particular range of loci
-
columns(*args, **kwargs)¶
-
connect_or_create(overwrite=False)¶ Return a connection to the database if it exists, otherwise create it. Overwrite the existing database if overwrite is True.
-
connection¶ Get a connection to the database or raise an exception
-
create(overwrite=False)¶ Create the local database (including indexing) if it’s not already set up. If overwrite is True, always re-create the database from scratch.
Returns a connection to the database.
-
distinct_column_values_at_locus(column, feature, contig, position, end=None, strand=None)¶ Gather all the distinct values for a property/column at some specified locus.
- column : str
- Which property are we getting the values of.
- feature : str
- Which type of entry (e.g. transcript, exon, gene) is the property associated with?
- contig : str
- Chromosome or unplaced contig name
- position : int
- Chromosomal position
- end : int, optional
- End position of a range, if unspecified assume we’re only looking at the single given position.
- strand : str, optional
- Either the positive (‘+’) or negative strand (‘-‘). If unspecified then check for values on either strand.
-
local_db_filename()¶
-
local_db_path()¶
-
query(*args, **kwargs)¶ Construct a SQL query and run against the sqlite3 database, filtered both by the feature type and a user-provided column/value.
-
query_distinct_on_contig(column_name, feature, contig)¶
-
query_feature_values(*args, **kwargs)¶ Run a SQL query against the sqlite3 database, filtered only on the feature type.
-
query_loci(filter_column, filter_value, feature)¶ Query for loci satisfying a given filter and feature type.
- filter_column : str
- Name of column to filter results by.
- filter_value : str
- Only return loci which have this value in the their filter_column.
- feature : str
- Feature names such as ‘transcript’, ‘gene’, and ‘exon’
Returns list of Locus objects
-
query_locus(filter_column, filter_value, feature)¶ Query for unique locus, raises error if missing or more than one locus in the database.
- filter_column : str
- Name of column to filter results by.
- filter_value : str
- Only return loci which have this value in the their filter_column.
- feature : str
- Feature names such as ‘transcript’, ‘gene’, and ‘exon’
Returns single Locus object.
-
query_one(select_column_names, filter_column, filter_value, feature, distinct=False, required=False)¶
-
run_sql_query(sql, required=False, query_params=[])¶ Given an arbitrary SQL query, run it against the database and return the results.
- sql : str
- SQL query
- required : bool
- Raise an error if no results found in the database
- query_params : list
- For each ‘?’ in the query there must be a corresponding value in this list.
-
pyensembl.download_cache module¶
-
class
pyensembl.download_cache.DownloadCache(reference_name, annotation_name, annotation_version=None, decompress_on_download=False, copy_local_files_to_cache=False, install_string_function=None, cache_directory_path=None)¶ Bases:
objectDownloads remote files to cache, optionally copies local files into cache, raises custom message if data is missing.
-
cache_directory_path¶
-
cached_path(path_or_url)¶ When downloading remote files, the default behavior is to name local files the same as their remote counterparts.
-
delete_cache_directory()¶
-
delete_cached_files(prefixes=[], suffixes=[])¶ Deletes any cached files matching the prefixes or suffixes given
-
download_or_copy_if_necessary(path_or_url, download_if_missing=False, overwrite=False)¶ Download a remote file or copy Get the local path to a possibly remote file.
Download if file is missing from the cache directory and download_if_missing is True. Download even if local file exists if both download_if_missing and overwrite are True.
If the file is on the local file system then return its path, unless self.copy_local_to_cache is True, and then copy it to the cache first.
path_or_url : str
- download_if_missing : bool, optional
- Download files if missing from local cache
- overwrite : bool, optional
- Overwrite existing copy if it exists
-
is_url_format(path_or_url)¶
-
local_path_or_install_error(field_name, path_or_url, download_if_missing=False, overwrite=False)¶
-
-
exception
pyensembl.download_cache.MissingLocalFile(path)¶ Bases:
exceptions.Exception
-
exception
pyensembl.download_cache.MissingRemoteFile(url)¶ Bases:
exceptions.Exception
-
pyensembl.download_cache.cache_subdirectory(reference_name=None, annotation_name=None, annotation_version=None)¶ Which cache subdirectory to use for a given annotation database over a particular reference. All arguments can be omitted to just get the base subdirectory for all pyensembl cached datasets.
pyensembl.ensembl_release module¶
Contains the EnsemblRelease class, which extends the Genome class to be specific to (a particular release of) Ensembl.
-
class
pyensembl.ensembl_release.EnsemblRelease(release=87, species=Species(latin_name='homo_sapiens', synonyms=['human'], reference_assemblies={'GRCh38': (76, 87), 'GRCh37': (55, 75), 'NCBI36': (54, 54)}), server='ftp://ftp.ensembl.org')¶ Bases:
pyensembl.genome.GenomeBundles together the genomic annotation and sequence data associated with a particular release of the Ensembl database.
-
classmethod
cached(release=87, species=Species(latin_name='homo_sapiens', synonyms=['human'], reference_assemblies={'GRCh38': (76, 87), 'GRCh37': (55, 75), 'NCBI36': (54, 54)}), server='ftp://ftp.ensembl.org')¶ Construct EnsemblRelease if it’s never been made before, otherwise return an old instance.
-
classmethod
from_dict(state_dict)¶ Deserialize EnsemblRelease without creating duplicate instances.
-
install_string()¶
-
classmethod
normalize_init_values(release, species, server)¶ Normalizes the arguments which uniquely specify an EnsemblRelease genome.
-
to_dict()¶
-
classmethod
pyensembl.ensembl_release_versions module¶
-
pyensembl.ensembl_release_versions.check_release_number(release)¶ Check to make sure a release is in the valid range of Ensembl releases.
pyensembl.ensembl_url_templates module¶
Templates for URLs and paths to specific relase, species, and file type on the Ensembl ftp server.
For example, the human chromosomal DNA sequences for release 78 are in:
-
pyensembl.ensembl_url_templates.make_fasta_dna_url(ensembl_release, species, contig, server='ftp://ftp.ensembl.org')¶ Construct URL to FASTA file with full sequence of a particular chromosome. Returns server_url/subdir and filename as tuple result.
-
pyensembl.ensembl_url_templates.make_fasta_url(ensembl_release, species, sequence_type, server='ftp://ftp.ensembl.org')¶ Construct URL to FASTA file with cDNA transcript or protein sequences
- Parameter examples:
- ensembl_release = 75 species = “Homo_sapiens” sequence_type = “cdna” (other option: “pep”)
-
pyensembl.ensembl_url_templates.make_gtf_url(ensembl_release, species, server='ftp://ftp.ensembl.org')¶ Returns a URL and a filename, which can be joined together.
pyensembl.exon module¶
-
class
pyensembl.exon.Exon(exon_id, contig, start, end, strand, gene_name, gene_id)¶ Bases:
pyensembl.locus.Locus-
to_dict()¶
-
pyensembl.gene module¶
-
class
pyensembl.gene.Gene(gene_id, gene_name, contig, start, end, strand, biotype, genome)¶ Bases:
pyensembl.locus_with_genome.LocusWithGenome-
exons¶
-
id¶ Alias for gene_id necessary for backwards compatibility.
-
name¶ Alias for gene_name necessary for backwards compatibility.
-
to_dict()¶
-
transcripts¶ Property which dynamically construct transcript objects for all transcript IDs associated with this gene.
-
pyensembl.genome module¶
Contains the Genome class, with its millions of accessors and wrappers around an arbitrary genomic database.
-
class
pyensembl.genome.Genome(reference_name, annotation_name, annotation_version=None, gtf_path_or_url=None, transcript_fasta_path_or_url=None, protein_fasta_path_or_url=None, decompress_on_download=False, copy_local_files_to_cache=False, require_ensembl_ids=True, cache_directory_path=None)¶ Bases:
serializable.serializable.SerializableBundles together the genomic annotation and sequence data associated with a particular genomic database source (e.g. a single Ensembl release) and provides a wide variety of helper methods for accessing this data.
-
clear_cache()¶ Clear any in-memory cached values and short-lived on-disk materializations from MemoryCache
-
contigs()¶ Returns all contig names for any gene in the genome (field called “seqname” in Ensembl GTF files)
-
db¶
-
delete_index_files()¶ Delete all data aside from source GTF and FASTA files
-
download(overwrite=False)¶ Download data files needed by this Genome instance.
- overwrite : bool, optional
- Download files regardless whether local copy already exists.
-
exon_by_id(exon_id)¶ Construct an Exon object from its ID by looking up the exon”s properties in the given Database.
-
exon_ids(contig=None, strand=None)¶
-
exon_ids_at_locus(contig, position, end=None, strand=None)¶
-
exon_ids_of_gene_id(gene_id)¶
-
exon_ids_of_gene_name(gene_name)¶
-
exon_ids_of_transcript_id(transcript_id)¶
-
exon_ids_of_transcript_name(transcript_name)¶
-
exons(contig=None, strand=None)¶ Create exon object for all exons in the database, optionally restrict to a particular chromosome using the contig argument.
-
exons_at_locus(contig, position, end=None, strand=None)¶
-
gene_by_id(gene_id)¶ Construct a Gene object for the given gene ID.
-
gene_by_protein_id(protein_id)¶ Get the gene ID associated with the given protein ID, return its Gene object
-
gene_id_of_protein_id(protein_id)¶ What is the gene ID associated with a given protein ID?
-
gene_ids(contig=None, strand=None)¶ What are all the gene IDs (optionally restrict to a given chromosome/contig and/or strand)
-
gene_ids_at_locus(contig, position, end=None, strand=None)¶
-
gene_ids_of_gene_name(gene_name)¶ What are the gene IDs associated with a given gene name? (due to copy events, there might be multiple genes per name)
-
gene_name_of_exon_id(exon_id)¶
-
gene_name_of_gene_id(gene_id)¶
-
gene_name_of_transcript_id(transcript_id)¶
-
gene_name_of_transcript_name(transcript_name)¶
-
gene_names(contig=None, strand=None)¶ Return all genes in the database, optionally restrict to a chromosome and/or strand.
-
gene_names_at_locus(contig, position, end=None, strand=None)¶
-
genes(contig=None, strand=None)¶ Returns all Gene objects in the database. Can be restricted to a particular contig/chromosome and strand by the following arguments:
- contig : str
- Only return genes on the given contig.
- strand : str
- Only return genes on this strand.
-
genes_at_locus(contig, position, end=None, strand=None)¶
-
genes_by_name(gene_name)¶ Get all the unqiue genes with the given name (there might be multiple due to copies in the genome), return a list containing a Gene object for each distinct ID.
-
gtf¶
-
index(overwrite=False)¶ Assuming that all necessary data for this Genome has been downloaded, generate the GTF database and save efficient representation of FASTA sequence files.
-
install_string()¶ Add every missing file to the install string shown to the user in an error message.
-
loci_of_gene_names(gene_name)¶ - Given a gene name returns list of Locus objects with fields:
- chromosome, start, stop, strand
You can get multiple results since a gene might have multiple copies in the genome.
-
locus_of_exon_id(exon_id)¶ Given an exon ID returns Locus
-
locus_of_gene_id(gene_id)¶ Given a gene ID returns Locus with: chromosome, start, stop, strand
-
locus_of_transcript_id(transcript_id)¶
-
protein_ids(contig=None, strand=None)¶ What are all the protein IDs (optionally restrict to a given chromosome and/or strand)
-
protein_ids_at_locus(contig, position, end=None, strand=None)¶
-
protein_sequence(protein_id)¶ Return cDNA nucleotide sequence of transcript, or None if transcript doesn’t have cDNA sequence.
-
protein_sequences¶
-
to_dict()¶ Returns a dictionary of the essential fields of this Genome.
-
transcript_by_id(transcript_id)¶ Construct Transcript object with given transcript ID
-
transcript_by_protein_id(protein_id)¶
-
transcript_id_of_protein_id(protein_id)¶ What is the transcript ID associated with a given protein ID?
-
transcript_ids(contig=None, strand=None)¶
-
transcript_ids_at_locus(contig, position, end=None, strand=None)¶
-
transcript_ids_of_exon_id(exon_id)¶
-
transcript_ids_of_gene_id(gene_id)¶
-
transcript_ids_of_gene_name(gene_name)¶
-
transcript_ids_of_transcript_name(transcript_name)¶
-
transcript_name_of_transcript_id(transcript_id)¶
-
transcript_names(contig=None, strand=None)¶ What are all the transcript names in the database (optionally, restrict to a given chromosome and/or strand)
-
transcript_names_at_locus(contig, position, end=None, strand=None)¶
-
transcript_names_of_gene_name(gene_name)¶
-
transcript_sequence(transcript_id)¶ Return cDNA nucleotide sequence of transcript, or None if transcript doesn’t have cDNA sequence.
-
transcript_sequences¶
-
transcripts(contig=None, strand=None)¶ Construct Transcript object for every transcript entry in the database. Optionally restrict to a particular chromosome using the contig argument.
-
transcripts_at_locus(contig, position, end=None, strand=None)¶
-
transcripts_by_name(transcript_name)¶
-
pyensembl.gtf module¶
-
class
pyensembl.gtf.GTF(gtf_path, cache_directory_path=None)¶ Bases:
objectParse a GTF gene annotation file from a given local path. Represent its contents as a Pandas DataFrame (optionally filtered by locus, column, contig, &c).
-
clear_cache()¶
-
data_subset_path(contig=None, feature=None, column=None, strand=None, distinct=False, extension='.csv')¶ Path to cached file for storing materialized views of the genomic data. Typically this is a CSV file, the filename reflects which filters have been applied to the entries of the database.
Parameters:
- contig : str, optional
- Path for subset of data restricted to given contig
- feature : str, optional
- Path for subset of data restrict to given feature
- column : str, optional
- Restrict to single column
- strand : str, optional
- Positive (“+”) or negative (“-”) DNA strand. Default = either.
- distinct : bool, optional
- Only keep unique values (default=False)
-
dataframe(contig=None, feature=None, strand=None, save_to_disk=False)¶ Load genome entries as a DataFrame, optionally restricted to particular contig or feature type.
-
dataframe_at_locus(contig, start, end=None, offset=None, strand=None)¶ Subset of entries which overlap an inclusive range of chromosomal positions
-
dataframe_column_at_locus(column_name, contig, start, end=None, offset=None, strand=None)¶ Subset of entries which overlap an inclusive range of loci
-
pyensembl.locus module¶
-
class
pyensembl.locus.Locus(contig, start, end, strand)¶ Bases:
serializable.serializable.SerializableBase class for any entity which can be localized at a range of positions on a particular strand of a chromosome/contig.
-
can_overlap(contig, strand=None)¶ Is this locus on the same contig and (optionally) on the same strand?
-
contains(contig, start, end, strand=None)¶
-
contains_locus(other_locus)¶
-
distance_to_interval(start, end)¶ Find the distance between intervals [start1, end1] and [start2, end2]. If the intervals overlap then the distance is 0.
-
distance_to_locus(other)¶
-
length¶
-
offset(position)¶ Offset of given position from stranded start of this locus.
For example, if a Locus goes from 10..20 and is on the negative strand, then the offset of position 13 is 7, whereas if the Locus is on the positive strand, then the offset is 3.
-
offset_range(start, end)¶ Database start/end entries are always ordered such that start < end. This makes computing a relative position (e.g. of a stop codon relative to its transcript) complicated since the “end” position of a backwards locus is actually earlir on the strand. This function correctly selects a start vs. end value depending on this locuses’s strand and determines that position’s offset from the earliest position in this locus.
-
on_backward_strand¶
-
on_contig(contig)¶
-
on_forward_strand¶
-
on_negative_strand¶
-
on_positive_strand¶
-
on_strand(strand)¶
-
overlaps(contig, start, end, strand=None)¶ Does this locus overlap with a given range of positions?
Since locus position ranges are inclusive, we should make sure that e.g. chr1:10-10 overlaps with chr1:10-10
-
overlaps_locus(other_locus)¶
-
to_dict()¶
-
pyensembl.memory_cache module¶
Cache and serializing the results of expensive computations. Used in pyensembl primarily to cache the heavy-weight parsing of GTF files and various filtering operations on Ensembl entries.
A piece of data is returned from one of three sources: 1) Cache cold. Run the user-supplied compute_fn. 2) Cache warm on disk. Parse or unpickle the serialized result into memory. 3) Cache warm in memory. Return cached object.
-
class
pyensembl.memory_cache.MemoryCache¶ Bases:
objectIn-memory and on-disk caching of long-running queries and computations.
-
cached_dataframe(csv_path, compute_fn)¶ If a CSV path is in the _memory_cache, then return that cached value.
If we’ve already saved the DataFrame as a CSV then load it.
Otherwise run the provided compute_fn, and store its result in memory and and save it as a CSV.
-
cached_object(path, compute_fn)¶ If cached_object has already been called for a value of path in this running Python instance, then it should have a cached value in the
_memory_cache; return that value.If this function was never called before with a particular value of path, then call compute_fn, and pickle it to path.
If path already exists, unpickle it and store that value in _memory_cache.
-
clear_cached_objects()¶
-
delete_file(path)¶
-
is_empty(filename)¶
-
remove_from_cache(key)¶
-
pyensembl.search module¶
Helper functions for searching over collections of PyEnsembl objects
-
pyensembl.search.find_nearest_locus(start, end, loci)¶ Finds nearest locus (object with method distance_to_interval) to the interval defined by the given start and end positions. Returns the distance to that locus, along with the locus object itself.
pyensembl.sequence_data module¶
-
class
pyensembl.sequence_data.SequenceData(fasta_path, require_ensembl_ids=False, cache_directory_path=None)¶ Bases:
objectContainer for reference nucleotide and amino acid sequenes.
-
clear_cache()¶
-
fasta_dictionary¶
-
get(sequence_id)¶ Get sequence associated with given ID or return None if missing
-
index(overwrite=False)¶
-
pyensembl.shell module¶
Manipulate pyensembl’s local cache.
%(prog)s {install, delete, delete-sequence-cache} [–release XXX –species human...]
- To install particular Ensembl human release(s):
- %(prog)s install –release 75 77
- To install particular Ensembl mouse release(s):
- %(prog)s install –release 75 77 –species mouse
- To delete all downloaded and cached data for a particular Ensembl release:
- %(prog)s delete-all-files –release 75 –species human
- To delete only cached data related to transcript and protein sequences:
- %(prog)s delete-index-files –release 75
- To install any genome:
- %(prog)s install –reference-name “GRCh38” –gtf URL_OR_PATH –transcript-fasta URL_OR_PATH –protein-fasta URL_OR_PATH
-
pyensembl.shell.run()¶
pyensembl.species module¶
-
class
pyensembl.species.Species(latin_name, synonyms=[], reference_assemblies={})¶ Bases:
serializable.serializable.SerializableContainer for combined information about a species name, its synonyn names and which reference to use for this species in each Ensembl release.
-
classmethod
from_dict(state_dict)¶
-
classmethod
register(latin_name, synonyms, reference_assemblies)¶ Create a Species object from the given arguments and enter into all the dicts used to look the species up by its fields.
-
to_dict()¶
-
which_reference(ensembl_release)¶
-
classmethod
-
pyensembl.species.check_species_object(species_name_or_object)¶ Helper for validating user supplied species names or objects.
-
pyensembl.species.find_species_by_name(species_name)¶
-
pyensembl.species.find_species_by_reference(reference_name)¶
-
pyensembl.species.max_ensembl_release(reference_name)¶
-
pyensembl.species.normalize_reference_name(name)¶ Search the dictionary of species-specific references to find a reference name that matches aside from capitalization.
If no matching reference is found, raise an exception.
-
pyensembl.species.normalize_species_name(name)¶ If species name was “Homo sapiens” then replace spaces with underscores and return “homo_sapiens”. Also replace common names like “human” with “homo_sapiens”.
-
pyensembl.species.which_reference(species_name, ensembl_release)¶
pyensembl.transcript module¶
-
class
pyensembl.transcript.Transcript(transcript_id, transcript_name, contig, start, end, strand, biotype, gene_id, genome)¶ Bases:
pyensembl.locus_with_genome.LocusWithGenomeTranscript encompasses the locus, exons, and sequence of a transcript.
Lazily fetches sequence in case we”re constructing many Transcripts and not using the sequence, avoid the memory/performance overhead of fetching and storing sequences from a FASTA file.
-
coding_sequence¶ cDNA coding sequence (from start codon to stop codon, without any introns)
-
coding_sequence_position_ranges¶ Return absolute chromosome position ranges for CDS fragments of this transcript
-
complete¶ Consider a transcript complete if it has start and stop codons and a coding sequence whose length is divisible by 3
-
contains_start_codon¶ Does this transcript have an annotated start_codon entry?
-
contains_stop_codon¶ Does this transcript have an annotated stop_codon entry?
-
exon_intervals¶ List of (start,end) tuples for each exon of this transcript, in the order specified by the ‘exon_number’ column of the exon table.
-
exons¶
-
first_start_codon_spliced_offset¶ Offset of first nucleotide in start codon into the spliced mRNA (excluding introns)
-
five_prime_utr_sequence¶ cDNA sequence of 5’ UTR (untranslated region at the beginning of the transcript)
-
gene¶
-
id¶ Alias for transcript_id necessary for backward compatibility.
-
last_stop_codon_spliced_offset¶ Offset of last nucleotide in stop codon into the spliced mRNA (excluding introns)
-
name¶ Alias for transcript_name necessary for backward compatibility.
-
protein_id¶
-
protein_sequence¶
-
sequence¶ Spliced cDNA sequence of transcript (includes 5” UTR, coding sequence, and 3” UTR)
-
spliced_offset(position)¶ Convert from an absolute chromosomal position to the offset into this transcript”s spliced mRNA.
Position must be inside some exon (otherwise raise exception).
-
start_codon_positions¶ Chromosomal positions of nucleotides in start codon.
-
start_codon_spliced_offsets¶ Offsets from start of spliced mRNA transcript of nucleotides in start codon.
-
start_codon_unspliced_offsets¶ Offsets from start of unspliced pre-mRNA transcript of nucleotides in start codon.
-
stop_codon_positions¶ Chromosomal positions of nucleotides in stop codon.
-
stop_codon_spliced_offsets¶ Offsets from start of spliced mRNA transcript of nucleotides in stop codon.
-
stop_codon_unspliced_offsets¶ Offsets from start of unspliced pre-mRNA transcript of nucleotides in stop codon.
-
three_prime_utr_sequence¶ cDNA sequence of 3’ UTR (untranslated region at the end of the transcript)
-
to_dict()¶
-
Module contents¶
-
class
pyensembl.MemoryCache¶ Bases:
objectIn-memory and on-disk caching of long-running queries and computations.
-
cached_dataframe(csv_path, compute_fn)¶ If a CSV path is in the _memory_cache, then return that cached value.
If we’ve already saved the DataFrame as a CSV then load it.
Otherwise run the provided compute_fn, and store its result in memory and and save it as a CSV.
-
cached_object(path, compute_fn)¶ If cached_object has already been called for a value of path in this running Python instance, then it should have a cached value in the
_memory_cache; return that value.If this function was never called before with a particular value of path, then call compute_fn, and pickle it to path.
If path already exists, unpickle it and store that value in _memory_cache.
-
clear_cached_objects()¶
-
delete_file(path)¶
-
is_empty(filename)¶
-
remove_from_cache(key)¶
-
-
class
pyensembl.DownloadCache(reference_name, annotation_name, annotation_version=None, decompress_on_download=False, copy_local_files_to_cache=False, install_string_function=None, cache_directory_path=None)¶ Bases:
objectDownloads remote files to cache, optionally copies local files into cache, raises custom message if data is missing.
-
cache_directory_path¶
-
cached_path(path_or_url)¶ When downloading remote files, the default behavior is to name local files the same as their remote counterparts.
-
delete_cache_directory()¶
-
delete_cached_files(prefixes=[], suffixes=[])¶ Deletes any cached files matching the prefixes or suffixes given
-
download_or_copy_if_necessary(path_or_url, download_if_missing=False, overwrite=False)¶ Download a remote file or copy Get the local path to a possibly remote file.
Download if file is missing from the cache directory and download_if_missing is True. Download even if local file exists if both download_if_missing and overwrite are True.
If the file is on the local file system then return its path, unless self.copy_local_to_cache is True, and then copy it to the cache first.
path_or_url : str
- download_if_missing : bool, optional
- Download files if missing from local cache
- overwrite : bool, optional
- Overwrite existing copy if it exists
-
is_url_format(path_or_url)¶
-
local_path_or_install_error(field_name, path_or_url, download_if_missing=False, overwrite=False)¶
-
-
class
pyensembl.EnsemblRelease(release=87, species=Species(latin_name='homo_sapiens', synonyms=['human'], reference_assemblies={'GRCh38': (76, 87), 'GRCh37': (55, 75), 'NCBI36': (54, 54)}), server='ftp://ftp.ensembl.org')¶ Bases:
pyensembl.genome.GenomeBundles together the genomic annotation and sequence data associated with a particular release of the Ensembl database.
-
classmethod
cached(release=87, species=Species(latin_name='homo_sapiens', synonyms=['human'], reference_assemblies={'GRCh38': (76, 87), 'GRCh37': (55, 75), 'NCBI36': (54, 54)}), server='ftp://ftp.ensembl.org')¶ Construct EnsemblRelease if it’s never been made before, otherwise return an old instance.
-
classmethod
from_dict(state_dict)¶ Deserialize EnsemblRelease without creating duplicate instances.
-
install_string()¶
-
classmethod
normalize_init_values(release, species, server)¶ Normalizes the arguments which uniquely specify an EnsemblRelease genome.
-
to_dict()¶
-
classmethod
-
pyensembl.cached_release(release, species='human')¶ Create an EnsemblRelease instance only if it’s hasn’t already been made, otherwise returns the old instance.
Keeping this function for backwards compatibility but this functionality has been moving into the cached method of EnsemblRelease.
-
class
pyensembl.Gene(gene_id, gene_name, contig, start, end, strand, biotype, genome)¶ Bases:
pyensembl.locus_with_genome.LocusWithGenome-
exons¶
-
id¶ Alias for gene_id necessary for backwards compatibility.
-
name¶ Alias for gene_name necessary for backwards compatibility.
-
to_dict()¶
-
transcripts¶ Property which dynamically construct transcript objects for all transcript IDs associated with this gene.
-
-
class
pyensembl.Transcript(transcript_id, transcript_name, contig, start, end, strand, biotype, gene_id, genome)¶ Bases:
pyensembl.locus_with_genome.LocusWithGenomeTranscript encompasses the locus, exons, and sequence of a transcript.
Lazily fetches sequence in case we”re constructing many Transcripts and not using the sequence, avoid the memory/performance overhead of fetching and storing sequences from a FASTA file.
-
coding_sequence¶ cDNA coding sequence (from start codon to stop codon, without any introns)
-
coding_sequence_position_ranges¶ Return absolute chromosome position ranges for CDS fragments of this transcript
-
complete¶ Consider a transcript complete if it has start and stop codons and a coding sequence whose length is divisible by 3
-
contains_start_codon¶ Does this transcript have an annotated start_codon entry?
-
contains_stop_codon¶ Does this transcript have an annotated stop_codon entry?
-
exon_intervals¶ List of (start,end) tuples for each exon of this transcript, in the order specified by the ‘exon_number’ column of the exon table.
-
exons¶
-
first_start_codon_spliced_offset¶ Offset of first nucleotide in start codon into the spliced mRNA (excluding introns)
-
five_prime_utr_sequence¶ cDNA sequence of 5’ UTR (untranslated region at the beginning of the transcript)
-
gene¶
-
id¶ Alias for transcript_id necessary for backward compatibility.
-
last_stop_codon_spliced_offset¶ Offset of last nucleotide in stop codon into the spliced mRNA (excluding introns)
-
name¶ Alias for transcript_name necessary for backward compatibility.
-
protein_id¶
-
protein_sequence¶
-
sequence¶ Spliced cDNA sequence of transcript (includes 5” UTR, coding sequence, and 3” UTR)
-
spliced_offset(position)¶ Convert from an absolute chromosomal position to the offset into this transcript”s spliced mRNA.
Position must be inside some exon (otherwise raise exception).
-
start_codon_positions¶ Chromosomal positions of nucleotides in start codon.
-
start_codon_spliced_offsets¶ Offsets from start of spliced mRNA transcript of nucleotides in start codon.
-
start_codon_unspliced_offsets¶ Offsets from start of unspliced pre-mRNA transcript of nucleotides in start codon.
-
stop_codon_positions¶ Chromosomal positions of nucleotides in stop codon.
-
stop_codon_spliced_offsets¶ Offsets from start of spliced mRNA transcript of nucleotides in stop codon.
-
stop_codon_unspliced_offsets¶ Offsets from start of unspliced pre-mRNA transcript of nucleotides in stop codon.
-
three_prime_utr_sequence¶ cDNA sequence of 3’ UTR (untranslated region at the end of the transcript)
-
to_dict()¶
-
-
class
pyensembl.Exon(exon_id, contig, start, end, strand, gene_name, gene_id)¶ Bases:
pyensembl.locus.Locus-
to_dict()¶
-
-
class
pyensembl.SequenceData(fasta_path, require_ensembl_ids=False, cache_directory_path=None)¶ Bases:
objectContainer for reference nucleotide and amino acid sequenes.
-
clear_cache()¶
-
fasta_dictionary¶
-
get(sequence_id)¶ Get sequence associated with given ID or return None if missing
-
index(overwrite=False)¶
-
-
pyensembl.find_nearest_locus(start, end, loci)¶ Finds nearest locus (object with method distance_to_interval) to the interval defined by the given start and end positions. Returns the distance to that locus, along with the locus object itself.
-
pyensembl.find_species_by_name(species_name)¶
-
pyensembl.find_species_by_reference(reference_name)¶
-
pyensembl.which_reference(species_name, ensembl_release)¶
-
pyensembl.check_species_object(species_name_or_object)¶ Helper for validating user supplied species names or objects.
-
pyensembl.normalize_reference_name(name)¶ Search the dictionary of species-specific references to find a reference name that matches aside from capitalization.
If no matching reference is found, raise an exception.
-
pyensembl.normalize_species_name(name)¶ If species name was “Homo sapiens” then replace spaces with underscores and return “homo_sapiens”. Also replace common names like “human” with “homo_sapiens”.
-
class
pyensembl.Genome(reference_name, annotation_name, annotation_version=None, gtf_path_or_url=None, transcript_fasta_path_or_url=None, protein_fasta_path_or_url=None, decompress_on_download=False, copy_local_files_to_cache=False, require_ensembl_ids=True, cache_directory_path=None)¶ Bases:
serializable.serializable.SerializableBundles together the genomic annotation and sequence data associated with a particular genomic database source (e.g. a single Ensembl release) and provides a wide variety of helper methods for accessing this data.
-
clear_cache()¶ Clear any in-memory cached values and short-lived on-disk materializations from MemoryCache
-
contigs()¶ Returns all contig names for any gene in the genome (field called “seqname” in Ensembl GTF files)
-
db¶
-
delete_index_files()¶ Delete all data aside from source GTF and FASTA files
-
download(overwrite=False)¶ Download data files needed by this Genome instance.
- overwrite : bool, optional
- Download files regardless whether local copy already exists.
-
exon_by_id(exon_id)¶ Construct an Exon object from its ID by looking up the exon”s properties in the given Database.
-
exon_ids(contig=None, strand=None)¶
-
exon_ids_at_locus(contig, position, end=None, strand=None)¶
-
exon_ids_of_gene_id(gene_id)¶
-
exon_ids_of_gene_name(gene_name)¶
-
exon_ids_of_transcript_id(transcript_id)¶
-
exon_ids_of_transcript_name(transcript_name)¶
-
exons(contig=None, strand=None)¶ Create exon object for all exons in the database, optionally restrict to a particular chromosome using the contig argument.
-
exons_at_locus(contig, position, end=None, strand=None)¶
-
gene_by_id(gene_id)¶ Construct a Gene object for the given gene ID.
-
gene_by_protein_id(protein_id)¶ Get the gene ID associated with the given protein ID, return its Gene object
-
gene_id_of_protein_id(protein_id)¶ What is the gene ID associated with a given protein ID?
-
gene_ids(contig=None, strand=None)¶ What are all the gene IDs (optionally restrict to a given chromosome/contig and/or strand)
-
gene_ids_at_locus(contig, position, end=None, strand=None)¶
-
gene_ids_of_gene_name(gene_name)¶ What are the gene IDs associated with a given gene name? (due to copy events, there might be multiple genes per name)
-
gene_name_of_exon_id(exon_id)¶
-
gene_name_of_gene_id(gene_id)¶
-
gene_name_of_transcript_id(transcript_id)¶
-
gene_name_of_transcript_name(transcript_name)¶
-
gene_names(contig=None, strand=None)¶ Return all genes in the database, optionally restrict to a chromosome and/or strand.
-
gene_names_at_locus(contig, position, end=None, strand=None)¶
-
genes(contig=None, strand=None)¶ Returns all Gene objects in the database. Can be restricted to a particular contig/chromosome and strand by the following arguments:
- contig : str
- Only return genes on the given contig.
- strand : str
- Only return genes on this strand.
-
genes_at_locus(contig, position, end=None, strand=None)¶
-
genes_by_name(gene_name)¶ Get all the unqiue genes with the given name (there might be multiple due to copies in the genome), return a list containing a Gene object for each distinct ID.
-
gtf¶
-
index(overwrite=False)¶ Assuming that all necessary data for this Genome has been downloaded, generate the GTF database and save efficient representation of FASTA sequence files.
-
install_string()¶ Add every missing file to the install string shown to the user in an error message.
-
loci_of_gene_names(gene_name)¶ - Given a gene name returns list of Locus objects with fields:
- chromosome, start, stop, strand
You can get multiple results since a gene might have multiple copies in the genome.
-
locus_of_exon_id(exon_id)¶ Given an exon ID returns Locus
-
locus_of_gene_id(gene_id)¶ Given a gene ID returns Locus with: chromosome, start, stop, strand
-
locus_of_transcript_id(transcript_id)¶
-
protein_ids(contig=None, strand=None)¶ What are all the protein IDs (optionally restrict to a given chromosome and/or strand)
-
protein_ids_at_locus(contig, position, end=None, strand=None)¶
-
protein_sequence(protein_id)¶ Return cDNA nucleotide sequence of transcript, or None if transcript doesn’t have cDNA sequence.
-
protein_sequences¶
-
to_dict()¶ Returns a dictionary of the essential fields of this Genome.
-
transcript_by_id(transcript_id)¶ Construct Transcript object with given transcript ID
-
transcript_by_protein_id(protein_id)¶
-
transcript_id_of_protein_id(protein_id)¶ What is the transcript ID associated with a given protein ID?
-
transcript_ids(contig=None, strand=None)¶
-
transcript_ids_at_locus(contig, position, end=None, strand=None)¶
-
transcript_ids_of_exon_id(exon_id)¶
-
transcript_ids_of_gene_id(gene_id)¶
-
transcript_ids_of_gene_name(gene_name)¶
-
transcript_ids_of_transcript_name(transcript_name)¶
-
transcript_name_of_transcript_id(transcript_id)¶
-
transcript_names(contig=None, strand=None)¶ What are all the transcript names in the database (optionally, restrict to a given chromosome and/or strand)
-
transcript_names_at_locus(contig, position, end=None, strand=None)¶
-
transcript_names_of_gene_name(gene_name)¶
-
transcript_sequence(transcript_id)¶ Return cDNA nucleotide sequence of transcript, or None if transcript doesn’t have cDNA sequence.
-
transcript_sequences¶
-
transcripts(contig=None, strand=None)¶ Construct Transcript object for every transcript entry in the database. Optionally restrict to a particular chromosome using the contig argument.
-
transcripts_at_locus(contig, position, end=None, strand=None)¶
-
transcripts_by_name(transcript_name)¶
-
-
class
pyensembl.GTF(gtf_path, cache_directory_path=None)¶ Bases:
objectParse a GTF gene annotation file from a given local path. Represent its contents as a Pandas DataFrame (optionally filtered by locus, column, contig, &c).
-
clear_cache()¶
-
data_subset_path(contig=None, feature=None, column=None, strand=None, distinct=False, extension='.csv')¶ Path to cached file for storing materialized views of the genomic data. Typically this is a CSV file, the filename reflects which filters have been applied to the entries of the database.
Parameters:
- contig : str, optional
- Path for subset of data restricted to given contig
- feature : str, optional
- Path for subset of data restrict to given feature
- column : str, optional
- Restrict to single column
- strand : str, optional
- Positive (“+”) or negative (“-”) DNA strand. Default = either.
- distinct : bool, optional
- Only keep unique values (default=False)
-
dataframe(contig=None, feature=None, strand=None, save_to_disk=False)¶ Load genome entries as a DataFrame, optionally restricted to particular contig or feature type.
-
dataframe_at_locus(contig, start, end=None, offset=None, strand=None)¶ Subset of entries which overlap an inclusive range of chromosomal positions
-
dataframe_column_at_locus(column_name, contig, start, end=None, offset=None, strand=None)¶ Subset of entries which overlap an inclusive range of loci
-
-
class
pyensembl.Locus(contig, start, end, strand)¶ Bases:
serializable.serializable.SerializableBase class for any entity which can be localized at a range of positions on a particular strand of a chromosome/contig.
-
can_overlap(contig, strand=None)¶ Is this locus on the same contig and (optionally) on the same strand?
-
contains(contig, start, end, strand=None)¶
-
contains_locus(other_locus)¶
-
distance_to_interval(start, end)¶ Find the distance between intervals [start1, end1] and [start2, end2]. If the intervals overlap then the distance is 0.
-
distance_to_locus(other)¶
-
length¶
-
offset(position)¶ Offset of given position from stranded start of this locus.
For example, if a Locus goes from 10..20 and is on the negative strand, then the offset of position 13 is 7, whereas if the Locus is on the positive strand, then the offset is 3.
-
offset_range(start, end)¶ Database start/end entries are always ordered such that start < end. This makes computing a relative position (e.g. of a stop codon relative to its transcript) complicated since the “end” position of a backwards locus is actually earlir on the strand. This function correctly selects a start vs. end value depending on this locuses’s strand and determines that position’s offset from the earliest position in this locus.
-
on_backward_strand¶
-
on_contig(contig)¶
-
on_forward_strand¶
-
on_negative_strand¶
-
on_positive_strand¶
-
on_strand(strand)¶
-
overlaps(contig, start, end, strand=None)¶ Does this locus overlap with a given range of positions?
Since locus position ranges are inclusive, we should make sure that e.g. chr1:10-10 overlaps with chr1:10-10
-
overlaps_locus(other_locus)¶
-
to_dict()¶
-
-
class
pyensembl.Exon(exon_id, contig, start, end, strand, gene_name, gene_id) Bases:
pyensembl.locus.Locus-
to_dict()
-