# Example solutions to the problem of mapping gene and protein names to their preferred names

Installing Biopython into the notebook environment...

In [1]:
!pip install biopython

Collecting biopython
  Downloading biopython-1.83-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: biopython
Successfully installed biopython-1.83


## Mapping gene names

First, we define a function for retrieving dictionary-like annotation records for a given list of gene IDs.

In [2]:
import sys

from Bio import Entrez

# *Always* tell NCBI who you are
Entrez.email = "novacek@fi.muni.cz"

def retrieve_annotation(id_list):

    """Annotates Entrez Gene IDs using Bio.Entrez, in particular epost (to
    submit the data to NCBI) and esummary to retrieve the information.
    Returns a nested dictionary-like structure with the annotations."""

    # posting a request to the NCBI gene DB via Entrez, based on a list of
    # gene IDs in the DB
    request = Entrez.epost("gene", id=",".join(id_list))
    try:
        result = Entrez.read(request)
    except RuntimeError as e:
        print("An error occurred while retrieving the annotations.")
        print("The error returned was %s" % e)
        sys.exit(-1)

    # contextual information for the web service Entrez API
    webEnv = result["WebEnv"]
    queryKey = result["QueryKey"]

    # actually getting the data, based on the DB, query and its metadata
    data = Entrez.esummary(db="gene", webenv=webEnv, query_key=queryKey)
    annotations = Entrez.read(data)

    print("Retrieved %d annotations for %d genes" % (len(annotations), \
                                                     len(id_list)))

    return annotations

Now we create a list of manully looked up IDs for the genes of interest, as per the lab assigmnents.

In [3]:
# gene DB IDs for 'IFNA1', 'IFNB1', 'NFKB1'
id_list = ['3439', '3456', '4790']

# retrieving the annotations from the NCBI gene DB
dct = retrieve_annotation(id_list)

Retrieved 1 annotations for 3 genes


Ready to generate the gene alias mappings from the annotations, and print the annotation contens along the way to what's in there...

In [4]:
# the variable into which we'll be storing the mappings
name_mappings_genes = {}

# keeping track of the number of duplicate mappings
duplicates = 0

# iterating through the records retrieved from the gene DB
for record in dct['DocumentSummarySet']['DocumentSummary']:
  print('Record items for gene name:', record['Name']) # the preferred name

  # printing out the record contents as a list of key->value mappings
  for key, value in record.items():
    print(' ', key, '->', value)

  # getting aliases from the OtherAliases attribute
  for alias in record['OtherAliases'].split(','):
    new_key = alias.strip()
    if new_key not in name_mappings_genes:
      name_mappings_genes[new_key] = record['Name']
    else:
      duplicates += 1 # incrementing no. of duplicate mappings

  # getting more aliases from the OtherDesignations attribute
  for alias in record['OtherDesignations'].split('|'):
    new_key = alias.strip()
    if new_key not in name_mappings_genes:
      name_mappings_genes[new_key] = record['Name']
    else:
      duplicates += 1 # incrementing no. of duplicate mappings

  # still more aliases from the other name-like attributes of the record
  name_mappings_genes[record['NomenclatureName']] = record['Name']
  name_mappings_genes[record['NomenclatureSymbol']] = record['Name']

# printing the mappings and the no. of ambiguities
print('\n\nGene name mappings:\n'+'\n'.join(['  '+' : '.join(item) for item in\
                                         name_mappings_genes.items()]))
print('Number of duplicates:', duplicates)

Record items for gene name: IFNA1
  Name -> IFNA1
  Description -> interferon alpha 1
  Status -> 0
  CurrentID -> 0
  Chromosome -> 9
  GeneticSource -> genomic
  MapLocation -> 9p21.3
  OtherAliases -> IFL, IFN, IFN-ALPHA, IFN-alphaD, IFNA13, IFNA@, leIF D
  OtherDesignations -> interferon alpha-1/13|IFN-alpha-1/13|interferon alpha 1b|interferon alpha-D|interferon-alpha1
  NomenclatureSymbol -> IFNA1
  NomenclatureName -> interferon alpha 1
  NomenclatureStatus -> Official
  Mim -> ['147660']
  GenomicInfo -> [{'ChrLoc': '9', 'ChrAccVer': 'NC_000009.12', 'ChrStart': '21440438', 'ChrStop': '21441315', 'ExonCount': '1'}]
  GeneWeight -> 25589
  Summary -> This gene is a member of the alpha interferon gene cluster on chromosome 9. The encoded cytokine is a member of the type I interferon family that is produced in response to viral infection as a key part of the innate immune response with potent antiviral, antiproliferative and immunomodulatory properties. This cytokine, like other typ

## Mapping protein names

First, we define a mapping from the preferred names to their corresponding UniProt XML record URLs.

In [5]:
names2urls = {
    'IFNA1' : 'https://www.uniprot.org/uniprot/P01562.xml',
    'IFNB1' : 'https://www.uniprot.org/uniprot/P01574.xml',
    'NFKB1' : 'https://www.uniprot.org/uniprot/P19838.xml'
}

Then we define a function for querying UniProt XML records and parsing their contents into a `SeqIO` Biopython object.

In [6]:
from Bio import SeqIO
import urllib

def get_protein_aliases(url):
  # opening the URL and parsing its contents into a SeqIO object
  handle = urllib.request.urlopen(url)
  record = SeqIO.read(handle, "uniprot-xml")

  aliases = [record.name]

  # need to make sure the name-like attreibutes are there at all
  if 'alternativeName_fullName' in record.annotations:
    aliases += record.annotations['alternativeName_fullName']
  if 'recommendedName_shortName' in record.annotations:
    aliases += record.annotations['recommendedName_shortName']
  if 'alternativeName_fullName' in record.annotations:
    aliases += record.annotations['alternativeName_fullName']
  if 'alternativeName_shortName' in record.annotations:
    aliases += record.annotations['alternativeName_shortName']
  if 'gene_name_primary' in record.annotations:
    aliases.append(record.annotations['gene_name_primary'])

  # returning a list of all name-like attributes of the record
  return aliases

Ready to generate the protein name mappings, and see how many ambiguities we have now...

In [7]:
# a separate mapping variable for the protein names
name_mappings_proteins = {}

# counters for protein-protein duplicates and gene-protein duplicates
duplicates_pp, duplicates_gp = 0, 0

# processing the pre-defined dictionary of preferred name -> URL mappings,
# getting their aliases
for name, url in names2urls.items():
  for alias in get_protein_aliases(url):
    new_key = alias.strip() # stripping white space padding, just in case

    # updating the aliases
    if new_key not in name_mappings_proteins:
      name_mappings_proteins[new_key] = name
    else:
      duplicates_pp += 1 # incrementing no. of duplicate mappings (proteins)
    if new_key in name_mappings_genes:
      duplicates_gp += 1 # incrementing no. of duplicate mappings (genes)

# printing the mappings and the no. of ambiguities
print('Protein name mappings:\n'+'\n'.join(['  '+' : '.join(item) for item in \
                                            name_mappings_proteins.items()]))
print('Number of total duplicates:', duplicates + duplicates_gp + \
                                     duplicates_pp)
print('Gene-protein duplicates   :', duplicates_gp)
print('Protein-protein duplicates:', duplicates_pp)

Protein name mappings:
  IFNA1_HUMAN : IFNA1
  Interferon alpha-D : IFNA1
  IFN-alpha-1/13 : IFNA1
  LeIF D : IFNA1
  IFNA13 : IFNA1
  IFNB_HUMAN : IFNB1
  Fibroblast interferon : IFNB1
  IFN-beta : IFNB1
  IFNB1 : IFNB1
  NFKB1_HUMAN : NFKB1
  DNA-binding factor KBF1 : NFKB1
  EBP-1 : NFKB1
  Nuclear factor of kappa light polypeptide gene enhancer in B-cells 1 : NFKB1
  NFKB1 : NFKB1
Number of total duplicates: 14
Gene-protein duplicates   : 9
Protein-protein duplicates: 5


Quite an increase in duplicates, right? If you feel curious, try to dig into the sample code in this notebook and:
- run some more detailed analysis to see exactly which names are clashing,
- implement a possible solution to the problem.