Bioinformatics

Tips for Remapping from NCBI36 to GRCh37 Genome Assembly


It might seem for some people straight forward but I had to spend quite some time trying to understand how to remap my array probes from ncbi36 to CGRCh37. If you use the Ensembl genome browser, you might have noticed that from July 2009 the ncbi37 assembly is now in use. For DECIPHER (the database I help develop), this is a little bit of a headache, because it means that all of the probes from array CGH that we used have to be remapped to the new assembly. If this does not interest you I recommend that you stop reading here.

First I learned that there is a program called liftOver by UCSC that is able to do this remapping. Since the amount of probes I have to map (around 6 million) is a number that I would not wish to through to anyone’s server, I decided to do this in-house. You can download this program from here. I did not know which was the right binary for me to download, as they had linux32 and linux64 versions. I decided to go for the former, since I am using debian and it sounds like a conservative option.

Once I downloaded the program, I needed to make it executable:

chmod u+x liftOver

OK, so I was in a position to run it:

./liftOver

In the usage information it appears that I need several arguments and files to be able to run this program correctly:

liftOver oldFile map.chain newFile unMapped

Now I learned that I need also to get a file called the map.chain. I was not sure what it meant. I learned that this map.chain file has parameters that are used by liftOver and that there are map.chain files depending on the remapping one wants to do. In my case, I want to remap from ncbi36 to GRCh37 in human. However, when I look at the different remappings, I do not see ncbi formats anywhere. I learned here that what I am looking for is map chain file that is called this:

hg18toHg19.over.chain

Apparently hg18 refers to ncbi36 and hg19 to ncbi37. Doing a google search I could find that file here.

Now I get quite a few options and learn that I need to have my probes in bed format to run liftOver. Apparently there are quite a few formats I can use according to UCSC FAQs formats. Here an example of what my bed file looks like (chromosome-tab-start_position-tab-end_position):

chrY       12308579        12468100
chrY       12468101        12581699
chrY       12581700        12759636
chrY       12759637        12838587

Now I am in a position to run liftOver. I notice now that in the usage one has the following description:

liftOver oldFile map.chain newFile unMapped

‘newFile’ and ‘unMapped’ are the names of the files where the output goes into and therefore are empty. This can be confusing as the user might think that these are some other kind of files one has to get hold of.

OK, so now I am ready to transform our old array probe mapping ncbi36 to the new ncbi37 one:

./liftOver probes.ncbi36 hg18toHg19.over.chain probes.grch37 unmapped-to-grch37

I got the following output to console:

Reading liftover chains
Mapping coordinates
ERROR: start coordinate is after end coordinate (chromStart > chromEnd) on line 5171240 of bed file probes.decipher.ncbi36
ERROR: 4 2515512 2515453

…which is a bit worrying.

I’ve gone through my probes and found that some of them (just 44757!) had start point coordinates greater than their ends. I guess that if you encounter those you’ll have to decide what to do. For the time being I just took them out and re run liftOver again.

This time it worked.

Advertisements

5 replies »

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s