SIFT: Scale Invariant Feature Transform

Matching features across different images in a common problem in computer vision. When all images are similar in nature (same scale, orientation, etc) simple corner detectors can work. But when you have images of different scales and rotations, you need to use the Scale Invariant Feature Transform.

Why care about SIFT

SIFT isn’t just scale invariant. You can change the following, and still get good results:

  • Scale (duh)
  • Rotation
  • Illumination
  • Viewpoint

Here’s an example. We’re looking for these:

And we want to find these objects in this scene:

Here’s the result:

Now that’s some real robust image matching going on. The big rectangles mark matched images. The smaller squares are for individual features in those regions. Note how the big rectangles are skewed. They follow the orientation and perspective of the object in the scene.

The algorithm

SIFT is quite an involved algorithm. It has a lot going on and can become confusing, So I’ve split up the entire algorithm into multiple parts. Here’s an outline of what happens in SIFT.

  1. Constructing a scale space
    This is the initial preparation. You create internal representations of the original image to ensure scale invariance. This is done by generating a “scale space”.
  2. LoG Approximation
    The Laplacian of Gaussian is great for finding interesting points (or key points) in an image. But it’s computationally expensive. So we cheat and approximate it using the representation created earlier.
  3. Finding keypoints
    With the super fast approximation, we now try to find key points. These are maxima and minima in the Difference of Gaussian image we calculate in step 2
  4. Get rid of bad key points
    Edges and low contrast regions are bad keypoints. Eliminating these makes the algorithm efficient and robust. A technique similar to the Harris Corner Detector is used here.
  5. Assigning an orientation to the keypoints
    An orientation is calculated for each key point. Any further calculations are done relative to this orientation. This effectively cancels out the effect of orientation, making it rotation invariant.
  6. Generate SIFT features
    Finally, with scale and rotation invariance in place, one more representation is generated. This helps uniquely identify features. Lets say you have 50,000 features. With this representation, you can easily identify the feature you’re looking for (say, a particular eye, or a sign board).

That was an overview of the entire algorithm. Over the next few days, I’ll go through each step in detail. Finally, I’ll show you how to implement SIFT in OpenCV!

What do I do with SIFT features?

After you run through the algorithm, you’ll have SIFT features for your image. Once you have these, you can do whatever you want.

Track images, detect and identify objects (which can be partly hidden as well), or whatever you can think of. We’ll get into this later as well.

But the catch is, this algorithm is patented.


So, it’s good enough for academic purposes. But if you’re looking to make something commercial, look for something else! [Thanks to aLu for pointing out SURF is patented too]

Issues? Suggestions? Visit the Github issue tracker for AI Shack

Back to top


  1. Ray
    Posted July 22, 2010 at 11:59 pm | Permalink

    I recently start looking in to using SIFT and SURF for a personal project. I was interested in what you have to say about storing and comparing SIFT/SURF descriptors, methods or techniques.

    So long story short; i want to save object descroptors in database (of some sort) and find a match. I am not sure how I can do that.


    • Posted July 23, 2010 at 4:24 pm | Permalink

      Storing a descriptor should be easy.. its just an array of numbers. Check the code I’ve included, each descriptor has a vector for the 128 numbers, and a few other values (scale, position, etc).

      For comparing features, you can use euclidean distance. That’s the “normal” distance formula, but with 128 squares in the underroot. Calculate the distance between the feature in your database and the one you’re checking. If they are “similar” you’ll get small values… or you’ll get large values.

      I hope this makes sense.

  2. Eric
    Posted August 12, 2010 at 9:54 pm | Permalink

    Recently, I begin to study the sift alogrithem, your tutorial is very useful to me.
    When I read the code of MySift, in the process of Calculate the DoG image, the code is
    “cvSub(m_gList[i][j-1], m_gList[i][j], m_dogList[i][j-1]);”, but both the JIFT and Rob Hess’s is
    vil_math_image_difference(m_glist[i][j],m_glist[i][j-1],m_doglist[i][j-1]); --JIFT
    cvSub( gauss_pyr[o][i+1], gauss_pyr[o][i], dog_pyr[o][i], NULL ); --Rob Hess’s
    May be this is a little bug!

    • Posted August 16, 2010 at 12:31 pm | Permalink

      Hi Eric! Thanks for the tip… I’ll check if that causes problems and fix it.

  3. S.Amar Nath
    Posted August 13, 2010 at 11:00 am | Permalink

    absolute genius at work !!!! that’s all i gotta say about u… awesome explanation maan
    so, are u participatin in FIRA….. anyways, how do u actually extract the affine parameters of the
    object image in the target frame… that isn’t quite clicking atleast for me i guess…..

    • Posted August 16, 2010 at 12:29 pm | Permalink

      Thanks :D No, I’m not participating in FIRA… atleast not yet ;) Are you?

      About the affine parameters part… I’m having some doubts with it too. When I know how it works, I’ll write about it here. Why don’t you subscribe.. you’ll get to know as soon as I write about it :)

      • S.Amar Nath
        Posted August 16, 2010 at 12:45 pm | Permalink

        Yes I am participating dude… its in India… why won’t I

        You should have with the wealth of knowledge you’ve got…

        Anyways, yup we are participating in four events (Scrapped Robosot for now no time) Mirosot5v5, Mirosot 11v11, Simurosot5v5 and Simurosot11v11 and we also participated and by chance won the simulation league in Robocup@India last year at IIT-Kgp… our ssl entry was naive to say the least (

        And ya, by the way, i went through nearly 20 papers on these feature descriptors till now… should tell u why i decided to give up research on them at least for now.

        1. They are just complicated ways of saying… “Pick those features which are common to all transformations and remember em…” u might as well hand pick em using a mouse or something and still be sure of getting it to work… Using an object classifier(which by the way rocks euclidean, bayesian etc etc all of them do… intuitive concept)
        2. SURF apparently isn’t rotation invariant in reality (U-SURF or no U-SURF it doesn’t usually work as expected) and requires another detector called FAST to achieve true rotation invariance ( ha ha ha ha programming just became a nightmare)
        3. SIFT is decent except for the detection part … real time doesn’t seem possible given the no of scaling operations required…

        The method which according to me actually holds promise and should have been explored a lot more than it has been is the log-polar transform… my intuition suggests going for point correlation based method to detect the target object in the current frame and then going for IIFT to get the affine deviation parameters (haven’t tried it though… still thinking on how it can be made faster)

        Can you please please post or send a mail on how do u identify bots using the apparently trusty CMU certified :) butterfly pattern. and orientation as well. The logic ain’t quite clicking should i go for an r,theta approach or something else… right now, we are still sticking to a greedy approach and tracking em
        Hoping to hear asap
        S.Amar Nath
        Team Les Invincibles

        • Posted August 18, 2010 at 1:31 pm | Permalink

          Whoa Amar! Great that you’re participating in all those competitions! I got to know about the IIT-Kgp robocup only in August last year… so couldn’t participate. Was interested in Robocup for a few days, and then realized that it was waaayyy too complex. Way too many things to take care of (mechanics, electrical/electronics, vision, algorithm, and so on). Almost impossible to do unless you have a good team. So, at least for now, I’ve put it on hold. Sometime in the future maybe I’ll participate.

          And I think you’re wrong about the feature detection papers. The whole point of automatic feature detection is to eliminate human interaction and ensure that the features are good enough for tracking. If human interaction was “allowed” you’d rather input the displacement from one frame to the next… no need to check features at all. And I’ve heard of SIFT running on the GPU, it runs near real time.

          And can you tell me about the log-polar transform thing? Maybe we can work on it together?

          I’ll have to write code for the butterfly pattern :P Never had the need to control so many robots at once (atleast not yet ;) ) I’ve put it on the todo list… should be up soon enough!

          (and this was one massive comment :P )

  4. mehdi
    Posted September 12, 2010 at 11:30 pm | Permalink

    Hi Master,
    it’s great work !,
    I need to achieve the first two steps(Constructing a scale space
    and LoG approximation), in order to get robust level lines and strong edge, do these steps achieve what I need ?
    could you indicate on the code which are the pieces of code to achieve these two steps ?
    thanx in advance,

    • Posted September 20, 2010 at 12:43 am | Permalink

      The BuildScaleSpace() function constructs the DoG for every octave.. have a look at that function.

  5. Posted September 13, 2010 at 7:45 pm | Permalink

    Wow, simply wow. Until I found your site I had read so much literature that I was fit to burst but I didn’t really understand the bigger picture; I didn’t really know what the moving parts were describing. However you bring it together absolutely beautifully, I understood every single word with crystal clear clarity and atleast two Uh-Ha! moments on every page. This was some very gifted tutorial/guide writing; well done and thankyou.

    • Posted September 20, 2010 at 12:07 am | Permalink

      Hi Robert! Glad you found this series to be helpful! :D

  6. aLu
    Posted September 20, 2010 at 9:01 pm | Permalink

    afaik SURF is also patented.

    • Posted September 21, 2010 at 7:09 pm | Permalink

      Oh! I thought SURF was the “open” thing for SIFT… thanks for the info!

  7. Tony
    Posted October 12, 2010 at 1:05 pm | Permalink

    That`s awesome..
    i`m a newbie. i`m still confuse about how to strore the descriptor..can u give me an example?

    • Posted October 12, 2010 at 4:24 pm | Permalink

      Well… a SIFT descriptor is just an array of numbers. Does this help?

  8. Divo
    Posted November 9, 2010 at 1:06 am | Permalink

    a SIFT algorithm produce a descriptors for one image. How to match 2 or more images with this descriptors? sorry for my english, just i’m from Russia:)

    • Posted November 12, 2010 at 1:25 am | Permalink

      One way could be to check the “distance” between two features – the square root of (sum of squares of (differences of coordinates)). Have a look at Lowe’s paper – he describes a technique to match features in images.

  9. Umar
    Posted November 23, 2010 at 3:46 am | Permalink

    Good Explanation Keep it up!

  10. Xiaojian Zhao
    Posted December 10, 2010 at 8:51 am | Permalink

    in your code, you set
    octaves = 4 and intervals = 2.

    I want to know the reason and I want to know the optimized values of them. Thx.

  11. Ashish
    Posted January 6, 2011 at 3:10 pm | Permalink

    Hello Utkarsh,

    I found this very useful for matching objects with different orientations. Do you have C# implementation of this code.


  12. konstantinos
    Posted February 2, 2011 at 11:39 am | Permalink

    First of all I would like to thank you for this wonderful presentation/explanation on how SIFT works. I’ve been searching for a long time to find solid information about this algorithm. I am new in openCV and I would like to ask you about the code that you wrote which implements SIFT.
    If i am not wrong, this code detects the keypoints of an image.
    How can I convert this code in order to compare these keypoints with another image (possibly a live feed from a camera). I’ve done something similar with SURF but i find little information about SIFT.
    Thank You!


  13. inbapuvi
    Posted February 25, 2011 at 9:46 am | Permalink

    How to subtract two images?

    • Posted February 25, 2011 at 4:58 pm | Permalink

      You just subtract corresponding pixel values. If I1 and I2 are two images, then the subtracted result will be I1(x,y)-I2(x,y).

  14. Posted March 3, 2011 at 10:16 pm | Permalink

    Hi, Utkarsh:
    Your web is great, in which I learned a lot. I’d like to ask you a question about the scale invarants. Suppose a keypoint has been done with a 128-D array {k1, k2, …, k128}, which is from a 32*32 region from a picture; another picture has a 37*37 region exactly corresponding to the 32*32 region, i.e. the latter is a resized version of the former. A second keypoint {j1,j2, … j128}, will be done from the 37*37 region by the same algorithm, by the same pyramid octaves. Let’s say the 16*16 region in the former contributes the values {k3,k4,k5,6}, then, Which octave contributes in the latter? How the octaves map to the values in the array?
    Your reply will be appreciated!

  15. Vahid
    Posted March 5, 2011 at 7:13 pm | Permalink

    Thanks alot for the tutorial

  16. mary
    Posted March 19, 2011 at 6:20 pm | Permalink

    thank you for the tutorial. it is really helpful. my team is using the SIFT with windows base mobile to do a mobile app for detecting features on paper money and matching it with our database. we are just afraid of the speed of the application because if it take around 20 sec on computer so it will be more on mobile so is there a way to decrease the computation of features and at the same time keep it accurate? or should we search on other algorithm such as the surf?

    • Posted March 21, 2011 at 1:05 pm | Permalink

      Hi! These algorithms are computationally expensive. One way to reduce time would be to search for features in only certain regions of the image – like a person’s face on the bill, the corners, etc. Btw, I doubt if you can use those algorithms in a commercial application – they’re patented.

  17. Gio
    Posted March 30, 2011 at 9:27 pm | Permalink

    thanks a lot for the tutorial!! really very useful one.
    thanks again! :-)

  18. Nova
    Posted May 16, 2011 at 12:59 pm | Permalink

    Hi! Thank you very much for SIFT article
    It really helps me in understanding SIFT
    By the way, I’m also implementing SURF using openCV for processing a sequences of video images. When I compare the processing time, SURF run faster and better than SIFT
    when i’m running SIFT, the video shown didn’t run very well. it was very slow, it seems almost like an images only, not video. Do you have any video why SIFT behave like this?
    Thank you so much ^_^)

  19. Daniel
    Posted June 10, 2011 at 9:37 am | Permalink

    Hi Utkarsh,

    Firstly, thank you for this wonderful guide. :-)

    I’ve been trying to make a Python version of your program, as a student of mine has requested info on how this works. You’ll get full credit, don’t worry. :-)

    Basically there’s ALOT of array cycling. Alot. I was curious if any OpenCv functions would be useful.

    Would this work…. use “features”=GoodFeaturesToTrack() in each image within an octave. Find matches in each image. i.e. Take features that are in all images for that octave.
    (Goodfeatures uses local maxima, repeat on Not(image) to find local minima)

    Then use cv.FindCornerSubPix()……. to get subpixel corners for each feature.

    Then conduct elimination based on this reduced feature set (contrast, edge, flat)?

    My main issue is the Taylor expansion, although you have made it very clear, i find it difficult to get good performance in Python.

    Kind regards,


    • Daniel
      Posted June 10, 2011 at 9:51 am | Permalink

      Two other feature extraction methods…..

      GetStarKeypoints() – The function GetStarKeypoints extracts keypoints that are local scale-space extremas. The scale-space is constructed by computing approximate values of laplacians with different sigma’s at each pixel. Instead of using pyramids, a popular approach to save computing time, all of the laplacians are computed at each pixel of the original high-resolution image. But each approximate laplacian value is computed in O(1) time regardless of the sigma, thanks to the use of integral images. The algorithm is based on the paper Agrawal08 , but instead of a square, hexagon or octagon it uses an 8-end star shape, hence the name, consisting of overlapping upright and tilted squares.

      ExtractSURF() – The function cvExtractSURF finds robust features in the image, as described in Bay06 . For each feature it returns its location, size, orientation and optionally the descriptor, basic or extended. The function can be used for object tracking and localization, image stitching etc.

      • Posted June 17, 2011 at 4:49 pm | Permalink

        Yup, SURF is another algorithm like SIFT. The Star one I’ll have a look at!

    • Posted June 17, 2011 at 4:48 pm | Permalink

      Hi Daniel! This sounds really interesting! Did you have a progress till now? I think you could do that, but I’m not sure. Would you find exactly those points again in another image?

  20. Levi
    Posted June 15, 2011 at 4:16 am | Permalink

    Thanks for the excellent tutorial! I had difficulty comprehending the paper by itself, but your explanation made things much clearer. Great use of images to illustrate your points.

  21. Posted August 31, 2011 at 8:46 am | Permalink

    Utkarsh, Can you please direct me to an algorithm that isnt patented?

  22. Richard
    Posted September 11, 2011 at 2:39 pm | Permalink

    We are studying SIFT in my Masters at NTU in Singapore and our lecturer asked us to read the original SIFT paper as preparation. The maths and explanations were way above my level of understanding but I found your tutorials to be really clear and they helped me to really appreciate the brilliance of the algorithm.

    Thanks for writing the tutorial! I intend to share the link with my entire class!

  23. Juanelo
    Posted January 13, 2012 at 10:18 pm | Permalink

    One question about the part “4. Finding keypoints”: I’m reading all your post in chronological order (from older to newer), and the natural option for subpixel precission would habe been using the method describer here:

    So it wouldn’t be OK for this SIFT calculation?

    • Posted January 20, 2012 at 1:24 am | Permalink

      That should work as well. But I’ve never tried it out.

  24. Posted January 19, 2012 at 9:03 pm | Permalink

    Very well written. Keep it up!

3 Trackbacks

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>