After step 4, we have legitimate key points. They’ve been tested to be stable. We already know the scale at which the keypoint was detected (it’s the same as the scale of the blurred image). So we have scale invariance. The next thing is to assign an orientation to each keypoint. This orientation provides rotation invariance. The more invariance you have the better it is. ![]()
The idea
The idea is to collect gradient directions and magnitudes around each keypoint. Then we figure out the most prominent orientation(s) in that region. And we assign this orientation(s) to the keypoint.
Any later calculations are done relative to this orientation. This ensures rotation invariance.

The size of the “orientation collection region” around the keypoint depends on it’s scale. The bigger the scale, the bigger the collection region.
The details
Now for the little details about collecting orientations.

Gradient magnitudes and orientations are calculated using these formulae:

The magnitude and orientation is calculated for all pixels around the keypoint. Then, A histogram is created for this.
In this histogram, the 360 degrees of orientation are broken into 36 bins (each 10 degrees). Lets say the gradient direction at a certain point (in the “orientation collection region”) is 18.759 degrees, then it will go into the 10-19 degree bin. And the “amount” that is added to the bin is proportional to the magnitude of gradient at that point.
Once you’ve done this for all pixels around the keypoint, the histogram will have a peak at some point.
Above, you see the histogram peaks at 20-29 degrees. So, the keypoint is assigned orientation 3 (the third bin)
Also, any peaks above 80% of the highest peak are converted into a new keypoint. This new keypoint has the same location and scale as the original. But it’s orientation is equal to the other peak.
So, orientation can split up one keypoint into multiple keypoints.

The Technical Details
Magnitudes
Saw the gradient magnitude image above? In SIFT, you need to blur it by an amount of 1.5*sigma.
Size of the window
The window size, or the “orientation collection region”, is equal to the size of the kernel for Gaussian Blur of amount 1.5*sigma.
Summary
To assign an orientation we use a histogram and a small region around it. Using the histogram, the most prominent gradient orientation(s) are identified. If there is only one peak, it is assigned to the keypoint. If there are multiple peaks above the 80% mark, they are all converted into a new keypoint (with their respective orientations).
Next, we generate a highly distinctive “fingerprint” for each keypoint. Here’s a little teaser. This fingerprint, or “feature vector”, has 128 different numbers.
Got any questions or suggestions? Leave a comment below!
More in this series


43 Comments
Hi,
I recently start looking in to using SIFT and SURF for a personal project. I was interested in what you have to say about storing and comparing SIFT/SURF descriptors, methods or techniques.
So long story short; i want to save object descroptors in database (of some sort) and find a match. I am not sure how I can do that.
Thank
Storing a descriptor should be easy.. its just an array of numbers. Check the code I’ve included, each descriptor has a vector for the 128 numbers, and a few other values (scale, position, etc).
For comparing features, you can use euclidean distance. That’s the “normal” distance formula, but with 128 squares in the underroot. Calculate the distance between the feature in your database and the one you’re checking. If they are “similar” you’ll get small values… or you’ll get large values.
I hope this makes sense.
Hello!
Recently, I begin to study the sift alogrithem, your tutorial is very useful to me.
When I read the code of MySift, in the process of Calculate the DoG image, the code is
“cvSub(m_gList[i][j-1], m_gList[i][j], m_dogList[i][j-1]);”, but both the JIFT and Rob Hess’s is
vil_math_image_difference(m_glist[i][j],m_glist[i][j-1],m_doglist[i][j-1]); --JIFT
cvSub( gauss_pyr[o][i+1], gauss_pyr[o][i], dog_pyr[o][i], NULL ); --Rob Hess’s
May be this is a little bug!
Hi Eric! Thanks for the tip… I’ll check if that causes problems and fix it.
Dude,
absolute genius at work !!!! that’s all i gotta say about u… awesome explanation maan
so, are u participatin in FIRA….. anyways, how do u actually extract the affine parameters of the
object image in the target frame… that isn’t quite clicking atleast for me i guess…..
Thanks
No, I’m not participating in FIRA… atleast not yet
Are you?
About the affine parameters part… I’m having some doubts with it too. When I know how it works, I’ll write about it here. Why don’t you subscribe.. you’ll get to know as soon as I write about it
Yes I am participating dude… its in India… why won’t I
You should have with the wealth of knowledge you’ve got…
Anyways, yup we are participating in four events (Scrapped Robosot for now no time) Mirosot5v5, Mirosot 11v11, Simurosot5v5 and Simurosot11v11 and we also participated and by chance won the simulation league in Robocup@India last year at IIT-Kgp… our ssl entry was naive to say the least (www.robocup.in)
And ya, by the way, i went through nearly 20 papers on these feature descriptors till now… should tell u why i decided to give up research on them at least for now.
The method which according to me actually holds promise and should have been explored a lot more than it has been is the log-polar transform… my intuition suggests going for point correlation based method to detect the target object in the current frame and then going for IIFT to get the affine deviation parameters (haven’t tried it though… still thinking on how it can be made faster)
Can you please please post or send a mail on how do u identify bots using the apparently trusty CMU certified
butterfly pattern. and orientation as well. The logic ain’t quite clicking should i go for an r,theta approach or something else… right now, we are still sticking to a greedy approach and tracking em
Hoping to hear asap
Regards
S.Amar Nath
Team Les Invincibles
Whoa Amar! Great that you’re participating in all those competitions! I got to know about the IIT-Kgp robocup only in August last year… so couldn’t participate. Was interested in Robocup for a few days, and then realized that it was waaayyy too complex. Way too many things to take care of (mechanics, electrical/electronics, vision, algorithm, and so on). Almost impossible to do unless you have a good team. So, at least for now, I’ve put it on hold. Sometime in the future maybe I’ll participate.
And I think you’re wrong about the feature detection papers. The whole point of automatic feature detection is to eliminate human interaction and ensure that the features are good enough for tracking. If human interaction was “allowed” you’d rather input the displacement from one frame to the next… no need to check features at all. And I’ve heard of SIFT running on the GPU, it runs near real time.
And can you tell me about the log-polar transform thing? Maybe we can work on it together?
I’ll have to write code for the butterfly pattern
Never had the need to control so many robots at once (atleast not yet
) I’ve put it on the todo list… should be up soon enough!
(and this was one massive comment
)
Hi Master,
it’s great work !,
I need to achieve the first two steps(Constructing a scale space
and LoG approximation), in order to get robust level lines and strong edge, do these steps achieve what I need ?
could you indicate on the code which are the pieces of code to achieve these two steps ?
thanx in advance,
The BuildScaleSpace() function constructs the DoG for every octave.. have a look at that function.
Wow, simply wow. Until I found your site I had read so much literature that I was fit to burst but I didn’t really understand the bigger picture; I didn’t really know what the moving parts were describing. However you bring it together absolutely beautifully, I understood every single word with crystal clear clarity and atleast two Uh-Ha! moments on every page. This was some very gifted tutorial/guide writing; well done and thankyou.
Hi Robert! Glad you found this series to be helpful!
afaik SURF is also patented.
Oh! I thought SURF was the “open” thing for SIFT… thanks for the info!
Hi,
That`s awesome..
i`m a newbie. i`m still confuse about how to strore the descriptor..can u give me an example?
thanks
Well… a SIFT descriptor is just an array of numbers. Does this help?
HI!
a SIFT algorithm produce a descriptors for one image. How to match 2 or more images with this descriptors? sorry for my english, just i’m from Russia:)
One way could be to check the “distance” between two features – the square root of (sum of squares of (differences of coordinates)). Have a look at Lowe’s paper – he describes a technique to match features in images.
Good Explanation Keep it up!
in your code, you set
octaves = 4 and intervals = 2.
I want to know the reason and I want to know the optimized values of them. Thx.
Hello Utkarsh,
I found this very useful for matching objects with different orientations. Do you have C# implementation of this code.
ThanKs,
Ashish
First of all I would like to thank you for this wonderful presentation/explanation on how SIFT works. I’ve been searching for a long time to find solid information about this algorithm. I am new in openCV and I would like to ask you about the code that you wrote which implements SIFT.
If i am not wrong, this code detects the keypoints of an image.
How can I convert this code in order to compare these keypoints with another image (possibly a live feed from a camera). I’ve done something similar with SURF but i find little information about SIFT.
Thank You!
Sincerely,
Konstantinos
How to subtract two images?
You just subtract corresponding pixel values. If I1 and I2 are two images, then the subtracted result will be I1(x,y)-I2(x,y).
Hi, Utkarsh:
Your web is great, in which I learned a lot. I’d like to ask you a question about the scale invarants. Suppose a keypoint has been done with a 128-D array {k1, k2, …, k128}, which is from a 32*32 region from a picture; another picture has a 37*37 region exactly corresponding to the 32*32 region, i.e. the latter is a resized version of the former. A second keypoint {j1,j2, … j128}, will be done from the 37*37 region by the same algorithm, by the same pyramid octaves. Let’s say the 16*16 region in the former contributes the values {k3,k4,k5,6}, then, Which octave contributes in the latter? How the octaves map to the values in the array?
Your reply will be appreciated!
TIA
gzdillon
Thanks alot for the tutorial
thank you for the tutorial. it is really helpful. my team is using the SIFT with windows base mobile to do a mobile app for detecting features on paper money and matching it with our database. we are just afraid of the speed of the application because if it take around 20 sec on computer so it will be more on mobile so is there a way to decrease the computation of features and at the same time keep it accurate? or should we search on other algorithm such as the surf?
Hi! These algorithms are computationally expensive. One way to reduce time would be to search for features in only certain regions of the image – like a person’s face on the bill, the corners, etc. Btw, I doubt if you can use those algorithms in a commercial application – they’re patented.
thanks a lot for the tutorial!! really very useful one.
thanks again!
Hi! Thank you very much for SIFT article
It really helps me in understanding SIFT
By the way, I’m also implementing SURF using openCV for processing a sequences of video images. When I compare the processing time, SURF run faster and better than SIFT
when i’m running SIFT, the video shown didn’t run very well. it was very slow, it seems almost like an images only, not video. Do you have any video why SIFT behave like this?
Thank you so much ^_^)
Hi Utkarsh,
Firstly, thank you for this wonderful guide.
I’ve been trying to make a Python version of your program, as a student of mine has requested info on how this works. You’ll get full credit, don’t worry.
Basically there’s ALOT of array cycling. Alot. I was curious if any OpenCv functions would be useful.
Would this work…. use “features”=GoodFeaturesToTrack() in each image within an octave. Find matches in each image. i.e. Take features that are in all images for that octave.
(Goodfeatures uses local maxima, repeat on Not(image) to find local minima)
Then use cv.FindCornerSubPix()……. to get subpixel corners for each feature.
Then conduct elimination based on this reduced feature set (contrast, edge, flat)?
My main issue is the Taylor expansion, although you have made it very clear, i find it difficult to get good performance in Python.
Kind regards,
Daniel
Two other feature extraction methods…..
GetStarKeypoints() – The function GetStarKeypoints extracts keypoints that are local scale-space extremas. The scale-space is constructed by computing approximate values of laplacians with different sigma’s at each pixel. Instead of using pyramids, a popular approach to save computing time, all of the laplacians are computed at each pixel of the original high-resolution image. But each approximate laplacian value is computed in O(1) time regardless of the sigma, thanks to the use of integral images. The algorithm is based on the paper Agrawal08 , but instead of a square, hexagon or octagon it uses an 8-end star shape, hence the name, consisting of overlapping upright and tilted squares.
ExtractSURF() – The function cvExtractSURF finds robust features in the image, as described in Bay06 . For each feature it returns its location, size, orientation and optionally the descriptor, basic or extended. The function can be used for object tracking and localization, image stitching etc.
Yup, SURF is another algorithm like SIFT. The Star one I’ll have a look at!
Hi Daniel! This sounds really interesting! Did you have a progress till now? I think you could do that, but I’m not sure. Would you find exactly those points again in another image?
Thanks for the excellent tutorial! I had difficulty comprehending the paper by itself, but your explanation made things much clearer. Great use of images to illustrate your points.
Utkarsh, Can you please direct me to an algorithm that isnt patented?
Haha! Tough question!
We are studying SIFT in my Masters at NTU in Singapore and our lecturer asked us to read the original SIFT paper as preparation. The maths and explanations were way above my level of understanding but I found your tutorials to be really clear and they helped me to really appreciate the brilliance of the algorithm.
Thanks for writing the tutorial! I intend to share the link with my entire class!
Awesome! Thanks!
One question about the part “4. Finding keypoints”: I’m reading all your post in chronological order (from older to newer), and the natural option for subpixel precission would habe been using the method describer here:
http://www.aishack.in/2010/05/subpixel-corners-increasing-accuracy/
http://www.aishack.in/2010/05/subpixel-corners-in-opencv/
So it wouldn’t be OK for this SIFT calculation?
Thanks
That should work as well. But I’ve never tried it out.
Very well written. Keep it up!
Thanks!
One Trackback
[...] 2.SIFT tutorial http://www.aishack.in/2010/05/sift-scale-invariant-feature-transform/ [...]