home

Evaluating Scene Composition


In my mind, one of the biggest variables in photography, really, is who’s behind the camera. No amount of smart HDR and machine learning powered scene detection can correct a bad shot if the camera is pointed at the wrong thing to begin with. I wanted to build a tool that could actively guide amateur photographers and teach them how to take better quality pictures, by automatically evaluating quality of image composition in real time.


I started my work looking through the ACM Digital Library to see if anyone tried to look at a similar problem. Although it isn’t exactly the same, I stumbled upon this paper written by some researchers at Dartmouth, who developed a ML pipeline to automatically crop images. By using a Support Vector Regression, or SVR, alongside simpler methods, like finding the density of content on the edges, they developed a simple, effective system to suggest good looking crops from a large image.


What interested me the most about this paper was that the research team collected a dataset of 500 poorly composed photos with suggested crops from Amazon Mechanical Turk users, along with 3,000 ‘good’ photos. They were then able to construct an SVR that gave a score to an input based on perceived aesthetic quality. For my problem, I chose to implement just that part. I wasn’t able to find the collection with the 3,000 known good images, so I used the crops from Mechanical Turk as the ground truth, and cropped the bad images randomly to represent poor composition.


In order to feed images into the SVR, Fang et. al. created a spatial pyramid of saliency maps (SPSM) that could then be represented as a vector of constant length. After making a saliency map from the input, they created a histogram of all the pixel values, and turned that into a vector. Then, they subdivided the image into four, constructed the same histogram for each for each subdivision, and repeated, appending each new vector onto the last. Then, the vectors were fed as training data into the SVM.


Although they didn’t mention it in the paper, I performed a scaler transformation, then principal component analysis on all the input vectors. This reduced training time by a lot and resulted in much smaller model file.


A complication I ran into was the saliency map, which is more or less at the heart of the image simplification step. I believe the original authors used Matlab to write their code, and according to their citations, Fang et. al. relied on a complex saliency map algorithm, referenced in another paper. I read that paper and was only able to find their code as compiled, secured Matlab. Seeing as I was doing all of my work in Python, I decided that recreating their paper was outside of the scope of what I was doing. I decided to stick with OpenCV’s builtin saliency map functions, one fast and one slow. Because I wanted to use my code in a somewhat real-time environment, I picked the fast one, but the slower one may result in better output quality.


I did most of my work reimplementing the paper in a Jupyter notebook, with Python, Matplotlib, Numpy, and OpenCV.


I wasn’t entirely happy with the result: some pictures showed only a small change or no change at all in score if I put effort into them or just waved my camera around randomly, and the SVM shows clear favorable bias toward images that came from the training dataset itself. It doesn’t help that I’m using the model described in the paper outside of it’s intended purpose: instead of judging the quality of crops of images, which I trained it on, I’m having it evaluate entire images.


My original goal was to have something that could run on a camera or camera app, but I did all my work in Python. I did make a demo application that used my computer’s webcam and displayed a live composition score in real time, but I recognize that if I want to move my code to Android or iOS, I’ll have to port my code to C++. Porting probably won’t be too hard, because OpenCV was written in C++ in the first place, and it does offer a SVR, so I wouldn’t need to use scikit-learn, which is a Python only library.


For demo purposes, I did wrap my model in a Flask API and deployed it to Heroku. It’s reasonably fast once the Heroku VM is preloaded. I’m not storing any images server side.


That API, as well as the wrapper app and original notebook, are available on my GitHub here.


There is probably another interesting project somewhere in this API that I haven’t thought of yet. Picking good pictures from groups of bursts? Sifting through stock photos to find the best? Let me know if you make something interesting or have any ideas.