The Most Distinctive Snowflake – Cloudera Weblog



Okay, I admit, the title is a bit of click-batey, nevertheless it does maintain some reality! I spent the vacations up within the mountains, and should you reside within the northern hemisphere like me, you realize that implies that I spent the vacations both celebrating or cursing the snow. After I was a child, throughout this time of yr we’d at all times do an artwork undertaking making snowflakes. We might bust out the scissors, glue, paper, string, and glitter, and go to work. Sooner or later, the instructor would undoubtedly pull out the massive weapons and blow our minds with the truth that each snowflake in the whole world for all of time is completely different and distinctive (folks simply like to oversell unimpressive snowflake options). 

Now that I’m a grown mature grownup that has all the things discovered (pause for laughter), I’ve began to surprise concerning the uniqueness of snowflakes. We are saying they’re all distinctive, however some have to be extra distinctive than others. Is there a way that we might quantify the distinctiveness of snowflakes and thus discover essentially the most distinctive snowflake

Absolutely with trendy ML know-how, a job like this could not solely be doable, however dare I say, trivial? It most likely seems like a novel concept to mix snowflakes with ML, nevertheless it’s about time somebody does. At Cloudera, we offer our clients with an intensive library of prebuilt knowledge science initiatives (full with out of the field fashions and apps) known as Utilized ML Prototypes (AMPs) to assist them transfer the place to begin of their undertaking nearer to the end line.

One in all my favourite issues about AMPs is that they’re completely open supply, which means anybody can use any a part of them to do no matter they need. Sure, they’re full ML options which can be able to deploy with a single click on in Cloudera Machine Studying (CML), however they will also be repurposed for use in different initiatives. AMPs are developed by ML analysis engineers at Cloudera’s Quick Ahead Labs, and in consequence they’re a terrific supply for ML greatest practices and code snippets. It’s one more device within the knowledge scientist’s toolbox that can be utilized to make their life simpler and assist ship initiatives quicker.

Launch the AMP

On this weblog we’ll dig into how the Deep Studying for Picture Evaluation AMP may be reused to search out snowflakes which can be much less just like each other. In case you are a Cloudera buyer and have entry to CML or Cloudera Information Science Workbench (CDSW), you can begin out by deploying the Deep Studying for Picture Evaluation AMP from the “AMPs” tab. 

For those who don’t have entry to CDSW or CML, the AMP github repo has a README with directions for getting up and operating in any setting.

Information Acquisition

After getting the AMP up and operating, we will get began from there. For essentially the most half, we can reuse components of the prevailing code. Nevertheless, as a result of we’re solely fascinated by evaluating snowflakes, we have to carry our personal dataset consisting solely of snowflakes, and a number of them.

It seems that there aren’t very many publicly out there datasets of snowflake photographs. This wasn’t an enormous shock, as taking photographs of particular person snowflakes could be a handbook intensive course of, with a comparatively minimal return. Nevertheless, I did discover one good dataset from Jap Indiana College that we’ll use on this tutorial. 

You would undergo and obtain every picture from the web site individually or use another utility, however I opted to place collectively a fast pocket book to obtain and retailer the photographs within the undertaking listing. You’ll want to position it within the /notebooks subdirectory and run it. The code parses out the entire picture URLs from the linked net pages that comprise photographs of snowflakes and downloads the photographs. It can create a brand new subdirectory known as snowflakes in /notebooks/photographs and the script will populate this new folder with the snowflake photographs.

Like every good knowledge scientist, we must always take a while to discover the info set. You’ll discover that these photographs have a constant format. They’ve little or no colour variation and a comparatively fixed background. An ideal playground for laptop imaginative and prescient fashions.

Repurposing the AMP

Now that we’ve our knowledge, and it seems to be moderately suited to picture evaluation, let’s take a second to restate our purpose. We need to quantify the distinctiveness of a person snowflake. In accordance with its description, Deep Studying for Picture Evaluation is an AMP that “demonstrates methods to construct a scalable semantic search resolution on a dataset of photographs.” Historically, semantic search is an NLP approach used to extract the contextual which means of a search time period, as an alternative of simply matching key phrases. This AMP is exclusive in that it extends that idea to pictures as an alternative of textual content to search out photographs which can be just like each other.

The purpose of this AMP is essentially centered on educating customers on how deep studying and semantic search works. Inside the AMP there’s a pocket book positioned in /notebooks that’s titled Semantic Picture Search Tutorial. It provides a sensible implementation information for 2 of the primary strategies underlying the general resolution – function extraction & semantic similarity search. This pocket book would be the basis for our snowflake evaluation. Go forward and open it and run the whole pocket book (as a result of it takes a short while), after which we’ll check out what it incorporates.

The pocket book is damaged down into three major sections: 

  1. A conceptual overview of semantic picture search
  2. A proof of extracting options with CNN’s and demonstration code
  3. A proof of similarity search with Fb’s AI Similarity Search (FAISS) and demonstration code

Pocket book Part 1

The primary part incorporates background info on how the end-to-end strategy of semantic search works. There is no such thing as a executable code on this part so there may be nothing for us to run or change, but when time permits and the subjects are new to you, you need to take the time to learn.

Pocket book Part 2

Part 2 is the place we’ll begin to make our adjustments. Within the first cell with executable code, we have to set the variable ICONIC_PATH equal to our new snowflake folder, so change 

ICONIC_PATH = “../app/frontend/construct/belongings/semsearch/datasets/iconic200/”


ICONIC_PATH = "./photographs/snowflakes"

Now run this cell and the subsequent one. You must see a picture of a snowflake displayed the place earlier than there there was a picture of a automobile. The pocket book will now use solely our snowflake photographs to carry out semantic search.

From right here, we truly can run the remainder of the cells in part 2 and depart the code as is up till part 3, Similarity Search with FAISS. When you’ve got time although, I’d extremely advocate studying the remainder of the part to achieve an understanding of what’s taking place. A pre-trained neural community is loaded, function maps are saved at every layer of the neural community, and the function maps are visualized for comparability.

Pocket book Part 3

Part 3 is the place we’ll make most of our adjustments. Often with semantic search, you are attempting to search out issues which can be similar to each other, however for our use case we have an interest within the reverse, we need to discover the snowflakes on this dataset which can be the least just like the others, aka essentially the most distinctive. 

The intro to this part within the pocket book does a terrific job of explaining how FAISS works. In abstract, FAISS is a library that enables us to retailer the function vectors in a extremely optimized database, after which question that database with different function vectors to retrieve the vector (or vectors) which can be most related. If you wish to dig deeper into FAISS, you need to learn this submit from Fb’s engineering web site by .

One of many classes that the unique pocket book focuses on is how the options output from the final convolutional layer are a way more summary and generalized illustration of what options the mannequin deems essential, particularly when in comparison with the output of the primary convolutional layer. Within the spirit of KISS (hold it easy silly), we’ll apply this lesson to our evaluation and solely concentrate on the function index of the final convolutional layer, b5c3, with the intention to discover our most unusual snowflake.

The code within the first 3 executable cells must be barely altered. We nonetheless need to extract the options of every picture then create an FAISS index for the set of options, however we’ll solely do that for the options from convolutional layer b5c3.

# Cell 1

​​def get_feature_maps(mannequin, image_holder):

    # Add dimension and preprocess to scale pixel values for VGG

    photographs = np.asarray(image_holder)

    photographs = preprocess_input(photographs)

    # Get function maps

    feature_maps = mannequin.predict(photographs)

    # Reshape to flatten function tensor into function vectors

    feature_vector = feature_maps.reshape(feature_maps.form[0], -1)

    return feature_vector


# Cell 2

all_b5c3_features = get_feature_maps(b5c3_model, iconic_imgs)


# Cell 3

import faiss

feature_dim = all_b5c3_features.form[1]

b5c3_index = faiss.IndexFlatL2(feature_dim)



Right here is the place we’ll begin deviating considerably from the supply materials. Within the unique pocket book, the writer created a operate that enables customers to pick a selected picture from every index, the operate returns essentially the most related photographs from every index and shows these photographs. We’re going to use components of that code with the intention to obtain our new purpose, discovering essentially the most distinctive snowflake, however for the needs of this tutorial you possibly can delete the remainder of the cells and we’ll undergo what so as to add of their place.

First off, we’ll create a operate that makes use of the index to retrieve the second most related function vector to the index that was chosen (as a result of essentially the most related could be the identical picture). There additionally occurs to be a pair duplicate photographs within the dataset, so if the second most related function vector can also be a precise match, we’ll use the third most related.


def get_most_similar(index, query_vec):

    distances, indices =, 2)

    if distances[0][1] > 0:

        return distances[0][1], indices[0][1]


        distances, indices =, 3)

        return distances[0][2], indices[0][2]


From there it’s only a matter of iterating via every function, trying to find essentially the most related picture that isn’t the very same picture, and storing the leads to an inventory:


distance_list = []

for x in vary(b5c3_index.ntotal):

    dist, indic = get_most_similar(b5c3_index, all_b5c3_features[x:x+1])

    distance_list.append([x, dist, indic])

Now we’ll import pandas and convert the record to a dataframe. This provides us a dataframe for every layer, containing a row for each function vector within the unique FAISS index, with the index of the function vector, the index of the function vector that’s most just like it, and the L2 distance between the 2 function vectors. We’re curious concerning the snowflakes which can be most distant from their most related snowflake, so we must always finish this cell with sorting the dataframe in ascending order by the L2 distance.

import pandas as pd

df = pd.DataFrame(distance_list, columns = ['index', 'L2', 'similar_index'])

df = df.sort_values('L2', ascending=False)

Let’s check out the outcomes by printing out the dataframe, in addition to displaying the L2 values in a box-and-whisker plot.



Wonderful stuff. Not solely did we discover the indexes of the snowflakes which can be the least just like their most related snowflake, however we’ve a handful of outliers made evident within the field and whisker plot, one in all which stands alone.

To complete issues up, we must always see what these tremendous distinctive snowflakes truly appear to be, so let’s show the highest 3 most unusual snowflakes in a column on the left, together with their most related snowflake counterparts within the column on the precise. 

fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(12, 12))

i = 0

for row in df.head(3).itertuples():

    # column 1



    ax[i][0].set_title('Distinctive Rank: %s' % (i+1), fontsize=12, loc='middle')

    ax[i][0].textual content(0.5, -0.1, 'index = %s' % row.index, dimension=11, ha='middle', rework=ax[i][0].transAxes)

    # column 2



    ax[i][1].set_title('L2 Distance: %s' % (row.L2), fontsize=12, loc='middle')

    ax[i][1].textual content(0.5, -0.1, 'index = %s' % row.similar_index, dimension=11, ha='middle', rework=ax[i][1].transAxes)

    i += 1

fig.subplots_adjust(wspace=-.56, hspace=.5)


For this reason ML strategies are so nice. Nobody would ever take a look at that first snowflake and assume, that’s one tremendous distinctive snowflake, however in accordance with our evaluation it’s by far essentially the most dissimilar to the subsequent most related snowflake.


Now, there are a mess of instruments that you possibly can have used and ML methodologies that you possibly can have leveraged to discover a distinctive snowflake, together with a type of overhyped ones. The great factor about utilizing Cloudera’s Utilized ML Prototypes is that we had been in a position to leverage an present, fully-built, and purposeful resolution, and alter it for our personal functions, leading to a considerably quicker time to perception than had we began from scratch. That, girls and gents, is what AMPs are all about!

To your comfort, I’ve made the ultimate ensuing pocket book out there on github right here. For those who’re fascinated by ending initiatives quicker (higher query – who isn’t?) you must also take the time to take a look at what code within the different AMPs could possibly be used in your present initiatives. Simply choose the AMP you’re fascinated by and also you’ll see a hyperlink to view the supply code on GitHub. In spite of everything, who wouldn’t be fascinated by, legally, beginning a race nearer to the end line? Take a take a look at drive to strive AMPs for your self.



Please enter your comment!
Please enter your name here