Quantum Annealing Cluster Analysis¶

An application of the D-Wave Advantage Quantum Annealer to a classic unsupervised learning problem. We frame k-means clustering as a quadratic unconstrained binary optimization (QUBO) problem and evaluate the results against classic, iterative algorithms.

In this formulation of the problem, we allocate three qubits on the D-Wave system for every instance in the dataset, where each qubit represents membership in a cluster. If for example, the minimum energy anneal returned the states {1, 0, 0} (equivalently {↑,↓, ↓}) for a data point, this would indicate that the instance belongs exclusively to the first cluster, as we should expect. To ensure this exclusivity, we also set a constraint on the valid arrangements of qubits in which exactly one of the three membership qubits is up and the others are down (i.e. ({1, 0, 0}, {0, 1, 0}, {0, 0, 1}).

We then iterate over all pairs of instances in the dataset and calculate simple Euclidean distances for each pair. We use these distances as weights in the 𝑎𝑖,𝑗 quadratic interaction terms in the QUBO problem. These distances are fed through − cos (𝑑 ∗ 𝜋) and − tanh(𝑑) ∗ 0.1 for same-cluster and different-cluster interaction terms respectively in order to reward similarity between instances with low energy values and penalize dissimilar pairs.

You can find the associated write-up for this project at this link.

Conor McCormack, Fall 2020
EE 520: Quantum Information Processing
University of Southern California, Viterbi School of Engineering.

import math
import numpy as np
import dwavebinarycsp
import dwave.inspector
from dwave.system import EmbeddingComposite, DWaveSampler
from collections import defaultdict

import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt

Clustering two dimenionsal data with ConstraintSatisfactionProblem¶

Code adopted from: https://github.com/dwave-examples/clustering

Define data structures and distances¶

class Coordinate:
    def __init__(self, x, y):
        self.x = x
        self.y = y

        # coordinate labels for groups red, green, and blue
        label = "{0},{1}_".format(x, y)
        self.r = label + "r"
        self.g = label + "g"
        self.b = label + "b"

        

def get_distance(coordinate_0, coordinate_1):
    diff_x = coordinate_0.x - coordinate_1.x
    diff_y = coordinate_0.y - coordinate_1.y

    return math.sqrt(diff_x**2 + diff_y**2)


def get_max_distance(coordinates):
    max_distance = 0
    for i, coord0 in enumerate(coordinates[:-1]):
        for coord1 in coordinates[i+1:]:
            distance = get_distance(coord0, coord1)
            max_distance = max(max_distance, distance)

    return max_distance

Functions for creating plots, saving to pngs¶

def get_groupings(sample):
    """Grab selected items and group them by color"""
    colored_points = defaultdict(list)

    for label, bool_val in sample.items():
        # Skip over items that were not selected
        if not bool_val:
            continue

        # Note: labels look like "<x_coord>,<y_coord>_<color>"
        coord, color = label.split("_")
        coord_tuple = tuple(map(float, coord.split(",")))
        colored_points[color].append(coord_tuple)

    return dict(colored_points)


def visualize_groupings(groupings_dict, filename):
    """
    Args:
        groupings_dict: key is a color, value is a list of x-y coordinate tuples.
          For example, {'r': [(0,1), (2,3)], 'b': [(8,3)]}
        filename: name of the file to save plot in
    """
    for color, points in groupings_dict.items():
        # Ignore items that do not contain any coordinates
        if not points:
            continue

        # Populate plot
        point_style = color + "o"
        plt.plot(*zip(*points), point_style)

    plt.savefig(filename)


def visualize_scatterplot(x_y_tuples_list, filename):
    """Plotting out a list of x-y tuples

    Args:
        x_y_tuples_list: A list of x-y coordinate values. e.g. [(1,4), (3, 2)]
    """
    plt.plot(*zip(*x_y_tuples_list), "o")
    plt.savefig(filename)

Cluster Analysis: problem composition, sampling, and results¶

def cluster_points(scattered_points, filename):
    # Set up problem
    # Note: max_distance gets used in division later on. Hence, the max(.., 1)
    #   is used to prevent a division by zero
    coordinates = [Coordinate(x, y) for x, y in scattered_points]
    max_distance = max(get_max_distance(coordinates), 1)

    # Build constraints
    csp = dwavebinarycsp.ConstraintSatisfactionProblem(dwavebinarycsp.BINARY)

    # Apply constraint: coordinate can only be in one color group
    choose_one_group = {(0, 0, 1), (0, 1, 0), (1, 0, 0)}
    for coord in coordinates:
        csp.add_constraint(choose_one_group, (coord.r, coord.g, coord.b))

    # Build initial BQM
    bqm = dwavebinarycsp.stitch(csp)

    # Edit BQM to bias for close together points to share the same color
    for i, coord0 in enumerate(coordinates[:-1]):
        for coord1 in coordinates[i+1:]:
            d = get_distance(coord0, coord1) / max_distance 
            same_weight = -math.cos(d*math.pi)

            bqm.add_interaction(coord0.r, coord1.r, same_weight)
            bqm.add_interaction(coord0.g, coord1.g, same_weight)
            bqm.add_interaction(coord0.b, coord1.b, same_weight)
            
            diff_weight = -math.tanh(d) * 0.1
            bqm.add_interaction(coord0.r, coord1.b, diff_weight)
            bqm.add_interaction(coord0.r, coord1.g, diff_weight)
            bqm.add_interaction(coord0.b, coord1.r, diff_weight)
            bqm.add_interaction(coord0.b, coord1.g, diff_weight)
            bqm.add_interaction(coord0.g, coord1.r, diff_weight)
            bqm.add_interaction(coord0.g, coord1.b, diff_weight)
            
    # Submit problem to D-Wave sampler
    sampler = EmbeddingComposite(DWaveSampler())
    
    # Note: we've received 1000 anneal results from the D-Wave system
    sampleset = sampler.sample(bqm, chain_strength=4, num_reads=1000)
    best_sample = sampleset.first.sample

    # Open Inspector
    dwave.inspector.show(bqm, sampleset)

    # Visualize solution
    groupings = get_groupings(best_sample)
    visualize_groupings(groupings, filename)

    # Print solution onto terminal
    # Note: This is simply a more compact version of 'best_sample'
    print(groupings)

Multivariate Gaussians¶

We use numpy to randomly sample three points from three seperate multivariate gaussian distributions and run our clustering QUBO on the transformed data.

# Set up three different clusters of data points
covariance = [[3, 0], [0, 3]]
n_points = 3
x0, y0 = np.random.multivariate_normal([0, 0], covariance, n_points).T
x1, y1 = np.random.multivariate_normal([12, 14], covariance, n_points).T
x2, y2 = np.random.multivariate_normal([8, 3], covariance, n_points).T

xs = np.hstack([x0, x1, x2])
ys = np.hstack([y0, y1, y2])
xys = np.vstack([xs, ys]).T
scattered_points = list(map(tuple, xys))

visualize_scatterplot(scattered_points, "2d_points.png")

# Run clustering script with scattered_points
clustered_filename = "2d_points_clustered.png"
cluster_points(scattered_points, clustered_filename)

print("Your plots are saved to '{}' and '{}'.".format("2d_points.png",
                                                 clustered_filename))

{'b': [(0.4626375962932738, 1.3760697696607773), (10.163755324355211, 1.8015849070958372), (10.414026497304949, 4.101929224375383), (2.1086567835770555, 0.6087152134416797), (2.8045628349746634, -0.9004171809911884), (7.994266041929519, 4.142323453056345)], 'r': [(10.617146844831872, 13.596855300507102), (11.45672394272729, 12.492456505531585), (11.841340244210386, 16.238727485448855)]}
Your plots are saved to '2d_points.png' and '2d_points_clustered.png'.

Results¶

from IPython.display import Image

Image(filename='2d_points.png')

Image(filename='2d_points_clustered.png')

Iris Dataset¶

import pandas as pd

iris_data = pd.read_csv('iris.data', names=['sepal length', 'sepal width', 'petal length', 'petal width', 'species'])

# Add species ID for convenience
unique = np.unique(iris_data['species'], return_inverse=True)[1]
iris_data['species id'] = unique
iris_data

Data normalization¶

We normalize each of the features to fit them to similar ranges. This is important since the most popular implementations of k-means all fit clusters to data based on distance metrics. Without normalization, any given feature may have an oversized influence on cluster formation.

from sklearn.preprocessing import StandardScaler

scaled = StandardScaler().fit_transform(iris_data.loc[:,:'petal width'])
scaled = pd.DataFrame(scaled, columns=['sepal length', 'sepal width', 'petal length', 'petal width'])
scaled['species id'] = iris_data['species id']
scaled['species'] = iris_data['species']
scaled['id'] = scaled.index
iris_data = scaled
iris_data

import plotly.express as px

Visualization the Iris Dataset¶

fig = px.scatter_3d(iris_data, x='sepal length', y='sepal width', z='petal width', color='species')
fig.show()

fig = px.scatter_3d(iris_data, x='sepal length', y='sepal width', z='petal length', color='species')
fig.show()

fig = px.scatter_3d(iris_data, x='sepal length', y='petal length', z='petal width', color='species')
fig.show()

fig = px.scatter_3d(iris_data, x='sepal width', y='petal length', z='petal width', color='species')
fig.show()

fig = px.scatter_3d(iris_data, x='sepal width', y='sepal length', z='petal width', color="petal length", symbol="species",)
fig.show()

Classical, iterative version of KMeans from sklearn¶

Note that cluster labels are randomly assigned here.

from sklearn.cluster import KMeans

preds = KMeans(n_clusters=3).fit_predict(iris_data.loc[:,:'petal width'])
preds

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2,
       0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int32)

iris_data['sklearn_preds'] = preds
iris_data

fig = px.scatter_3d(iris_data, x='sepal width', y='petal length', z='petal width', color='sklearn_preds')
fig.show()

from sklearn.metrics import accuracy_score, cluster

accuracy_score(iris_data['species id'], preds)

0.24

iris_data = iris_data.drop(['sklearn_preds'],axis=1)

Mapping Clustering on the Iris Dataset as a QUBO¶

class Iris:
    def __init__(self, sepal_length, sepal_width, petal_length, petal_width, species_id, species, idx):
        self.sepal_length = sepal_length
        self.sepal_width = sepal_width
        self.petal_length = petal_length
        self.petal_width = petal_width
        self.species_id = species_id
        self.species = species
        self.id = idx
        
        label = f'{sepal_length},{sepal_width},{petal_length},{petal_width}_'
        self.setosa = label + "set"
        self.versicolor = label + "vers"
        self.virginica = label + "virg"

# Computer euclidean distance between flowers
def get_iris_distance(flower0, flower1):
    diff_sl = flower0.sepal_length - flower1.sepal_length
    diff_sw = flower0.sepal_width - flower1.sepal_width
    diff_pl = flower0.petal_length - flower1.petal_length
    diff_pw = flower0.petal_width - flower1.petal_width
    
    return math.sqrt(diff_sl**2 + diff_sw**2 + diff_pl**2 + diff_pw **2)

def get_max_iris_distance(coordinates):
    max_distance = 0
    for i, coord0 in enumerate(coordinates[:-1]):
        for coord1 in coordinates[i+1:]:
            distance = get_iris_distance(coord0, coord1)
            max_distance = max(max_distance, distance)

    return max_distance

def get_iris_groupings(sample):
    """Grab selected items and group them by color"""
    colored_points = defaultdict(list)
    print(sample)


    for label, bool_val in sample.items():
        # Skip over items that were not selected
        if not bool_val:
            continue

        # Parse selected items
        # Note: label look like "<sep_len>,<sep_width>,<pet_len>,<pet_width>,<id>_<species>"
        print(label)
        if len(label.split("_")) < 2: 
            continue
        coord, species = label.split("_")
        coord_tuple = tuple(map(float, coord.split(",")))
        colored_points[species].append(coord_tuple)

    return dict(colored_points)

def cluster_iris(iris_data):
    flower_list = list(map(tuple, list(iris_data.to_numpy())))
    flowers = [Iris(sepal_length, sepal_width, petal_length, petal_width, species_id, species, idx) for sepal_length, sepal_width, petal_length, petal_width, species_id, species, idx in flower_list]
    max_distance = max(get_max_iris_distance(flowers), 1)

    # Build constraints
    csp = dwavebinarycsp.ConstraintSatisfactionProblem(dwavebinarycsp.BINARY)

    # Apply constraint: coordinate can only be in one color group
    choose_one_group = {(0, 0, 1), (0, 1, 0), (1, 0, 0)}
    for iris in flowers:
        csp.add_constraint(choose_one_group, (iris.setosa, iris.versicolor, iris.virginica))

    # Build initial BQM
    bqm = dwavebinarycsp.stitch(csp, min_classical_gap=3)

    # Edit BQM to bias for close together points to share the same color
    for i, flower0 in enumerate(flowers[:-1]):
        for flower1 in flowers[i+1:]:
            # Set up weight
            d = get_iris_distance(flower0, flower1) / max_distance  # rescale distance
            same_weight = -math.cos(d*math.pi)

            # Apply weights to BQM
            bqm.add_interaction(flower0.setosa, flower1.setosa, same_weight)
            bqm.add_interaction(flower0.versicolor, flower1.versicolor, same_weight)
            bqm.add_interaction(flower0.virginica, flower1.virginica, same_weight)
            
            diff_weight = -math.tanh(d) * 0.5

            # Apply weights to BQM
            bqm.add_interaction(flower0.setosa, flower1.virginica, diff_weight)
            bqm.add_interaction(flower0.setosa, flower1.versicolor, diff_weight)
            bqm.add_interaction(flower0.virginica, flower1.setosa, diff_weight)
            bqm.add_interaction(flower0.virginica, flower1.versicolor, diff_weight)
            bqm.add_interaction(flower0.versicolor, flower1.setosa, diff_weight)
            bqm.add_interaction(flower0.versicolor, flower1.virginica, diff_weight)

    # Submit problem to D-Wave sampler
    sampler = EmbeddingComposite(DWaveSampler())
    sampleset = sampler.sample(bqm, chain_strength=8, num_reads=1000)
    best_sample = sampleset.first.sample

    # Visualize graph problem
    dwave.inspector.show(bqm, sampleset)

    groupings = get_iris_groupings(best_sample)

    # Print solution onto terminal
    # Note: This is simply a more compact version of 'best_sample'
    return groupings

test = iris_data.loc[:3,].append(iris_data.loc[60:63,: ]).append(iris_data.loc[120:123:,:])
test

groups = cluster_iris(test)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-21-1098e696a0e5> in <module>
----> 1 groups = cluster_iris(test)

NameError: name 'cluster_iris' is not defined

results = pd.DataFrame()
for cluster in groups.items():
    for flower in cluster[1]:
        results = results.append({'sepal length': flower[0], 'sepal width': flower[1], 'petal length': flower[2], 'petal width': flower[3], 'cluster': cluster[0]}, ignore_index=True)

results

merged = results.merge(test, how="inner", on=["petal length", "petal width", "sepal length", "sepal width"])
merged = merged.replace('set',0).replace('vers',1).replace('virg',2)
merged

accuracy_score(merged['species id'], merged['cluster'])

0.5714285714285714

fig = px.scatter_3d(results, x='sepal width', y='sepal length', z='petal width', color="cluster",)
fig.show()

fig = px.scatter_3d(results, x='sepal width', y='sepal length', z='petal length', color="cluster",)
fig.show()

fig = px.scatter_3d(results, x='sepal length', y='petal width', z='petal length', color="cluster",)
fig.show()

fig = px.scatter_3d(results, x='sepal width', y='petal width', z='petal length', color="cluster",)
fig.show()

Shrinking Iris Dataset to two dimensions¶

Will selecting important features allow us to form larger clusters with greater stability?

class TinyIris:
    def __init__(self, sepal_length, petal_width):
        self.sepal_length = sepal_length
        self.petal_width = petal_width
        
        label = f'{sepal_length},{petal_width}_'
        self.setosa = label + "set"
        self.versicolor = label + "vers"
        self.virginica = label + "virg"

# Computer euclidean distance between flowers
def get_tiny_iris_distance(flower0, flower1):
    diff_sl = flower0.sepal_length - flower1.sepal_length
    diff_pw = flower0.petal_width - flower1.petal_width
    
    return math.sqrt(diff_sl**2 + diff_pw **2)

def get_max_iris_distance(coordinates):
    max_distance = 0
    for i, coord0 in enumerate(coordinates[:-1]):
        for coord1 in coordinates[i+1:]:
            distance = get_tiny_iris_distance(coord0, coord1)
            max_distance = max(max_distance, distance)

    return max_distance

def cluster_tiny_iris(iris_data):
    flower_list = list(map(tuple, list(iris_data.to_numpy())))
    flowers = [TinyIris(sepal_length, petal_width) for sepal_length, sepal_width, petal_length, petal_width, species_id, species, idx in flower_list]
    max_distance = max(get_max_iris_distance(flowers), 1)

    # Build constraints
    csp = dwavebinarycsp.ConstraintSatisfactionProblem(dwavebinarycsp.BINARY)

    # Apply constraint: coordinate can only be in one color group
    choose_one_group = {(0, 0, 1), (0, 1, 0), (1, 0, 0)}
    for iris in flowers:
        csp.add_constraint(choose_one_group, (iris.setosa, iris.versicolor, iris.virginica))

    # Build initial BQM
    bqm = dwavebinarycsp.stitch(csp, min_classical_gap=3)

    # Edit BQM to bias for close together points to share the same color
    for i, flower0 in enumerate(flowers[:-1]):
        for flower1 in flowers[i+1:]:
            # Set up weight
            d = get_tiny_iris_distance(flower0, flower1) / max_distance  # rescale distance
            same_weight = -math.cos(d*math.pi)

            # Apply weights to BQM
            bqm.add_interaction(flower0.setosa, flower1.setosa, same_weight)
            bqm.add_interaction(flower0.versicolor, flower1.versicolor, same_weight)
            bqm.add_interaction(flower0.virginica, flower1.virginica, same_weight)
            
            diff_weight = -math.tanh(d) * 0.5

            # Apply weights to BQM
            bqm.add_interaction(flower0.setosa, flower1.virginica, diff_weight)
            bqm.add_interaction(flower0.setosa, flower1.versicolor, diff_weight)
            bqm.add_interaction(flower0.virginica, flower1.setosa, diff_weight)
            bqm.add_interaction(flower0.virginica, flower1.versicolor, diff_weight)
            bqm.add_interaction(flower0.versicolor, flower1.setosa, diff_weight)
            bqm.add_interaction(flower0.versicolor, flower1.virginica, diff_weight)

    # Submit problem to D-Wave sampler
    sampler = EmbeddingComposite(DWaveSampler())
    sampleset = sampler.sample(bqm, chain_strength=8, num_reads=1000)
    best_sample = sampleset.first.sample

    # Visualize graph problem
    dwave.inspector.show(bqm, sampleset)

    groupings = get_iris_groupings(best_sample)

    # Print solution onto terminal
    # Note: This is simply a more compact version of 'best_sample'
    return groupings

tiny_test = iris_data.loc[:5,].append(iris_data.loc[60:65,: ]).append(iris_data.loc[120:125:,:])
tiny_test

tiny_group = cluster_tiny_iris(tiny_test)

{'-0.29484181807955234,0.13322594295296525_set': 1, '-0.29484181807955234,0.13322594295296525_vers': 1, '-0.29484181807955234,0.13322594295296525_virg': 0, '-0.29484181807955234,1.053536733088581_set': 1, '-0.29484181807955234,1.053536733088581_vers': 1, '-0.29484181807955234,1.053536733088581_virg': 1, '-0.537177558966854,-1.0500307872213979_set': 1, '-0.537177558966854,-1.0500307872213979_vers': 0, '-0.537177558966854,-1.0500307872213979_virg': 1, '-0.9006811702978088,-1.3129767272601454_set': 1, '-0.9006811702978088,-1.3129767272601454_vers': 0, '-0.9006811702978088,-1.3129767272601454_virg': 1, '-1.0218490407414595,-0.2611929671051558_set': 1, '-1.0218490407414595,-0.2611929671051558_vers': 0, '-1.0218490407414595,-0.2611929671051558_virg': 1, '-1.0218490407414595,-1.3129767272601454_set': 1, '-1.0218490407414595,-1.3129767272601454_vers': 0, '-1.0218490407414595,-1.3129767272601454_virg': 0, '-1.1430169111851105,-1.3129767272601454_set': 1, '-1.1430169111851105,-1.3129767272601454_vers': 0, '-1.1430169111851105,-1.3129767272601454_virg': 0, '-1.3853526520724133,-1.3129767272601454_set': 1, '-1.3853526520724133,-1.3129767272601454_vers': 0, '-1.3853526520724133,-1.3129767272601454_virg': 1, '-1.5065205225160652,-1.3129767272601454_set': 1, '-1.5065205225160652,-1.3129767272601454_vers': 0, '-1.5065205225160652,-1.3129767272601454_virg': 1, '0.06866179325140237,0.3961718829917126_set': 0, '0.06866179325140237,0.3961718829917126_vers': 1, '0.06866179325140237,0.3961718829917126_virg': 1, '0.18982966369505322,-0.2611929671051558_set': 0, '0.18982966369505322,-0.2611929671051558_vers': 0, '0.18982966369505322,-0.2611929671051558_virg': 1, '0.31099753413870407,0.2646989129723388_set': 0, '0.31099753413870407,0.2646989129723388_vers': 1, '0.31099753413870407,0.2646989129723388_virg': 1, '0.5533332750260068,0.7905907930498337_set': 0, '0.5533332750260068,0.7905907930498337_vers': 1, '0.5533332750260068,0.7905907930498337_virg': 1, '1.0380047568006125,0.2646989129723388_set': 0, '1.0380047568006125,0.2646989129723388_vers': 1, '1.0380047568006125,0.2646989129723388_virg': 1, '1.0380047568006125,1.1850097031079547_set': 0, '1.0380047568006125,1.1850097031079547_vers': 1, '1.0380047568006125,1.1850097031079547_virg': 1, '1.2803404976879151,1.4479556431467018_set': 0, '1.2803404976879151,1.4479556431467018_vers': 1, '1.2803404976879151,1.4479556431467018_virg': 1, '1.643844109018869,0.7905907930498337_set': 0, '1.643844109018869,0.7905907930498337_vers': 1, '1.643844109018869,0.7905907930498337_virg': 1, '2.2496834612371255,1.053536733088581_set': 1, '2.2496834612371255,1.053536733088581_vers': 1, '2.2496834612371255,1.053536733088581_virg': 0, 'aux0': 1, 'aux1': 1, 'aux10': 1, 'aux11': 0, 'aux12': 1, 'aux13': 1, 'aux14': 1, 'aux15': 0, 'aux16': 1, 'aux17': 0, 'aux18': 1, 'aux19': 0, 'aux2': 0, 'aux20': 0, 'aux21': 1, 'aux22': 0, 'aux23': 0, 'aux24': 0, 'aux25': 0, 'aux26': 1, 'aux27': 0, 'aux28': 0, 'aux29': 1, 'aux3': 1, 'aux30': 0, 'aux31': 0, 'aux32': 1, 'aux33': 0, 'aux34': 0, 'aux35': 0, 'aux4': 1, 'aux5': 1, 'aux6': 1, 'aux7': 0, 'aux8': 1, 'aux9': 1}
-0.29484181807955234,0.13322594295296525_set
-0.29484181807955234,0.13322594295296525_vers
-0.29484181807955234,1.053536733088581_set
-0.29484181807955234,1.053536733088581_vers
-0.29484181807955234,1.053536733088581_virg
-0.537177558966854,-1.0500307872213979_set
-0.537177558966854,-1.0500307872213979_virg
-0.9006811702978088,-1.3129767272601454_set
-0.9006811702978088,-1.3129767272601454_virg
-1.0218490407414595,-0.2611929671051558_set
-1.0218490407414595,-0.2611929671051558_virg
-1.0218490407414595,-1.3129767272601454_set
-1.1430169111851105,-1.3129767272601454_set
-1.3853526520724133,-1.3129767272601454_set
-1.3853526520724133,-1.3129767272601454_virg
-1.5065205225160652,-1.3129767272601454_set
-1.5065205225160652,-1.3129767272601454_virg
0.06866179325140237,0.3961718829917126_vers
0.06866179325140237,0.3961718829917126_virg
0.18982966369505322,-0.2611929671051558_virg
0.31099753413870407,0.2646989129723388_vers
0.31099753413870407,0.2646989129723388_virg
0.5533332750260068,0.7905907930498337_vers
0.5533332750260068,0.7905907930498337_virg
1.0380047568006125,0.2646989129723388_vers
1.0380047568006125,0.2646989129723388_virg
1.0380047568006125,1.1850097031079547_vers
1.0380047568006125,1.1850097031079547_virg
1.2803404976879151,1.4479556431467018_vers
1.2803404976879151,1.4479556431467018_virg
1.643844109018869,0.7905907930498337_vers
1.643844109018869,0.7905907930498337_virg
2.2496834612371255,1.053536733088581_set
2.2496834612371255,1.053536733088581_vers
aux0
aux1
aux10
aux12
aux13
aux14
aux16
aux18
aux21
aux26
aux29
aux3
aux32
aux4
aux5
aux6
aux8
aux9

tiny_results = pd.DataFrame()
for cluster in tiny_group.items():
    for flower in cluster[1]:
        tiny_results = tiny_results.append({'sepal length': flower[0], 'petal width': flower[1], 'cluster': cluster[0]}, ignore_index=True)

tiny_results.drop_duplicates()

tiny_merged = tiny_results.merge(tiny_test, how="left", on=["petal width", "sepal length"])
tiny_merged = tiny_merged.replace('set',0).replace('vers',1).replace('virg',2)
tiny_merged

accuracy_score(tiny_merged['species id'], tiny_merged['cluster'])

0.4411764705882353

	sepal length	sepal width	petal length	petal width	species	species id
0	5.1	3.5	1.4	0.2	Iris-setosa	0
1	4.9	3.0	1.4	0.2	Iris-setosa	0
2	4.7	3.2	1.3	0.2	Iris-setosa	0
3	4.6	3.1	1.5	0.2	Iris-setosa	0
4	5.0	3.6	1.4	0.2	Iris-setosa	0
...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	Iris-virginica	2
146	6.3	2.5	5.0	1.9	Iris-virginica	2
147	6.5	3.0	5.2	2.0	Iris-virginica	2
148	6.2	3.4	5.4	2.3	Iris-virginica	2
149	5.9	3.0	5.1	1.8	Iris-virginica	2

	sepal length	sepal width	petal length	petal width	species id	species	id
0	-0.900681	1.032057	-1.341272	-1.312977	0	Iris-setosa	0
1	-1.143017	-0.124958	-1.341272	-1.312977	0	Iris-setosa	1
2	-1.385353	0.337848	-1.398138	-1.312977	0	Iris-setosa	2
3	-1.506521	0.106445	-1.284407	-1.312977	0	Iris-setosa	3
4	-1.021849	1.263460	-1.341272	-1.312977	0	Iris-setosa	4
...	...	...	...	...	...	...	...
145	1.038005	-0.124958	0.819624	1.447956	2	Iris-virginica	145
146	0.553333	-1.281972	0.705893	0.922064	2	Iris-virginica	146
147	0.795669	-0.124958	0.819624	1.053537	2	Iris-virginica	147
148	0.432165	0.800654	0.933356	1.447956	2	Iris-virginica	148
149	0.068662	-0.124958	0.762759	0.790591	2	Iris-virginica	149

	sepal length	sepal width	petal length	petal width	species id	species	id	sklearn_preds
0	-0.900681	1.032057	-1.341272	-1.312977	0	Iris-setosa	0	1
1	-1.143017	-0.124958	-1.341272	-1.312977	0	Iris-setosa	1	1
2	-1.385353	0.337848	-1.398138	-1.312977	0	Iris-setosa	2	1
3	-1.506521	0.106445	-1.284407	-1.312977	0	Iris-setosa	3	1
4	-1.021849	1.263460	-1.341272	-1.312977	0	Iris-setosa	4	1
...	...	...	...	...	...	...	...	...
145	1.038005	-0.124958	0.819624	1.447956	2	Iris-virginica	145	2
146	0.553333	-1.281972	0.705893	0.922064	2	Iris-virginica	146	0
147	0.795669	-0.124958	0.819624	1.053537	2	Iris-virginica	147	2
148	0.432165	0.800654	0.933356	1.447956	2	Iris-virginica	148	2
149	0.068662	-0.124958	0.762759	0.790591	2	Iris-virginica	149	0

	sepal length	sepal width	petal length	petal width	species id	species	id
0	-0.900681	1.032057	-1.341272	-1.312977	0	Iris-setosa	0
1	-1.143017	-0.124958	-1.341272	-1.312977	0	Iris-setosa	1
2	-1.385353	0.337848	-1.398138	-1.312977	0	Iris-setosa	2
3	-1.506521	0.106445	-1.284407	-1.312977	0	Iris-setosa	3
60	-1.021849	-2.438987	-0.147093	-0.261193	1	Iris-versicolor	60
61	0.068662	-0.124958	0.250967	0.396172	1	Iris-versicolor	61
62	0.189830	-1.976181	0.137236	-0.261193	1	Iris-versicolor	62
63	0.310998	-0.356361	0.535296	0.264699	1	Iris-versicolor	63
120	1.280340	0.337848	1.103953	1.447956	2	Iris-virginica	120
121	-0.294842	-0.587764	0.649027	1.053537	2	Iris-virginica	121
122	2.249683	-0.587764	1.672610	1.053537	2	Iris-virginica	122
123	0.553333	-0.819166	0.649027	0.790591	2	Iris-virginica	123

	cluster	petal length	petal width	sepal length	sepal width
0	vers	0.649027	1.053537	-0.294842	-0.587764
1	vers	0.250967	0.396172	0.068662	-0.124958
2	vers	0.137236	-0.261193	0.189830	-1.976181
3	vers	0.535296	0.264699	0.310998	-0.356361
4	vers	0.649027	0.790591	0.553333	-0.819166
5	vers	1.103953	1.447956	1.280340	0.337848
6	vers	1.672610	1.053537	2.249683	-0.587764
7	set	-1.341272	-1.312977	-0.900681	1.032057
8	set	-0.147093	-0.261193	-1.021849	-2.438987
9	set	-1.341272	-1.312977	-1.143017	-0.124958
10	set	-1.398138	-1.312977	-1.385353	0.337848
11	set	-1.284407	-1.312977	-1.506521	0.106445
12	virg	-1.341272	-1.312977	-0.900681	1.032057
13	virg	1.103953	1.447956	1.280340	0.337848

	cluster	petal width	sepal length
0	set	0.133226	-0.294842
1	set	1.053537	-0.294842
2	set	-1.050031	-0.537178
3	set	-1.312977	-0.900681
4	set	-0.261193	-1.021849
5	set	-1.312977	-1.021849
6	set	-1.312977	-1.143017
7	set	-1.312977	-1.385353
8	set	-1.312977	-1.506521
9	set	1.053537	2.249683
10	vers	0.133226	-0.294842
11	vers	1.053537	-0.294842
12	vers	0.396172	0.068662
13	vers	0.264699	0.310998
14	vers	0.790591	0.553333
15	vers	0.264699	1.038005
16	vers	1.185010	1.038005
17	vers	1.447956	1.280340
18	vers	0.790591	1.643844
19	vers	1.053537	2.249683
20	virg	1.053537	-0.294842
21	virg	-1.050031	-0.537178
22	virg	-1.312977	-0.900681
23	virg	-0.261193	-1.021849
24	virg	-1.312977	-1.385353
25	virg	-1.312977	-1.506521
26	virg	0.396172	0.068662
27	virg	-0.261193	0.189830
28	virg	0.264699	0.310998
29	virg	0.790591	0.553333
30	virg	0.264699	1.038005
31	virg	1.185010	1.038005
32	virg	1.447956	1.280340
33	virg	0.790591	1.643844