Quantum Annealing Cluster AnalysisΒΆ

An application of the D-Wave Advantage Quantum Annealer to a classic unsupervised learning problem. We frame k-means clustering as a quadratic unconstrained binary optimization (QUBO) problem and evaluate the results against classic, iterative algorithms.

In this formulation of the problem, we allocate three qubits on the D-Wave system for every instance in the dataset, where each qubit represents membership in a cluster. If for example, the minimum energy anneal returned the states {1, 0, 0} (equivalently {↑,↓, ↓}) for a data point, this would indicate that the instance belongs exclusively to the first cluster, as we should expect. To ensure this exclusivity, we also set a constraint on the valid arrangements of qubits in which exactly one of the three membership qubits is up and the others are down (i.e. ({1, 0, 0}, {0, 1, 0}, {0, 0, 1}).

We then iterate over all pairs of instances in the dataset and calculate simple Euclidean distances for each pair. We use these distances as weights in the π‘Žπ‘–,𝑗 quadratic interaction terms in the QUBO problem. These distances are fed through βˆ’ cos (𝑑 βˆ— πœ‹) and βˆ’ tanh(𝑑) βˆ— 0.1 for same-cluster and different-cluster interaction terms respectively in order to reward similarity between instances with low energy values and penalize dissimilar pairs.

You can find the associated write-up for this project at this link.

Conor McCormack, Fall 2020
EE 520: Quantum Information Processing
University of Southern California, Viterbi School of Engineering.

In [1]:
import math
import numpy as np
import dwavebinarycsp
import dwave.inspector
from dwave.system import EmbeddingComposite, DWaveSampler
from collections import defaultdict
In [2]:
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt

Clustering two dimenionsal data with ConstraintSatisfactionProblemΒΆ

Code adopted from: https://github.com/dwave-examples/clustering

Define data structures and distancesΒΆ

In [3]:
class Coordinate:
    def __init__(self, x, y):
        self.x = x
        self.y = y

        # coordinate labels for groups red, green, and blue
        label = "{0},{1}_".format(x, y)
        self.r = label + "r"
        self.g = label + "g"
        self.b = label + "b"

        

def get_distance(coordinate_0, coordinate_1):
    diff_x = coordinate_0.x - coordinate_1.x
    diff_y = coordinate_0.y - coordinate_1.y

    return math.sqrt(diff_x**2 + diff_y**2)


def get_max_distance(coordinates):
    max_distance = 0
    for i, coord0 in enumerate(coordinates[:-1]):
        for coord1 in coordinates[i+1:]:
            distance = get_distance(coord0, coord1)
            max_distance = max(max_distance, distance)

    return max_distance

Functions for creating plots, saving to pngsΒΆ

In [4]:
def get_groupings(sample):
    """Grab selected items and group them by color"""
    colored_points = defaultdict(list)

    for label, bool_val in sample.items():
        # Skip over items that were not selected
        if not bool_val:
            continue

        # Note: labels look like "<x_coord>,<y_coord>_<color>"
        coord, color = label.split("_")
        coord_tuple = tuple(map(float, coord.split(",")))
        colored_points[color].append(coord_tuple)

    return dict(colored_points)


def visualize_groupings(groupings_dict, filename):
    """
    Args:
        groupings_dict: key is a color, value is a list of x-y coordinate tuples.
          For example, {'r': [(0,1), (2,3)], 'b': [(8,3)]}
        filename: name of the file to save plot in
    """
    for color, points in groupings_dict.items():
        # Ignore items that do not contain any coordinates
        if not points:
            continue

        # Populate plot
        point_style = color + "o"
        plt.plot(*zip(*points), point_style)

    plt.savefig(filename)


def visualize_scatterplot(x_y_tuples_list, filename):
    """Plotting out a list of x-y tuples

    Args:
        x_y_tuples_list: A list of x-y coordinate values. e.g. [(1,4), (3, 2)]
    """
    plt.plot(*zip(*x_y_tuples_list), "o")
    plt.savefig(filename)

Cluster Analysis: problem composition, sampling, and resultsΒΆ

In [5]:
def cluster_points(scattered_points, filename):
    # Set up problem
    # Note: max_distance gets used in division later on. Hence, the max(.., 1)
    #   is used to prevent a division by zero
    coordinates = [Coordinate(x, y) for x, y in scattered_points]
    max_distance = max(get_max_distance(coordinates), 1)

    # Build constraints
    csp = dwavebinarycsp.ConstraintSatisfactionProblem(dwavebinarycsp.BINARY)

    # Apply constraint: coordinate can only be in one color group
    choose_one_group = {(0, 0, 1), (0, 1, 0), (1, 0, 0)}
    for coord in coordinates:
        csp.add_constraint(choose_one_group, (coord.r, coord.g, coord.b))

    # Build initial BQM
    bqm = dwavebinarycsp.stitch(csp)

    # Edit BQM to bias for close together points to share the same color
    for i, coord0 in enumerate(coordinates[:-1]):
        for coord1 in coordinates[i+1:]:
            d = get_distance(coord0, coord1) / max_distance 
            same_weight = -math.cos(d*math.pi)

            bqm.add_interaction(coord0.r, coord1.r, same_weight)
            bqm.add_interaction(coord0.g, coord1.g, same_weight)
            bqm.add_interaction(coord0.b, coord1.b, same_weight)
            
            diff_weight = -math.tanh(d) * 0.1
            bqm.add_interaction(coord0.r, coord1.b, diff_weight)
            bqm.add_interaction(coord0.r, coord1.g, diff_weight)
            bqm.add_interaction(coord0.b, coord1.r, diff_weight)
            bqm.add_interaction(coord0.b, coord1.g, diff_weight)
            bqm.add_interaction(coord0.g, coord1.r, diff_weight)
            bqm.add_interaction(coord0.g, coord1.b, diff_weight)
            
    # Submit problem to D-Wave sampler
    sampler = EmbeddingComposite(DWaveSampler())
    
    # Note: we've received 1000 anneal results from the D-Wave system
    sampleset = sampler.sample(bqm, chain_strength=4, num_reads=1000)
    best_sample = sampleset.first.sample

    # Open Inspector
    dwave.inspector.show(bqm, sampleset)

    # Visualize solution
    groupings = get_groupings(best_sample)
    visualize_groupings(groupings, filename)

    # Print solution onto terminal
    # Note: This is simply a more compact version of 'best_sample'
    print(groupings)

Multivariate GaussiansΒΆ

We use numpy to randomly sample three points from three seperate multivariate gaussian distributions and run our clustering QUBO on the transformed data.

In [6]:
# Set up three different clusters of data points
covariance = [[3, 0], [0, 3]]
n_points = 3
x0, y0 = np.random.multivariate_normal([0, 0], covariance, n_points).T
x1, y1 = np.random.multivariate_normal([12, 14], covariance, n_points).T
x2, y2 = np.random.multivariate_normal([8, 3], covariance, n_points).T

xs = np.hstack([x0, x1, x2])
ys = np.hstack([y0, y1, y2])
xys = np.vstack([xs, ys]).T
scattered_points = list(map(tuple, xys))

visualize_scatterplot(scattered_points, "2d_points.png")

# Run clustering script with scattered_points
clustered_filename = "2d_points_clustered.png"
cluster_points(scattered_points, clustered_filename)

print("Your plots are saved to '{}' and '{}'.".format("2d_points.png",
                                                 clustered_filename))
{'b': [(0.4626375962932738, 1.3760697696607773), (10.163755324355211, 1.8015849070958372), (10.414026497304949, 4.101929224375383), (2.1086567835770555, 0.6087152134416797), (2.8045628349746634, -0.9004171809911884), (7.994266041929519, 4.142323453056345)], 'r': [(10.617146844831872, 13.596855300507102), (11.45672394272729, 12.492456505531585), (11.841340244210386, 16.238727485448855)]}
Your plots are saved to '2d_points.png' and '2d_points_clustered.png'.

ResultsΒΆ

In [7]:
from IPython.display import Image
In [8]:
Image(filename='2d_points.png') 
Out[8]:
In [9]:
Image(filename='2d_points_clustered.png') 
Out[9]:

Iris DatasetΒΆ

In [10]:
import pandas as pd
In [11]:
iris_data = pd.read_csv('iris.data', names=['sepal length', 'sepal width', 'petal length', 'petal width', 'species'])

# Add species ID for convenience
unique = np.unique(iris_data['species'], return_inverse=True)[1]
iris_data['species id'] = unique
iris_data
Out[11]:
sepal length sepal width petal length petal width species species id
0 5.1 3.5 1.4 0.2 Iris-setosa 0
1 4.9 3.0 1.4 0.2 Iris-setosa 0
2 4.7 3.2 1.3 0.2 Iris-setosa 0
3 4.6 3.1 1.5 0.2 Iris-setosa 0
4 5.0 3.6 1.4 0.2 Iris-setosa 0
... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica 2
146 6.3 2.5 5.0 1.9 Iris-virginica 2
147 6.5 3.0 5.2 2.0 Iris-virginica 2
148 6.2 3.4 5.4 2.3 Iris-virginica 2
149 5.9 3.0 5.1 1.8 Iris-virginica 2

150 rows Γ— 6 columns

Data normalizationΒΆ

We normalize each of the features to fit them to similar ranges. This is important since the most popular implementations of k-means all fit clusters to data based on distance metrics. Without normalization, any given feature may have an oversized influence on cluster formation.

In [12]:
from sklearn.preprocessing import StandardScaler
In [13]:
scaled = StandardScaler().fit_transform(iris_data.loc[:,:'petal width'])
scaled = pd.DataFrame(scaled, columns=['sepal length', 'sepal width', 'petal length', 'petal width'])
scaled['species id'] = iris_data['species id']
scaled['species'] = iris_data['species']
scaled['id'] = scaled.index
iris_data = scaled
iris_data
Out[13]:
sepal length sepal width petal length petal width species id species id
0 -0.900681 1.032057 -1.341272 -1.312977 0 Iris-setosa 0
1 -1.143017 -0.124958 -1.341272 -1.312977 0 Iris-setosa 1
2 -1.385353 0.337848 -1.398138 -1.312977 0 Iris-setosa 2
3 -1.506521 0.106445 -1.284407 -1.312977 0 Iris-setosa 3
4 -1.021849 1.263460 -1.341272 -1.312977 0 Iris-setosa 4
... ... ... ... ... ... ... ...
145 1.038005 -0.124958 0.819624 1.447956 2 Iris-virginica 145
146 0.553333 -1.281972 0.705893 0.922064 2 Iris-virginica 146
147 0.795669 -0.124958 0.819624 1.053537 2 Iris-virginica 147
148 0.432165 0.800654 0.933356 1.447956 2 Iris-virginica 148
149 0.068662 -0.124958 0.762759 0.790591 2 Iris-virginica 149

150 rows Γ— 7 columns

In [14]:
import plotly.express as px

Visualization the Iris DatasetΒΆ

In [15]:
fig = px.scatter_3d(iris_data, x='sepal length', y='sepal width', z='petal width', color='species')
fig.show()
In [18]:
fig = px.scatter_3d(iris_data, x='sepal length', y='sepal width', z='petal length', color='species')
fig.show()
In [19]:
fig = px.scatter_3d(iris_data, x='sepal length', y='petal length', z='petal width', color='species')
fig.show()
In [20]:
fig = px.scatter_3d(iris_data, x='sepal width', y='petal length', z='petal width', color='species')
fig.show()
In [21]:
fig = px.scatter_3d(iris_data, x='sepal width', y='sepal length', z='petal width', color="petal length", symbol="species",)
fig.show()
speciesIris-setosaIris-versicolorIris-virginicaβˆ’1.5βˆ’1βˆ’0.500.511.5petal length

Classical, iterative version of KMeans from sklearnΒΆ

Note that cluster labels are randomly assigned here.

In [16]:
from sklearn.cluster import KMeans
In [28]:
preds = KMeans(n_clusters=3).fit_predict(iris_data.loc[:,:'petal width'])
preds
Out[28]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2,
       0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int32)
In [31]:
iris_data['sklearn_preds'] = preds
iris_data
Out[31]:
sepal length sepal width petal length petal width species id species id sklearn_preds
0 -0.900681 1.032057 -1.341272 -1.312977 0 Iris-setosa 0 1
1 -1.143017 -0.124958 -1.341272 -1.312977 0 Iris-setosa 1 1
2 -1.385353 0.337848 -1.398138 -1.312977 0 Iris-setosa 2 1
3 -1.506521 0.106445 -1.284407 -1.312977 0 Iris-setosa 3 1
4 -1.021849 1.263460 -1.341272 -1.312977 0 Iris-setosa 4 1
... ... ... ... ... ... ... ... ...
145 1.038005 -0.124958 0.819624 1.447956 2 Iris-virginica 145 2
146 0.553333 -1.281972 0.705893 0.922064 2 Iris-virginica 146 0
147 0.795669 -0.124958 0.819624 1.053537 2 Iris-virginica 147 2
148 0.432165 0.800654 0.933356 1.447956 2 Iris-virginica 148 2
149 0.068662 -0.124958 0.762759 0.790591 2 Iris-virginica 149 0

150 rows Γ— 8 columns

In [32]:
fig = px.scatter_3d(iris_data, x='sepal width', y='petal length', z='petal width', color='sklearn_preds')
fig.show()
In [24]:
from sklearn.metrics import accuracy_score, cluster
In [42]:
accuracy_score(iris_data['species id'], preds)
Out[42]:
0.24
In [49]:
iris_data = iris_data.drop(['sklearn_preds'],axis=1)

Mapping Clustering on the Iris Dataset as a QUBOΒΆ

In [16]:
class Iris:
    def __init__(self, sepal_length, sepal_width, petal_length, petal_width, species_id, species, idx):
        self.sepal_length = sepal_length
        self.sepal_width = sepal_width
        self.petal_length = petal_length
        self.petal_width = petal_width
        self.species_id = species_id
        self.species = species
        self.id = idx
        
        label = f'{sepal_length},{sepal_width},{petal_length},{petal_width}_'
        self.setosa = label + "set"
        self.versicolor = label + "vers"
        self.virginica = label + "virg"
In [17]:
# Computer euclidean distance between flowers
def get_iris_distance(flower0, flower1):
    diff_sl = flower0.sepal_length - flower1.sepal_length
    diff_sw = flower0.sepal_width - flower1.sepal_width
    diff_pl = flower0.petal_length - flower1.petal_length
    diff_pw = flower0.petal_width - flower1.petal_width
    
    return math.sqrt(diff_sl**2 + diff_sw**2 + diff_pl**2 + diff_pw **2)
In [18]:
def get_max_iris_distance(coordinates):
    max_distance = 0
    for i, coord0 in enumerate(coordinates[:-1]):
        for coord1 in coordinates[i+1:]:
            distance = get_iris_distance(coord0, coord1)
            max_distance = max(max_distance, distance)

    return max_distance
In [19]:
def get_iris_groupings(sample):
    """Grab selected items and group them by color"""
    colored_points = defaultdict(list)
    print(sample)


    for label, bool_val in sample.items():
        # Skip over items that were not selected
        if not bool_val:
            continue

        # Parse selected items
        # Note: label look like "<sep_len>,<sep_width>,<pet_len>,<pet_width>,<id>_<species>"
        print(label)
        if len(label.split("_")) < 2: 
            continue
        coord, species = label.split("_")
        coord_tuple = tuple(map(float, coord.split(",")))
        colored_points[species].append(coord_tuple)

    return dict(colored_points)
In [68]:
def cluster_iris(iris_data):
    flower_list = list(map(tuple, list(iris_data.to_numpy())))
    flowers = [Iris(sepal_length, sepal_width, petal_length, petal_width, species_id, species, idx) for sepal_length, sepal_width, petal_length, petal_width, species_id, species, idx in flower_list]
    max_distance = max(get_max_iris_distance(flowers), 1)

    # Build constraints
    csp = dwavebinarycsp.ConstraintSatisfactionProblem(dwavebinarycsp.BINARY)

    # Apply constraint: coordinate can only be in one color group
    choose_one_group = {(0, 0, 1), (0, 1, 0), (1, 0, 0)}
    for iris in flowers:
        csp.add_constraint(choose_one_group, (iris.setosa, iris.versicolor, iris.virginica))

    # Build initial BQM
    bqm = dwavebinarycsp.stitch(csp, min_classical_gap=3)

    # Edit BQM to bias for close together points to share the same color
    for i, flower0 in enumerate(flowers[:-1]):
        for flower1 in flowers[i+1:]:
            # Set up weight
            d = get_iris_distance(flower0, flower1) / max_distance  # rescale distance
            same_weight = -math.cos(d*math.pi)

            # Apply weights to BQM
            bqm.add_interaction(flower0.setosa, flower1.setosa, same_weight)
            bqm.add_interaction(flower0.versicolor, flower1.versicolor, same_weight)
            bqm.add_interaction(flower0.virginica, flower1.virginica, same_weight)
            
            diff_weight = -math.tanh(d) * 0.5

            # Apply weights to BQM
            bqm.add_interaction(flower0.setosa, flower1.virginica, diff_weight)
            bqm.add_interaction(flower0.setosa, flower1.versicolor, diff_weight)
            bqm.add_interaction(flower0.virginica, flower1.setosa, diff_weight)
            bqm.add_interaction(flower0.virginica, flower1.versicolor, diff_weight)
            bqm.add_interaction(flower0.versicolor, flower1.setosa, diff_weight)
            bqm.add_interaction(flower0.versicolor, flower1.virginica, diff_weight)

    # Submit problem to D-Wave sampler
    sampler = EmbeddingComposite(DWaveSampler())
    sampleset = sampler.sample(bqm, chain_strength=8, num_reads=1000)
    best_sample = sampleset.first.sample

    # Visualize graph problem
    dwave.inspector.show(bqm, sampleset)

    groupings = get_iris_groupings(best_sample)

    # Print solution onto terminal
    # Note: This is simply a more compact version of 'best_sample'
    return groupings
In [20]:
test = iris_data.loc[:3,].append(iris_data.loc[60:63,: ]).append(iris_data.loc[120:123:,:])
test
Out[20]:
sepal length sepal width petal length petal width species id species id
0 -0.900681 1.032057 -1.341272 -1.312977 0 Iris-setosa 0
1 -1.143017 -0.124958 -1.341272 -1.312977 0 Iris-setosa 1
2 -1.385353 0.337848 -1.398138 -1.312977 0 Iris-setosa 2
3 -1.506521 0.106445 -1.284407 -1.312977 0 Iris-setosa 3
60 -1.021849 -2.438987 -0.147093 -0.261193 1 Iris-versicolor 60
61 0.068662 -0.124958 0.250967 0.396172 1 Iris-versicolor 61
62 0.189830 -1.976181 0.137236 -0.261193 1 Iris-versicolor 62
63 0.310998 -0.356361 0.535296 0.264699 1 Iris-versicolor 63
120 1.280340 0.337848 1.103953 1.447956 2 Iris-virginica 120
121 -0.294842 -0.587764 0.649027 1.053537 2 Iris-virginica 121
122 2.249683 -0.587764 1.672610 1.053537 2 Iris-virginica 122
123 0.553333 -0.819166 0.649027 0.790591 2 Iris-virginica 123
In [21]:
groups = cluster_iris(test)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-21-1098e696a0e5> in <module>
----> 1 groups = cluster_iris(test)

NameError: name 'cluster_iris' is not defined
In [71]:
results = pd.DataFrame()
for cluster in groups.items():
    for flower in cluster[1]:
        results = results.append({'sepal length': flower[0], 'sepal width': flower[1], 'petal length': flower[2], 'petal width': flower[3], 'cluster': cluster[0]}, ignore_index=True)

results
Out[71]:
cluster petal length petal width sepal length sepal width
0 vers 0.649027 1.053537 -0.294842 -0.587764
1 vers 0.250967 0.396172 0.068662 -0.124958
2 vers 0.137236 -0.261193 0.189830 -1.976181
3 vers 0.535296 0.264699 0.310998 -0.356361
4 vers 0.649027 0.790591 0.553333 -0.819166
5 vers 1.103953 1.447956 1.280340 0.337848
6 vers 1.672610 1.053537 2.249683 -0.587764
7 set -1.341272 -1.312977 -0.900681 1.032057
8 set -0.147093 -0.261193 -1.021849 -2.438987
9 set -1.341272 -1.312977 -1.143017 -0.124958
10 set -1.398138 -1.312977 -1.385353 0.337848
11 set -1.284407 -1.312977 -1.506521 0.106445
12 virg -1.341272 -1.312977 -0.900681 1.032057
13 virg 1.103953 1.447956 1.280340 0.337848
In [98]:
merged = results.merge(test, how="inner", on=["petal length", "petal width", "sepal length", "sepal width"])
merged = merged.replace('set',0).replace('vers',1).replace('virg',2)
merged
Out[98]:
cluster petal length petal width sepal length sepal width species id species id
0 1 0.649027 1.053537 -0.294842 -0.587764 2 Iris-virginica 121
1 1 0.250967 0.396172 0.068662 -0.124958 1 Iris-versicolor 61
2 1 0.137236 -0.261193 0.189830 -1.976181 1 Iris-versicolor 62
3 1 0.535296 0.264699 0.310998 -0.356361 1 Iris-versicolor 63
4 1 0.649027 0.790591 0.553333 -0.819166 2 Iris-virginica 123
5 1 1.103953 1.447956 1.280340 0.337848 2 Iris-virginica 120
6 2 1.103953 1.447956 1.280340 0.337848 2 Iris-virginica 120
7 1 1.672610 1.053537 2.249683 -0.587764 2 Iris-virginica 122
8 0 -1.341272 -1.312977 -0.900681 1.032057 0 Iris-setosa 0
9 2 -1.341272 -1.312977 -0.900681 1.032057 0 Iris-setosa 0
10 0 -0.147093 -0.261193 -1.021849 -2.438987 1 Iris-versicolor 60
11 0 -1.341272 -1.312977 -1.143017 -0.124958 0 Iris-setosa 1
12 0 -1.398138 -1.312977 -1.385353 0.337848 0 Iris-setosa 2
13 0 -1.284407 -1.312977 -1.506521 0.106445 0 Iris-setosa 3
In [99]:
accuracy_score(merged['species id'], merged['cluster'])
Out[99]:
0.5714285714285714
In [74]:
fig = px.scatter_3d(results, x='sepal width', y='sepal length', z='petal width', color="cluster",)
fig.show()
In [75]:
fig = px.scatter_3d(results, x='sepal width', y='sepal length', z='petal length', color="cluster",)
fig.show()
In [76]:
fig = px.scatter_3d(results, x='sepal length', y='petal width', z='petal length', color="cluster",)
fig.show()
In [77]:
fig = px.scatter_3d(results, x='sepal width', y='petal width', z='petal length', color="cluster",)
fig.show()

Shrinking Iris Dataset to two dimensionsΒΆ

Will selecting important features allow us to form larger clusters with greater stability?

In [101]:
class TinyIris:
    def __init__(self, sepal_length, petal_width):
        self.sepal_length = sepal_length
        self.petal_width = petal_width
        
        label = f'{sepal_length},{petal_width}_'
        self.setosa = label + "set"
        self.versicolor = label + "vers"
        self.virginica = label + "virg"
In [102]:
# Computer euclidean distance between flowers
def get_tiny_iris_distance(flower0, flower1):
    diff_sl = flower0.sepal_length - flower1.sepal_length
    diff_pw = flower0.petal_width - flower1.petal_width
    
    return math.sqrt(diff_sl**2 + diff_pw **2)
In [105]:
def get_max_iris_distance(coordinates):
    max_distance = 0
    for i, coord0 in enumerate(coordinates[:-1]):
        for coord1 in coordinates[i+1:]:
            distance = get_tiny_iris_distance(coord0, coord1)
            max_distance = max(max_distance, distance)

    return max_distance
In [108]:
def cluster_tiny_iris(iris_data):
    flower_list = list(map(tuple, list(iris_data.to_numpy())))
    flowers = [TinyIris(sepal_length, petal_width) for sepal_length, sepal_width, petal_length, petal_width, species_id, species, idx in flower_list]
    max_distance = max(get_max_iris_distance(flowers), 1)

    # Build constraints
    csp = dwavebinarycsp.ConstraintSatisfactionProblem(dwavebinarycsp.BINARY)

    # Apply constraint: coordinate can only be in one color group
    choose_one_group = {(0, 0, 1), (0, 1, 0), (1, 0, 0)}
    for iris in flowers:
        csp.add_constraint(choose_one_group, (iris.setosa, iris.versicolor, iris.virginica))

    # Build initial BQM
    bqm = dwavebinarycsp.stitch(csp, min_classical_gap=3)

    # Edit BQM to bias for close together points to share the same color
    for i, flower0 in enumerate(flowers[:-1]):
        for flower1 in flowers[i+1:]:
            # Set up weight
            d = get_tiny_iris_distance(flower0, flower1) / max_distance  # rescale distance
            same_weight = -math.cos(d*math.pi)

            # Apply weights to BQM
            bqm.add_interaction(flower0.setosa, flower1.setosa, same_weight)
            bqm.add_interaction(flower0.versicolor, flower1.versicolor, same_weight)
            bqm.add_interaction(flower0.virginica, flower1.virginica, same_weight)
            
            diff_weight = -math.tanh(d) * 0.5

            # Apply weights to BQM
            bqm.add_interaction(flower0.setosa, flower1.virginica, diff_weight)
            bqm.add_interaction(flower0.setosa, flower1.versicolor, diff_weight)
            bqm.add_interaction(flower0.virginica, flower1.setosa, diff_weight)
            bqm.add_interaction(flower0.virginica, flower1.versicolor, diff_weight)
            bqm.add_interaction(flower0.versicolor, flower1.setosa, diff_weight)
            bqm.add_interaction(flower0.versicolor, flower1.virginica, diff_weight)

    # Submit problem to D-Wave sampler
    sampler = EmbeddingComposite(DWaveSampler())
    sampleset = sampler.sample(bqm, chain_strength=8, num_reads=1000)
    best_sample = sampleset.first.sample

    # Visualize graph problem
    dwave.inspector.show(bqm, sampleset)

    groupings = get_iris_groupings(best_sample)

    # Print solution onto terminal
    # Note: This is simply a more compact version of 'best_sample'
    return groupings
In [110]:
tiny_test = iris_data.loc[:5,].append(iris_data.loc[60:65,: ]).append(iris_data.loc[120:125:,:])
tiny_test
Out[110]:
sepal length sepal width petal length petal width species id species id
0 -0.900681 1.032057 -1.341272 -1.312977 0 Iris-setosa 0
1 -1.143017 -0.124958 -1.341272 -1.312977 0 Iris-setosa 1
2 -1.385353 0.337848 -1.398138 -1.312977 0 Iris-setosa 2
3 -1.506521 0.106445 -1.284407 -1.312977 0 Iris-setosa 3
4 -1.021849 1.263460 -1.341272 -1.312977 0 Iris-setosa 4
5 -0.537178 1.957669 -1.170675 -1.050031 0 Iris-setosa 5
60 -1.021849 -2.438987 -0.147093 -0.261193 1 Iris-versicolor 60
61 0.068662 -0.124958 0.250967 0.396172 1 Iris-versicolor 61
62 0.189830 -1.976181 0.137236 -0.261193 1 Iris-versicolor 62
63 0.310998 -0.356361 0.535296 0.264699 1 Iris-versicolor 63
64 -0.294842 -0.356361 -0.090227 0.133226 1 Iris-versicolor 64
65 1.038005 0.106445 0.364699 0.264699 1 Iris-versicolor 65
120 1.280340 0.337848 1.103953 1.447956 2 Iris-virginica 120
121 -0.294842 -0.587764 0.649027 1.053537 2 Iris-virginica 121
122 2.249683 -0.587764 1.672610 1.053537 2 Iris-virginica 122
123 0.553333 -0.819166 0.649027 0.790591 2 Iris-virginica 123
124 1.038005 0.569251 1.103953 1.185010 2 Iris-virginica 124
125 1.643844 0.337848 1.274550 0.790591 2 Iris-virginica 125
In [111]:
tiny_group = cluster_tiny_iris(tiny_test)
{'-0.29484181807955234,0.13322594295296525_set': 1, '-0.29484181807955234,0.13322594295296525_vers': 1, '-0.29484181807955234,0.13322594295296525_virg': 0, '-0.29484181807955234,1.053536733088581_set': 1, '-0.29484181807955234,1.053536733088581_vers': 1, '-0.29484181807955234,1.053536733088581_virg': 1, '-0.537177558966854,-1.0500307872213979_set': 1, '-0.537177558966854,-1.0500307872213979_vers': 0, '-0.537177558966854,-1.0500307872213979_virg': 1, '-0.9006811702978088,-1.3129767272601454_set': 1, '-0.9006811702978088,-1.3129767272601454_vers': 0, '-0.9006811702978088,-1.3129767272601454_virg': 1, '-1.0218490407414595,-0.2611929671051558_set': 1, '-1.0218490407414595,-0.2611929671051558_vers': 0, '-1.0218490407414595,-0.2611929671051558_virg': 1, '-1.0218490407414595,-1.3129767272601454_set': 1, '-1.0218490407414595,-1.3129767272601454_vers': 0, '-1.0218490407414595,-1.3129767272601454_virg': 0, '-1.1430169111851105,-1.3129767272601454_set': 1, '-1.1430169111851105,-1.3129767272601454_vers': 0, '-1.1430169111851105,-1.3129767272601454_virg': 0, '-1.3853526520724133,-1.3129767272601454_set': 1, '-1.3853526520724133,-1.3129767272601454_vers': 0, '-1.3853526520724133,-1.3129767272601454_virg': 1, '-1.5065205225160652,-1.3129767272601454_set': 1, '-1.5065205225160652,-1.3129767272601454_vers': 0, '-1.5065205225160652,-1.3129767272601454_virg': 1, '0.06866179325140237,0.3961718829917126_set': 0, '0.06866179325140237,0.3961718829917126_vers': 1, '0.06866179325140237,0.3961718829917126_virg': 1, '0.18982966369505322,-0.2611929671051558_set': 0, '0.18982966369505322,-0.2611929671051558_vers': 0, '0.18982966369505322,-0.2611929671051558_virg': 1, '0.31099753413870407,0.2646989129723388_set': 0, '0.31099753413870407,0.2646989129723388_vers': 1, '0.31099753413870407,0.2646989129723388_virg': 1, '0.5533332750260068,0.7905907930498337_set': 0, '0.5533332750260068,0.7905907930498337_vers': 1, '0.5533332750260068,0.7905907930498337_virg': 1, '1.0380047568006125,0.2646989129723388_set': 0, '1.0380047568006125,0.2646989129723388_vers': 1, '1.0380047568006125,0.2646989129723388_virg': 1, '1.0380047568006125,1.1850097031079547_set': 0, '1.0380047568006125,1.1850097031079547_vers': 1, '1.0380047568006125,1.1850097031079547_virg': 1, '1.2803404976879151,1.4479556431467018_set': 0, '1.2803404976879151,1.4479556431467018_vers': 1, '1.2803404976879151,1.4479556431467018_virg': 1, '1.643844109018869,0.7905907930498337_set': 0, '1.643844109018869,0.7905907930498337_vers': 1, '1.643844109018869,0.7905907930498337_virg': 1, '2.2496834612371255,1.053536733088581_set': 1, '2.2496834612371255,1.053536733088581_vers': 1, '2.2496834612371255,1.053536733088581_virg': 0, 'aux0': 1, 'aux1': 1, 'aux10': 1, 'aux11': 0, 'aux12': 1, 'aux13': 1, 'aux14': 1, 'aux15': 0, 'aux16': 1, 'aux17': 0, 'aux18': 1, 'aux19': 0, 'aux2': 0, 'aux20': 0, 'aux21': 1, 'aux22': 0, 'aux23': 0, 'aux24': 0, 'aux25': 0, 'aux26': 1, 'aux27': 0, 'aux28': 0, 'aux29': 1, 'aux3': 1, 'aux30': 0, 'aux31': 0, 'aux32': 1, 'aux33': 0, 'aux34': 0, 'aux35': 0, 'aux4': 1, 'aux5': 1, 'aux6': 1, 'aux7': 0, 'aux8': 1, 'aux9': 1}
-0.29484181807955234,0.13322594295296525_set
-0.29484181807955234,0.13322594295296525_vers
-0.29484181807955234,1.053536733088581_set
-0.29484181807955234,1.053536733088581_vers
-0.29484181807955234,1.053536733088581_virg
-0.537177558966854,-1.0500307872213979_set
-0.537177558966854,-1.0500307872213979_virg
-0.9006811702978088,-1.3129767272601454_set
-0.9006811702978088,-1.3129767272601454_virg
-1.0218490407414595,-0.2611929671051558_set
-1.0218490407414595,-0.2611929671051558_virg
-1.0218490407414595,-1.3129767272601454_set
-1.1430169111851105,-1.3129767272601454_set
-1.3853526520724133,-1.3129767272601454_set
-1.3853526520724133,-1.3129767272601454_virg
-1.5065205225160652,-1.3129767272601454_set
-1.5065205225160652,-1.3129767272601454_virg
0.06866179325140237,0.3961718829917126_vers
0.06866179325140237,0.3961718829917126_virg
0.18982966369505322,-0.2611929671051558_virg
0.31099753413870407,0.2646989129723388_vers
0.31099753413870407,0.2646989129723388_virg
0.5533332750260068,0.7905907930498337_vers
0.5533332750260068,0.7905907930498337_virg
1.0380047568006125,0.2646989129723388_vers
1.0380047568006125,0.2646989129723388_virg
1.0380047568006125,1.1850097031079547_vers
1.0380047568006125,1.1850097031079547_virg
1.2803404976879151,1.4479556431467018_vers
1.2803404976879151,1.4479556431467018_virg
1.643844109018869,0.7905907930498337_vers
1.643844109018869,0.7905907930498337_virg
2.2496834612371255,1.053536733088581_set
2.2496834612371255,1.053536733088581_vers
aux0
aux1
aux10
aux12
aux13
aux14
aux16
aux18
aux21
aux26
aux29
aux3
aux32
aux4
aux5
aux6
aux8
aux9
In [120]:
tiny_results = pd.DataFrame()
for cluster in tiny_group.items():
    for flower in cluster[1]:
        tiny_results = tiny_results.append({'sepal length': flower[0], 'petal width': flower[1], 'cluster': cluster[0]}, ignore_index=True)

tiny_results.drop_duplicates()
Out[120]:
cluster petal width sepal length
0 set 0.133226 -0.294842
1 set 1.053537 -0.294842
2 set -1.050031 -0.537178
3 set -1.312977 -0.900681
4 set -0.261193 -1.021849
5 set -1.312977 -1.021849
6 set -1.312977 -1.143017
7 set -1.312977 -1.385353
8 set -1.312977 -1.506521
9 set 1.053537 2.249683
10 vers 0.133226 -0.294842
11 vers 1.053537 -0.294842
12 vers 0.396172 0.068662
13 vers 0.264699 0.310998
14 vers 0.790591 0.553333
15 vers 0.264699 1.038005
16 vers 1.185010 1.038005
17 vers 1.447956 1.280340
18 vers 0.790591 1.643844
19 vers 1.053537 2.249683
20 virg 1.053537 -0.294842
21 virg -1.050031 -0.537178
22 virg -1.312977 -0.900681
23 virg -0.261193 -1.021849
24 virg -1.312977 -1.385353
25 virg -1.312977 -1.506521
26 virg 0.396172 0.068662
27 virg -0.261193 0.189830
28 virg 0.264699 0.310998
29 virg 0.790591 0.553333
30 virg 0.264699 1.038005
31 virg 1.185010 1.038005
32 virg 1.447956 1.280340
33 virg 0.790591 1.643844
In [121]:
tiny_merged = tiny_results.merge(tiny_test, how="left", on=["petal width", "sepal length"])
tiny_merged = tiny_merged.replace('set',0).replace('vers',1).replace('virg',2)
tiny_merged
Out[121]:
cluster petal width sepal length sepal width petal length species id species id
0 0 0.133226 -0.294842 -0.356361 -0.090227 1 Iris-versicolor 64
1 0 1.053537 -0.294842 -0.587764 0.649027 2 Iris-virginica 121
2 0 -1.050031 -0.537178 1.957669 -1.170675 0 Iris-setosa 5
3 0 -1.312977 -0.900681 1.032057 -1.341272 0 Iris-setosa 0
4 0 -0.261193 -1.021849 -2.438987 -0.147093 1 Iris-versicolor 60
5 0 -1.312977 -1.021849 1.263460 -1.341272 0 Iris-setosa 4
6 0 -1.312977 -1.143017 -0.124958 -1.341272 0 Iris-setosa 1
7 0 -1.312977 -1.385353 0.337848 -1.398138 0 Iris-setosa 2
8 0 -1.312977 -1.506521 0.106445 -1.284407 0 Iris-setosa 3
9 0 1.053537 2.249683 -0.587764 1.672610 2 Iris-virginica 122
10 1 0.133226 -0.294842 -0.356361 -0.090227 1 Iris-versicolor 64
11 1 1.053537 -0.294842 -0.587764 0.649027 2 Iris-virginica 121
12 1 0.396172 0.068662 -0.124958 0.250967 1 Iris-versicolor 61
13 1 0.264699 0.310998 -0.356361 0.535296 1 Iris-versicolor 63
14 1 0.790591 0.553333 -0.819166 0.649027 2 Iris-virginica 123
15 1 0.264699 1.038005 0.106445 0.364699 1 Iris-versicolor 65
16 1 1.185010 1.038005 0.569251 1.103953 2 Iris-virginica 124
17 1 1.447956 1.280340 0.337848 1.103953 2 Iris-virginica 120
18 1 0.790591 1.643844 0.337848 1.274550 2 Iris-virginica 125
19 1 1.053537 2.249683 -0.587764 1.672610 2 Iris-virginica 122
20 2 1.053537 -0.294842 -0.587764 0.649027 2 Iris-virginica 121
21 2 -1.050031 -0.537178 1.957669 -1.170675 0 Iris-setosa 5
22 2 -1.312977 -0.900681 1.032057 -1.341272 0 Iris-setosa 0
23 2 -0.261193 -1.021849 -2.438987 -0.147093 1 Iris-versicolor 60
24 2 -1.312977 -1.385353 0.337848 -1.398138 0 Iris-setosa 2
25 2 -1.312977 -1.506521 0.106445 -1.284407 0 Iris-setosa 3
26 2 0.396172 0.068662 -0.124958 0.250967 1 Iris-versicolor 61
27 2 -0.261193 0.189830 -1.976181 0.137236 1 Iris-versicolor 62
28 2 0.264699 0.310998 -0.356361 0.535296 1 Iris-versicolor 63
29 2 0.790591 0.553333 -0.819166 0.649027 2 Iris-virginica 123
30 2 0.264699 1.038005 0.106445 0.364699 1 Iris-versicolor 65
31 2 1.185010 1.038005 0.569251 1.103953 2 Iris-virginica 124
32 2 1.447956 1.280340 0.337848 1.103953 2 Iris-virginica 120
33 2 0.790591 1.643844 0.337848 1.274550 2 Iris-virginica 125
In [122]:
accuracy_score(tiny_merged['species id'], tiny_merged['cluster'])
Out[122]:
0.4411764705882353
In [ ]: