Tutorial: Data Collection

Following a set of idioms and using common utilities when running NISQy quantum experiments is advantageous to:

  • Avoid duplication of effort for common tasks like data saving and loading

  • Enable easy data sharing

  • Reduce cognitive load of onboarding onto a new experiment. The ‘science’ part is isolated from an idiomatic ‘infrastructure’ part.

  • Idioms and conventions are more flexible than a strict framework. You don’t need to do everything exactly.

This notebook shows how to design the infrastructure to support a simple experiment.

[1]:
import os

import numpy as np
import sympy

import cirq
import recirq

Tasks

We organize our experiments around the concept of “tasks”. A task is a unit of work which consists of loading in input data, doing data processing or data collection, and saving results. Dividing your pipeline into tasks can be more of an art than a science. However, some rules of thumb can be observed:

  1. A task should be at least 30 seconds worth of work but less than ten minutes worth of work. Finer division of tasks can make your pipelines more composable, more resistant to failure, easier to restart from failure, and easier to parallelize. Coarser division of tasks can amortize the cost of input and ouput data serialization and deserialization.

  2. A task should be completely determined by a small-to-medium collection of primitive data type parameters. In fact, these parameters will represent instances of tasks and will act as “keys” in a database or on the filesystem.

Practically, a task consists of a TasknameTask (use your own name!) dataclass and a function which takes an instance of such a class as its argument, does the requisite data processing, and saves its results. Here, we define the ReadoutScanTask class with members that tell us exactly what data we want to collect.

[2]:
@recirq.json_serializable_dataclass(namespace='recirq.readout_scan',
                                    registry=recirq.Registry,
                                    frozen=True)
class ReadoutScanTask:
    """Scan over Ry(theta) angles from -pi/2 to 3pi/2 tracing out a sinusoid
    which is primarily affected by readout error.

    See Also:
        :py:func:`run_readout_scan`

    Attributes:
        dataset_id: A unique identifier for this dataset.
        device_name: The device to run on, by name.
        n_shots: The number of repetitions for each theta value.
        qubit: The qubit to benchmark.
        resolution_factor: We select the number of points in the linspace
            so that the special points: (-1/2, 0, 1/2, 1, 3/2) * pi are
            always included. The total number of theta evaluations
            is resolution_factor * 4 + 1.
    """
    dataset_id: str
    device_name: str
    n_shots: int
    qubit: cirq.GridQubit
    resolution_factor: int

    @property
    def fn(self):
        n_shots = _abbrev_n_shots(n_shots=self.n_shots)
        qubit = _abbrev_grid_qubit(self.qubit)
        return (f'{self.dataset_id}/'
                f'{self.device_name}/'
                f'q-{qubit}/'
                f'ry_scan_{self.resolution_factor}_{n_shots}')


# Define the following helper functions to make nicer `fn` keys
# for the tasks:

def _abbrev_n_shots(n_shots: int) -> str:
    """Shorter n_shots component of a filename"""
    if n_shots % 1000 == 0:
        return f'{n_shots // 1000}k'
    return str(n_shots)

def _abbrev_grid_qubit(qubit: cirq.GridQubit) -> str:
    """Formatted grid_qubit component of a filename"""
    return f'{qubit.row}_{qubit.col}'

There are some things worth noting with this TasknameTask class.

  1. We use the utility annotation @json_serializable_dataclass, which wraps the vanilla @dataclass annotation, except it permits saving and loading instances of ReadoutScanTask using Cirq’s JSON serialization facilities. We give it an appropriate namespace to distinguish between top-level cirq objects.

  2. Data members are all primitive or near-primitive data types: str, int, GridQubit. This sets us up well to use ReadoutScanTask in a variety of contexts where it may be tricky to use too-abstract data types. First, these simple members allow us to map from a task object to a unique /-delimited string appropriate for use as a filename or a unique key. Second, these parameters are immediately suitable to serve as columns in a pd.DataFrame or a database table.

  3. There is a property named fn which provides a mapping from ReadoutScanTask instances to strings suitable for use as filenames. In fact, we will use this to save per-task data. Note that every dataclass member variable is used in the construction of fn. We also define some utility methods to make more human-readable strings. There must be a 1:1 mapping from task attributes to filenames. In general it is easy to go from a Task object to a filename. It should be possible to go the other way, although filenames prioritize readability over parsability; so in general this relationship won’t be used.

  4. We begin with a dataset_id field. Remember, instances of ReadoutScanTask must completely capture a task. We may want to run the same qubit for the same number of shots on the same device on two different days, so we include dataset_id to capture the notion of time and/or the state of the universe for tasks. Each family of tasks should include dataset_id as its first parameter.

Namespacing

A collection of tasks can be grouped into an “experiment” with a particular name. This defines a folder ~/cirq-results/[experiment_name]/ under which data will be stored. If you were storing data in a database, this might be the table name. The second level of namespacing comes from tasks’ dataset_id field which groups together an immutable collection of results taken at roughly the same time.

By convention, you can define the following global variables in your experiment scripts:

[3]:
EXPERIMENT_NAME = 'readout-scan'
DEFAULT_BASE_DIR = os.path.expanduser(f'~/cirq-results/{EXPERIMENT_NAME}')

All of the I/O functions take a base_dir parameter to support full control over where things are saved / loaded. Your script will use DEFAULT_BASE_DIR.

Typically, data collection (i.e. the code in this notebook) would be in a script so you can run it headless for a long time. Typically, analysis is done in one or more notebooks because of their ability to display rich output. By saving data correctly, your analysis and plotting code can run fast and interactively.

Running a Task

Each task is comprised not only of the Task object, but also a function that executes the task. For example, here we define the process by which we collect data.

  • There should only be one required argument: task whose type is the class defined to completely specify the parameters of a task. Why define a separate class instead of just using normal function arguments?

  • Remember this class has a fn property that gives a unique string for parameters. If there were more arguments to this function, there would be inputs not specified in fn and the data output path could be ambiguous.

  • By putting the arguments in a class, they can easily be serialized as metadata alongside the output of the task.

  • The behavior of the function must be completely determined by its inputs.

  • This is why we put a dataset_id field in each task that’s usually something resembling a timestamp. It captures the ‘state of the world’ as an input.

  • It’s recommended that you add a check to the beginning of each task function to check if the output file already exists. If it does and the output is completely determined by its inputs, then we can deduce that the task is already done. This can save time for expensive classical pre-computations or it can be used to re-start a collection of tasks where only some of them had completed.

  • In general, you have freedom to implement your own logic in these functions, especially between the beginning (which is code for loading in input data) and the end (which is always a call to recirq.save()). Don’t go crazy. If there’s too much logic in your task execution function, consider factoring out useful functionality into the main library.

[4]:
def run_readout_scan(task: ReadoutScanTask,
                     base_dir=None):
    """Execute a :py:class:`ReadoutScanTask` task."""
    if base_dir is None:
        base_dir = DEFAULT_BASE_DIR

    if recirq.exists(task, base_dir=base_dir):
        print(f"{task} already exists. Skipping.")
        return

    # Create a simple circuit
    theta = sympy.Symbol('theta')
    circuit = cirq.Circuit([
        cirq.ry(theta).on(task.qubit),
        cirq.measure(task.qubit, key='z')
    ])

    # Use utilities to map sampler names to Sampler objects
    sampler = recirq.get_sampler_by_name(device_name=task.device_name)

    # Use a sweep over theta values.
    # Set up limits so we include (-1/2, 0, 1/2, 1, 3/2) * pi
    # The total number of points is resolution_factor * 4 + 1
    n_special_points: int = 5
    resolution_factor = task.resolution_factor
    theta_sweep = cirq.Linspace(theta, -np.pi / 2, 3 * np.pi / 2,
                                resolution_factor * (n_special_points - 1) + 1)
    thetas = np.asarray([v for ((k, v),) in theta_sweep.param_tuples()])
    flat_circuit, flat_sweep = cirq.flatten_with_sweep(circuit, theta_sweep)

    # Run the jobs
    print(f"Collecting data for {task.qubit}", flush=True)
    results = sampler.run_sweep(program=flat_circuit, params=flat_sweep,
                                repetitions=task.n_shots)

    # Save the results
    recirq.save(task=task, data={
        'thetas': thetas,
        'all_bitstrings': [
            recirq.BitArray(np.asarray(r.measurements['z']))
            for r in results]
    }, base_dir=base_dir)

The driver script

Typically, the above classes and functions will live in a Python module; something like cirq/experiments/readout_scan/tasks.py. You can then have one or more “driver scripts” which are actually executed.

View the driver script as a configuration file that specifies exactly which parameters you want to run. You can see that below, we’ve formatted the construction of all the task objects to look like a configuration file. This is no accident! As noted in the docstring, the user can be expected to twiddle values defined in the script. Trying to factor this out into an ini file (or similar) is more effort than it’s worth.

[5]:
# Put in a file named run-readout-scan.py

import datetime
import cirq.google as cg

MAX_N_QUBITS = 5

def main():
    """Main driver script entry point.

    This function contains configuration options and you will likely need
    to edit it to suit your needs. Of particular note, please make sure
    `dataset_id` and `device_name`
    are set how you want them. You may also want to change the values in
    the list comprehension to set the qubits.
    """
    # Uncomment below for an auto-generated unique dataset_id
    # dataset_id = datetime.datetime.now().isoformat(timespec='minutes')
    dataset_id = '2020-02-tutorial'
    data_collection_tasks = [
        ReadoutScanTask(
            dataset_id=dataset_id,
            device_name='Syc23-simulator',
            n_shots=40_000,
            qubit=qubit,
            resolution_factor=6,
        )
        for qubit in cg.Sycamore23.qubits[:MAX_N_QUBITS]
    ]

    for dc_task in data_collection_tasks:
        run_readout_scan(dc_task)


if __name__ == '__main__':
    main()
Collecting data for (3, 2)
Collecting data for (4, 1)
Collecting data for (4, 2)
Collecting data for (4, 3)
Collecting data for (5, 0)

We additionally follow good Python convention by wrapping the entry point in a function (i.e. def main(): rather than putting it directly under if __name__ == '__main__'. The latter strategy puts all variables in the global scope (bad!).