Tutorial: Data Collection¶
Following a set of idioms and using common utilities when running NISQy quantum experiments is advantageous to:
Avoid duplication of effort for common tasks like data saving and loading
Enable easy data sharing
Reduce cognitive load of onboarding onto a new experiment. The ‘science’ part is isolated from an idiomatic ‘infrastructure’ part.
Idioms and conventions are more flexible than a strict framework. You don’t need to do everything exactly.
This notebook shows how to design the infrastructure to support a simple experiment.
[1]:
import os
import numpy as np
import sympy
import cirq
import recirq
Tasks¶
We organize our experiments around the concept of “tasks”. A task is a unit of work which consists of loading in input data, doing data processing or data collection, and saving results. Dividing your pipeline into tasks can be more of an art than a science. However, some rules of thumb can be observed:
A task should be at least 30 seconds worth of work but less than ten minutes worth of work. Finer division of tasks can make your pipelines more composable, more resistant to failure, easier to restart from failure, and easier to parallelize. Coarser division of tasks can amortize the cost of input and ouput data serialization and deserialization.
A task should be completely determined by a small-to-medium collection of primitive data type parameters. In fact, these parameters will represent instances of tasks and will act as “keys” in a database or on the filesystem.
Practically, a task consists of a TasknameTask
(use your own name!) dataclass and a function which takes an instance of such a class as its argument, does the requisite data processing, and saves its results. Here, we define the ReadoutScanTask
class with members that tell us exactly what data we want to collect.
[2]:
@recirq.json_serializable_dataclass(namespace='recirq.readout_scan',
registry=recirq.Registry,
frozen=True)
class ReadoutScanTask:
"""Scan over Ry(theta) angles from -pi/2 to 3pi/2 tracing out a sinusoid
which is primarily affected by readout error.
See Also:
:py:func:`run_readout_scan`
Attributes:
dataset_id: A unique identifier for this dataset.
device_name: The device to run on, by name.
n_shots: The number of repetitions for each theta value.
qubit: The qubit to benchmark.
resolution_factor: We select the number of points in the linspace
so that the special points: (-1/2, 0, 1/2, 1, 3/2) * pi are
always included. The total number of theta evaluations
is resolution_factor * 4 + 1.
"""
dataset_id: str
device_name: str
n_shots: int
qubit: cirq.GridQubit
resolution_factor: int
@property
def fn(self):
n_shots = _abbrev_n_shots(n_shots=self.n_shots)
qubit = _abbrev_grid_qubit(self.qubit)
return (f'{self.dataset_id}/'
f'{self.device_name}/'
f'q-{qubit}/'
f'ry_scan_{self.resolution_factor}_{n_shots}')
# Define the following helper functions to make nicer `fn` keys
# for the tasks:
def _abbrev_n_shots(n_shots: int) -> str:
"""Shorter n_shots component of a filename"""
if n_shots % 1000 == 0:
return f'{n_shots // 1000}k'
return str(n_shots)
def _abbrev_grid_qubit(qubit: cirq.GridQubit) -> str:
"""Formatted grid_qubit component of a filename"""
return f'{qubit.row}_{qubit.col}'
There are some things worth noting with this TasknameTask class.
We use the utility annotation
@json_serializable_dataclass
, which wraps the vanilla@dataclass
annotation, except it permits saving and loading instances ofReadoutScanTask
using Cirq’s JSON serialization facilities. We give it an appropriate namespace to distinguish between top-levelcirq
objects.Data members are all primitive or near-primitive data types:
str
,int
,GridQubit
. This sets us up well to useReadoutScanTask
in a variety of contexts where it may be tricky to use too-abstract data types. First, these simple members allow us to map from a task object to a unique/
-delimited string appropriate for use as a filename or a unique key. Second, these parameters are immediately suitable to serve as columns in apd.DataFrame
or a database table.There is a property named
fn
which provides a mapping fromReadoutScanTask
instances to strings suitable for use as filenames. In fact, we will use this to save per-task data. Note that every dataclass member variable is used in the construction offn
. We also define some utility methods to make more human-readable strings. There must be a 1:1 mapping from task attributes to filenames. In general it is easy to go from a Task object to a filename. It should be possible to go the other way, although filenames prioritize readability over parsability; so in general this relationship won’t be used.We begin with a
dataset_id
field. Remember, instances ofReadoutScanTask
must completely capture a task. We may want to run the same qubit for the same number of shots on the same device on two different days, so we includedataset_id
to capture the notion of time and/or the state of the universe for tasks. Each family of tasks should includedataset_id
as its first parameter.
Namespacing¶
A collection of tasks can be grouped into an “experiment” with a particular name. This defines a folder ~/cirq-results/[experiment_name]/
under which data will be stored. If you were storing data in a database, this might be the table name. The second level of namespacing comes from tasks’ dataset_id
field which groups together an immutable collection of results taken at roughly the same time.
By convention, you can define the following global variables in your experiment scripts:
[3]:
EXPERIMENT_NAME = 'readout-scan'
DEFAULT_BASE_DIR = os.path.expanduser(f'~/cirq-results/{EXPERIMENT_NAME}')
All of the I/O functions take a base_dir
parameter to support full control over where things are saved / loaded. Your script will use DEFAULT_BASE_DIR
.
Typically, data collection (i.e. the code in this notebook) would be in a script so you can run it headless for a long time. Typically, analysis is done in one or more notebooks because of their ability to display rich output. By saving data correctly, your analysis and plotting code can run fast and interactively.
Running a Task¶
Each task is comprised not only of the Task object, but also a function that executes the task. For example, here we define the process by which we collect data.
There should only be one required argument:
task
whose type is the class defined to completely specify the parameters of a task. Why define a separate class instead of just using normal function arguments?Remember this class has a
fn
property that gives a unique string for parameters. If there were more arguments to this function, there would be inputs not specified infn
and the data output path could be ambiguous.By putting the arguments in a class, they can easily be serialized as metadata alongside the output of the task.
The behavior of the function must be completely determined by its inputs.
This is why we put a
dataset_id
field in each task that’s usually something resembling a timestamp. It captures the ‘state of the world’ as an input.It’s recommended that you add a check to the beginning of each task function to check if the output file already exists. If it does and the output is completely determined by its inputs, then we can deduce that the task is already done. This can save time for expensive classical pre-computations or it can be used to re-start a collection of tasks where only some of them had completed.
In general, you have freedom to implement your own logic in these functions, especially between the beginning (which is code for loading in input data) and the end (which is always a call to
recirq.save()
). Don’t go crazy. If there’s too much logic in your task execution function, consider factoring out useful functionality into the main library.
[4]:
def run_readout_scan(task: ReadoutScanTask,
base_dir=None):
"""Execute a :py:class:`ReadoutScanTask` task."""
if base_dir is None:
base_dir = DEFAULT_BASE_DIR
if recirq.exists(task, base_dir=base_dir):
print(f"{task} already exists. Skipping.")
return
# Create a simple circuit
theta = sympy.Symbol('theta')
circuit = cirq.Circuit([
cirq.ry(theta).on(task.qubit),
cirq.measure(task.qubit, key='z')
])
# Use utilities to map sampler names to Sampler objects
sampler = recirq.get_sampler_by_name(device_name=task.device_name)
# Use a sweep over theta values.
# Set up limits so we include (-1/2, 0, 1/2, 1, 3/2) * pi
# The total number of points is resolution_factor * 4 + 1
n_special_points: int = 5
resolution_factor = task.resolution_factor
theta_sweep = cirq.Linspace(theta, -np.pi / 2, 3 * np.pi / 2,
resolution_factor * (n_special_points - 1) + 1)
thetas = np.asarray([v for ((k, v),) in theta_sweep.param_tuples()])
flat_circuit, flat_sweep = cirq.flatten_with_sweep(circuit, theta_sweep)
# Run the jobs
print(f"Collecting data for {task.qubit}", flush=True)
results = sampler.run_sweep(program=flat_circuit, params=flat_sweep,
repetitions=task.n_shots)
# Save the results
recirq.save(task=task, data={
'thetas': thetas,
'all_bitstrings': [
recirq.BitArray(np.asarray(r.measurements['z']))
for r in results]
}, base_dir=base_dir)
The driver script¶
Typically, the above classes and functions will live in a Python module; something like cirq/experiments/readout_scan/tasks.py
. You can then have one or more “driver scripts” which are actually executed.
View the driver script as a configuration file that specifies exactly which parameters you want to run. You can see that below, we’ve formatted the construction of all the task objects to look like a configuration file. This is no accident! As noted in the docstring, the user can be expected to twiddle values defined in the script. Trying to factor this out into an ini file (or similar) is more effort than it’s worth.
[5]:
# Put in a file named run-readout-scan.py
import datetime
import cirq.google as cg
MAX_N_QUBITS = 5
def main():
"""Main driver script entry point.
This function contains configuration options and you will likely need
to edit it to suit your needs. Of particular note, please make sure
`dataset_id` and `device_name`
are set how you want them. You may also want to change the values in
the list comprehension to set the qubits.
"""
# Uncomment below for an auto-generated unique dataset_id
# dataset_id = datetime.datetime.now().isoformat(timespec='minutes')
dataset_id = '2020-02-tutorial'
data_collection_tasks = [
ReadoutScanTask(
dataset_id=dataset_id,
device_name='Syc23-simulator',
n_shots=40_000,
qubit=qubit,
resolution_factor=6,
)
for qubit in cg.Sycamore23.qubits[:MAX_N_QUBITS]
]
for dc_task in data_collection_tasks:
run_readout_scan(dc_task)
if __name__ == '__main__':
main()
Collecting data for (3, 2)
Collecting data for (4, 1)
Collecting data for (4, 2)
Collecting data for (4, 3)
Collecting data for (5, 0)
We additionally follow good Python convention by wrapping the entry point in a function (i.e. def main():
rather than putting it directly under if __name__ == '__main__'
. The latter strategy puts all variables in the global scope (bad!).