{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial: Data Collection\n", "\n", "Following a set of idioms and using common utilities when running NISQy quantum\n", "experiments is advantageous to:\n", "\n", " - Avoid duplication of effort for common tasks like data saving and loading\n", " - Enable easy data sharing\n", " - Reduce cognitive load of onboarding onto a new experiment. The 'science'\n", " part is isolated from an idiomatic 'infrastructure' part.\n", " - Idioms and conventions are more flexible than a strict framework. You\n", " don't need to do everything exactly.\n", " \n", "This notebook shows how to design the infrastructure to support a simple experiment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import numpy as np\n", "import sympy\n", "\n", "import cirq\n", "import recirq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tasks\n", "\n", "We organize our experiments around the concept of \"tasks\". A task is a unit of work which consists of loading in input data, doing data processing or data collection, and saving results. Dividing your pipeline into tasks can be more of an art than a science. However, some rules of thumb can be observed:\n", "\n", " 1. A task should be at least 30 seconds worth of work but less than ten minutes worth of work. Finer division of tasks can make your pipelines more composable, more resistant to failure, easier to restart from failure, and easier to parallelize. Coarser division of tasks can amortize the cost of input and ouput data serialization and deserialization.\n", " \n", " 2. A task should be completely determined by a small-to-medium collection of primitive data type parameters. In fact, these parameters will represent instances of tasks and will act as \"keys\" in a database or on the filesystem.\n", "\n", "Practically, a task consists of a `TasknameTask` (use your own name!) dataclass and a function which takes an instance of such a class as its argument, does the requisite data processing, and saves its results. Here, we define the `ReadoutScanTask` class with members that tell us exactly what data we want to collect." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@recirq.json_serializable_dataclass(namespace='recirq.readout_scan', \n", " registry=recirq.Registry,\n", " frozen=True)\n", "class ReadoutScanTask:\n", " \"\"\"Scan over Ry(theta) angles from -pi/2 to 3pi/2 tracing out a sinusoid\n", " which is primarily affected by readout error.\n", "\n", " See Also:\n", " :py:func:`run_readout_scan`\n", "\n", " Attributes:\n", " dataset_id: A unique identifier for this dataset.\n", " device_name: The device to run on, by name.\n", " n_shots: The number of repetitions for each theta value.\n", " qubit: The qubit to benchmark.\n", " resolution_factor: We select the number of points in the linspace\n", " so that the special points: (-1/2, 0, 1/2, 1, 3/2) * pi are\n", " always included. The total number of theta evaluations\n", " is resolution_factor * 4 + 1.\n", " \"\"\"\n", " dataset_id: str\n", " device_name: str\n", " n_shots: int\n", " qubit: cirq.GridQubit\n", " resolution_factor: int\n", "\n", " @property\n", " def fn(self):\n", " n_shots = _abbrev_n_shots(n_shots=self.n_shots)\n", " qubit = _abbrev_grid_qubit(self.qubit)\n", " return (f'{self.dataset_id}/'\n", " f'{self.device_name}/'\n", " f'q-{qubit}/'\n", " f'ry_scan_{self.resolution_factor}_{n_shots}')\n", "\n", "\n", "# Define the following helper functions to make nicer `fn` keys\n", "# for the tasks:\n", " \n", "def _abbrev_n_shots(n_shots: int) -> str:\n", " \"\"\"Shorter n_shots component of a filename\"\"\"\n", " if n_shots % 1000 == 0:\n", " return f'{n_shots // 1000}k'\n", " return str(n_shots)\n", "\n", "def _abbrev_grid_qubit(qubit: cirq.GridQubit) -> str:\n", " \"\"\"Formatted grid_qubit component of a filename\"\"\"\n", " return f'{qubit.row}_{qubit.col}'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are some things worth noting with this TasknameTask class.\n", "\n", "1. We use the utility annotation `@json_serializable_dataclass`, which wraps the vanilla `@dataclass` annotation, except it permits saving and loading instances of `ReadoutScanTask` using Cirq's JSON serialization facilities. We give it an appropriate namespace to distinguish between top-level `cirq` objects.\n", "\n", "2. Data members are all primitive or near-primitive data types: `str`, `int`, `GridQubit`. This sets us up well to use `ReadoutScanTask` in a variety of contexts where it may be tricky to use too-abstract data types. First, these simple members allow us to map from a task object to a unique `/`-delimited string appropriate for use as a filename or a unique key. Second, these parameters are immediately suitable to serve as columns in a `pd.DataFrame` or a database table. \n", "\n", "3. There is a property named `fn` which provides a mapping from `ReadoutScanTask` instances to strings suitable for use as filenames. In fact, we will use this to save per-task data. Note that every dataclass member variable is used in the construction of `fn`. We also define some utility methods to make more human-readable strings. There must be a 1:1 mapping from task attributes to filenames. In general it is easy to go from a Task object to a filename. It should be possible to go the other way, although filenames prioritize readability over parsability; so in general this relationship won’t be used.\n", "\n", "4. We begin with a `dataset_id` field. Remember, instances of `ReadoutScanTask` must completely capture a task. We may want to run the same qubit for the same number of shots on the same device on two different days, so we include `dataset_id` to capture the notion of time and/or the state of the universe for tasks. Each family of tasks should include `dataset_id` as its first parameter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Namespacing\n", "\n", "A collection of tasks can be grouped into an \"experiment\" with a particular name.\n", "This defines a folder `~/cirq-results/[experiment_name]/` under which data will be stored.\n", "If you were storing data in a database, this might be the table name.\n", "The second level of namespacing comes from tasks' `dataset_id` field which groups together an immutable collection of results taken at roughly the same time.\n", "\n", "By convention, you can define the following global variables in your experiment scripts:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "EXPERIMENT_NAME = 'readout-scan'\n", "DEFAULT_BASE_DIR = os.path.expanduser(f'~/cirq-results/{EXPERIMENT_NAME}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of the I/O functions take a `base_dir` parameter to support full control\n", "over where things are saved / loaded. Your script will use `DEFAULT_BASE_DIR`.\n", "\n", "Typically, data collection (i.e. the code in this notebook) would be in a script so you can run it headless for a long time. Typically, analysis is done in one or more notebooks because of their ability to display rich output. By saving data correctly, your analysis and plotting code can run fast and interactively." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Running a Task\n", "\n", "Each task is comprised not only of the Task object, but also a function that executes the task. For example, here we define the process by which we collect data.\n", "\n", " - There should only be one required argument: `task` whose type is the class defined to completely specify the parameters of a task. Why define a separate class instead of just using normal function arguments?\n", " - Remember this class has a `fn` property that gives a unique string for parameters. If there were more arguments to this function, there would be inputs not specified in `fn` and the data output path could be ambiguous.\n", " - By putting the arguments in a class, they can easily be serialized as metadata alongside the output of the task.\n", " - The behavior of the function must be completely determined by its inputs.\n", " - This is why we put a `dataset_id` field in each task that's usually something resembling a timestamp. It captures the 'state of the world' as an input.\n", " - It's recommended that you add a check to the beginning of each task function to check if the output file already exists. If it does and the output is completely determined by its inputs, then we can deduce that the task is already done. This can save time for expensive classical pre-computations or it can be used to re-start a collection of tasks where only some of them had completed.\n", " - In general, you have freedom to implement your own logic in these functions, especially between the beginning (which is code for loading in input data) and the end (which is always a call to `recirq.save()`). Don't go crazy. If there's too much logic in your task execution function, consider factoring out useful functionality into the main library." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def run_readout_scan(task: ReadoutScanTask,\n", " base_dir=None):\n", " \"\"\"Execute a :py:class:`ReadoutScanTask` task.\"\"\"\n", " if base_dir is None:\n", " base_dir = DEFAULT_BASE_DIR\n", " \n", " if recirq.exists(task, base_dir=base_dir):\n", " print(f\"{task} already exists. Skipping.\")\n", " return\n", "\n", " # Create a simple circuit\n", " theta = sympy.Symbol('theta')\n", " circuit = cirq.Circuit([\n", " cirq.ry(theta).on(task.qubit),\n", " cirq.measure(task.qubit, key='z')\n", " ])\n", "\n", " # Use utilities to map sampler names to Sampler objects\n", " sampler = recirq.get_sampler_by_name(device_name=task.device_name)\n", "\n", " # Use a sweep over theta values.\n", " # Set up limits so we include (-1/2, 0, 1/2, 1, 3/2) * pi\n", " # The total number of points is resolution_factor * 4 + 1\n", " n_special_points: int = 5\n", " resolution_factor = task.resolution_factor\n", " theta_sweep = cirq.Linspace(theta, -np.pi / 2, 3 * np.pi / 2,\n", " resolution_factor * (n_special_points - 1) + 1)\n", " thetas = np.asarray([v for ((k, v),) in theta_sweep.param_tuples()])\n", " flat_circuit, flat_sweep = cirq.flatten_with_sweep(circuit, theta_sweep)\n", "\n", " # Run the jobs\n", " print(f\"Collecting data for {task.qubit}\", flush=True)\n", " results = sampler.run_sweep(program=flat_circuit, params=flat_sweep,\n", " repetitions=task.n_shots)\n", "\n", " # Save the results\n", " recirq.save(task=task, data={\n", " 'thetas': thetas,\n", " 'all_bitstrings': [\n", " recirq.BitArray(np.asarray(r.measurements['z']))\n", " for r in results]\n", " }, base_dir=base_dir)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The driver script\n", "\n", "Typically, the above classes and functions will live in a Python module; something like `cirq/experiments/readout_scan/tasks.py`. You can then have one or more \"driver scripts\" which are actually executed.\n", "\n", "View the driver script as a configuration file that specifies exactly which parameters you want to run. You can see that below, we've formatted the construction of all the task objects to look like a configuration file. This is no accident! As noted in the docstring, the user can be expected to twiddle values defined in the script. Trying to factor this out into an ini file (or similar) is more effort than it's worth." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Put in a file named run-readout-scan.py\n", "\n", "import datetime\n", "import cirq.google as cg\n", "\n", "MAX_N_QUBITS = 5\n", "\n", "def main():\n", " \"\"\"Main driver script entry point.\n", "\n", " This function contains configuration options and you will likely need\n", " to edit it to suit your needs. Of particular note, please make sure\n", " `dataset_id` and `device_name`\n", " are set how you want them. You may also want to change the values in\n", " the list comprehension to set the qubits.\n", " \"\"\"\n", " # Uncomment below for an auto-generated unique dataset_id\n", " # dataset_id = datetime.datetime.now().isoformat(timespec='minutes')\n", " dataset_id = '2020-02-tutorial'\n", " data_collection_tasks = [\n", " ReadoutScanTask(\n", " dataset_id=dataset_id,\n", " device_name='Syc23-simulator',\n", " n_shots=40_000,\n", " qubit=qubit,\n", " resolution_factor=6,\n", " )\n", " for qubit in cg.Sycamore23.qubits[:MAX_N_QUBITS]\n", " ]\n", "\n", " for dc_task in data_collection_tasks:\n", " run_readout_scan(dc_task)\n", "\n", "\n", "if __name__ == '__main__':\n", " main()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We additionally follow good Python convention by wrapping the entry point in a function (i.e. `def main():` rather than putting it directly under `if __name__ == '__main__'`. The latter strategy puts all variables in the global scope (bad!)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }