Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ rm -rf book/_build
pytest

# Run tests with coverage
pytest --cov=src/BetterCodeBetterScience --cov-report term-missing
pytest --cov=src/bettercode --cov-report term-missing

# Run specific test modules
pytest tests/textmining/
Expand Down Expand Up @@ -62,7 +62,7 @@ pre-commit run --all-files
## Project Structure

- `book/` - MyST markdown chapters (configured in myst.yml)
- `src/BetterCodeBetterScience/` - Example Python code referenced in book chapters
- `src/bettercode/` - Example Python code referenced in book chapters
- `tests/` - Test examples demonstrating testing concepts from the book
- `data/` - Data files for examples
- `scripts/` - Utility scripts
Expand Down
6 changes: 3 additions & 3 deletions book/AI_coding_assistants.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ def linear_regression_normal_eq(X: np.ndarray, y: np.ndarray) -> np.ndarray:
```

Unlike the previous examples, the code now includes type hints.
It's always a bad idea to generalize from a single result, so we ran these prompts through ChatGPT 10 times each (using the Openai API to generate them programmatically; see the [notebook](../src/BetterCodeBetterScience/incontext_learning_example.ipynb)).
It's always a bad idea to generalize from a single result, so we ran these prompts through ChatGPT 10 times each (using the Openai API to generate them programmatically; see the [notebook](../src/bettercode/incontext_learning_example.ipynb)).
Here are the function signatures generated for each of the 10 runs without mentioning type hints:

```
Expand Down Expand Up @@ -272,7 +272,7 @@ In addition to the time and labor of running things by hand, it is also a recipe

You might be asking at this point, "What's an API"? The acronym stands for "Application Programming Interface", which is a method by which one can programmatically send commands to and receive responses from a computer system, which could be local or remote[^1].
To understand this better, let's see how to send a chat command and receive a response from the Claude language model.
The full outline is in [the notebook](https://github.com/poldrack/BetterCodeBetterScience/blob/main/src/BetterCodeBetterScience/language_model_api_prompting.ipynb).
The full outline is in [the notebook](https://github.com/poldrack/BetterCodeBetterScience/blob/main/src/bettercode/language_model_api_prompting.ipynb).
Coding agents are very good at generating code to perform API calls, so I used Claude Sonnet 4 to generate the example code in the notebook:

```python
Expand Down Expand Up @@ -358,7 +358,7 @@ Let's see how we could get the previous example to return a JSON object containi
Here we will use a function called `send_prompt_to_claude()` that wraps the call to the model object and returns the text from the result:

```python
from BetterCodeBetterScience.llm_utils import send_prompt_to_claude
from bettercode.llm_utils import send_prompt_to_claude

json_prompt = """
What is the capital of France?
Expand Down
18 changes: 9 additions & 9 deletions book/data_management.md
Original file line number Diff line number Diff line change
Expand Up @@ -471,7 +471,7 @@ df_merged = pd.concat([df1, df2, df3], ignore_index=True)

The most common file formats are *comma-separated value* (CSV) or *tab-separated value* (TSV) files. Both of these have the benefit of being represented in plain text, so their contents can be easily examined without any special software. I generally prefer to use tabs rather than commas as the separator (or *delimiter*), primarily because they can more easily naturally represent longer pieces of text that may include commas. These can also be represented using CSV, but they require additional processing in order to *escape* the commas within the text so that they are not interpreted as delimiters.

Text file formats like CSV and TSV are nice for their ease of interpretability, but they are highly inefficient for large data compared to optimized file formats, such as the *Parquet* format. To see this in action, I loaded a brain image and saved all of the non-zero data points (857,785 to be exact) to a data frame, which I then saved to CSV and Parquet formats; see [the management notebook](src/BetterCodeBetterScience/data_management.ipynb) for details. Looking at the resulting files, we can see that the Parquet file is only about 20% the size of the CSV file:
Text file formats like CSV and TSV are nice for their ease of interpretability, but they are highly inefficient for large data compared to optimized file formats, such as the *Parquet* format. To see this in action, I loaded a brain image and saved all of the non-zero data points (857,785 to be exact) to a data frame, which I then saved to CSV and Parquet formats; see [the management notebook](src/bettercode/data_management.ipynb) for details. Looking at the resulting files, we can see that the Parquet file is only about 20% the size of the CSV file:

```bash
➤ du -sk /tmp/brain_tabular.*
Expand Down Expand Up @@ -718,7 +718,7 @@ In this section we discuss data organization. The most important principle of da

### File granularity

One common decision that we need to make when managing data is to save data in more smaller files versus fewer larger files. The right answer to this question depends in part on how we will have to access the data. If we only need to access a small portion of the data and we can easily determine which file to open to obtain those data, then it probably makes sense to save many small files. However, if we need to combine data across many small files, then it likely makes sense to save the data as one large file. For example, in the [data management notebook](src/BetterCodeBetterScience/data_management.ipynb) there is an example where we create a large (10000 x 100000) matrix of random numbers, and save them either to a single file or to a separate file for each row. When loading these data, the loading of the single file is about 5 times faster than loading the individual files.
One common decision that we need to make when managing data is to save data in more smaller files versus fewer larger files. The right answer to this question depends in part on how we will have to access the data. If we only need to access a small portion of the data and we can easily determine which file to open to obtain those data, then it probably makes sense to save many small files. However, if we need to combine data across many small files, then it likely makes sense to save the data as one large file. For example, in the [data management notebook](src/bettercode/data_management.ipynb) there is an example where we create a large (10000 x 100000) matrix of random numbers, and save them either to a single file or to a separate file for each row. When loading these data, the loading of the single file is about 5 times faster than loading the individual files.

Another consideration about the number of files has to do with storage systems that are commonly used on high-performance computing systems. On these systems, it is common to have separate quotas for total space used (e.g., in terabytes) as well as for the number of *inodes*, which are structures that store information about files and folders on a UNIX filesystem. Thus, generating many small files (e.g., millions) can sometimes cause problems on these systems. For this reason, we generally err on the side of generating fewer larger files versus more smaller files when working on high-performance computing systems.

Expand Down Expand Up @@ -1038,7 +1038,7 @@ unlock(ok): my_datalad_repo/data/demographics.csv (file)
We then use a Python script to make the change, which in this case is removing some columns from the dataset:

```bash
➤ python src/BetterCodeBetterScience/modify_data.py my_datalad_repo/data/demographics.csv
➤ python src/bettercode/modify_data.py my_datalad_repo/data/demographics.csv

```

Expand Down Expand Up @@ -1074,15 +1074,15 @@ nothing to save, working tree clean
Although the previous example was meant to provide background on how DataLad works, in practice there is actually a much easier way to accomplish these steps, which is by using the [`datalad run`](https://docs.datalad.org/en/stable/generated/man/datalad-run.html) command. This command will automatically take care of fetching and unlocking the relevant files, running the command, and then committing the files back in, generating a commit message that tracks the specific command that was used:

```bash
➤ datalad run -i my_datalad_repo/data/demographics.csv -o my_datalad_repo/data/demographics.csv -- uv run src/BetterCodeBetterScience/modify_data.py my_datalad_repo/data/demographics.csv
➤ datalad run -i my_datalad_repo/data/demographics.csv -o my_datalad_repo/data/demographics.csv -- uv run src/bettercode/modify_data.py my_datalad_repo/data/demographics.csv
[INFO ] Making sure inputs are available (this may take some time)
unlock(ok): my_datalad_repo/data/demographics.csv (file)
[INFO ] == Command start (output follows) =====
Built bettercodebetterscience @ file:///Users/poldrack/Dropbox/code/BetterCode
Uninstalled 1 package in 1ms
Installed 1 package in 1ms
[INFO ] == Command exit (modification check follows) =====
run(ok): /Users/poldrack/Dropbox/code/BetterCodeBetterScience (dataset) [uv run src/BetterCodeBetterScience/modif...]
run(ok): /Users/poldrack/Dropbox/code/BetterCodeBetterScience (dataset) [uv run src/bettercode/modif...]
add(ok): data/demographics.csv (file)
save(ok): my_datalad_repo (dataset)
add(ok): my_datalad_repo (dataset)
Expand All @@ -1095,12 +1095,12 @@ commit 3ef3b94a0abffec6a8db7570a97339f48ee728ed (HEAD -> text/datamgmt-Nov3)
Author: Russell Poldrack <poldrack@gmail.com>
Date: Mon Dec 15 13:28:06 2025 -0800

[DATALAD RUNCMD] uv run src/BetterCodeBetterScience/modif...
[DATALAD RUNCMD] uv run src/bettercode/modif...

=== Do not change lines below ===
{
"chain": [],
"cmd": "uv run src/BetterCodeBetterScience/modify_data.py my_datalad_repo/data/demographics.csv",
"cmd": "uv run src/bettercode/modify_data.py my_datalad_repo/data/demographics.csv",
"exit": 0,
"extra_inputs": [],
"inputs": [
Expand Down Expand Up @@ -1220,7 +1220,7 @@ The question that I will ask is as follows: How well can the biological similari
- A dataset of genome-wise association study (GWAS) results for specific traits obtained from [here](https://www.ebi.ac.uk/gwas/docs/file-downloads).
- Abstracts that refer to each of the traits identified in the GWAS result, obtained from the [PubMed](https://pubmed.ncbi.nlm.nih.gov/) database.

I will not present all of the code for each step; this can be found [here](src/BetterCodeBetterScience/database_example_funcs.py) and [here](src/BetterCodeBetterScience/database.py). Rather, I will show portions that are particularly relevant to the databases being used.
I will not present all of the code for each step; this can be found [here](src/bettercode/database_example_funcs.py) and [here](src/bettercode/database.py). Rather, I will show portions that are particularly relevant to the databases being used.

### Adding GWAS data to a document store

Expand All @@ -1236,7 +1236,7 @@ In this case, looking at the data we see that several columns contain multiple v
gwas_data = get_exploded_gwas_data()
```

We can now import the data from this data frame into a MongoDB collection, mapping each unique trait to the genes that are reported as being associated with it. First I generated a separate function that sets up a MongoDB collection (see `setup_mongo_collection` [here](src/BetterCodeBetterScience/database.py)). We can then use that function to set up our gene set collection:
We can now import the data from this data frame into a MongoDB collection, mapping each unique trait to the genes that are reported as being associated with it. First I generated a separate function that sets up a MongoDB collection (see `setup_mongo_collection` [here](src/bettercode/database.py)). We can then use that function to set up our gene set collection:


```python
Expand Down
4 changes: 2 additions & 2 deletions book/project_organization.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ A final way that one might use notebooks is as a way to create standalone progra

It's very common for researchers to use different coding languages to solve different problems. A common use case is the Python user who wishes to take advantage of the much wider range of statistical methods that are implemented in R. There is a package called `rpy2` that allows this within pure Python code, but it can be cumbersome to work with, particularly due to the need to convert complex data types. Fortunately, Jupyter notebooks provide a convenient solution to this problem, via [*magic* commands](https://scipy-ipython.readthedocs.io/en/latest/interactive/magics.html). These are commands that start with either a `%` (for line commands) or `%%` for cell commands, which enable additional functionality.

An example of this can be seen in the [mixing_languages.ipynb](src/BetterCodeBetterScience/notebooks/mixing_languages.ipynb) notebook, in which we load and preprocess some data using Python and then use R magic commands to analyze the data using a package only available within R. In this example, we will work with data from a study published by our laboratory (Eisenberg et al., 2019), in which 522 people completed a large battery of psychological tests and surveys. We will focus here on the responses to a survey known as the "Barratt Impulsiveness Scale" which includes 30 questions related to different aspects of the psychological construct of "impulsiveness"; for example, "I say things without thinking" or "I plan tasks carefully". Each participant rated each of these statements on a four-point scale from 'Rarely/Never' to 'Almost Always/Always'; the scores were coded so that the number 1 always represented the most impulsive choice and 4 represented the most self-controlled choice.
An example of this can be seen in the [mixing_languages.ipynb](src/bettercode/notebooks/mixing_languages.ipynb) notebook, in which we load and preprocess some data using Python and then use R magic commands to analyze the data using a package only available within R. In this example, we will work with data from a study published by our laboratory (Eisenberg et al., 2019), in which 522 people completed a large battery of psychological tests and surveys. We will focus here on the responses to a survey known as the "Barratt Impulsiveness Scale" which includes 30 questions related to different aspects of the psychological construct of "impulsiveness"; for example, "I say things without thinking" or "I plan tasks carefully". Each participant rated each of these statements on a four-point scale from 'Rarely/Never' to 'Almost Always/Always'; the scores were coded so that the number 1 always represented the most impulsive choice and 4 represented the most self-controlled choice.

In order to enable the R magic commands, we first need to load the rpy2 extension for Jupyter:

Expand Down Expand Up @@ -526,7 +526,7 @@ test output from container
To create a reproducible software execution environment, we will often need to create our own new Docker image that contains the necessary dependencies and application code. AI coding tools are generally quite good at creating the required `Dockerfile` that defines the image. We use the following prompt to Claude Sonnet 4:

```
I would like to generate a Dockerfile to define a Docker image based on the python:3.13.9 image. The Python package wonderwords should be installed from PyPi. A local Python script should be created that creates a random sentence using wonderwords.RandomSentence() and prints it. This script should be the entrypoint for the Docker container. Create this within src/BetterCodeBetterScience/docker-example inside the current project. Do not create a new workspace - use the existing workspace for this project.
I would like to generate a Dockerfile to define a Docker image based on the python:3.13.9 image. The Python package wonderwords should be installed from PyPi. A local Python script should be created that creates a random sentence using wonderwords.RandomSentence() and prints it. This script should be the entrypoint for the Docker container. Create this within src/bettercode/docker-example inside the current project. Do not create a new workspace - use the existing workspace for this project.
```

Here is the content of the resulting `Dockerfile`:
Expand Down
18 changes: 9 additions & 9 deletions book/software_engineering.md
Original file line number Diff line number Diff line change
Expand Up @@ -758,7 +758,7 @@ C = 299792458
We could then import this from our module within the iPython shell:

```
In: from BetterCodeBetterScience.constants import C
In: from bettercode.constants import C

In: C
Out: 299792458
Expand Down Expand Up @@ -793,7 +793,7 @@ class Constants:
Then within our iPython shell, we generate an instance of the Constants class, and see what happens if we try to change the value once it's instantiated:

```
In: from BetterCodeBetterScience.constants import Constants
In: from bettercode.constants import Constants

In: constants = Constants()

Expand All @@ -806,7 +806,7 @@ AttributeError Traceback (most recent call last)
Cell In[4], line 1
----> 1 constants.C = 42

File ~/Dropbox/code/BetterCodeBetterScience/src/BetterCodeBetterScience/constants.py:11, in Constants.__setattr__(self, name, value)
File ~/Dropbox/code/BetterCodeBetterScience/src/bettercode/constants.py:11, in Constants.__setattr__(self, name, value)
10 def __setattr__(self, name, value):
---> 11 raise AttributeError("Constants cannot be modified")

Expand Down Expand Up @@ -847,8 +847,8 @@ We see that `ruff` detects both formatting problems (such as the lack of spaces
We can also use `ruff` from the command line to detect and fix code problems:

```bash
❯ ruff check src/BetterCodeBetterScience/formatting_example.py
src/BetterCodeBetterScience/formatting_example.py:6:1: F403 `from numpy.random import *` used; unable to detect undefined names
❯ ruff check src/bettercode/formatting_example.py
src/bettercode/formatting_example.py:6:1: F403 `from numpy.random import *` used; unable to detect undefined names
|
4 | # Poorly formatted code for linting example
5 |
Expand All @@ -858,7 +858,7 @@ src/BetterCodeBetterScience/formatting_example.py:6:1: F403 `from numpy.random i
8 | mynum=randint(0,100)
|

src/BetterCodeBetterScience/formatting_example.py:8:7: F405 `randint` may be undefined, or defined from star imports
src/bettercode/formatting_example.py:8:7: F405 `randint` may be undefined, or defined from star imports
|
6 | from numpy.random import *
7 |
Expand All @@ -872,12 +872,12 @@ Found 2 errors.
Most linters can also automatically fix the issues that they detect in the code. `ruff` modifies the file in place, so we will first create a copy (so that our original remains intact) and then run the formatter on that copy:

```bash
❯ cp src/BetterCodeBetterScience/formatting_example.py src/BetterCodeBetterScience/formatting_example_ruff.py
❯ cp src/bettercode/formatting_example.py src/bettercode/formatting_example_ruff.py

❯ ruff format src/BetterCodeBetterScience/formatting_example_ruff.py
❯ ruff format src/bettercode/formatting_example_ruff.py
1 file reformatted

❯ diff src/BetterCodeBetterScience/formatting_example.py src/BetterCodeBetterScience/formatting_example_ruff.py
❯ diff src/bettercode/formatting_example.py src/bettercode/formatting_example_ruff.py
1,3d0
<
<
Expand Down
Loading