Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Commit a920ed2

Browse files
committed
Merge branch 'master' into issue_479_2
2 parents 5454310 + 4c68554 commit a920ed2

22 files changed

+896
-401
lines changed

README.md

Lines changed: 60 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -2,136 +2,109 @@
22
<img alt="Datafold" src="https://user-images.githubusercontent.com/1799931/196497110-d3de1113-a97f-4322-b531-026d859b867a.png" width="50%" />
33
</p>
44

5-
# **data-diff**
5+
<h1 align="center">
6+
data-diff
7+
</h1>
8+
9+
<h2 align="center">
10+
Develop dbt models faster by testing as you code.
11+
</h2>
12+
<h4 align="center">
13+
See how every change to dbt code affects the data produced in the modified model and downstream.
14+
</h4>
15+
<br>
616

717
## What is `data-diff`?
8-
data-diff is a **free, open-source tool** that enables data professionals to detect differences in values between any two tables.
918

10-
## Documentation
19+
data-diff is an open source package that you can use to see the impact of your dbt code changes on your dbt models as you code.
1120

12-
[**🗎 Documentation**](https://docs.datafold.com/guides/os_data_diff) - our detailed documentation has everything you need to start diffing.
21+
<div align="center">
1322

14-
### Databases we support
23+
![development_testing_gif](https://user-images.githubusercontent.com/1799931/236354286-d1d044cf-2168-4128-8a21-8c8ca7fd494c.gif)
1524

16-
- PostgreSQL >=10
17-
- MySQL
18-
- Snowflake
19-
- BigQuery
20-
- Redshift
21-
- Oracle
22-
- Presto
23-
- Databricks
24-
- Trino
25-
- Clickhouse
26-
- Vertica
27-
- DuckDB >=0.6
28-
- SQLite (coming soon)
25+
</div>
2926

30-
For their corresponding connection strings, check out our [detailed table](https://github.com/datafold/data-diff/blob/master/docs/supported-databases.md).
27+
<br>
3128

32-
#### Looking for a database not on the list?
33-
If a database is not on the list, we'd still love to support it. [Please open an issue](https://github.com/datafold/data-diff/issues) to discuss it, or vote on existing requests to push them up our todo list.
29+
## Getting Started
3430

35-
## Get started
31+
**Install `data-diff`**
3632

37-
### Installation
38-
39-
#### First, install `data-diff` using `pip`.
33+
Install `data-diff` with the command that is specific to the database you use with dbt.
4034

35+
### Snowflake
4136
```
42-
pip install data-diff
37+
pip install data-diff 'data-diff[snowflake,dbt]' -U
4338
```
4439

45-
#### Then, install one or more driver(s) specific to the database(s) you want to connect to.
46-
47-
- `pip install 'data-diff[mysql]'`
48-
49-
- `pip install 'data-diff[postgresql]'`
50-
51-
- `pip install 'data-diff[snowflake]'`
52-
53-
- `pip install 'data-diff[presto]'`
54-
55-
- `pip install 'data-diff[oracle]'`
56-
57-
- `pip install 'data-diff[trino]'`
58-
59-
- `pip install 'data-diff[clickhouse]'`
60-
61-
- `pip install 'data-diff[vertica]'`
62-
63-
- For BigQuery, see: https://pypi.org/project/google-cloud-bigquery/
64-
65-
_Some drivers have dependencies that cannot be installed using `pip` and still need to be installed manually._
66-
67-
### Run your first diff
40+
### BigQuery
41+
```
42+
pip install data-diff 'data-diff[dbt]' google-cloud-bigquery -U
43+
```
6844

69-
Once you've installed `data-diff`, you can run it from the command line.
45+
### Redshift
46+
```
47+
pip install data-diff 'data-diff[redshift,dbt]' -U
48+
```
7049

50+
### Postgres
7151
```
72-
data-diff DB1_URI TABLE1_NAME DB2_URI TABLE2_NAME [OPTIONS]
52+
pip install data-diff 'data-diff[postgres,dbt]' -U
7353
```
7454

75-
Be sure to read [the docs](https://docs.datafold.com/reference/open_source/cli) for detailed instructions how to build one of these commands depending on your database setup.
55+
### Databricks
56+
```
57+
pip install data-diff 'data-diff[databricks,dbt]' -U
58+
```
7659

77-
#### Code Example: Diff Tables Between Databases
78-
Here's an example command for your copy/pasting, taken from the screenshot above when we diffed data between Snowflake and Postgres.
60+
### DuckDB
61+
```
62+
pip install data-diff 'data-diff[duckdb,dbt]' -U
63+
```
7964

65+
**Update a few lines in your `dbt_project.yml`**.
8066
```
81-
data-diff \
82-
postgresql://<username>:'<password>'@localhost:5432/<database> \
83-
<table> \
84-
"snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
85-
<TABLE> \
86-
-k activity_id \
87-
-c activity \
88-
-w "event_timestamp < '2022-10-10'"
67+
#dbt_project.yml
68+
vars:
69+
data_diff:
70+
prod_database: my_database
71+
prod_schema: my_default_schema
8972
```
9073

91-
#### Code Example: Diff Tables Within a Database
74+
**Run your first data diff!**
9275

9376
```
94-
data-diff \
95-
"snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA_1>?warehouse=<WAREHOUSE>&role=<ROLE>" <TABLE_1> \
96-
<SCHEMA_2>.<TABLE_2> \
97-
-k org_id \
98-
-c created_at -c is_internal \
99-
-w "org_id != 1 and org_id < 2000" \
100-
-m test_results_%t \
101-
--materialize-all-rows \
102-
--table-write-limit 10000
77+
dbt run && data-diff --dbt
10378
```
10479

105-
In both code examples, I've used `<>` carrots to represent values that **should be replaced with your values** in the database connection strings. For the flags (`-k`, `-c`, etc.), I opted for "real" values (`org_id`, `is_internal`) to give you a more realistic view of what your command will look like.
80+
We recommend you get started by walking through [our simple setup instructions](https://docs.datafold.com/development_testing/open_source) which contain examples and details.
81+
82+
Please reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) if you have any trouble whatsoever getting started!
10683

107-
### We're here to help!
84+
<br><br>
10885

109-
We're here to help! Please post any questions in [GitHub Discussions](https://github.com/datafold/data-diff/discussions).
86+
### Diffing between databases
11087

111-
## How to Use
88+
Check out our [documentation](https://github.com/datafold/data-diff/blob/master/docs/supported-databases.md) if you're looking to compare data across databases (for example, between Postgres and Snowflake).
11289

113-
* [Examples with dbt, joindiff, and hashdiff](https://docs.datafold.com/reference/open_source/cli#examples)
114-
* [Examples with Python](https://data-diff.readthedocs.io/en/latest/python-api.html)
115-
* [How to use with TOML configuration file](https://docs.datafold.com/reference/open_source/cli#toml-config-file)
90+
<br>
11691

117-
## How to Contribute
118-
* Feel free to open an issue or contribute to the project by working on an existing issue.
119-
* Please read the [contributing guidelines](https://github.com/datafold/data-diff/blob/master/CONTRIBUTING.md) to get started.
120-
* To add a new database driver, check out [docs](https://github.com/datafold/data-diff/blob/master/docs/new-database-driver-guide.rst).
92+
## Contributors
12193

122-
Big thanks to everyone who contributed so far:
94+
We thank everyone who contributed so far!
12395

12496
<a href="https://github.com/datafold/data-diff/graphs/contributors">
12597
<img src="https://contributors-img.web.app/image?repo=datafold/data-diff" />
12698
</a>
12799

128-
## Technical Explanation
129-
130-
Check out this [technical explanation](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md) of how data-diff works.
100+
<br>
131101

132102
## Analytics
103+
133104
* [Usage Analytics & Data Privacy](https://github.com/datafold/data-diff/blob/master/docs/usage_analytics.md)
134105

106+
<br>
107+
135108
## License
136109

137110
This project is licensed under the terms of the [MIT License](https://github.com/datafold/data-diff/blob/master/LICENSE).

data_diff/__main__.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -228,6 +228,13 @@ def write_usage(self, prog: str, args: str = "", prefix: Optional[str] = None) -
228228
metavar="PATH",
229229
help="Which directory to look in for the dbt_project.yml file. Default is the current working directory and its parents.",
230230
)
231+
@click.option(
232+
"--select",
233+
"-s",
234+
default=None,
235+
metavar="PATH",
236+
help="select dbt resources to compare using dbt selection syntax",
237+
)
231238
def main(conf, run, **kw):
232239
if kw["table2"] is None and kw["database2"]:
233240
# Use the "database table table" form
@@ -264,6 +271,7 @@ def main(conf, run, **kw):
264271
profiles_dir_override=kw["dbt_profiles_dir"],
265272
project_dir_override=kw["dbt_project_dir"],
266273
is_cloud=kw["cloud"],
274+
dbt_selection=kw["select"],
267275
)
268276
else:
269277
return _data_diff(**kw)
@@ -306,6 +314,7 @@ def _data_diff(
306314
cloud,
307315
dbt_profiles_dir,
308316
dbt_project_dir,
317+
select,
309318
threads1=None,
310319
threads2=None,
311320
__conf__=None,

data_diff/cloud/datafold_api.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,8 @@ class TCloudApiDataDiff(pydantic.BaseModel):
103103
pk_columns: List[str]
104104
filter1: Optional[str] = None
105105
filter2: Optional[str] = None
106+
include_columns: Optional[List[str]]
107+
exclude_columns: Optional[List[str]]
106108

107109

108110
class TSummaryResultPrimaryKeyStats(pydantic.BaseModel):

0 commit comments

Comments
 (0)