Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Commit cf4ce2d

Browse files
authored
Update README.md
Add information about algorithms to README
1 parent b1f0780 commit cf4ce2d

File tree

1 file changed

+69
-26
lines changed

1 file changed

+69
-26
lines changed

β€ŽREADME.mdβ€Ž

Lines changed: 69 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -8,33 +8,58 @@ data-diff: compare datasets fast, within or across SQL databases
88

99
<br>
1010

11+
# How it works
1112

12-
# Use cases
13+
When comparing the data, `data-diff` utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:
1314

14-
## Data Migration & Replication Testing
15-
Compare source to target and check for discrepancies when moving data between systems:
16-
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
17-
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
18-
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)
15+
## joindiff
16+
- Recommended for comparing data within the same database
17+
- Uses the outer join operation to diff the rows as efficiently as possible within the same database
18+
- Fully relies on the underlying database engine for computation
19+
- Requires both datasets to be queryable with a single SQL query
20+
- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
21+
22+
## hashdiff
23+
- Recommended for comparing datasets across different databases
24+
- Can also be helpful in diffing very large tables with few expected differences within the same database
25+
- Employs a divide-and-conquer algorithm based on hashing and binary search
26+
- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
27+
- Time complexity approximates COUNT(*) operation when there are few differences
28+
- Performance degrades when datasets have a large number of differences
29+
30+
More information about the algorithm and performance considerations can be found [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md)
1931

32+
# Get started
2033

2134
Install `data-diff` with specific database adapters, e.g.:
2235

2336
```
2437
pip install data-diff 'data-diff[postgresql,snowflake]' -U
2538
```
26-
Run `data-diff` with connection URIs to compare tables:
39+
40+
Run `data-diff` with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using hashdiff algorithm:
2741
```
2842
data-diff \
2943
postgresql://<username>:'<password>'@localhost:5432/<database> \
3044
<table> \
3145
"snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
3246
<TABLE> \
33-
-k activity_id \
34-
-c activity \
35-
-w "event_timestamp < '2022-10-10'"
47+
-k <primary key column> \
48+
-c <columns to compare> \
49+
-w <filter condition>
3650
```
37-
Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for full command reference.
51+
52+
Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference.
53+
54+
55+
# Use cases
56+
57+
## Data Migration & Replication Testing
58+
Compare source to target and check for discrepancies when moving data between systems:
59+
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
60+
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
61+
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)
62+
3863

3964
## Data Development Testing
4065
Test SQL code and preview changes by comparing development/staging environment data to production:
@@ -54,21 +79,39 @@ Test SQL code and preview changes by comparing development/staging environment d
5479

5580
Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support
5681

57-
## Supported databases
58-
59-
- PostgreSQL >=10
60-
- MySQL
61-
- Snowflake
62-
- BigQuery
63-
- Redshift
64-
- Oracle
65-
- Presto
66-
- Databricks
67-
- Trino
68-
- Clickhouse
69-
- Vertica
70-
- DuckDB >=0.6
71-
- SQLite (coming soon)
82+
# Supported databases
83+
84+
85+
| Database | Status | Connection string |
86+
|---------------|-------------------------------------------------------------------------------------------------------------------------------------|--------|
87+
| PostgreSQL >=10 | πŸ’š | `postgresql://<user>:<password>@<host>:5432/<database>` |
88+
| MySQL | πŸ’š | `mysql://<user>:<password>@<hostname>:5432/<database>` |
89+
| Snowflake | πŸ’š | `"snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"` |
90+
| BigQuery | πŸ’š | `bigquery://<project>/<dataset>` |
91+
| Redshift | πŸ’š | `redshift://<username>:<password>@<hostname>:5439/<database>` |
92+
| Oracle | πŸ’› | `oracle://<username>:<password>@<hostname>/database` |
93+
| Presto | πŸ’› | `presto://<username>:<password>@<hostname>:8080/<database>` |
94+
| Databricks | πŸ’› | `databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>` |
95+
| Trino | πŸ’› | `trino://<username>:<password>@<hostname>:8080/<database>` |
96+
| Clickhouse | πŸ’› | `clickhouse://<username>:<password>@<hostname>:9000/<database>` |
97+
| Vertica | πŸ’› | `vertica://<username>:<password>@<hostname>:5433/<database>` |
98+
| DuckDB | πŸ’› | |
99+
| ElasticSearch | πŸ“ | |
100+
| Planetscale | πŸ“ | |
101+
| Pinot | πŸ“ | |
102+
| Druid | πŸ“ | |
103+
| Kafka | πŸ“ | |
104+
| SQLite | πŸ“ | |
105+
106+
* πŸ’š: Implemented and thoroughly tested.
107+
* πŸ’›: Implemented, but not thoroughly tested yet.
108+
* ⏳: Implementation in progress.
109+
* πŸ“: Implementation planned. Contributions welcome.
110+
111+
Your database not listed here?
112+
113+
- Contribute a [new database adapter](https://github.com/datafold/data-diff/blob/master/docs/new-database-driver-guide.rst) – we accept pull requests!
114+
- [Get in touch](https://www.datafold.com/demo) about enterprise support and adding new adapters and features
72115

73116

74117
<br>

0 commit comments

Comments
Β (0)