You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on May 17, 2024. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+69-26Lines changed: 69 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,33 +8,58 @@ data-diff: compare datasets fast, within or across SQL databases
8
8
9
9
<br>
10
10
11
+
# How it works
11
12
12
-
# Use cases
13
+
When comparing the data, `data-diff` utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:
13
14
14
-
## Data Migration & Replication Testing
15
-
Compare source to target and check for discrepancies when moving data between systems:
16
-
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
17
-
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
18
-
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)
15
+
## joindiff
16
+
- Recommended for comparing data within the same database
17
+
- Uses the outer join operation to diff the rows as efficiently as possible within the same database
18
+
- Fully relies on the underlying database engine for computation
19
+
- Requires both datasets to be queryable with a single SQL query
20
+
- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
21
+
22
+
## hashdiff
23
+
- Recommended for comparing datasets across different databases
24
+
- Can also be helpful in diffing very large tables with few expected differences within the same database
25
+
- Employs a divide-and-conquer algorithm based on hashing and binary search
26
+
- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
27
+
- Time complexity approximates COUNT(*) operation when there are few differences
28
+
- Performance degrades when datasets have a large number of differences
29
+
30
+
More information about the algorithm and performance considerations can be found [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md)
19
31
32
+
# Get started
20
33
21
34
Install `data-diff` with specific database adapters, e.g.:
0 commit comments