Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 11 additions & 8 deletions docs/user_guides/fs/data_source/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ We retrieve a data source simply by its unique name.
project = hopsworks.login()
feature_store = project.get_feature_store()
# Retrieve data source
connector = feature_store.get_storage_connector('data_source_name')
ds = feature_store.get_data_source('data_source_name')
```

=== "Scala"
Expand Down Expand Up @@ -119,17 +119,20 @@ Another important aspect of a data source is its ability to facilitate creation
The `Connector API` relies on data sources behind the scenes to integrate with external datasource.
This enables seamless integration with any data source as long as there is a data source defined.

To create an external feature group, we use the `create_external_feature_group` API, also known as `Connector API`, and simply pass the data source created before to the `storage_connector` argument.

To create an external feature group, we use the `create_external_feature_group` API, also known as `Connector API`, and simply pass the data source created before to the `data_source` argument.
Depending on the external source, we should set either the `query` argument for data warehouse based sources, or the `path` and `data_format` arguments for data lake based sources, similar to reading into dataframes as explained in above section.

Example for any data warehouse/SQL based external sources, we set the desired SQL to `query` argument, and set the `storage_connector` argument to the data source object of desired data source.
Example for any data warehouse/SQL based external sources, we set the desired SQL to `query` argument, and set the `data_source` argument to the data source object of desired data source.

=== "PySpark"
```python
ds.query="SELECT * FROM TABLE"

fg = feature_store.create_external_feature_group(name="sales",
version=1
description="Physical shop sales features",
query="SELECT * FROM TABLE",
storage_connector=connector,
data_source = ds,
primary_key=['ss_store_sk'],
event_time='sale_date'
)
Expand All @@ -141,8 +144,8 @@ For more information on `Connector API`, read detailed guide about [external fea

## Writing Training Data

Data Sources are also used while writing training data to external sources.
While calling the [Feature View](../../../concepts/fs/feature_view/fv_overview.md) API `create_training_data` , we can pass the `storage_connector` argument which is necessary to materialise the data to external sources, as shown below.
Data Sources are also used while writing training data to external sources. While calling the
While calling the [Feature View](../../../concepts/fs/feature_view/fv_overview.md) API `create_training_data` , we can pass the `data_source` argument which is necessary to materialise the data to external sources, as shown below.

=== "PySpark"
```python
Expand All @@ -151,7 +154,7 @@ While calling the [Feature View](../../../concepts/fs/feature_view/fv_overview.m
description = 'describe training data',
data_format = 'spark_data_format', # e.g., data_format = "parquet" or data_format = "csv"
write_options = {"wait_for_job": False},
storage_connector = connector
data_source = ds
)
```

Expand Down
2 changes: 1 addition & 1 deletion docs/user_guides/fs/feature_group/create.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ The currently support values are "HUDI", "DELTA", "NONE" (which defaults to Parq

##### Data Source

During the creation of a feature group, it is possible to define the `storage_connector` parameter, this allows for management of offline data in the desired table format outside the Hopsworks cluster.
During the creation of a feature group, it is possible to define the `data_source` parameter, this allows for management of offline data in the desired table format outside the Hopsworks cluster.
Currently, [S3](../data_source/creation/s3.md) and [GCS](../data_source/creation/gcs.md) connectors and "DELTA" `time_travel_format` format is supported.

##### Online Table Configuration
Expand Down
8 changes: 4 additions & 4 deletions docs/user_guides/fs/feature_group/create_external.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ To create an external feature group using the HSFS APIs you need to provide an e
=== "Python"

```python
connector = feature_store.get_storage_connector("data_source_name")
ds = feature_store.get_data_source("data_source_name")
```

### Create an External Feature Group
Expand Down Expand Up @@ -52,7 +52,7 @@ Once you have defined the metadata, you can
version=1,
description="Physical shop sales features",
query=query,
storage_connector=connector,
data_source=ds,
primary_key=['ss_store_sk'],
event_time='sale_date'
)
Expand All @@ -69,7 +69,7 @@ Once you have defined the metadata, you can
version=1,
description="Physical shop sales features",
data_format="parquet",
storage_connector=connector,
data_source=ds,
primary_key=['ss_store_sk'],
event_time='sale_date'
)
Expand Down Expand Up @@ -112,7 +112,7 @@ For an external feature group to be available online, during the creation of the
version=1,
description="Physical shop sales features",
query=query,
storage_connector=connector,
data_source=ds,
primary_key=['ss_store_sk'],
event_time='sale_date',
online_enabled=True)
Expand Down
12 changes: 6 additions & 6 deletions docs/user_guides/fs/provenance/provenance.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,28 +35,28 @@ You can inspect the relationship between data sources and feature groups using t

```python
# Retrieve the data source
snowflake_sc = fs.get_storage_connector("snowflake_sc")
ds = fs.get_data_source("snowflake_sc")
ds.query = "SELECT * FROM USER_PROFILES"

# Create the user profiles feature group
user_profiles_fg = fs.create_external_feature_group(
name="user_profiles",
version=1,
storage_connector=snowflake_sc,
query="SELECT * FROM USER_PROFILES"
data_source=ds
)
user_profiles_fg.save()
```

### Step 1, Using Python

Starting from a feature group metadata object, you can traverse upstream the provenance graph to retrieve the metadata objects of the data sources that are part of the feature group.
To do so, you can use the [`FeatureGroup.get_storage_connector_provenance`][hsfs.feature_group.FeatureGroup.get_storage_connector_provenance] method.
To do so, you can use the [`FeatureGroup.get_data_source_provenance`][hsfs.feature_group.FeatureGroup.get_data_source_provenance] method.

=== "Python"

```python
# Returns all data sources linked to the provided feature group
lineage = user_profiles_fg.get_storage_connector_provenance()
lineage = user_profiles_fg.get_data_source_provenance()

# List all accessible parent data sources
lineage.accessible
Expand All @@ -72,7 +72,7 @@ To do so, you can use the [`FeatureGroup.get_storage_connector_provenance`][hsfs

```python
# Returns an accessible data source linked to the feature group (if it exists)
user_profiles_fg.get_storage_connector()
user_profiles_fg.get_data_source()
```

To traverse the provenance graph in the opposite direction (i.e., from the data source to the feature group), you can use the [`StorageConnector.get_feature_groups_provenance`][hsfs.storage_connector.StorageConnector.get_feature_groups_provenance] method.
Expand Down
Loading