Install additional dependencies
Some environments and Data Sources require additional Python libraries or third party utilities that are not included in the base installation of GX Core. Use the information provided here to install the necessary dependencies for your databases.
- Amazon S3
- Microsoft Azure Blob Storage
- Google Cloud Storage
- SQL databases
- Spark
GX Core uses the Python library boto3
to access objects stored in Amazon S3 buckets, but you must configure your Amazon S3 account and credentials through AWS and the AWS command line interface (CLI).
Prerequisites
- The AWS CLI. See Installing or updating the latest version of the AWS CLI.
- AWS credentials. See Configuring the AWS CLI.
- Python version 3.9 to 3.12
- Recommended. A Python virtual environment.
Installation
Python interacts with AWS through the boto3
library. GX Core uses the library in the background when working with AWS. Although you won't use boto3
directly, must install it in your Python environment.
To set up boto3
with AWS, and use boto3
within Python, see the Boto3 documentation.
-
Run the following code to verify the AWS CLI version:
Terminal inputaws --version
If this command does not return AWS CLI version information, reinstall or update the AWS CLI. See Install or update to the latest version of the AWS CLI.
-
Run the following terminal command to install
boto3
in your Python environment:Terminal inputpython -m pip install boto3
tipIf the
python -m pip install boto3
does not work, try:Terminal inputpython3 -m pip install boto3
If these
pip
commands don't work, verify that Python is installed correctly. -
Run the following terminal command to verify your AWS credentials are properly configured:
Terminal inputaws sts get-caller-identity
When your credentials are properly configured, your
UserId
,Account
, andArn
are returned. If your credentials are not configured correctly, an error message appears. If you received an error message, or you couldn't verify your credentials, see Amazon's documentation for Configure the AWS CLI. -
Install the Python dependencies for AWS S3 support.
Run the following terminal command to install the optional dependencies required by GX Core to work with AWS S3:
infoIf you installed GX in a virtual environment, your virtual environment should be active when you install these dependencies.Terminal inputpython -m pip install 'great_expectations[s3]'
GX Core and the requirements for the
boto3
Python library are installed.
Azure Blob Storage stores unstructured data on the Microsoft cloud data storage platform. To validate Azure Blob Storage data with GX Core you install additional Python libraries and define a connection string.
Prerequisites
- An Azure Storage account.
- Azure storage account access keys.
- Python version 3.9 to 3.12
- Recommended. A Python virtual environment.
Installation
-
Install the Python dependencies for Azure Blob Storage support.
Run the following code to install GX Core with the additional Python libraries needed to work with Azure Blob Storage:
infoIf you installed GX in a virtual environment, your virtual environment should be active when you install these dependencies.Terminal inputpython -m pip install 'great_expectations[azure]'
-
Configure your Azure Blob Storage credentials.
To store your Azure Blob Storage credentials as an environment variable, replace
<YOUR-STORAGE-ACCOUNT-NAME>
and<YOUR-STORAGE-ACCOUNT-KEY>
in the following terminal command with your Azure Blob Storage account values:Terminal inputexport AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=<YOUR-STORAGE-ACCOUNT-NAME>;AccountKey=<YOUR-STORAGE-ACCOUNT-KEY>"
infoYou can manage your credentials for all environments and Data Sources by storing them as environment variables. To do this, enter
export ENV_VARIABLE_NAME=env_var_value
in the terminal or add the equivalent command to your~/.bashrc
file.As an alternative to environment variables, you can also store credentials in the file
config_variables.yml
after you have created a File Data Context.
To validate Google Cloud Platform (GCP) data with GX Core, you create your GX Python environment, configure your GCP credentials, and install GX Core locally with the additional dependencies to support GCP.
Prerequisites
- A GCP service account with permissions to access GCP resources and storage Objects.
- The
GOOGLE_APPLICATION_CREDENTIALS
environment variable is set. See the Google documentation Set up Application Default Credentials. - Google Cloud API authentication is set up. See the Google documentation Set up authentication.
- Python version 3.9 to 3.12
- Recommended. A Python virtual environment.
Installation
-
Ensure your GCP credentials are correctly configured. This process includes:
- Creating a Google Cloud Platform (GCP) service account.
- Setting the
GOOGLE_APPLICATION_CREDENTIALS
environment variable, - Verifying authentication by running a Google Cloud Storage client library script.
For more information, see the GCP documentation on how to verify authentication for the Google Cloud API.
-
Install the Python dependencies for GCP support.
Run the following terminal command to install GX Core with the additional dependencies for GCP support:
infoIf you installed GX in a virtual environment, your virtual environment should be active when you install these dependencies.Terminal inputpython -m pip install 'great_expectations[gcp]'
To validate data stored on SQL databases with GX Core, you create your GX Python environment, install GX Core locally, and then configure the necessary dependencies.
Prerequisites
- Python version 3.9 to 3.12
- Recommended. A Python virtual environment.
Installation
-
Run the pip command to install the dependencies for your data's SQL dialect.
The following table lists the installation commands used to install GX Core dependencies for specific SQL dialects.
SQL Dialect Command AWS Athena pip install 'great_expectations[athena]'
BigQuery pip install 'great_expectations[bigquery]'
Databricks pip install 'great_expectations[databricks]'
MSSQL pip install 'great_expectations[mssql]'
PostgreSQL pip install 'great_expectations[postgresql]'
Snowflake pip install 'great_expectations[snowflake]'
To install dependencies for a specific SQL dialect, use the corresponding command from the table above.
-
Configure your SQL database credentials.
You can store your SQL database password by replacing
<MY_PASSWORD>
with your password in the following command:Terminal inputexport MY_DB_PW=<MY_PASSWORD>
Or you can store your entire SQL database connection string by replacing
<MY_CONNECTION_STRING>
with it and running:Terminal inputexport MY_DB_CONNECTION_STRING=<MY_CONNECTION_STRING>
infoYou can manage your credentials for all environments and Data Sources by storing them as environment variables. To do this, enter
export ENV_VARIABLE_NAME=env_var_value
in the terminal or add the equivalent command to your~/.bashrc
file.You can reference environment variables in GX Core by including them in strings using the format
${ENV_VARIABLE_NAME}
. For instance, to insert the password stored asMY_DB_PASSWORD
into a PostgreSql connection string you would provide the string:Example PostgreSql Connection String"postgresql+psycopg2://<username>:${MY_DB_PW}@<host>:<port>/<database>"
As an alternative to environment variables, you can also store credentials in the file
config_variables.yml
after you have created a File Data Context.
To validate data while using Spark to read from dataframes or file formats such as .csv
and .parquet
with GX Core, you create your GX Python environment, install GX Core locally, and then configure the necessary dependencies.
Prerequisites
- Python version 3.9 to 3.12
- Recommended. A Python virtual environment.
Installation
-
Optional. Activate your virtual environment.
If you installed GX in a virtual environment, your virtual environment should be active when you install these dependencies. -
Run the pip command to install the dependencies for Spark:
Terminal inputpython -m pip install 'great_expectations[spark]'