Connect to data¶
Once you have a DataContext, you’ll want to connect to data. In Great Expectations, Datasources simplify connections, by managing configuration and providing a consistent, cross-platform API for referencing data.
Let’s configure your first Datasource: a connection to the data directory we’ve provided in the repo. This could also be a database connection, but for now we’re just using a simple file store:
Would you like to configure a Datasource? [Y/n]: <press enter>
What data would you like Great Expectations to connect to?
1. Files on a filesystem (for processing with Pandas or Spark)
2. Relational database (SQL)
: 1
What are you processing your files with?
1. Pandas
2. PySpark
: 1
Enter the path (relative or absolute) of the root directory where the data files are stored.
: data
Give your new Datasource a short name.
[data__dir]: <press enter>
... <some more output here> ...
Would you like to proceed? [Y/n]:
A new datasource 'data__dir' was added to your project.
Would you like to profile new Expectations for a single data asset
within your new Datasource? [Y/n]: n
That’s it! You just configured your first Datasource!
Make sure to choose n
at this prompt to exit the init
flow for now. Normally, the init
flow takes you through another step to create sample Expectations, but we want to jump straight to creating an Expectation Suite using the scaffold
method next.
Before continuing, let’s stop and unpack what just happened.
Configuring Datasources¶
When you completed those last few steps in great_expectations init
, you told Great Expectations that:
You want to create a new Datasource called
data__dir
.You want to use Pandas to read the data from CSV.
Based on that information, the CLI added the following entry into your great_expectations.yml
file, under the datasources
header:
data__dir:
data_asset_type:
class_name: PandasDataset
module_name: great_expectations.dataset
batch_kwargs_generators:
subdir_reader:
class_name: SubdirReaderBatchKwargsGenerator
base_directory: ../data
class_name: PandasDatasource
module_name: great_expectations.datasource
This datasource does not require any credentials. However, if you were to connect to a database that requires connection credentials, those would be stored in great_expectations/uncommitted/config_variables.yml
.
In the future, you can modify or delete your configuration by editing your great_expectations.yml
and config_variables.yml
files directly.
For now, let’s move on to creating your first Expectations.