Data
get_df
Retrieve data from configured sources as pandas DataFrames
The get_df
function retrieves data from a configured source and returns it as a pandas DataFrame. For database sources (PostgreSQL, ClickHouse), a table name must be specified.
Parameters
source_name
(str): Name of the data source as configured in preswald.toml OR a path to a file (supports CSV, Parquet, and JSON)table_name
(Optional[str]): Required for database sources, specifies which table to retrieve
Returns
pd.DataFrame
: Data from the specified source as a pandas DataFrame
Usage Examples
Note: connect
must be called before get_df
can be used.
CSV Source
For CSV sources, table_name
is not required since the entire CSV file is treated as a single table:
PostgreSQL Source
For PostgreSQL sources, table_name
is required:
ClickHouse Source
Similarly for ClickHouse sources, table_name
is required:
Error Handling
The function includes comprehensive error handling:
- Validates source existence
- Checks for required table_name parameter for database sources
- Handles connection and query errors
- Provides detailed error messages through logging
Best Practices
- Always check if source exists in preswald.toml before calling
- For database sources, always provide
table_name
- Use error handling when calling the function
- Consider memory limitations when retrieving large datasets
Example with error handling:
Related Functions
query()
: For custom SQL queries against data sources