Pipeline
Pipeline
Pipeline for synthetic data generation, training, quantization, and deployment.
Source code in textforge/pipeline.py
__init__(config)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
PipelineConfig
|
Configuration for the pipeline. |
required |
Source code in textforge/pipeline.py
run(data, serve=False, save=False, skip_data_generation=False)
Runs the pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
The input data. |
required |
serve
|
bool
|
Whether to serve the model after training. |
False
|
save
|
bool
|
Whether to save the intermediate and final outputs. |
False
|
skip_data_generation
|
bool
|
Whether to skip the data generation step. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
The output path where the results are saved. |
Source code in textforge/pipeline.py
PipelineConfig
Configuration class for the pipeline.
Source code in textforge/pipeline.py
__init__(labels, query, api_key=None, use_local=False, data_gen_model='gpt-4o-mini', model_name='distilbert/distilbert-base-uncased', model_path=None, max_length=128, epochs=3, batch_size=8, save_steps=100, eval_steps=100, base_url=None, sync_client=False)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_key
|
str
|
API key for data generation. |
None
|
labels
|
list
|
List of labels for classification. |
required |
query
|
str
|
Query for data generation. |
required |
use_local
|
bool
|
Whether to use local data generation. |
False
|
data_gen_model
|
str
|
Model name for synthetic data generation. |
'gpt-4o-mini'
|
model_name
|
str
|
Model name for training. |
'distilbert/distilbert-base-uncased'
|
model_path
|
str
|
Path to a pre-trained model. |
None
|
max_length
|
int
|
Maximum sequence length. |
128
|
epochs
|
int
|
Number of training epochs. |
3
|
batch_size
|
int
|
Batch size for training and evaluation. |
8
|
save_steps
|
int
|
Number of steps between model saves. |
100
|
eval_steps
|
int
|
Number of steps between evaluations. |
100
|
base_url
|
str
|
Base URL for API requests. |
None
|
sync_client
|
bool
|
Whether to use a synchronous client. |
False
|