Training | ||
---|---|---|
Parameter | Pre-training | Transfer learning |
Batch size | 128 | 128 |
Learning rate | \(10^{-4}\) | \(5\times 10^{-5}\) |
Weight decay | 0.01 | 0.01 |
Dropout | 0.1 | 0.3 |
Initialization | Xavier | Xavier |
Optimizer | AdamW | AdamW |
Scheduler | None | None |
Frozen encoder | No | Yes |
Max sequence length | 175 | 175 |
Token space | 68 | 68 |
Embedding dimension | 512 | 512 |