- Migrate from deprecated langchainplus_sdk to `langsmith` package
- Update the `run_on_dataset()` API to use an eval config
- Update a number of evaluators, as well as the loading logic
- Update docstrings / reference docs
- Update tracer to share single HTTP session
Notebook shows preference scoring between two chains and reports wilson
score interval + p value
I think I'll add the option to insert ground truth labels but doesn't
have to be in this PR