Commit Graph

34 Commits

Author SHA1 Message Date
Ingo Kleiber fd9975dad7 add CoNLL-U document loader (#1297)
I've added a simple
[CoNLL-U](https://universaldependencies.org/format.html) document
loader. CoNLL-U is a common format for NLP tasks and is used, for
example, in the Universal Dependencies treebank corpora. The loader
reads a single file in standard CoNLL-U format and returns a document.
2023-02-26 17:27:00 -08:00
Harrison Chase d29f74114e copy paste loader (#1302) 2023-02-26 17:26:37 -08:00
Matt Robinson 2f15c11b87 feat: document loader for MS Word documents (#1282)
### Summary

Adds a document loader for MS Word Documents. Works with both `.docx`
and `.doc` files as longer as the user has installed
`unstructured>=0.4.11`.

### Testing

The follow workflow test the loader for both `.doc` and `.docx` files
using example docs from the `unstructured` repo.

#### `.docx`

```python
from langchain.document_loaders import UnstructuredWordDocumentLoader

filename = "../unstructured/example-docs/fake.docx"
loader = UnstructuredWordDocumentLoader(filename)
loader.load()
```

#### `.doc`

```python
from langchain.document_loaders import UnstructuredWordDocumentLoader

filename = "../unstructured/example-docs/fake.doc"
loader = UnstructuredWordDocumentLoader(filename)
loader.load()
```
2023-02-24 08:26:19 -08:00
Harrison Chase 96db6ed073 cleanup (#1274) 2023-02-24 07:38:24 -08:00
Harrison Chase 42167a1e24 Harrison/fb loader (#1277)
Co-authored-by: Vairo Di Pasquale <vairo.dp@gmail.com>
2023-02-24 07:22:48 -08:00
Klein Tahiraj 8a0751dadd adding .ipynb loader and documentation Fixes #1248 (#1252)
`NotebookLoader.load()` loads the `.ipynb` notebook file into a
`Document` object.

**Parameters**:

* `include_outputs` (bool): whether to include cell outputs in the
resulting document (default is False).
* `max_output_length` (int): the maximum number of characters to include
from each cell output (default is 10).
* `remove_newline` (bool): whether to remove newline characters from the
cell sources and outputs (default is False).
* `traceback` (bool): whether to include full traceback (default is
False).
2023-02-24 07:10:35 -08:00
Matt Robinson 3d5f56a8a1 docs: add quotes to unstructured[local-inference] install instructions (#1208)
### Summary

Corrects the install instruction for local inference to `pip install
"unstructured[local-inference]"`
2023-02-21 08:06:43 -08:00
Harrison Chase d90a287d8f Harrison/updating docs (#1196) 2023-02-20 22:54:26 -08:00
Dennis Antela Martinez 23243ae69c add gitbook document loader (#1180)
Added a GitBook document loader. It lets you both, (1) fetch text from
any single GitBook page, or (2) fetch all relative paths and return
their respective content in Documents.

I've modified the `scrape` method in the `WebBaseLoader` to accept
custom web paths if given, but happy to remove it and move that logic
into the `GitbookLoader` itself.
2023-02-20 20:05:04 -08:00
Harrison Chase 65cc81c479 directory loader improvements (#1162) 2023-02-19 20:47:08 -08:00
Harrison Chase fb3c73d194 add srt loader (#1140) 2023-02-18 10:58:39 -08:00
Harrison Chase 483821ea3b fix docs (#1133) 2023-02-18 08:13:54 -08:00
Harrison Chase d5f3dfa1e1 Harrison/hn loader (#1130)
Co-authored-by: William X <william.y.xuan@gmail.com>
2023-02-17 15:15:02 -08:00
Harrison Chase c60954d0f8 Harrison/telegram loader (#1080)
Co-authored-by: Maxime Vidal <max.vidal@hotmail.fr>
2023-02-15 23:24:32 -08:00
Harrison Chase 98186ef180 Harrison/evernote nb (#1078)
Co-authored-by: Akshay <64036106+akshayvkt@users.noreply.github.com>
2023-02-15 22:47:30 -08:00
Harrison Chase 7fb33fca47 chroma docs (#1012) 2023-02-12 23:02:01 -08:00
cragwolfe 05d8969c79 Unstructured example notebook: add a pdf, related deps (#1011)
Updates the Unstructured example notebook with a PDF example. Includes
additional dependencies for PDF processing (and images, etc).
2023-02-12 14:56:48 -08:00
Harrison Chase 0998577dfe Harrison/unstructured structured (#1004) 2023-02-12 07:36:11 -08:00
Harrison Chase bbb06ca4cf pdfminer (#1003) 2023-02-12 07:29:26 -08:00
Francisco Ingham 0b6aa6a024 Added initial capital letter to bullet points that had it missing (#1000)
Co-authored-by: Francisco Ingham <>
2023-02-11 20:31:34 -08:00
Harrison Chase 2e96704d59 Harrison/airbyte (#989)
Co-authored-by: zanderchase <zanderchase@gmail.com>
Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MacBook-Pro.local>
2023-02-10 18:08:00 -08:00
zanderchase c2d1d903fa Zander/online pdf loader (#984) 2023-02-10 15:42:30 -08:00
Matt Robinson 07a407d89a feat: adds UnstructuredURLLoader for loading data from urls (#979)
### Summary

Adds a `UnstructuredURLLoader` that supports loading data from a list of
URLs.


### Testing

```python
from langchain.document_loaders import UnstructuredURLLoader

urls = [
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"
]
loader = UnstructuredURLLoader(urls=urls)
raw_documents = loader.load()
```
2023-02-10 10:18:38 -08:00
Harrison Chase c64f98e2bb Harrison/format agent instructions (#973)
Co-authored-by: Andrew White <white.d.andrew@gmail.com>
Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net>
Co-authored-by: Peng Qu <82029664+pengqu123@users.noreply.github.com>
2023-02-10 10:07:26 -08:00
Harrison Chase 5469d898a9 Harrison/everynote (#974)
Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net>
2023-02-10 08:02:35 -08:00
Harrison Chase 01fa2d8117 Harrison/youtube fixes (#955)
Co-authored-by: Ji <jizhang.work@gmail.com>
Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net>
2023-02-09 08:12:22 -08:00
zanderchase 8e126bc9bd adding webpage loading logic (#942) 2023-02-09 07:52:50 -08:00
Harrison Chase 3e1901e1aa gutenberg books (#946)
Co-authored-by: zanderchase <zander@unfold.ag>
Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net>
2023-02-08 12:00:47 -08:00
Harrison Chase 44ecec3896 Harrison/add roam loader (#939) 2023-02-08 00:35:33 -08:00
Harrison Chase 637c0d6508 Harrison/obsidian (#920) 2023-02-06 22:21:16 -08:00
Ankush Gola 6bd1529cb7 add GoogleDriveLoader (#914)
only deal with docs files for now
2023-02-06 21:44:35 -08:00
Harrison Chase 2ec25ddd4c add unstructured examples (#913) 2023-02-06 18:13:46 -08:00
Harrison Chase 71e662e88d update docs (#905) 2023-02-06 00:26:20 -08:00
Harrison Chase 53d56d7650 Harrison/unstructured support (#903) 2023-02-05 23:02:07 -08:00