Skip to main content

Document loaders

Features

The following table shows the feature support for all document loaders.

Document LoaderDescriptionLazy loadingNative async support
AZLyricsLoaderLoad AZLyrics webpages.
AcreomLoaderLoad acreom vault from a directory.
AirbyteCDKLoaderLoad with an Airbyte source connector implemented using the CDK.
AirbyteGongLoaderLoad from Gong using an Airbyte source connector.
AirbyteHubspotLoaderLoad from Hubspot using an Airbyte source connector.
AirbyteJSONLoaderLoad local Airbyte json files.
AirbyteSalesforceLoaderLoad from Salesforce using an Airbyte source connector.
AirbyteShopifyLoaderLoad from Shopify using an Airbyte source connector.
AirbyteStripeLoaderLoad from Stripe using an Airbyte source connector.
AirbyteTypeformLoaderLoad from Typeform using an Airbyte source connector.
AirbyteZendeskSupportLoaderLoad from Zendesk Support using an Airbyte source connector.
AirtableLoaderLoad the Airtable tables.
AmazonTextractPDFLoaderLoad PDF files from a local file system, HTTP or S3.
ApifyDatasetLoaderLoad datasets from Apify web scraping, crawling, and data extraction platform.
ArcGISLoaderLoad records from an ArcGIS FeatureLayer.
ArxivLoaderLoad a query result from Arxiv.
AssemblyAIAudioLoaderById
AssemblyAIAudioTranscriptLoaderLoad AssemblyAI audio transcripts.
AstraDBLoader[Deprecated]
AsyncChromiumLoaderScrape HTML pages from URLs using a
AsyncHtmlLoaderLoad HTML asynchronously.
AthenaLoaderLoad documents from AWS Athena.
AzureAIDataLoaderLoad from Azure AI Data.
AzureAIDocumentIntelligenceLoaderLoad a PDF with Azure Document Intelligence.
AzureBlobStorageContainerLoaderLoad from Azure Blob Storage container.
AzureBlobStorageFileLoaderLoad from Azure Blob Storage files.
BSHTMLLoaderLoad HTML files and parse them with beautiful soup.
BibtexLoaderLoad a bibtex file.
BigQueryLoader[Deprecated] Load from the Google Cloud Platform BigQuery.
BiliBiliLoader
BlackboardLoaderLoad a Blackboard course.
BlockchainDocumentLoaderLoad elements from a blockchain smart contract.
BraveSearchLoaderLoad with Brave Search engine.
BrowserbaseLoaderLoad pre-rendered web pages using a headless browser hosted on Browserbase.
BrowserlessLoaderLoad webpages with Browserless /content endpoint.
CSVLoaderLoad a CSV file into a list of Documents.
CassandraLoader
ChatGPTLoaderLoad conversations from exported ChatGPT data.
CoNLLULoaderLoad CoNLL-U files.
CollegeConfidentialLoaderLoad College Confidential webpages.
ConcurrentLoaderLoad and pars Documents concurrently.
ConfluenceLoaderLoad Confluence pages.
CouchbaseLoaderLoad documents from Couchbase.
CubeSemanticLoaderLoad Cube semantic layer metadata.
DataFrameLoaderLoad Pandas DataFrame.
DatadogLogsLoaderLoad Datadog logs.
DiffbotLoaderLoad Diffbot json file.
DirectoryLoaderLoad from a directory.
DiscordChatLoaderLoad Discord chat logs.
DocugamiLoader[Deprecated] Load from Docugami.
DocusaurusLoaderLoad from Docusaurus Documentation.
Docx2txtLoaderLoad DOCX file using docx2txt and chunks at character level.
DropboxLoaderLoad files from Dropbox.
DuckDBLoaderLoad from DuckDB.
EtherscanLoaderLoad transactions from Ethereum mainnet.
EverNoteLoaderLoad from EverNote.
FacebookChatLoaderLoad Facebook Chat messages directory dump.
FaunaLoaderLoad from FaunaDB.
FigmaFileLoaderLoad Figma file.
FireCrawlLoaderLoad web pages as Documents using FireCrawl.
GCSDirectoryLoader[Deprecated] Load from GCS directory.
GCSFileLoader[Deprecated] Load from GCS file.
GeoDataFrameLoaderLoad geopandas Dataframe.
GitHubIssuesLoaderLoad issues of a GitHub repository.
GitLoaderLoad Git repository files.
GitbookLoaderLoad GitBook data.
GithubFileLoaderLoad GitHub File
GlueCatalogLoaderLoad table schemas from AWS Glue.
GoogleApiYoutubeLoaderLoad all Videos from a YouTube Channel.
GoogleDriveLoader[Deprecated] Load Google Docs from Google Drive.
GoogleSpeechToTextLoader[Deprecated] Loader for Google Cloud Speech-to-Text audio transcripts.
GutenbergLoaderLoad from Gutenberg.org.
HNLoaderLoad Hacker News data.
HuggingFaceDatasetLoaderLoad from Hugging Face Hub datasets.
HuggingFaceModelLoader
IFixitLoaderLoad iFixit repair guides, device wikis and answers.
IMSDbLoaderLoad IMSDb webpages.
ImageCaptionLoaderLoad image captions.
IuguLoaderLoad from IUGU.
JSONLoader
JoplinLoaderLoad notes from Joplin.
KineticaLoaderLoad from Kinetica API.
LLMSherpaFileLoaderLoad Documents using LLMSherpa.
LakeFSLoaderLoad from lakeFS.
LarkSuiteDocLoaderLoad from LarkSuite (FeiShu).
MHTMLLoaderParse MHTML files with BeautifulSoup.
MWDumpLoaderLoad MediaWiki dump from an XML file.
MastodonTootsLoaderLoad the Mastodon 'toots'.
MathpixPDFLoaderLoad PDF files using Mathpix service.
MaxComputeLoaderLoad from Alibaba Cloud MaxCompute table.
MergedDataLoaderMerge documents from a list of loaders
ModernTreasuryLoaderLoad from Modern Treasury.
MongodbLoaderLoad MongoDB documents.
NewsURLLoaderLoad news articles from URLs using Unstructured.
NotebookLoaderLoad Jupyter notebook (.ipynb) files.
NotionDBLoaderLoad from Notion DB.
NotionDirectoryLoaderLoad Notion directory dump.
OBSDirectoryLoaderLoad from Huawei OBS directory.
OBSFileLoaderLoad from the Huawei OBS file.
ObsidianLoaderLoad Obsidian files from directory.
OneDriveFileLoaderLoad a file from Microsoft OneDrive.
OneDriveLoaderLoad from Microsoft OneDrive.
OnlinePDFLoaderLoad online PDF.
OpenCityDataLoaderLoad from Open City.
OracleAutonomousDatabaseLoader
OracleDocLoaderRead documents using OracleDocLoader
OutlookMessageLoader
PDFMinerLoaderLoad PDF files using PDFMiner.
PDFMinerPDFasHTMLLoaderLoad PDF files as HTML content using PDFMiner.
PDFPlumberLoaderLoad PDF files using pdfplumber.
PagedPDFSplitterLoad PDF using pypdf into list of documents.
PebbloSafeLoaderPebblo Safe Loader class is a wrapper around document loaders enabling the data
PlaywrightURLLoaderLoad HTML pages with Playwright and parse with Unstructured.
PolarsDataFrameLoaderLoad Polars DataFrame.
PsychicLoaderLoad from Psychic.dev.
PubMedLoaderLoad from the PubMed biomedical library.
PyMuPDFLoaderLoad PDF files using PyMuPDF.
PyPDFDirectoryLoaderLoad a directory with PDF files using pypdf and chunks at character level.
PyPDFLoaderLoad PDF using pypdf into list of documents.
PyPDFium2LoaderLoad PDF using pypdfium2 and chunks at character level.
PySparkDataFrameLoaderLoad PySpark DataFrames.
PythonLoaderLoad Python files, respecting any non-default encoding if specified.
RSSFeedLoaderLoad news articles from RSS feeds using Unstructured.
ReadTheDocsLoaderLoad ReadTheDocs documentation directory.
RecursiveUrlLoaderRecursively load all child links from a root URL.
RedditPostsLoaderLoad Reddit posts.
RoamLoaderLoad Roam files from a directory.
RocksetLoaderLoad from a Rockset database.
S3DirectoryLoaderLoad from Amazon AWS S3 directory.
S3FileLoaderLoad from Amazon AWS S3 file.
SQLDatabaseLoader
SRTLoaderLoad .srt (subtitle) files.
ScrapflyLoaderTurn a url to llm accessible markdown with Scrapfly.io.
SeleniumURLLoaderLoad HTML pages with Selenium and parse with Unstructured.
SharePointLoaderLoad from SharePoint.
SitemapLoaderLoad a sitemap and its URLs.
SlackDirectoryLoaderLoad from a Slack directory dump.
SnowflakeLoaderLoad from Snowflake API.
SpiderLoaderLoad web pages as Documents using Spider AI.
SpreedlyLoaderLoad from Spreedly API.
StripeLoaderLoad from Stripe API.
SurrealDBLoaderLoad SurrealDB documents.
TelegramChatApiLoaderLoad Telegram chat json directory dump.
TelegramChatFileLoaderLoad from Telegram chat dump.
TelegramChatLoaderLoad from Telegram chat dump.
TencentCOSDirectoryLoaderLoad from Tencent Cloud COS directory.
TencentCOSFileLoaderLoad from Tencent Cloud COS file.
TensorflowDatasetLoaderLoad from TensorFlow Dataset.
TextLoaderLoad text file.
TiDBLoaderLoad documents from TiDB.
ToMarkdownLoaderLoad HTML using 2markdown API.
TomlLoaderLoad TOML files.
TrelloLoaderLoad cards from a Trello board.
TwitterTweetLoaderLoad Twitter tweets.
UnstructuredAPIFileIOLoaderLoad files using Unstructured API.
UnstructuredAPIFileLoaderLoad files using Unstructured API.
UnstructuredCHMLoaderLoad CHM files using Unstructured.
UnstructuredCSVLoaderLoad CSV files using Unstructured.
UnstructuredEPubLoaderLoad EPub files using Unstructured.
UnstructuredEmailLoaderLoad email files using Unstructured.
UnstructuredExcelLoaderLoad Microsoft Excel files using Unstructured.
UnstructuredFileIOLoaderLoad files using Unstructured.
UnstructuredFileLoaderLoad files using Unstructured.
UnstructuredHTMLLoaderLoad HTML files using Unstructured.
UnstructuredImageLoaderLoad PNG and JPG files using Unstructured.
UnstructuredMarkdownLoaderLoad Markdown files using Unstructured.
UnstructuredODTLoaderLoad OpenOffice ODT files using Unstructured.
UnstructuredOrgModeLoaderLoad Org-Mode files using Unstructured.
UnstructuredPDFLoaderLoad PDF files using Unstructured.
UnstructuredPowerPointLoaderLoad Microsoft PowerPoint files using Unstructured.
UnstructuredRSTLoaderLoad RST files using Unstructured.
UnstructuredRTFLoaderLoad RTF files using Unstructured.
UnstructuredTSVLoaderLoad TSV files using Unstructured.
UnstructuredURLLoaderLoad files from remote URLs using Unstructured.
UnstructuredWordDocumentLoaderLoad Microsoft Word file using Unstructured.
UnstructuredXMLLoaderLoad XML file using Unstructured.
VsdxLoader
WeatherDataLoaderLoad weather data with Open Weather Map API.
WebBaseLoaderLoad HTML pages using urllib and parse them with `BeautifulSoup'.
WhatsAppChatLoaderLoad WhatsApp messages text file.
WikipediaLoaderLoad from Wikipedia.
XorbitsLoaderLoad Xorbits DataFrame.
YoutubeLoaderLoad YouTube video transcripts.
YuqueLoaderLoad documents from Yuque.

Was this page helpful?


You can leave detailed feedback on GitHub.