How it works

Starting from a URL, we can use our agent that extracts information from your site and prepares it for ClaudIA to use. The content is optimized so that Claudia can maintain performance.

What can and should be avoided

What can be done:

URLs that start with "help" or "support"
URLs that have the base website as the beginning www.example.com and contain a '/products' path that we want to extract
URLs that have the base website as the beginning www.example.com and include some '/blog' paths, for example, which we do not want to extract

What to avoid:

URLs where all information is contained on a single page

How you can view this in the Content page

How to handle this content

You will be able to view and access the External Content that was used as the basis for generating this specific session.

Additionally, a filter can be applied to view all sources extracted from external origins.

Finally, based on certain filters, it is possible to disable or delete multiple IDS sessions if desired.

This external content will not be editable, as it will always be overwritten when performing another synchronization with the content database.

Which tag gets filled

Initially, a default arbitrary tag will be filled in (e.g., 'faq' or 'product')

⮕⮕⮕ We plan to evolve towards automatic content categorization in future versions.

How you will see it in the Conversation

Just like in the "Content" tab, in the Conversation tab you can also see whether the content is from an external source or not.

How to perform synchronization

IMPORTANT: If you choose to fetch external source content using our agent, it is ESSENTIAL to disable sessions that are identical or duplicated. Our team can analyze duplicates to support this process.

⮕⮕⮕ Currently, you need to contact our support team to use this feature; it is in Beta.

📌 List of HTML tags ignored by default

During the refinement process, the extraction agent automatically discards many elements that are not usually useful for the knowledge base. These selectors are defined to reduce visual and structural noise (menus, ads, comments, forms, etc.), keeping only the main content.

Page structure

#footer
#header
#nav
nav
footer

Scripts and styles

script
style
noscript

Media

svg
img
audio
video

Navigation and menus

.sidebar
.menu
.navigation
.breadcrumb
.breadcrumbs
.pagination
.pager
.page-navigation

Ads and banners

.advertisement
.ads
.ad-banner
.cookie-banner
.cookie-notice
.gdpr-notice

Social and sharing

.social-share
.social-buttons
.share-buttons

Forms and subscriptions

.newsletter
.subscription
.signup-form
.search-box
.search-form
.search-bar

Comments and discussions

.comments
.comment-section
.discussion

Metadata and authorship

.tags
.tag-list
.categories
.author-bio
.author-info
.byline
.meta
.metadata
.post-meta

Widgets and sidebars

.widget
.widgets
.sidebar-widget

Popups and overlays

.popup
.modal
.overlay
.lightbox

Accessibility and hidden navigation

.skip-link
.screen-reader-text
.sr-only
.print-only
.no-print

Accessibility Attributes

[role='alert']
[role='banner']
[role='navigation']
[role='complementary']
[role='dialog']
[role='alertdialog']
[role="region"][aria-label*="skip" i]
[aria-hidden='true']
[aria-modal='true']

Invisible elements

.hidden
.invisible

Synchronized FAQ or Product Page Content