How it works
Starting from a URL, we can use our agent that extracts information from your site and prepares it for ClaudIA to use. The content is optimized so that Claudia can maintain performance.
What can and should be avoided
What can be done:
-
URLs that start with "help" or "support"
-
URLs that have the base website as the beginning www.example.com and contain a '/products' path that we want to extract
-
URLs that have the base website as the beginning www.example.com and include some '/blog' paths, for example, which we do not want to extract
What to avoid:
- URLs where all information is contained on a single page
How you can view this in the Content page
How to handle this content
You will be able to view and access the External Content that was used as the basis for generating this specific session.
Additionally, a filter can be applied to view all sources extracted from external origins.
Finally, based on certain filters, it is possible to disable or delete multiple IDS sessions if desired.
This external content will not be editable, as it will always be overwritten when performing another synchronization with the content database.
Which tag gets filled
Initially, a default arbitrary tag will be filled in (e.g., 'faq' or 'product')
⮕⮕⮕ We plan to evolve towards automatic content categorization in future versions.
How you will see it in the Conversation
Just like in the "Content" tab, in the Conversation tab you can also see whether the content is from an external source or not.
How to perform synchronization
IMPORTANT: If you choose to fetch external source content using our agent, it is ESSENTIAL to disable sessions that are identical or duplicated. Our team can analyze duplicates to support this process.
⮕⮕⮕ Currently, you need to contact our support team to use this feature; it is in Beta.
📌 List of HTML tags ignored by default
During the refinement process, the extraction agent automatically discards many elements that are not usually useful for the knowledge base. These selectors are defined to reduce visual and structural noise (menus, ads, comments, forms, etc.), keeping only the main content.
Page structure
#footer
#header
#nav
nav
footer
Scripts and styles
script
style
noscript
Media
svg
img
audio
video
Navigation and menus
.sidebar
.menu
.navigation
.breadcrumb
.breadcrumbs
.pagination
.pager
.page-navigation
Ads and banners
.advertisement
.ads
.ad-banner
.cookie-banner
.cookie-notice
.gdpr-notice
Social and sharing
.social-share
.social-buttons
.share-buttons
Forms and subscriptions
.newsletter
.subscription
.signup-form
.search-box
.search-form
.search-bar
Related content
.related-posts
.recommended
.suggestions
Comments and discussions
.comments
.comment-section
.discussion
Metadata and authorship
.tags
.tag-list
.categories
.author-bio
.author-info
.byline
.meta
.metadata
.post-meta
Widgets and sidebars
.widget
.widgets
.sidebar-widget
Popups and overlays
.popup
.modal
.overlay
.lightbox
Accessibility and hidden navigation
.skip-link
.screen-reader-text
.sr-only
.print-only
.no-print
Accessibility Attributes
[role='alert']
[role='banner']
[role='navigation']
[role='complementary']
[role='dialog']
[role='alertdialog']
[role="region"][aria-label*="skip" i]
[aria-hidden='true']
[aria-modal='true']
Invisible elements
.hidden
.invisible