Robots.txt Generator Pro
Create standard-compliant Robots.txt files to control Googlebot & other crawlers.
The Ultimate Master Guide to Robots.txt & Bot Control
Fundamentals & Basic Concepts
What is Robots.txt?
It is a simple text file placed in your website's root directory. Think of it as a "Do Not Enter" sign for web robots. It doesn't physically lock the door (password protection does that), but it politely asks "good" bots not to go inside specific rooms.
Why is it Mandatory?
Without this file, search engines will try to index everything, including duplicate pages, private scripts, and search result pages. This confuses Google and dilutes your SEO ranking.
Where is it Located?
It must always be at the root: example.com/robots.txt. If you put it in a subdirectory like example.com/blog/robots.txt, bots will ignore it.
User-agent Meaning
"User-agent" is the name of the robot. User-agent: * means the rule applies to ALL robots. User-agent: Googlebot means the rule applies ONLY to Google.
The Clean-Slate Rule
Many developers assume that if they don't mention a bot, it's blocked. The opposite is true. By default, bots assume Allow: /. You don't need to explicitly write "Allow" lines unless you are overriding a specific "Disallow" rule in a parent directory.
Syntax, Configuration & Logic
Allow vs. Disallow
Disallow: /private/ blocks access to that folder. Allow: /private/image.jpg can unlock a specific file inside a blocked folder. Order matters!
The Power of Wildcards (*)
The asterisk (*) is a wildcard. For example, Disallow: /*.pdf tells bots to block ALL files that end with .pdf, no matter where they are located.
The Dollar Sign ($) & Regex
Standard robots.txt doesn't support full Regex, but Googlebot supports simplistic pattern matching. You can use $ to match the end of a URL. For example, Disallow: /*.xls$ will block all Excel files, but allow /file.xls?id=123 because the URL doesn't end immediately after .xls.
The Trailing Slash Dilemma
This is a common cause of SEO disasters. Disallow: /admin blocks both /admin/login and /admin-style.css. However, Disallow: /admin/ (with the slash) only blocks the folder contents. Always use the trailing slash if you intend to block a directory.
Case Sensitivity Traps
Servers might be case-insensitive, but robots.txt is strictly case-sensitive. Disallow: /Files/ will NOT block /files/. Ensure your rules match the exact casing of your URL structure as seen in the browser address bar.
Priority of Directives
Google uses the "Longest Match" rule. If you have Allow: /folder/page and Disallow: /folder/, Google will crawl the page because the "Allow" rule is longer (more specific), even if the "Disallow" rule appears first in the file.
Handling Crawl-delay & DDoS Protection
If bots are slowing down your server, use Crawl-delay: 10. This forces them to wait 10 seconds between clicks. Aggressive bots can crash small servers by requesting too many pages at once; this acts as a polite firewall. Note: Googlebot often ignores this, but Bing and Yahoo respect it.
Managing AI & Scrapers
Why Block ChatGPT (GPTBot)?
OpenAI uses GPTBot to scrape the web to train their AI models. If you don't want your unique content to be used to train ChatGPT without credit, you should block this bot.
What is Common Crawl (CCBot)?
CCBot is a massive web scraper used by many AI companies (not just OpenAI) to build datasets. Blocking it is the most effective way to stop your site from feeding multiple AI models at once.
Future-Proofing for AI Search (SGE)
As Google moves to Search Generative Experience (SGE), allowing Google-Extended becomes a strategic choice. If you block AI scrapers, your content might not appear in AI-generated summaries, potentially lowering your visibility in the new era of search.
Does Blocking AI Hurt SEO?
No. ChatGPT and Claude are not search engines. Blocking them does not affect your ranking on Google or Bing. It only protects your intellectual property.
CMS & Platform Specific Rules
WordPress Best Practices
Always block /wp-admin/. However, NEVER block /wp-content/uploads/ or /wp-includes/js/. WordPress often generates a virtual robots.txt dynamically. If you upload a real .txt file via FTP, it overrides the dynamic one. Be careful of conflicts between real files and plugins like Yoast SEO.
Shopify Requirements
Shopify handles robots.txt automatically, but you can override it. It's crucial to block /cart, /orders, and /checkout. Also, watch out for "oEmbed" issues; ensure you aren't blocking third-party app resources that load reviews or popups.
Magento & E-commerce
E-commerce sites generate thousands of URL parameters (filters like ?color=red). Use robots.txt to block these parameters to save your Crawl Budget.
Robots.txt for React & SPA
For Single Page Applications (React, Vue, Angular), bots need to execute JavaScript to see content. Never block your JS or API endpoints (e.g., /api/v1/products) if they deliver content. Blocking these resources prevents Google from rendering your page correctly.
Wix & Squarespace Nuances
Wix generates a robots.txt automatically, but you can edit it in the SEO Dashboard. Squarespace is rigid; you cannot edit the physical file but can inject rules via SEO settings. Squarespace blocks built-in search pages by default with Noindex headers, so duplicating it in robots.txt is redundant.
Strategic SEO & Indexing Control
Robots.txt vs Noindex (The Golden Rule)
This is the #1 mistake. Robots.txt prevents crawling, but "noindex" prevents indexing. If a page is already indexed, blocking it in robots.txt will NOT remove it; it will just stay there with a warning. To de-index, you must Allow the page first, add a noindex meta tag, wait for Google to see it, and then block it.
Crawl Budget Optimization
For sites with 10,000+ pages, Crawl Budget is real. Block filters, sort parameters (?sort=price_asc), and internal search results (/search?q=). This forces Googlebot to spend its time on your high-value blog posts and product pages instead of junk URLs.
Handling Affiliate Links
If you run an affiliate site, you might have thousands of outgoing redirect links (e.g., /go/product-name). It is best practice to Disallow the /go/ directory to prevent bots from wasting resources following external redirects.
Blocking CSS and JS Files
Stop! Years ago, SEOs recommended blocking .css and .js files. Today, Google renders pages like a human. If you block style files, Google will see a broken page and may rank you lower.
Security & Privacy
Security Through Obscurity?
Never rely on robots.txt to hide private data. Listing Disallow: /secret-financial-report.pdf tells hackers exactly where your sensitive files are. Use server-side password protection (.htaccess) for real security.
GDPR & User Data Paths
If your site generates URLs containing user emails or IDs (e.g., /user-profile/john@doe.com), you must block these to prevent personal data from appearing in search results. This is crucial for GDPR compliance.
Blocking Staging Environments
When building a new version of your site (e.g., dev.site.com), always use Disallow: /. If Google indexes your staging site, it creates a massive "Duplicate Content" penalty for your main live site.
The "Allow" Trap for Hackers
Malicious bots ignore robots.txt. If you see traffic to disallowed folders in your analytics, it’s a sign of a bad actor. Use this file as a "tripwire"—if an IP accesses a disallowed folder, block that IP at the firewall level.
Sitemaps & Final Checks
Sitemap Location & Hacks
You can put your Sitemap link anywhere in the file. Interestingly, you can include sitemaps from different domains. If you host your sitemap on Amazon S3 or a CDN (e.g., https://cdn.mysite.com/sitemap.xml), you can still reference it in your main domain's robots.txt. Google accepts this cross-origin reference.
How to Test Your File
After uploading, go to Google Search Console and use the "Robots.txt Tester" tool. It will tell you instantly if you are accidentally blocking important pages.
Google Images & Videos
If you want your images to appear in Google Images, ensure you are NOT blocking the image folder. Conversely, to hide images, specifically disallow the images directory.
Comments & Trailing Slashes
You can use # to add comments for yourself. Also, be careful with slashes: /fish blocks a file, while /fish/ blocks a directory.