What is meant by Robot.txt?
Important
Guidelines By Zakyas Butt (@ZakyasButt)
Explanation of Robots.txt Each Code
Answer:
Easy Definition of "What is Robots.txt" & "How To Add Robots.txt"
Robots.txt is a command file, we
create it instruct to google search engines and other engines to crawl our
website as per instructions. It isn't mean that our website posts or URL are
added in google console, but it is just an instruction for google BOT and other
engine BOTS.
Complicated Definition of Robots.txt
Actually, Robots.txt is a simple
text file with simple codes which is used by the owner of the website to write
specific commands for search engine spiders or crawlers. These files instruct
the crawlers about their limitation to index a particular page. Let me throw
some extra light on this, remember whenever the web crawlers crawl your website
before indexing your web pages they first search for the robot.txt file to check
for the exclusions protocol. Hence Search Engine Crawlers get an idea of which
webpage to index and which page not to Index. This is a great tool to increase
the ranking of your blog and also it helps you to get more traffic provided if
it is used properly. Let us dig deep into this and learn about how to add a
custom robot.txt file to the Blogspot blog.
Add
custom robots.txt file in blogger Blog
1)
Go to Blogger.com
2) Login To your Account
3) Select the blog if you have multiple blogs
otherwise ignore this step
4) Now look for settings>> Search
Preferences
5) If you click the search preferences mentioned above you will
get the below display
If you click the edit option marked above you will get the below display. You
can see a text area that you can use for content exclusion by web crawlers
or spiders.
Now enter the following code in the above box
User-agent:
Mediapartners-Google
Disallow:
User-agent: *
Disallow: /search
Allow: /
Sitemap: http:// appjah.blogspot.com /feeds/posts/default?orderby=UPDATED
NOTE: Replace
“appjah.blogspot.com” with the name of your blog.
You are now done with adding a custom robots.txt file to your BlogSpot Blog
Explanation of each code
Now you will wonder to see there
are many terms such as User-agent, Disallow, Disallow:/search, Allow, and
Sitemap so if you were not aware of these terms then you check my explanation
of each code.
User-agent: Mediapartners-Google
This code is for Google AdSense.
The AdSense bots will crawl this so that they can serve better ads for your
blog. So this code is only for the use of Google AdSense. If you are not using
google AdSense then you can simply ignore this step
User-agent:*
The User-agent marked with an
asterisk allows bots or crawlers of all search engines it may be either yahoo
or Bing etc., to crawl your blog.
Disallow:
By adding this you are
instructing crawlers not to index your web pages
Disallow:/search
This means you are not allowing
the search results of your blog by default. That is you are not allowing the
crawlers to crawl the search directory which comes very next to your domain
name. It will look like
http://yoursite.blogspot.com/search/label/yourlabel
This page will not be crawled and never be indexed
Allow
As its name states allow this
will allow crawlers to crawl the home page of your blog
Sitemap
This code refers to the sitemap
of our blog. This helps the crawlers to index all web pages which can be
accessed. Web crawlers will always find a path to sitemap because it makes
their job easier as sitemap have all the links of our published posts.
You can add the XML sitemap up to the first 500 posts only
Sitemap: http:// appjah.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=500
If you have more than 500
posts say 1000 then you can add two sitemaps as below
Sitemap:
http://appjah.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=500
Sitemap: http:// appjah.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=1000
Easy Definition of Sitemap.xml
Sitemap.xml is a bunch of all URLs of
our website, sitemap allows the search engine to find all your website pages,
it's a simple process for our website for indexing or detail about all web
pages.
You need to copy below same text and paste it into NOTEPAD and save it as a "Robots.txt" if you have word press blog, you can upload it to your WordPress blog, or you can paste it into your blog setting CRAWL options. The command is as under.
User-agent: *
Allow: /
User-agent:
Mediapartners-Google*
Allow: /
User-agent:
Googlebot-Image
Allow: /
User-agent:
Adsbot-Google
Allow: /
User-agent:
Googlebot-Mobile
Allow: /
Sitemap:
https://appjah.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=500
If you have Disallow: /a, it blocks all URLs
whose paths begin with /a. That could be /a.html, /a/b/c/hello,
or /about.
Don't use "Disallow: / " option, it means it blocks all URL, or if
you add any text or word that mean will be it block that URL or path beginning
with that word.
Google
Blogger Robots.txt Related Issues
Important Guidelines By Zakyas Butt (@ZakyasButt)
Do I need to disallow mediapartners-google in robots.txt for SEO?
I found that most of the websites disallow mediapartners-google in robots.txt.
Why do I need to block mediapartners-google? Can blocking it helps
with better SEO or does it not matter for SEO?
Answer:
It doesn't matter at all in SEO, because it's an
AdSense crawler
If you've AdSense Ads on your site, then
you shouldn't block mediapartners-google
user agents in robots.txt. But if you don't serve AdSense ads on your site,
then you can surely block it just like other webmasters do it. If you see such
a user agent in your website log and if you think it consumes more bandwidth
then you're free to block it. But they have already configured their bot very
well and crawl required and important pages only.
About the AdSense
ads crawler
Important
Guidelines By Zakyas Butt (@ZakyasButt)
What Is AdSense Crawler
Answer:
A crawler, also known as a spider or a bot, is the software Google uses
to process and index the content of web pages. The AdSense crawler visits your
site to determine its content in order to provide relevant ads.
Here are some important facts to know about the AdSense crawler:
- The crawler report is updated weekly.
The crawl is performed automatically and we're not able
to accommodate requests for more frequent crawling.
- The AdSense crawler is different from the Google
crawler.
The two crawlers are separate, but they do share a cache.
We do this to avoid both crawlers requesting the same pages, thereby helping
publishers conserve their bandwidth. Similarly, the Search Console crawler is
separate.
- Resolving AdSense crawl issues will not resolve
issues with the Google crawl.
Resolving the issues listed on your Crawler access page will have
no impact on your placement within Google search results.
- The crawler indexes by URL.
Our crawler will access site.com and www.site.com
separately. However, our crawler will not count site.com and site.com/#anchor
separately.
- The crawler won't access pages or directories
prohibited by a robots.txt file.
Both the Google and AdSense Mediapartners crawlers honor
your robots.txt file. If your robots.txt file prohibits access
to certain pages or directories, then they will not be crawled.
Note: If you’re serving ads on pages that are being roboted
out with the line User-agent: *
, then the AdSense crawler will still crawl these pages.
To prevent the AdSense crawler from accessing your pages, you need to
specify User-agent: Mediapartners-Google
in your robots.txt file.
- The crawler will attempt to access URLs only where
our ad tags are implemented.
Only pages displaying Google ads should be sending
requests to our systems and being crawled.
- The crawler will attempt to access pages that
redirect.
When you have "original pages" that redirect to
other pages, our crawler must access the original pages to determine that a
redirect is in place. Therefore, our crawler's visit to the original pages will
appear in your access logs.
- There is no control over how often the crawler will
index your site content.
At this time, we have no control over re-crawling sites.
Crawling is done automatically by our bots. If you make changes to a page, it
may take up to 1 or 2 weeks before the changes are reflected in our index.
Give access to our crawler in your robots.txt file
Important
Guidelines By Zakyas Butt (@ZakyasButt)
Should We Give Access to AdSense Crawler?
If you’ve modified your site’s robots.txt file to
disallow the AdSense crawler from indexing your pages, then we
are not able to serve Google ads on these pages.
To update your robots.txt file to grant our crawler access to your
pages, remove the following two
lines of text from your robots.txt file:
User-agent: Mediapartners-Google
Disallow: /
This change will allow our crawler to index the content of your site and
provide you with Google ads. Please note that any changes you make to your
robots.txt file may not be reflected in our index until our crawlers attempt to
visit your site again.
Create A robots.txt file
Important
Guidelines By Zakyas Butt (@ZakyasButt)
Should We Give Access to AdSense Crawler?
Answer:
You
can control which files crawlers may access on your site with a
robots.txt file. A robots.txt file lives at the root of your site. So, for the
site AppJah.blogspot.com
, the robots.txt file lives at AppJah.blogspot.com/robots.txt. Robots.txt is a
plain text file that follows the Robots Exclusion Standard. A robots.txt
file consists of one or more rules. Each rule blocks or allows access for a
given crawler to a specified file path on the domain or subdomain where the
robots.txt file is hosted. Unless you specify otherwise in your robots.txt
file, all files are implicitly allowed for crawling.
Here is a
simple robots.txt file with two rules:
User-agent:
Googlebot
Disallow:
/nogooglebot/
User-agent: *
Allow: /
Sitemap:
https://www.example.com/sitemap.xml
Here's
what that robots.txt file means:
1. The
user agent named Googlebot is not allowed to crawl any URL that starts with
https://AppJah.blogspot.com/nogooglebot/.
2. All
other user agents are allowed to crawl the entire site. This could have been
omitted and the result would be the same; the default behavior is that user
agents are allowed to crawl the entire site.
3. The site's sitemap file is located at https://AppJah.blogspot.com /sitemap.xml
Give access to our crawler in your robots.txt file
Important
Guidelines By Zakyas Butt (@ZakyasButt)
Should We Give Access to AdSense Crawler?
Your robots file already exists and contains
everything you need (including a sitemap link). You don't have to create or
enable a custom one.
The correct map is in the yourdomain/sitemap.xml
file.
The default robots.txt file is valid and does not
need to be modified.
Do NOT change
your robots.txt file unless you know exactly what you are doing. if you get
things confused you can block web crawlers from being able to visit your blog's
posts.
Blogger
automatically builds a sitemap file & updates it for you. You
have to do nothing.
https://www.google.com/search?q=site%3Aappjah.blogspot.com
Search in Google like that for your Indexed
Pages which pages have appeared in Google Search.
site:appjah.blogspot.com
Google Sitemap Ping Link
https://www.google.com/ping?sitemap=https://appjah.blogspot.com/sitemap.xml
Google SEO Course
https://developers.google.com/search/docs