What is Robots.txt? Is This Necessary? All Questions | Google Blogger

What is meant by Robot.txt?

Important Guidelines By Zakyas Butt              (@ZakyasButt)

how to Add Robots.txt in Blogger what is robots.txt google


Explanation of Robots.txt Each Code

Answer:

Easy Definition of "What is Robots.txt" & "How To Add Robots.txt"

Robots.txt is a command file, we create it instruct to google search engines and other engines to crawl our website as per instructions. It isn't mean that our website posts or URL are added in google console, but it is just an instruction for google BOT and other engine BOTS.

Complicated Definition of Robots.txt

Actually, Robots.txt is a simple text file with simple codes which is used by the owner of the website to write specific commands for search engine spiders or crawlers. These files instruct the crawlers about their limitation to index a particular page. Let me throw some extra light on this, remember whenever the web crawlers crawl your website before indexing your web pages they first search for the robot.txt file to check for the exclusions protocol. Hence Search Engine Crawlers get an idea of which webpage to index and which page not to Index. This is a great tool to increase the ranking of your blog and also it helps you to get more traffic provided if it is used properly. Let us dig deep into this and learn about how to add a custom robot.txt file to the Blogspot blog.

Add custom robots.txt file in blogger Blog

1)      Go to Blogger.com
2)      Login To your Account
3)      Select the blog if you have multiple blogs otherwise ignore this step
4)      Now look for settings>> Search Preferences
5)     If you click the search preferences mentioned above you will get the below display
If you click the edit option marked above you will get the below display. You can see a text area that you can use for content exclusion by web crawlers or spiders.
Now enter the following code in the above box

User-agent: Mediapartners-Google
Disallow:
User-agent: *
Disallow: /search
Allow: /
Sitemap: http:// appjah.blogspot.com /feeds/posts/default?orderby=UPDATED

 

NOTE: Replace “appjah.blogspot.com” with the name of your blog.
You are now done with adding a custom robots.txt file to your BlogSpot Blog

Explanation of each code

Now you will wonder to see there are many terms such as User-agent, Disallow, Disallow:/search, Allow, and Sitemap so if you were not aware of these terms then you check my explanation of each code.

User-agent: Mediapartners-Google

This code is for Google AdSense. The AdSense bots will crawl this so that they can serve better ads for your blog. So this code is only for the use of Google AdSense. If you are not using google AdSense then you can simply ignore this step

User-agent:*

The User-agent marked with an asterisk allows bots or crawlers of all search engines it may be either yahoo or Bing etc., to crawl your blog.

Disallow:

By adding this you are instructing crawlers not to index your web pages

Disallow:/search

This means you are not allowing the search results of your blog by default. That is you are not allowing the crawlers to crawl the search directory which comes very next to your domain name. It will look like
http://yoursite.blogspot.com/search/label/yourlabel
This page will not be crawled and never be indexed

Allow

As its name states allow this will allow crawlers to crawl the home page of your blog

Sitemap

This code refers to the sitemap of our blog. This helps the crawlers to index all web pages which can be accessed. Web crawlers will always find a path to sitemap because it makes their job easier as sitemap have all the links of our published posts.
You can add the XML sitemap up to the first 500 posts only

Sitemap: http:// appjah.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=500

 

If you have more than 500 posts say 1000 then you can add two sitemaps as below

Sitemap: http://appjah.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=500
Sitemap: http:// appjah.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=1000

 

Easy Definition of Sitemap.xml

Sitemap.xml is a bunch of all URLs of our website, sitemap allows the search engine to find all your website pages, it's a simple process for our website for indexing or detail about all web pages.

 

You need to copy below same text and paste it into NOTEPAD and save it as a "Robots.txt" if you have word press blog, you can upload it to your WordPress blog, or you can paste it into your blog setting CRAWL options. The command is as under.


User-agent: *

Allow: /

 

User-agent: Mediapartners-Google*

Allow: /

 

User-agent: Googlebot-Image

Allow: /

 

User-agent: Adsbot-Google

Allow: /

 

User-agent: Googlebot-Mobile

Allow: /

Sitemap: https://appjah.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=500

If you have Disallow: /a, it blocks all URLs whose paths begin with /a. That could be /a.html, /a/b/c/hello, or /about.

Don't use "Disallow: / " option, it means it blocks all URL, or if you add any text or word that mean will be it block that URL or path beginning with that word.


Google Blogger Robots.txt Related Issues

Important Guidelines By Zakyas Butt              (@ZakyasButt)

Do I need to disallow mediapartners-google in robots.txt for SEO?

I found that most of the websites disallow mediapartners-google in robots.txt.

Why do I need to block mediapartners-google? Can blocking it helps with better SEO or does it not matter for SEO?

Answer:

It doesn't matter at all in SEO, because it's an AdSense crawler

If you've AdSense Ads on your site, then you shouldn't block mediapartners-google user agents in robots.txt. But if you don't serve AdSense ads on your site, then you can surely block it just like other webmasters do it. If you see such a user agent in your website log and if you think it consumes more bandwidth then you're free to block it. But they have already configured their bot very well and crawl required and important pages only.

 

About the AdSense ads crawler

Important Guidelines By Zakyas Butt              (@ZakyasButt)

What Is AdSense Crawler

Answer:

A crawler, also known as a spider or a bot, is the software Google uses to process and index the content of web pages. The AdSense crawler visits your site to determine its content in order to provide relevant ads.

Here are some important facts to know about the AdSense crawler:

  • The crawler report is updated weekly.

The crawl is performed automatically and we're not able to accommodate requests for more frequent crawling.

  • The AdSense crawler is different from the Google crawler.

The two crawlers are separate, but they do share a cache. We do this to avoid both crawlers requesting the same pages, thereby helping publishers conserve their bandwidth. Similarly, the Search Console crawler is separate.

  • Resolving AdSense crawl issues will not resolve issues with the Google crawl.

Resolving the issues listed on your Crawler access page will have no impact on your placement within Google search results.

  • The crawler indexes by URL.

Our crawler will access site.com and www.site.com separately. However, our crawler will not count site.com and site.com/#anchor separately.

  • The crawler won't access pages or directories prohibited by a robots.txt file.

Both the Google and AdSense Mediapartners crawlers honor your robots.txt file. If your robots.txt file prohibits access to certain pages or directories, then they will not be crawled.

Note: If you’re serving ads on pages that are being roboted out with the line User-agent: *, then the AdSense crawler will still crawl these pages. To prevent the AdSense crawler from accessing your pages, you need to specify User-agent: Mediapartners-Google in your robots.txt file.

  • The crawler will attempt to access URLs only where our ad tags are implemented.

Only pages displaying Google ads should be sending requests to our systems and being crawled.

  • The crawler will attempt to access pages that redirect.

When you have "original pages" that redirect to other pages, our crawler must access the original pages to determine that a redirect is in place. Therefore, our crawler's visit to the original pages will appear in your access logs.

  • There is no control over how often the crawler will index your site content.

At this time, we have no control over re-crawling sites. Crawling is done automatically by our bots. If you make changes to a page, it may take up to 1 or 2 weeks before the changes are reflected in our index.

 

 

 

Give access to our crawler in your robots.txt file

Important Guidelines By Zakyas Butt              (@ZakyasButt)

Should We Give Access to AdSense Crawler?

If you’ve modified your site’s robots.txt file to disallow the AdSense crawler from indexing your pages, then we are not able to serve Google ads on these pages.

To update your robots.txt file to grant our crawler access to your pages, remove the following two lines of text from your robots.txt file:

User-agent: Mediapartners-Google
Disallow: /

This change will allow our crawler to index the content of your site and provide you with Google ads. Please note that any changes you make to your robots.txt file may not be reflected in our index until our crawlers attempt to visit your site again.


Create A robots.txt file

Important Guidelines By Zakyas Butt              (@ZakyasButt)

Should We Give Access to AdSense Crawler?

Answer:

You can control which files crawlers may access on your site with a robots.txt file. A robots.txt file lives at the root of your site. So, for the site AppJah.blogspot.com, the robots.txt file lives at AppJah.blogspot.com/robots.txt. Robots.txt is a plain text file that follows the Robots Exclusion Standard. A robots.txt file consists of one or more rules. Each rule blocks or allows access for a given crawler to a specified file path on the domain or subdomain where the robots.txt file is hosted. Unless you specify otherwise in your robots.txt file, all files are implicitly allowed for crawling.

Here is a simple robots.txt file with two rules:

User-agent: Googlebot

Disallow: /nogooglebot/

 

User-agent: *

Allow: /

 

Sitemap: https://www.example.com/sitemap.xml

Here's what that robots.txt file means:

1.      The user agent named Googlebot is not allowed to crawl any URL that starts with 

https://AppJah.blogspot.com/nogooglebot/.

2.      All other user agents are allowed to crawl the entire site. This could have been omitted and the result would be the same; the default behavior is that user agents are allowed to crawl the entire site.

3.     The site's sitemap file is located at https://AppJah.blogspot.com /sitemap.xml

 


Give access to our crawler in your robots.txt file

Important Guidelines By Zakyas Butt              (@ZakyasButt)

Should We Give Access to AdSense Crawler?

Your robots file already exists and contains everything you need (including a sitemap link). You don't have to create or enable a custom one.

The correct map is in the yourdomain/sitemap.xml file.

The default robots.txt file is valid and does not need to be modified.

 

Do NOT change your robots.txt file unless you know exactly what you are doing. if you get things confused you can block web crawlers from being able to visit your blog's posts.

 

Blogger automatically builds a sitemap file & updates it for you.  You have to do nothing.

 

https://www.google.com/search?q=site%3Aappjah.blogspot.com

Or

 Search in Google like that for your Indexed Pages which pages have appeared in Google Search.

site:appjah.blogspot.com

Google Sitemap Ping Link

https://www.google.com/ping?sitemap=https://appjah.blogspot.com/sitemap.xml

Google SEO Course

https://developers.google.com/search/docs

 

 

Post a Comment

Previous Post Next Post